Skip to content

Commit

Permalink
Add changelog and update readme.
Browse files Browse the repository at this point in the history
  • Loading branch information
alexlangberg committed May 14, 2015
1 parent 7335cab commit 7823924
Show file tree
Hide file tree
Showing 5 changed files with 57 additions and 22 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# 4.0.0 (2015-15-05)

## Features

- It is now possible to both use goldwasher for scraping and conversion of its own formats. Thus, the input can now be any of the following: html, xml, cheerio object, array of goldwasher items, goldwasher xml or even an RSS/Atom feed. The output can be either json, xml, atom or rss. Note that feeds do not contain the same amount of information as json or xml.
- The parameter "batch" has been added to the format. It contains a UUID that will be the same for all nuggets of a goldwasher batch.
- The parameter "source" has been added to the format. It contains the original URL of the scraped page.

## Breaking changes

- The flags for individual goldwasher format keys have been removed. You will thus always get full goldwasher formatted objects out. If you need to remove keys from them, do so afterwards.
- If upgrading from older versions, note that ```target``` has been renamed to the more proper ```selector```.
30 changes: 21 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
# node-goldwasher
[![npm version](http://img.shields.io/npm/v/goldwasher.svg)](https://www.npmjs.org/package/goldwasher)
[![Build Status](http://img.shields.io/travis/alexlangberg/node-goldwasher.svg)](https://travis-ci.org/alexlangberg/node-goldwasher)
[![Coverage Status](http://img.shields.io/coveralls/alexlangberg/node-goldwasher.svg)](https://coveralls.io/r/alexlangberg/node-goldwasher?branch=master)
[![Code Climate](http://img.shields.io/codeclimate/github/alexlangberg/node-goldwasher.svg)](https://codeclimate.com/github/alexlangberg/node-goldwasher)
[![npm version](http://img.shields.io/npm/v/goldwasher.svg)](https://www.npmjs.org/package/goldwasher)

[![Dependency Status](https://david-dm.org/alexlangberg/node-goldwasher.svg)](https://david-dm.org/alexlangberg/node-goldwasher)
[![devDependency Status](https://david-dm.org/alexlangberg/node-goldwasher/dev-status.svg)](https://david-dm.org/alexlangberg/node-goldwasher#info=devDependencies)

**NOTE:** Version 3 has been a complete rewrite. UUIDs have been added and all parts can be selectively turned off by passing e.g. ```href: false``` as an option. The only breaking change should be that you have to switch the html and options parameters and rename the ```targets``` parameter to ```selector```.

The purpose module is to extract text information from HTML, usually a website, which will often have to be sanitized and filtered to be useful. This module takes a pile of HTML and washes out the parts you need as small, golden nuggets of text and related metadata, the default options referred to as "goldwasher format":

JSON format (see additional formats in the bottom):
Expand All @@ -26,7 +24,9 @@ JSON format (see additional formats in the bottom):
tag: "h1",
position: 0,
total: 2,
uuid: "808b7490-f743-11e4-90b2-df723554e9be"
uuid: "808b7490-f743-11e4-90b2-df723554e9be",
batch: "14eefda0-f762-11e4-a0b3-d5647c4f7651",
source: "http://www.oakisstrong.com"
}
```

Expand All @@ -43,6 +43,8 @@ It works by passing it either pure HTML as a string (e.g. from [request](https:/
9. Assign a unique identifier (UUID V1).
10. Index the nugget position in the order it was found found.
11. Add the total nugget count.
12. Add the URL of the original source.
13. Assign a unique identifier (UUID V1) that is similar for the entire batch of nuggets.

The returned nuggets include the object properties:

Expand All @@ -58,8 +60,11 @@ The returned nuggets include the object properties:
- ```position``` - the position of the object, indicating the order in which tags were found. 0-based.
- ```total``` - total number of nuggets in relation to the position. 1-based.
- ```uuid``` - a unique identifier (UUID V1).
- ```batch``` - a unique identifier (UUID V1) that is the same for the entire batch of nuggets.
- ```source``` - a URL that was scraped, also the same for all nuggets.


Alternatively, the output can be configured as XML, Atom or RSS format with the ```output``` option.
Alternatively, the output can be configured as XML, Atom or RSS format with the ```output``` option. The reason redundant information is included, such as the source, is that each returned nugget is supposed to be an atomic piece of information. As such, each nugget is to contain the information that "somewhere, at some point in time, something was written (with a link to some place)".

## Installation
```
Expand All @@ -75,7 +80,6 @@ npm install goldwasher
- ```filterKeywords``` - stop words that should be excluded as keywords.
- ```filterLocale``` - stop words from external json file (see the folder stop_words).
- ```format``` - output format (```json```, ```xml```, ```atom``` or ```rss```).
- The rest can be selectively turned off by passing e.g. ```href: false```.

## Example
```javascript
Expand Down Expand Up @@ -110,7 +114,9 @@ var result = goldwasher(html, options);
tag: "h1",
position: 0,
total: 2,
uuid: "808b7490-f743-11e4-90b2-df723554e9be"
uuid: "808b7490-f743-11e4-90b2-df723554e9be",
batch: "14eefda0-f762-11e4-a0b3-d5647c4f7651",
source: "http://www.oakisstrong.com"
},
{
timestamp: 1402847736381,
Expand All @@ -124,7 +130,9 @@ var result = goldwasher(html, options);
tag: "h2",
position: 1,
total: 2,
uuid: "a48fbb30-f743-11e4-96e6-7b423a412011"
uuid: "a48fbb30-f743-11e4-96e6-7b423a412011",
batch: "14eefda0-f762-11e4-a0b3-d5647c4f7651",
source: "http://www.oakisstrong.com"
}
]
*/
Expand All @@ -146,7 +154,9 @@ var result = goldwasher(html, options);
tag: "h1",
position: 0,
total: 2,
uuid: "808b7490-f743-11e4-90b2-df723554e9be"
uuid: "808b7490-f743-11e4-90b2-df723554e9be",
batch: "14eefda0-f762-11e4-a0b3-d5647c4f7651",
source: "http://www.oakisstrong.com"
}
```

Expand All @@ -162,6 +172,8 @@ var result = goldwasher(html, options);
<timestamp>1431296135800</timestamp>
<uuid>14eefda0-f762-11e4-a0b3-d5647c4f7651</uuid>
<total>3</total>
<batch>14eefda0-f762-11e4-a0b3-d5647c4f7651</batch>
<source>http://www.oakisstrong.com</batch>
<keyword>
<word>oak</word>
<count>1</count>
Expand Down
2 changes: 1 addition & 1 deletion docs/goldwasher.js.html
Original file line number Diff line number Diff line change
Expand Up @@ -682,7 +682,7 @@ <h2><a href="index.html">Index</a></h2>
<br clear="both">

<footer>
Documentation generated by <a href="https://github.com/jsdoc3/jsdoc">JSDoc 3.3.0-alpha5</a> on Thu May 14 2015 14:59:49 GMT+0200 (CEST)
Documentation generated by <a href="https://github.com/jsdoc3/jsdoc">JSDoc 3.3.0-alpha5</a> on Thu May 14 2015 16:33:25 GMT+0200 (CEST)
</footer>

<script> prettyPrint(); </script>
Expand Down
34 changes: 23 additions & 11 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,12 @@ <h3> </h3>

<section>
<article><h1 id="node-goldwasher">node-goldwasher</h1>
<p><a href="https://travis-ci.org/alexlangberg/node-goldwasher"><img src="http://img.shields.io/travis/alexlangberg/node-goldwasher.svg" alt="Build Status"></a>
<p><a href="https://www.npmjs.org/package/goldwasher"><img src="http://img.shields.io/npm/v/goldwasher.svg" alt="npm version"></a>
<a href="https://travis-ci.org/alexlangberg/node-goldwasher"><img src="http://img.shields.io/travis/alexlangberg/node-goldwasher.svg" alt="Build Status"></a>
<a href="https://coveralls.io/r/alexlangberg/node-goldwasher?branch=master"><img src="http://img.shields.io/coveralls/alexlangberg/node-goldwasher.svg" alt="Coverage Status"></a>
<a href="https://codeclimate.com/github/alexlangberg/node-goldwasher"><img src="http://img.shields.io/codeclimate/github/alexlangberg/node-goldwasher.svg" alt="Code Climate"></a>
<a href="https://www.npmjs.org/package/goldwasher"><img src="http://img.shields.io/npm/v/goldwasher.svg" alt="npm version"></a></p>
<a href="https://codeclimate.com/github/alexlangberg/node-goldwasher"><img src="http://img.shields.io/codeclimate/github/alexlangberg/node-goldwasher.svg" alt="Code Climate"></a></p>
<p><a href="https://david-dm.org/alexlangberg/node-goldwasher"><img src="https://david-dm.org/alexlangberg/node-goldwasher.svg" alt="Dependency Status"></a>
<a href="https://david-dm.org/alexlangberg/node-goldwasher#info=devDependencies"><img src="https://david-dm.org/alexlangberg/node-goldwasher/dev-status.svg" alt="devDependency Status"></a></p>
<p><strong>NOTE:</strong> Version 3 has been a complete rewrite. UUIDs have been added and all parts can be selectively turned off by passing e.g. <code>href: false</code> as an option. The only breaking change should be that you have to switch the html and options parameters and rename the <code>targets</code> parameter to <code>selector</code>.</p>
<p>The purpose module is to extract text information from HTML, usually a website, which will often have to be sanitized and filtered to be useful. This module takes a pile of HTML and washes out the parts you need as small, golden nuggets of text and related metadata, the default options referred to as &quot;goldwasher format&quot;:</p>
<p>JSON format (see additional formats in the bottom):</p>
<pre><code class="lang-javascript">{
Expand All @@ -64,7 +63,9 @@ <h3> </h3>
tag: &quot;h1&quot;,
position: 0,
total: 2,
uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;
uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;,
batch: &quot;14eefda0-f762-11e4-a0b3-d5647c4f7651&quot;,
source: &quot;http://www.oakisstrong.com&quot;
}
</code></pre>
<p>It works by passing it either pure HTML as a string (e.g. from <a href="https://www.npmjs.org/package/request">request</a>) or a <a href="https://www.npmjs.org/package/cheerio">cheerio</a> object, usually along with a <a href="https://www.npmjs.org/package/cheerio">cheerio</a> (jQuery) selector (html tags) from which the text should be extracted, along with other options. It will then return an array of nuggets (objects) of information - one per recognized tag. For each nugget, it will try to:</p>
Expand All @@ -80,6 +81,8 @@ <h3> </h3>
<li>Assign a unique identifier (UUID V1).</li>
<li>Index the nugget position in the order it was found found.</li>
<li>Add the total nugget count.</li>
<li>Add the URL of the original source.</li>
<li>Assign a unique identifier (UUID V1) that is similar for the entire batch of nuggets.</li>
</ol>
<p>The returned nuggets include the object properties:</p>
<ul>
Expand All @@ -97,8 +100,10 @@ <h3> </h3>
<li><code>position</code> - the position of the object, indicating the order in which tags were found. 0-based.</li>
<li><code>total</code> - total number of nuggets in relation to the position. 1-based.</li>
<li><code>uuid</code> - a unique identifier (UUID V1).</li>
<li><code>batch</code> - a unique identifier (UUID V1) that is the same for the entire batch of nuggets.</li>
<li><code>source</code> - a URL that was scraped, also the same for all nuggets.</li>
</ul>
<p>Alternatively, the output can be configured as XML, Atom or RSS format with the <code>output</code> option.</p>
<p>Alternatively, the output can be configured as XML, Atom or RSS format with the <code>output</code> option. The reason redundant information is included, such as the source, is that each returned nugget is supposed to be an atomic piece of information. As such, each nugget is to contain the information that &quot;somewhere, at some point in time, something was written (with a link to some place)&quot;.</p>
<h2 id="installation">Installation</h2>
<pre><code>npm install goldwasher
</code></pre><h2 id="options">Options</h2>
Expand All @@ -111,7 +116,6 @@ <h2 id="installation">Installation</h2>
<li><code>filterKeywords</code> - stop words that should be excluded as keywords.</li>
<li><code>filterLocale</code> - stop words from external json file (see the folder stop_words).</li>
<li><code>format</code> - output format (<code>json</code>, <code>xml</code>, <code>atom</code> or <code>rss</code>).</li>
<li>The rest can be selectively turned off by passing e.g. <code>href: false</code>.</li>
</ul>
<h2 id="example">Example</h2>
<pre><code class="lang-javascript">var goldwasher = require(&#39;goldwasher&#39;);
Expand Down Expand Up @@ -145,7 +149,9 @@ <h2 id="example">Example</h2>
tag: &quot;h1&quot;,
position: 0,
total: 2,
uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;
uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;,
batch: &quot;14eefda0-f762-11e4-a0b3-d5647c4f7651&quot;,
source: &quot;http://www.oakisstrong.com&quot;
},
{
timestamp: 1402847736381,
Expand All @@ -159,7 +165,9 @@ <h2 id="example">Example</h2>
tag: &quot;h2&quot;,
position: 1,
total: 2,
uuid: &quot;a48fbb30-f743-11e4-96e6-7b423a412011&quot;
uuid: &quot;a48fbb30-f743-11e4-96e6-7b423a412011&quot;,
batch: &quot;14eefda0-f762-11e4-a0b3-d5647c4f7651&quot;,
source: &quot;http://www.oakisstrong.com&quot;
}
]
*/
Expand All @@ -179,7 +187,9 @@ <h2 id="output-formats">Output formats</h2>
tag: &quot;h1&quot;,
position: 0,
total: 2,
uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;
uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;,
batch: &quot;14eefda0-f762-11e4-a0b3-d5647c4f7651&quot;,
source: &quot;http://www.oakisstrong.com&quot;
}
</code></pre>
<p><strong>XML:</strong></p>
Expand All @@ -193,6 +203,8 @@ <h2 id="output-formats">Output formats</h2>
&lt;timestamp&gt;1431296135800&lt;/timestamp&gt;
&lt;uuid&gt;14eefda0-f762-11e4-a0b3-d5647c4f7651&lt;/uuid&gt;
&lt;total&gt;3&lt;/total&gt;
&lt;batch&gt;14eefda0-f762-11e4-a0b3-d5647c4f7651&lt;/batch&gt;
&lt;source&gt;http://www.oakisstrong.com&lt;/batch&gt;
&lt;keyword&gt;
&lt;word&gt;oak&lt;/word&gt;
&lt;count&gt;1&lt;/count&gt;
Expand Down Expand Up @@ -303,7 +315,7 @@ <h2><a href="index.html">Index</a></h2>
<br clear="both">

<footer>
Documentation generated by <a href="https://github.com/jsdoc3/jsdoc">JSDoc 3.3.0-alpha5</a> on Thu May 14 2015 14:59:49 GMT+0200 (CEST)
Documentation generated by <a href="https://github.com/jsdoc3/jsdoc">JSDoc 3.3.0-alpha5</a> on Thu May 14 2015 16:33:25 GMT+0200 (CEST)
</footer>

<script> prettyPrint(); </script>
Expand Down
1 change: 0 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@
"dependencies": {
"cheerio": "^0.19.0",
"feed": "^0.2.6",
"istanbul": "^0.3.14",
"joi": "^6.4.1",
"js2xmlparser": "^0.1.9",
"node-uuid": "^1.4.3",
Expand Down

0 comments on commit 7823924

Please sign in to comment.