Add changelog and update readme.

alexlangberg · May 14, 2015 · 7823924 · 7823924
1 parent 7335cab
commit 7823924
Show file tree

Hide file tree

Showing 5 changed files with 57 additions and 22 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,12 @@
+# 4.0.0 (2015-15-05)
+
+## Features
+
+- It is now possible to both use goldwasher for scraping and conversion of its own formats. Thus, the input can now be any of the following: html, xml, cheerio object, array of goldwasher items, goldwasher xml or even an RSS/Atom feed. The output can be either json, xml, atom or rss. Note that feeds do not contain the same amount of information as json or xml.
+- The parameter "batch" has been added to the format. It contains a UUID that will be the same for all nuggets of a goldwasher batch.
+- The parameter "source" has been added to the format. It contains the original URL of the scraped page.
+
+## Breaking changes
+
+- The flags for individual goldwasher format keys have been removed. You will thus always get full goldwasher formatted objects out. If you need to remove keys from them, do so afterwards.
+- If upgrading from older versions, note that ```target``` has been renamed to the more proper ```selector```.
diff --git a/README.md b/README.md
@@ -1,14 +1,12 @@
 # node-goldwasher
+[![npm version](http://img.shields.io/npm/v/goldwasher.svg)](https://www.npmjs.org/package/goldwasher)
 [![Build Status](http://img.shields.io/travis/alexlangberg/node-goldwasher.svg)](https://travis-ci.org/alexlangberg/node-goldwasher)
 [![Coverage Status](http://img.shields.io/coveralls/alexlangberg/node-goldwasher.svg)](https://coveralls.io/r/alexlangberg/node-goldwasher?branch=master)
 [![Code Climate](http://img.shields.io/codeclimate/github/alexlangberg/node-goldwasher.svg)](https://codeclimate.com/github/alexlangberg/node-goldwasher)
-[![npm version](http://img.shields.io/npm/v/goldwasher.svg)](https://www.npmjs.org/package/goldwasher)
 
 [![Dependency Status](https://david-dm.org/alexlangberg/node-goldwasher.svg)](https://david-dm.org/alexlangberg/node-goldwasher)
 [![devDependency Status](https://david-dm.org/alexlangberg/node-goldwasher/dev-status.svg)](https://david-dm.org/alexlangberg/node-goldwasher#info=devDependencies)
 
-**NOTE:** Version 3 has been a complete rewrite. UUIDs have been added and all parts can be selectively turned off by passing e.g. ```href: false``` as an option. The only breaking change should be that you have to switch the html and options parameters and rename the ```targets``` parameter to ```selector```.
-
 The purpose module is to extract text information from HTML, usually a website, which will often have to be sanitized and filtered to be useful. This module takes a pile of HTML and washes out the parts you need as small, golden nuggets of text and related metadata, the default options referred to as "goldwasher format":
 
 JSON format (see additional formats in the bottom):
@@ -26,7 +24,9 @@ JSON format (see additional formats in the bottom):
     tag: "h1",
     position: 0,
     total: 2,
-    uuid: "808b7490-f743-11e4-90b2-df723554e9be"
+    uuid: "808b7490-f743-11e4-90b2-df723554e9be",
+    batch: "14eefda0-f762-11e4-a0b3-d5647c4f7651",
+    source: "http://www.oakisstrong.com"
 }
 ```
 
@@ -43,6 +43,8 @@ It works by passing it either pure HTML as a string (e.g. from [request](https:/
 9. Assign a unique identifier (UUID V1).
 10. Index the nugget position in the order it was found found.
 11. Add the total nugget count.
+12. Add the URL of the original source.
+13. Assign a unique identifier (UUID V1) that is similar for the entire batch of nuggets.
 
 The returned nuggets include the object properties:
 
@@ -58,8 +60,11 @@ The returned nuggets include the object properties:
 - ```position``` - the position of the object, indicating the order in which tags were found. 0-based.
 - ```total``` - total number of nuggets in relation to the position. 1-based.
 - ```uuid``` - a unique identifier (UUID V1).
+- ```batch``` - a unique identifier (UUID V1) that is the same for the entire batch of nuggets.
+- ```source``` - a URL that was scraped, also the same for all nuggets.
+
 
-Alternatively, the output can be configured as XML, Atom or RSS format with the ```output``` option.
+Alternatively, the output can be configured as XML, Atom or RSS format with the ```output``` option. The reason redundant information is included, such as the source, is that each returned nugget is supposed to be an atomic piece of information. As such, each nugget is to contain the information that "somewhere, at some point in time, something was written (with a link to some place)".
 
 ## Installation
 ```
@@ -75,7 +80,6 @@ npm install goldwasher
 - ```filterKeywords``` - stop words that should be excluded as keywords.
 - ```filterLocale``` - stop words from external json file (see the folder stop_words).
 - ```format``` - output format (```json```, ```xml```, ```atom``` or ```rss```).
-- The rest can be selectively turned off by passing e.g. ```href: false```.
 
 ## Example
 ```javascript
@@ -110,7 +114,9 @@ var result = goldwasher(html, options);
     tag: "h1",
     position: 0,
     total: 2,
-    uuid: "808b7490-f743-11e4-90b2-df723554e9be"
+    uuid: "808b7490-f743-11e4-90b2-df723554e9be",
+    batch: "14eefda0-f762-11e4-a0b3-d5647c4f7651",
+    source: "http://www.oakisstrong.com"
    },
   { 
     timestamp: 1402847736381,
@@ -124,7 +130,9 @@ var result = goldwasher(html, options);
     tag: "h2",
     position: 1,
     total: 2,
-    uuid: "a48fbb30-f743-11e4-96e6-7b423a412011"
+    uuid: "a48fbb30-f743-11e4-96e6-7b423a412011",
+    batch: "14eefda0-f762-11e4-a0b3-d5647c4f7651",
+    source: "http://www.oakisstrong.com"
   }
 ]
 */
@@ -146,7 +154,9 @@ var result = goldwasher(html, options);
     tag: "h1",
     position: 0,
     total: 2,
-    uuid: "808b7490-f743-11e4-90b2-df723554e9be"
+    uuid: "808b7490-f743-11e4-90b2-df723554e9be",
+    batch: "14eefda0-f762-11e4-a0b3-d5647c4f7651",
+    source: "http://www.oakisstrong.com"
 }
 ```
 
@@ -162,6 +172,8 @@ var result = goldwasher(html, options);
         <timestamp>1431296135800</timestamp>
         <uuid>14eefda0-f762-11e4-a0b3-d5647c4f7651</uuid>
         <total>3</total>
+        <batch>14eefda0-f762-11e4-a0b3-d5647c4f7651</batch>
+        <source>http://www.oakisstrong.com</batch>
         <keyword>
             <word>oak</word>
             <count>1</count>

diff --git a/docs/goldwasher.js.html b/docs/goldwasher.js.html
@@ -682,7 +682,7 @@ <h2><a href="index.html">Index</a></h2>
 <br clear="both">
 
 <footer>
-    Documentation generated by <a href="https://github.com/jsdoc3/jsdoc">JSDoc 3.3.0-alpha5</a> on Thu May 14 2015 14:59:49 GMT+0200 (CEST)
+    Documentation generated by <a href="https://github.com/jsdoc3/jsdoc">JSDoc 3.3.0-alpha5</a> on Thu May 14 2015 16:33:25 GMT+0200 (CEST)
 </footer>
 
 <script> prettyPrint(); </script>

diff --git a/docs/index.html b/docs/index.html
@@ -42,13 +42,12 @@ <h3> </h3>
 
     <section>
         <article><h1 id="node-goldwasher">node-goldwasher</h1>
-<p><a href="https://travis-ci.org/alexlangberg/node-goldwasher"><img src="http://img.shields.io/travis/alexlangberg/node-goldwasher.svg" alt="Build Status"></a>
+<p><a href="https://www.npmjs.org/package/goldwasher"><img src="http://img.shields.io/npm/v/goldwasher.svg" alt="npm version"></a>
+<a href="https://travis-ci.org/alexlangberg/node-goldwasher"><img src="http://img.shields.io/travis/alexlangberg/node-goldwasher.svg" alt="Build Status"></a>
 <a href="https://coveralls.io/r/alexlangberg/node-goldwasher?branch=master"><img src="http://img.shields.io/coveralls/alexlangberg/node-goldwasher.svg" alt="Coverage Status"></a>
-<a href="https://codeclimate.com/github/alexlangberg/node-goldwasher"><img src="http://img.shields.io/codeclimate/github/alexlangberg/node-goldwasher.svg" alt="Code Climate"></a>
-<a href="https://www.npmjs.org/package/goldwasher"><img src="http://img.shields.io/npm/v/goldwasher.svg" alt="npm version"></a></p>
+<a href="https://codeclimate.com/github/alexlangberg/node-goldwasher"><img src="http://img.shields.io/codeclimate/github/alexlangberg/node-goldwasher.svg" alt="Code Climate"></a></p>
 <p><a href="https://david-dm.org/alexlangberg/node-goldwasher"><img src="https://david-dm.org/alexlangberg/node-goldwasher.svg" alt="Dependency Status"></a>
 <a href="https://david-dm.org/alexlangberg/node-goldwasher#info=devDependencies"><img src="https://david-dm.org/alexlangberg/node-goldwasher/dev-status.svg" alt="devDependency Status"></a></p>
-<p><strong>NOTE:</strong> Version 3 has been a complete rewrite. UUIDs have been added and all parts can be selectively turned off by passing e.g. <code>href: false</code> as an option. The only breaking change should be that you have to switch the html and options parameters and rename the <code>targets</code> parameter to <code>selector</code>.</p>
 <p>The purpose module is to extract text information from HTML, usually a website, which will often have to be sanitized and filtered to be useful. This module takes a pile of HTML and washes out the parts you need as small, golden nuggets of text and related metadata, the default options referred to as &quot;goldwasher format&quot;:</p>
 <p>JSON format (see additional formats in the bottom):</p>
 <pre><code class="lang-javascript">{ 
@@ -64,7 +63,9 @@ <h3> </h3>
     tag: &quot;h1&quot;,
     position: 0,
     total: 2,
-    uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;
+    uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;,
+    batch: &quot;14eefda0-f762-11e4-a0b3-d5647c4f7651&quot;,
+    source: &quot;http://www.oakisstrong.com&quot;
 }
 </code></pre>
 <p>It works by passing it either pure HTML as a string (e.g. from <a href="https://www.npmjs.org/package/request">request</a>) or a <a href="https://www.npmjs.org/package/cheerio">cheerio</a> object, usually along with a <a href="https://www.npmjs.org/package/cheerio">cheerio</a> (jQuery) selector (html tags) from which the text should be extracted, along with other options. It will then return an array of nuggets (objects) of information - one per recognized tag. For each nugget, it will try to:</p>
@@ -80,6 +81,8 @@ <h3> </h3>
 <li>Assign a unique identifier (UUID V1).</li>
 <li>Index the nugget position in the order it was found found.</li>
 <li>Add the total nugget count.</li>
+<li>Add the URL of the original source.</li>
+<li>Assign a unique identifier (UUID V1) that is similar for the entire batch of nuggets.</li>
 </ol>
 <p>The returned nuggets include the object properties:</p>
 <ul>
@@ -97,8 +100,10 @@ <h3> </h3>
 <li><code>position</code> - the position of the object, indicating the order in which tags were found. 0-based.</li>
 <li><code>total</code> - total number of nuggets in relation to the position. 1-based.</li>
 <li><code>uuid</code> - a unique identifier (UUID V1).</li>
+<li><code>batch</code> - a unique identifier (UUID V1) that is the same for the entire batch of nuggets.</li>
+<li><code>source</code> - a URL that was scraped, also the same for all nuggets.</li>
 </ul>
-<p>Alternatively, the output can be configured as XML, Atom or RSS format with the <code>output</code> option.</p>
+<p>Alternatively, the output can be configured as XML, Atom or RSS format with the <code>output</code> option. The reason redundant information is included, such as the source, is that each returned nugget is supposed to be an atomic piece of information. As such, each nugget is to contain the information that &quot;somewhere, at some point in time, something was written (with a link to some place)&quot;.</p>
 <h2 id="installation">Installation</h2>
 <pre><code>npm install goldwasher
 </code></pre><h2 id="options">Options</h2>
@@ -111,7 +116,6 @@ <h2 id="installation">Installation</h2>
 <li><code>filterKeywords</code> - stop words that should be excluded as keywords.</li>
 <li><code>filterLocale</code> - stop words from external json file (see the folder stop_words).</li>
 <li><code>format</code> - output format (<code>json</code>, <code>xml</code>, <code>atom</code> or <code>rss</code>).</li>
-<li>The rest can be selectively turned off by passing e.g. <code>href: false</code>.</li>
 </ul>
 <h2 id="example">Example</h2>
 <pre><code class="lang-javascript">var goldwasher = require(&#39;goldwasher&#39;);
@@ -145,7 +149,9 @@ <h2 id="example">Example</h2>
     tag: &quot;h1&quot;,
     position: 0,
     total: 2,
-    uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;
+    uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;,
+    batch: &quot;14eefda0-f762-11e4-a0b3-d5647c4f7651&quot;,
+    source: &quot;http://www.oakisstrong.com&quot;
    },
   { 
     timestamp: 1402847736381,
@@ -159,7 +165,9 @@ <h2 id="example">Example</h2>
     tag: &quot;h2&quot;,
     position: 1,
     total: 2,
-    uuid: &quot;a48fbb30-f743-11e4-96e6-7b423a412011&quot;
+    uuid: &quot;a48fbb30-f743-11e4-96e6-7b423a412011&quot;,
+    batch: &quot;14eefda0-f762-11e4-a0b3-d5647c4f7651&quot;,
+    source: &quot;http://www.oakisstrong.com&quot;
   }
 ]
 */
@@ -179,7 +187,9 @@ <h2 id="output-formats">Output formats</h2>
     tag: &quot;h1&quot;,
     position: 0,
     total: 2,
-    uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;
+    uuid: &quot;808b7490-f743-11e4-90b2-df723554e9be&quot;,
+    batch: &quot;14eefda0-f762-11e4-a0b3-d5647c4f7651&quot;,
+    source: &quot;http://www.oakisstrong.com&quot;
 }
 </code></pre>
 <p><strong>XML:</strong></p>
@@ -193,6 +203,8 @@ <h2 id="output-formats">Output formats</h2>
         &lt;timestamp&gt;1431296135800&lt;/timestamp&gt;
         &lt;uuid&gt;14eefda0-f762-11e4-a0b3-d5647c4f7651&lt;/uuid&gt;
         &lt;total&gt;3&lt;/total&gt;
+        &lt;batch&gt;14eefda0-f762-11e4-a0b3-d5647c4f7651&lt;/batch&gt;
+        &lt;source&gt;http://www.oakisstrong.com&lt;/batch&gt;
         &lt;keyword&gt;
             &lt;word&gt;oak&lt;/word&gt;
             &lt;count&gt;1&lt;/count&gt;
@@ -303,7 +315,7 @@ <h2><a href="index.html">Index</a></h2>
 <br clear="both">
 
 <footer>
-    Documentation generated by <a href="https://github.com/jsdoc3/jsdoc">JSDoc 3.3.0-alpha5</a> on Thu May 14 2015 14:59:49 GMT+0200 (CEST)
+    Documentation generated by <a href="https://github.com/jsdoc3/jsdoc">JSDoc 3.3.0-alpha5</a> on Thu May 14 2015 16:33:25 GMT+0200 (CEST)
 </footer>
 
 <script> prettyPrint(); </script>

diff --git a/package.json b/package.json
@@ -47,7 +47,6 @@
   "dependencies": {
     "cheerio": "^0.19.0",
     "feed": "^0.2.6",
-    "istanbul": "^0.3.14",
     "joi": "^6.4.1",
     "js2xmlparser": "^0.1.9",
     "node-uuid": "^1.4.3",