Skip to content

Latest commit

 

History

History
11 lines (7 loc) · 620 Bytes

README.md

File metadata and controls

11 lines (7 loc) · 620 Bytes

raw-html -- Apache Nutch plugin

Apache Nutch Plugin to capture Raw, Unstripped HTML content that nutch crawls

Apache nutch by default strips all HTML tags before indexing. What if you wanted to keep the tag ? ... Enter raw-html plugin. Use this nutch plugin to store raw html.

To do that download the built version from the build directory. Instructions avaiable in the readme there.

The plugin can be easily modified to make it extract/filter/both specific tags. Go to source directory if want to do that; download, modify and build it yourself! It's easy. Instructions avaiable in the readme there.