Apache Nutch Plugin to capture Raw, Unstripped HTML content that nutch crawls
Apache nutch by default strips all HTML tags before indexing. What if you wanted to keep the tag ? ... Enter raw-html plugin. Use this nutch plugin to store raw html.
To do that download the built version from the build directory. Instructions avaiable in the readme there.
The plugin can be easily modified to make it extract/filter/both specific tags. Go to source directory if want to do that; download, modify and build it yourself! It's easy. Instructions avaiable in the readme there.