Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Tutorial: Nutch

jpmckinney edited this page · 6 revisions

The following information was contributed by Praful Bagai and J. Gobel.

This tutorial assumes that you are customizing the Reuters tutorial. It has been tested with Solr 3.6 and Nutch 1.6.

Solr and Nutch

In $NUTCH_RUNTIME_HOME/conf/schema.xml, replace:

<field name="content" type="text" stored="false" indexed="true"/>

with:

<field name="content" type="text" stored="true" indexed="true"/>

to make the value of the content field retrievable during a search.

Check the following properties in your nutch-default.xml:

<property>
  <name>fetcher.store.content</name>
  <value>true</value>
  <description>If true, fetcher will store content.</description>
</property>
<property>
  <name>parser.caching.forbidden.policy</name>
  <value>content</value>
  <description>If a site (or a page) requests through its robot metatags
  that it should not be shown as cached content, apply this policy.
Currently
  three keywords are recognized: "none" ignores any "noarchive" directives.
  "content" doesn't show the content, but shows summaries (snippets).
  "all" doesn't show either content or summaries.</description>
</property>

You may also need to copy fields from your Nutch schema to your Solr schema.

Next, follow this tutorial up to step 3.1. At step 3.1, do not run the command below, then continue up to step 6. You should then be able to log in to your Solr server and search for what Nutch crawled.

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

AJAX Solr

Download AJAX Solr and unpack the ZIP file into its own directory where your web server can find it.

Widgets

In examples/reuters/js/reuters.js, set solrUrl to point to your Solr server, and update the Solr parameters in var params to reflect the structure of your Solr documents:

  • Update facet.field with the fields on which you want to facet, e.g. [ 'title' ]
  • Remove f.topics.facet.limit and f.countryCodes.facet.limit unless your Solr documents have topics or countryCodes fields
  • Remove all facet.date parameters unless your Solr documents have a date field on which you want to facet

Either update or remove the tag cloud, autocomplete, country code and calendar widgets. For the tag cloud, you can set the associated Solr fields by changing the value of var fields, e.g. [ 'title', 'url', 'content' ].

Theme

Nutch uses a content field, instead of a text field like Reuters. In examples/reuters/js/reuters.theme.js, in the AjaxSolr.theme.prototype.snippet function, replace all occurrences of doc.text with doc.content. Nutch has no dateline field, so remove all occurrences of doc.dateline + ' ' +.

Wrap up

You should now be able to open examples/reuters/index.html in a browser.

Something went wrong with that request. Please try again.