sparkler 0.1

Table of Contents Sparkler v0.1 Requirements Steps Download Apache Solr Start Solr Local Mode Verify Solr Cloud mode Inject Seed URLs Run Crawl

Sparkler v0.1

Quick Start Guide

Requirements

Apache Solr (Tested on 6.4.0, recommended, older versions have bugs which affects the functionality of this system)

Steps

Download Apache Solr

# A place to keep all the files organized
mkdir ~/work/sparkler/ -p
cd ~/work/sparkler/
# Download Solr Binary
For Mac : curl -O http://archive.apache.org/dist/lucene/solr/6.4.0/solr-6.4.0.tgz
For other: wget "http://archive.apache.org/dist/lucene/solr/6.4.0/solr-6.4.0.tgz"  # pick your version and mirror
# Extract Solr
tar xvzf solr-6.4.0.tgz
# Add crawldb config sets
cd solr-6.4.0/
cp -rv ${SPARKLER_GIT_SOURCE_PATH}/conf/solr/crawldb server/solr/configsets/

Start Solr

Solr can be started in local mode or in cloud mode. Note: You have to do either one of these below modes:

Local Mode

There are many ways to do this, Here is a relatively easy way to start solr with crawldb

# from the solr extracted directory
cp -r server/solr/configsets/crawldb server/solr/
./bin/solr start

Wait for a while to start the solr, Open http://localhost:8983/solr/#/~cores/ in your browser, Follow Add Core > then fill 'crawldb' for both name and instanceDir form fields and click Add Core.

Verify Solr

After above steps you should have a core named "crawldb" in solr. You can verify it by opening http://localhost:8983/solr/crawldb/select?q=* in your browser. This link should give a valid solr response with 0 documents.

Now the crawldb core is ready, go to Inject Seed URLs phase.

Cloud mode

Once the crawldb configs are copied to server/solr/configsets/ folder as described above, follow the interactive shell to launch a solr cloud.

The below section shows steps to create cloud with 3 instances or 2 shards, 2 replicas of crawldb collection. Hit enter to leave default values.

$  bin/solr -e cloud
Welcome to the SolrCloud example!
This interactive session will help you launch a SolrCloud cluster on your local workstation.
To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2]:
3
Ok, let's start up 3 Solr nodes for your example SolrCloud cluster.
Please enter the port for node1 [8983]:

Please enter the port for node2 [7574]:

Please enter the port for node3 [8984]:

Creating Solr home directory /Users/tg/work/irds/sparkler/workspace/solr-6.4.0/example/cloud/node1/solr
Cloning /Users/tg/work/irds/sparkler/workspace/solr-6.4.0/example/cloud/node1 into
   /Users/tg/work/irds/sparkler/workspace/solr-6.4.0/example/cloud/node2
Cloning /Users/tg/work/irds/sparkler/workspace/solr-6.4.0/example/cloud/node1 into
   /Users/tg/work/irds/sparkler/workspace/solr-6.4.0/example/cloud/node3
...
Now let's create a new collection for indexing documents in your 3-node cluster.
Please provide a name for your new collection: [gettingstarted]
crawldb
How many shards would you like to split crawldb into? [2]

How many replicas per shard would you like to create? [2]

Please choose a configuration for the crawldb collection, available options are:
basic_configs, data_driven_schema_configs, or sample_techproducts_configs [data_driven_schema_configs]
crawldb

Connecting to ZooKeeper at localhost:9983 ...
....
SolrCloud example running, please visit: http://localhost:8983/solr

Now we know from previous setup that our collection name is crawldb and zookeeper service is at localhost:9983, lets configure it. Open conf/sparkler-default.yaml and set crawldb.uri: crawldb::localhost:9983

Inject Seed URLs

Open a file called seed.txt and enter your seed urls. Example :

http://nutch.apache.org/
http://tika.apache.org/

If not already, build the `sparkler-app` jar referring to Build and Deploy instructions.

To inject URLs, run the following command.

$ java -jar sparkler-app-0.1.jar inject -sf seed.txt
2016-06-07 19:22:49 INFO  Injector$:70 [main] - Injecting 2 seeds
>>jobId = sparkler-job-1465352569649

This step just injected 2 URLs. In addition, we got a jobId `sparkler-job-1465352569649`. Suppose, to inject more seeds to the crawldb later phase, we can update using this job id. Usage :

$ java -jar sparkler-app-0.1.jar inject 
 -id (--job-id) VAL        : Id of an existing Job to which the urls are to be
                             injected. No argument will create a new job
 -sf (--seed-file) FILE    : path to seed file
 -su (--seed-url) STRING[] : Seed Url(s)

For example:

   bin/sparkler.sh inject -id sparkler-job-1465352569649 \
      -su http://www.bbc.com/news -su http://espn.go.com/

To see these URLS in crawldb : http://localhost:8983/solr/crawldb/query?q=*:*&facet=true&facet.field=status&facet.field=depth&facet.field=group

//NOTE: solr url can be updated in `sparkler-[default|site].properties` file

Run Crawl

To run a crawl:

$ java -jar sparkler-app-0.1.jar crawl
 -i (--iterations) N  : Number of iterations to run
 -id (--id) VAL       : Job id. When not sure, get the job id from injector
                        command
 -m (--master) VAL    : Spark Master URI. Ignore this if job is started by
                        spark-submit
 -o (--out) VAL       : Output path, default is job id
 -tg (--top-groups) N : Max Groups to be selected for fetch..
 -tn (--top-n) N      : Top urls per domain to be selected for a round

Example :

    bin/sparkler.sh crawl -id sparkler-job-1465352569649  -m local[*] -i 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly