ConfigureSearch

Following the same logic of Nutchwax, the configuration files will be inside the JAR files, which requires that after changing configuration you must compile again.

Step-by-step

Edit file pwa-technologies/PwaArchive-access/projects/nutchwax/conf/wax-default.xml:

Change property collection.type according to the type of collection for indexing. This has implications mostly during the Linkdb and Index phases. There are three types of collections:
- normal - collection from one crawl. It will be handled as one snapshot, where a version is identified by a URL.
- multiple - collection from multiple crawls. It will be handled as multiple snapshots, where a version is identified by a URL and a day.
- trec - collection from TREC (Text REtrieval Conference)
If normal or trec are selected then exit this page.

NOTE: you can change the values of the other properties if necessary.

If multiple is selected then the database parameters must be configured to create virtual snapshots:
- Change property database.conection (e.g. //t2.tomba.fccn.pt/nutchwax)
- Change property database.username (e.g. nutchwax)
- Change property database.password (e.g. xxxxx)

Virtual snapshots guarantee that versions of documents link only to versions of other documents with the closeness timestamp. This is a requirement for link-based algorithms, such as Pagerank, to work well.

Prepare a PostgreSQL database if multiple is selected in the collection.type property:

Create database for the first time using it:
- createdb nutchwax
- ALTER USER nutchwax WITH PASSWORD 'xxxxxx';
Login in database:
- psql
Create table and trigger:

  DROP table files cascade;
  SET client_encoding TO 'LATIN1';

  CREATE LANGUAGE plpgsql;
  CREATE OR REPLACE FUNCTION before_insert() RETURNS trigger AS '
  DECLARE
   n integer;
  BEGIN
   IF tg_op = ''INSERT'' THEN
     select count(*) into n
      from files
      where date=new.date and url=new.url;  
      IF n > 0 THEN
        RETURN NULL;
      ELSE
        RETURN new;	
      END IF;   
   END IF;
  END
  ' LANGUAGE plpgsql;

  create table files
      (date TIMESTAMP, 
       url VARCHAR(4000),
       type VARCHAR(100),
       status INTEGER,
       size INTEGER NOT NULL,
       arcname VARCHAR(100) NOT NULL,
       PRIMARY KEY (url,date));

   CREATE TRIGGER before_insert_trigger BEFORE INSERT ON files
    FOR EACH ROW EXECUTE PROCEDURE before_insert();

Install heritrix-1.12.1 because of libraries to extract meta-data:
- download heritrix-1.12.1.zip
- unzip it to a heritrixDirectory directory
- export HERITRIX_HOME=heritrixDirectory (e.g. /opt/searcher/heritrix-1.12.1)
- export WAYBACK_HOME=waybackDirectory (e.g. /opt/searcher/pwa-technologies/PwaArchive-access/projects/wayback/wayback-webapp/target/wayback-1.2.1/WEB-INF/lib/)
Extract metadata from ARC files:
- ${HERITRIX_HOME}/src/scripts/arcreader.sh /_directoryArcs_ /_directoryArcs_/stats.csv
- ${HADOOP_HOME}/bin/hadoop jar pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-job/target/nutchwax-job-0.11.0-SNAPSHOT.jar class org.apache.access.nutch.utils.UrlNormalizer /_directoryArcs_/stats.csv /_directoryArcs_/statsNormalized.csv 6 1
Load metadata into database:
- psql

  \COPY files FROM '/_directoryArcs_/statsNormalized.csv' DELIMITER ',' NULL AS '-' CSV

Create indexes:

  CREATE INDEX type_index ON files(type);
  CREATE INDEX status_index ON files(status);
  CREATE INDEX url_index ON files USING hash(url);

Errors and debugging

Check collection.type configured:
- ${HADOOP_HOME}/bin/hadoop jar pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-job/target/nutchwax-job-0.11.0-SNAPSHOT.jar version
Test database:
- test postmaster:

psql -h t7.tomba.fccn.pt -d nutchwax * run postmaster: su service postgresql stop service postgresql start * accessing remotly to postgresql - add to file /data/postgres:/data/pg_hba.conf

   host all all 0.0.0.0/0 md5

* set /usr/local/pgsql/data/postgresql.conf:

   listen_addresses = '*'
   port = 5432

See http://www.cyberciti.biz/tips/postgres-allow-remote-access-tcp-connection.html for more info.

Query database just for debugging:

   java -classpath pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-job/target/nutchwax-job-0.11.0-SNAPSHOT.jar:~/.m2/repository/postgresql/postgresql/8.3-604.jdbc4/postgresql-8.3-604.jdbc4.jar org.archive.access.nutch.jobs.sql.SqlSearcher [database.connection] [database.user] [database.password] [URL] [timestamp]
   e.g. org.archive.access.nutch.jobs.sql.SqlSearcher //t7.tomba.fccn.pt/nutchwax nutchwax xxxxx http://jn.sapo.pt/robots.txt "20070831100222"

If postgres requires a superuser to create the before_insert() function, then you should:
- su
- su postgres
- psql

   ALTER USER nutchwax WITH CREATEUSER CREATEDB;
   \du

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConfigureSearch

Step-by-step

Errors and debugging

Clone this wiki locally