Skip to content
arquivo edited this page Jul 28, 2015 · 2 revisions

Following the same logic of Nutchwax, the configuration files will be inside the JAR files, which requires that after changing configuration you must compile again.

Step-by-step

Edit file pwa-technologies/PwaArchive-access/projects/nutchwax/conf/wax-default.xml:

  • Change property collection.type according to the type of collection for indexing. This has implications mostly during the Linkdb and Index phases. There are three types of collections:

    • normal - collection from one crawl. It will be handled as one snapshot, where a version is identified by a URL.
    • multiple - collection from multiple crawls. It will be handled as multiple snapshots, where a version is identified by a URL and a day.
    • trec - collection from TREC (Text REtrieval Conference)
  • If normal or trec are selected then exit this page.

NOTE: you can change the values of the other properties if necessary.

  • If multiple is selected then the database parameters must be configured to create virtual snapshots:
    • Change property database.conection (e.g. //t2.tomba.fccn.pt/nutchwax)
    • Change property database.username (e.g. nutchwax)
    • Change property database.password (e.g. xxxxx)

Virtual snapshots guarantee that versions of documents link only to versions of other documents with the closeness timestamp. This is a requirement for link-based algorithms, such as Pagerank, to work well.

Prepare a PostgreSQL database if multiple is selected in the collection.type property:

  • Create database for the first time using it:

    • createdb nutchwax
    • ALTER USER nutchwax WITH PASSWORD 'xxxxxx';
  • Login in database:

    • psql
  • Create table and trigger:

  DROP table files cascade;
  SET client_encoding TO 'LATIN1';

  CREATE LANGUAGE plpgsql;
  CREATE OR REPLACE FUNCTION before_insert() RETURNS trigger AS '
  DECLARE
   n integer;
  BEGIN
   IF tg_op = ''INSERT'' THEN
     select count(*) into n
      from files
      where date=new.date and url=new.url;  
      IF n > 0 THEN
        RETURN NULL;
      ELSE
        RETURN new;	
      END IF;   
   END IF;
  END
  ' LANGUAGE plpgsql;

  create table files
      (date TIMESTAMP, 
       url VARCHAR(4000),
       type VARCHAR(100),
       status INTEGER,
       size INTEGER NOT NULL,
       arcname VARCHAR(100) NOT NULL,
       PRIMARY KEY (url,date));

   CREATE TRIGGER before_insert_trigger BEFORE INSERT ON files
    FOR EACH ROW EXECUTE PROCEDURE before_insert();
  • Install heritrix-1.12.1 because of libraries to extract meta-data:

    • download heritrix-1.12.1.zip
    • unzip it to a heritrixDirectory directory
    • export HERITRIX_HOME=heritrixDirectory (e.g. /opt/searcher/heritrix-1.12.1)
    • export WAYBACK_HOME=waybackDirectory (e.g. /opt/searcher/pwa-technologies/PwaArchive-access/projects/wayback/wayback-webapp/target/wayback-1.2.1/WEB-INF/lib/)
  • Extract metadata from ARC files:

    • ${HERITRIX_HOME}/src/scripts/arcreader.sh /_directoryArcs_ /_directoryArcs_/stats.csv
    • ${HADOOP_HOME}/bin/hadoop jar pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-job/target/nutchwax-job-0.11.0-SNAPSHOT.jar class org.apache.access.nutch.utils.UrlNormalizer /_directoryArcs_/stats.csv /_directoryArcs_/statsNormalized.csv 6 1
  • Load metadata into database:

    • psql
  \COPY files FROM '/_directoryArcs_/statsNormalized.csv' DELIMITER ',' NULL AS '-' CSV
  • Create indexes:
  CREATE INDEX type_index ON files(type);
  CREATE INDEX status_index ON files(status);
  CREATE INDEX url_index ON files USING hash(url);

Errors and debugging

  • Check collection.type configured:

    • ${HADOOP_HOME}/bin/hadoop jar pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-job/target/nutchwax-job-0.11.0-SNAPSHOT.jar version
  • Test database:

    • test postmaster:

psql -h t7.tomba.fccn.pt -d nutchwax * run postmaster: su service postgresql stop service postgresql start * accessing remotly to postgresql - add to file /data/postgres:/data/pg_hba.conf

   host all all 0.0.0.0/0 md5
* set /usr/local/pgsql/data/postgresql.conf:
   listen_addresses = '*'
   port = 5432

See http://www.cyberciti.biz/tips/postgres-allow-remote-access-tcp-connection.html for more info.

  • Query database just for debugging:
   java -classpath pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-job/target/nutchwax-job-0.11.0-SNAPSHOT.jar:~/.m2/repository/postgresql/postgresql/8.3-604.jdbc4/postgresql-8.3-604.jdbc4.jar org.archive.access.nutch.jobs.sql.SqlSearcher [database.connection] [database.user] [database.password] [URL] [timestamp]
   e.g. org.archive.access.nutch.jobs.sql.SqlSearcher //t7.tomba.fccn.pt/nutchwax nutchwax xxxxx http://jn.sapo.pt/robots.txt "20070831100222"
  • If postgres requires a superuser to create the before_insert() function, then you should:
    • su
    • su postgres
    • psql
   ALTER USER nutchwax WITH CREATEUSER CREATEDB;
   \du