The map/reduce file ingest tool of the full-scale E-ARK deployment. It unpackages TAR packaged E-ARK information packages and initiates the indexing of the individual files using the Lily API. The Java-based tool runs as a service and consumes RabbitMQ messages notifying about new packages being available for indexing in HDFS.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
archive_search
scripts
src/main
.gitignore
LICENSE
README.md
archetype_info.txt
mrjob-assembly.xml
pom.xml

README.md

dm-file-ingest

eArk WP6 - index file contents from extracted archives

Text is extracted from PDF-, Word and other documents. Structural information (e.g. headlines) is not parsed and can not be used for search queries.

How to: reset the Lily index and/or add new fields

reset lily index

cd /srv/lily-2.4/bin

list indexes

./lily-list-indexes

set environment

LILY_CONFIG=/srv/dm/dm-file-ingest/src/main/config/lily

only if a new field should be added: edit the following files

$LILY_CONFIG/schema.json
$LILY_CONFIG/indexerconf.xml
/srv/apache-solr-4.0.0/example/solr/eark1/conf/schema.xml

load the schema

./lily-import -s $LILY_CONFIG/schema.json

delete the now outdated index

./lily-update-index -n eark1 --state DELETE_REQUESTED

add index

./lily-add-index -n eark1 -c $LILY_CONFIG/indexerconf.xml -sm classic -s shard1:http://localhost:8983/solr/eark1 -dt eark1

clear solr index

curl http://localhost:8983/solr/eark1/update/?commit=true -d "<delete><query>*:*</query></delete>" -H "Content-Type: text/xml"

rebuild the index

./lily-update-index -n eark1 --build-state BUILD_REQUESTED