Skip to content

Automation of Data Acquisition and Crawling

Kristen Cheung edited this page Oct 24, 2018 · 4 revisions

Introduction

This wiki entry provides a guide for setting up a daemon to monitor the data/staging folder for incoming data, upon detection of new data the crawler module will kick in and run a metadata extraction task in order to generate client-side metadata before sending all of that to the File Manager to be ingested.

Prerequisite(s)

Please ensure you have all of the prerequisite software installed. In particular, you should now have an unpacked coal-sds deployment available on your filesystem at /usr/local/coal-sds-deploy. The following documentation assumes you have executed a cd into that directory. The unpackaged coal-sds contents will look as follows

[ec2-user@ip-172-31-28-45 coal-sds-deploy]$ ls -al
total 56
drwxr-xr-x 14 root root 4096 Apr 16 02:12 .
drwxr-xr-x 14 root root 4096 Apr 16 02:12 ..
drwxrwxrwx  2 root root 4096 Apr 16 01:51 bin
drwxr-xr-x  7 root root 4096 Apr 16 02:12 crawler
drwxrwxrwx  7 root root 4096 Apr 16 02:07 data
drwxr-xr-x  4 root root 4096 Apr 16 02:12 extensions
drwxr-xr-x  8 root root 4096 Apr 16 02:12 filemgr
drwxrwxrwx  2 root root 4096 Apr 16 02:07 logs
drwxr-xr-x  8 root root 4096 Apr 16 02:12 pcs
drwxr-xr-x  5 root root 4096 Apr 16 02:12 pge
drwxr-xr-x  8 root root 4096 Apr 16 02:12 resmgr
drwxr-xr-x  3 root root 4096 Apr 16 02:12 solr
drwxrwxrwx 11 root root 4096 Apr 16 02:06 tomcat
drwxr-xr-x  8 root root 4096 Apr 16 02:12 workflow

Crawl Controller

Start the COAL SDS Process

Starting the process can be found here.

Once the OODT has been started, navigate to the bin folder in the crawler directory:

$ cd /usr/local/coal-sds-deploy/crawler/bin

Running the Crawler as a Daemon Process

Then, run the following command to execute the daemon (running on port 9003) that will monitor (every 2 seconds) the /usr/local/coal-sds-deploy/data/staging and do the following upon detection of new staging data

  1. Execute the TikaCmdLineMetExtractor which uses Apache Tika to extract metadata from whatever it finds.
  2. Ingests the file into the File Manager running on http://localhost:9000
  3. Upon successful ingestion moves the staging file to data/archive and deletes the original data file from data/staging
  4. Upon unsuccessful ingestion moves the staging file out of data/staging into data/failure
./crawlctl

That's it...

Actually, the above script is merely wrapping the following execution

./crawler_launcher --filemgrUrl http://localhost:9000 \
  --operation --launchMetCrawler \
  --clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory \
  --productPath /usr/local/coal-sds-deploy/data/staging \
  --metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor \
  --metExtractorConfig /usr/local/coal-sds-deploy/data/met/tika.conf \
  --failureDir /usr/local/coal-sds-deploy/data/failure \
  --daemonPort 9003 \
  --daemonWait 2 \
  --successDir /usr/local/coal-sds-deploy/data/archive \
  --actionIds DeleteDataFile

In order to see the above workflow in action... visit http://localhost:8080/opsui/status to view the ingested metadata.