Java Applications

Fabrizio Celli edited this page Jun 16, 2016 · 30 revisions

Requirements: the application needs at least Java 6 to work properly.

The package org.fao.oekc.autotagger.main contains the three main applications you can use to automatic tag documents. These applications should be exectuted sequentially, since the output of one application is needed as input of another one. At the end of this guide, an example of usage is provided.

  • DownloadFiles: Starting from a text file containing the list of files to be downloaded (NewLine separator), downloads these files in a specific directory and converts the file to TXT (supported files: PDF, TXT, HTML). From v1.2 files are downloaded by multiple Java threads (parallelization).
    The constructor requires three parameters, plus an optional one:

    1. input text file location (e.g. data/sources/crawler_result.txt)
    2. the output directory without the ending slash (e.g. data/documents). The output directory will contain the output tar.gz file. The application checks if this directory exists: if not, it creates the directory; if yes, it does NOT delete the content. This directory is alse used as temporary container for downloaded files and RDF generation. But, after the execution of ProduceMappingTable, only output tar.gz files will be there.
    3. the download mode; the mode "listURL" means that the source file contains a list of URLs, one per line; the mode "nutchOutput" means that the source file is the output file of a Nutch Crawler (you can see the two samples in data/sources). The application generates a mapping file DOC_URL=DOC_NAME, located in the same directory of the input text file (param. 1), with the same name and a suffix "_mapping". This file will be deleted after the execution of ProduceMappingTable.
    4. (optional) extractTitles; a boolean flag to declare if you want to extract titles and descriptions too from documents' metadata (default is true).
  • MauiAutoTaggerKey: Given the path with text files downloaded by DownloadFiles, it generates the .key files with AGROVOC English keywords. Thus, the parameter should be the output directory of the DownloadFiles application (step 2), where there are downloaded files.
    From v1.2.1, as optional parameters, the user can specify a new AGROVOC thesaurus and a model name. The thesaurus name must refer to a file in /data/vocabularies called AGROVOC_NAME.rdf.gz. The model name refers to the file containing the MAUI model. If these two parameters are not specified, the system uses the default values "agrovoc_en" and "fao780". For more information about how to build a new model, please refer to New AGROVOC.

  • ProduceMappingTable: this application reads the outputs of MauiAutoTaggerKey and generates an output file, which is a tar.gz file of NTRIPLES RDF files. The application requires three mandatory command line parameters and two optional ones:

    1. input text file location (e.g. data/sources/crawler_result.txt), the same as DownloadFiles (step 1)
    2. the output directory without the ending slash, the same as DownloadFiles (step 2). At the end, the output directory contains only the output tar.gz file, and all temporary files (downloaded files, .key, _mapping) are deleted.
    3. the output format {rdfnt,text}. The output mode "rdfnt" creates NTRIPLES: the predicate is dcterms:subject, the subject is the document URL, the object is an AGROVOC URI. Currently, an output file can contain only 100.000 triples: more triples will cause more output files, gzipped in a single file. The output mode "text" creates a TXT file: each line of the file contains the pattern URL=LIST_OF_KEYWORDS, where "URL is the URL" of the document, and "LIST_OF_KEYWORDS" is the list of AGROVOC terms (as strings) assigned to the document.
    4. (optional) the name of the output file. If this parameter is not express, the system will generate a timestamp as filename. The given filename must be the name of the file, and not the path. It may include the extension tar.gz or not: if the extension is not give, the system will add .tar.gz to the filename.
    5. (optional) a boolean flag to state if the tagger should extract also titles (default: true). If you need to specify this parameter but you don't want to specify the previous one, you can set the previous one to null.

It is important to run this application after the DownloadFiles and MauiAutoTaggerKeyones, that generate a mapping file DOC_URL=DOC_NAME (and, eventually, DOC_URL=title and DOC_URL=DOC_DESCRIPTION if extractTitles is true). The application needs to find in the output directory the .key files produced by MAUI in the application MauiAutoTaggerKey.

In order to generate triples, since MAUI returns labels, ProduceMappingTable needs to read a mapping file that links AGROVOC URIs to labels. The mapping file is located in data/vocabularies and its named agrovocURILabelMappings.txt. The file provided in the package contains mappings for English, Spanish, French, Italian, and Portugese. To use another language, please read New AGROVOC.


DIRECTORIES:

Mandatory:

  • data/stopwords: to use MAUI
  • data/vocabularies: vocabularies to use MAUI and mapping AGROVOC URIs to strings
  • fao780 : MAUI model trained by 780 FAO bibliographic documents

Optional:

  • data/sources: files with URLs list and mapping files URLs->downloaded files names (two samples are provided)
  • data/test: for testing purposes (can be removed)

EXAMPLE OF USAGE:

  • java org.fao.oekc.autotagger.main.DownloadFiles data/sources/crawler_result.txt data/documents nutchOutput
    Downloads PDF, HTML, TXT files and convert them to TXT in the directory data/documents. Produces a mapping file DOC_URL=DOC_NAME and, eventually, mapping files DOC_URL=DOC_TITLE and DOC_URL=DOC_DESCRIPTION.

  • java org.fao.oekc.autotagger.main.MauiAutoTaggerKey data/documents/docKeys
    Creates .key files but needs txt files downloaded by the previous application in the directory data/documents/docKeys.

  • java org.fao.oekc.autotagger.main.ProduceMappingTable data/sources/crawler_result.txt data/documents rdfnt
    Produce the output, an RDF N3 file in the output directory (in the example data/documents).


Example of different usages of DownloadFiles:

  • java org.fao.oekc.autotagger.main.DownloadFiles data/sources/input.txt data/documents listURL false
    Since the boolean flag is specified and it is false, the application does NOT generate the mapping files DOC_URL=DOC_TITLE and DOC_URL=DOC_DESCRIPTION.

Example of different usages of MauiAutoTaggerKey:

  • java org.fao.oekc.autotagger.main.MauiAutoTaggerKey data/documents/docKeys my_agrovoc my_model
    The system expects to find my_agrovoc.rdf.gz in the data/vocabularies directory, and my_model in the root of the application, at the same level of the existing fao780 file.

Example of different usages of ProduceMappingTable:

  • java org.fao.oekc.autotagger.main.ProduceMappingTable ../work/test/crawler_result.txt ../work/output rdfnt myoutput
    The system adds the extension .tar.gz.
  • java org.fao.oekc.autotagger.main.ProduceMappingTable ../work/test/crawler_result.txt ../work/output rdfnt myoutput.tar.gz
  • java org.fao.oekc.autotagger.main.ProduceMappingTable ../work/test/crawler_result.txt ../work/output rdfnt
    The output file name is a timestamp.
  • java org.fao.oekc.autotagger.main.ProduceMappingTable ../work/test/crawler_result.txt ../work/output rdfnt null false
    The output file name is a timestamp.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.