No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
wdk-core-cwb
wdk-core-indexing
wdk-core-io
wdk-core-ner
wdk-core-normalization
wdk-core-preprocessing Initial commit. Feb 9, 2017
wdk-core-resources
wdk-core-topicmodeling
wdk-core-types
README.md
pom.xml

README.md

Preprocessing

The class PreprocessWdk reads a corpus in WdK format and annotates sentences, tokens, lemmas, part-of-speech tags, and named entities, after performing some normalization. The results are stored as binary Cas files.

The input files should be stored as in the following structure:

  • The meta.xml file contains the metadata for a book with a bookid:
<basedir>/<docid>/meta.xml
  • The pages in the book are stored like this:
<basedir>/<bookid>/ocr/<collection_ppn>_txt/<pageid>.txt

Where the pageid comprises 8 digits, e.g. 00000001.txt or 00000110.txt.

Specify the following parameters:

  • --corpusDir (-c): the corpus base directory (basedir in the example above.)
  • --glob (-g): the glob/pattern to use for the book ids; default is [0-9]*, i.e. all directories comprising numbers only.
  • --targetDir (-t): the target directory in which the output files are stored (in binary Cas format).

Topic Modelling

In order to estimate a topic model on (preprocessed) WdK documents, use the class EstimateTopicModel. It estimates a model based on a specific collection of input documents, stored in binary CAS files. The executable Jar takes a number of arguments:

usage: java -jar EstimateTopicModel.jar
 -a,--alpha <arg>        Alpha value (symmetric for all topics) for
                         Dirichlet process.
 -b,--beta <arg>         Beta value for Dirichlet process.
 -c,--CPUs <arg>         Number of CPUs/threads to use (default: 1).
 -d,--divType <arg>      Allowed div type (multiple allowed, e.g.
                         'Chapter').
 -f,--pos <arg>          POS tags to use.
 -i,--iterations <arg>   Number of iterations during model generation
                         (default: 500).
 -l,--lemma              Use lemma instead of original word form where
                         available.
 -m,--modelFile <arg>    Target file in which to store model.
 -n,--minLength <arg>    Minimum token (or other type) length
                         (default: 3).
 -p,--pattern <arg>      File pattern(s) for binary CAS files
                         (default: [+]*/ocr/*/*.bin).
 -r,--regex <arg>        Regular expression for filtering: if given, only
                         retain tokens that match this regex, e.g.
                         '[A-Z].{2,}' for tokens that start with a capital
                         letter and have at least a length of three.
 -s,--sourceDir <arg>    Base directory containing binary CAS'
                         (default: '.').
 -S,--sentences          Use sentences instead of the whole document for
                         model estimation.
 -t,--topics <arg>       Number of topics to generate (default: 50).
 -w,--stopWords <arg>    Stopwords file; if none specified, don't filter
                         stopwords.
 -W,--wordsFile <arg>    Use only words listed in this file.
 -y,--typeName <arg>     Type to use for model generation, e.g.
                         NamedEntity (default: Token)

The directory wdk-core-topicmodeling/src/main/java/de/tudarmstadt/ukp/experiments/wdk/topicmodeling/estimation/ contains a number of additional topic model estimation pipelines for more specific cases:

  • EstimateById: Reads a list of document IDs and generates a model based only on files matching those IDs. Use IdSearcher to generate a list of IDs based on a query.
  • EstimateByMetadataStringfield: Generates a model after selecting documents based on a value for a MetadataStringfield annotation in the CAS.
  • EstimateNEs: Estimate a model based on the named entity annotations in the documents only.
  • EstimateByCsvMetadata (deprecated)

Commit to Solr Index

Creating a complete Solr index comprises three steps:

  1. Index the texts and incorporated metadata with IndexWdkCollection.
  2. Index the topic-document distributions based on a previously generated model. IndexMalletLDA uses an existing model to infer the topic distributions for all documents and adds them to the index.
  3. Add the current metadata export in a CSV file with IndexCsvMetadata. It takes a CSV file as an input where the header line defines the field names, and adds the fields to the index.

In all steps, the index' schema.xml must have defined all the fields.

Update Metadata

IndexCsvMetadata can be called again at a later stage to update an existing index with newly updated metadata from a CSV file. Fields that existing in that metadata file are overwritten (possibly with empty values), all other documents fields remain untouched.