Skip to content

Retraining CluProcessor

Mihai Surdeanu edited this page Jun 4, 2020 · 13 revisions

Retraining CluProcessor

Note: these components are no longer supported, starting with processors v8+. Please see Metal instead.

CluProcessor is the lab's internal suite of NLP tools, which includes only tools licensed under the Apache license. All these components (with the exception of the tokenizer and lemmatizer) are largely language and domain independent, and can be trained on other domains relatively quickly. Please follow these instructions to re-train the components in CluProcessor.

Retraining the part-of-speech (POS) tagger

Use the following command line to retrain the POS tagger:

sbt 'run-main org.clulab.processors.clu.sequences.PartOfSpeechTagger -train <YOUR_TRAIN_FILE> -model <FILE_WHERE_YOU_WANT_TO_SAVE_YOUR_MODEL> -test <YOUR_TEST_FILE> -conllu -bi 10'

The model file is simply a text file, where the classifier saves the statistics it learned from the data. Both the training file and the testing file share the same format. They both expect one word per line, where each line contains the word itself and the POS tag, separated by tab. There is an empty line between sentences. For example, the beginning of the training file from the Penn Treebank looks like this:

Pierre	NNP
Vinken	NNP
,	,
61	CD
years	NNS
old	JJ
,	,
will	MD
join	VB
the	DT
board	NN
as	IN
a	DT
nonexecutive	JJ
director	NN
Nov.	NNP
29	CD
.	.

Mr.	NNP
Vinken	NNP
is	VBZ
chairman	NN
of	IN
Elsevier	NNP
N.V.	NNP
,	,
the	DT
Dutch	NNP
publishing	VBG
group	NN
.	.

The features for the POS tagger are implemented in this file: main/src/main/scala/org/clulab/processors/clu/sequences/PartOfSpeechTagger.scala, in the method featureExtractor(). Most of these features are language independent. The only exception is FeatureExtractor.lemma(), which currently relies on the English lemmatizer in CluProcessor.

Once you have a POS model you are happy with, you can copy it in this directory: modelsmain/src/main/resources/org/clulab/processors/clu/. Then adjust the config file for CluProcessor to point to the new model file. Here is an example of a valid config file for CluProcessor: main/src/main/resources/cluprocessoropen.conf.

Retraining the maltparser models

First, download then maltparser from here: http://www.maltparser.org/download.html

We are currently using version 1.9.0. If you change the version number, please copy again the corresponding appdata/ directory from the malt distribution to this location in processors: modelsmain/src/main/resources/appdata/.

When copying over a new appdata/ directory from a newer malt version, make sure to replace @version@ with the actual version number (e.g., 1.9.0). These two files must be edited for the @version@ change: appdata/options.xml and appdata/release.properties.

The parser in our CluProcessor consists of an ensemble of three malt models: arc-eager traversing the text left-to-right, arc-standard traversing the text left-to-right, and arc-standard traversing the text right-to-left. Below are instructions how to train all these models:

Training the arc-eager, left-to-right model

Use the following commands to train the arc-eager forward, i.e., left-to-right model:

mkdir -p output

java -jar maltparser-1.9.0/maltparser-1.9.0.jar -w output -c en-forward-nivre -i <COMBINED TRAIN FILE FROM WSJ AND GENIA> -a nivreeager -m learn -l liblinear -llo -s_4_-c_0.1 -d POSTAG -s Input[0] -T 1000 -F NivreEager.xml

where:

  • The combined train file is available on our servers at: corpora/processors/deps/combined/wsjtrain-wsjdev-geniatrain-geniadev.conllx
  • The NivreEager.xml is the one located under appdata/features/liblinear/conllx/NivreEager.xml

Training the arc-standard, left-to-right model

Use the following commands to train the arc-standard left-to-right model:

mkdir -p output

java -jar maltparser-1.9.0/maltparser-1.9.0.jar -w output -c en-forward-nivrestandard -i <COMBINED TRAIN FILE FROM WSJ AND GENIA> -a nivrestandard -m learn -l liblinear -llo -s_4_-c_0.1 -d POSTAG -s Input[0] -T 1000 -F NivreStandard.xml

Training the arc-standard, right-to-left model

Use the following commands to train the arc-standard right-to-left model:

mkdir -p output

java -jar code/maltparser-1.9.0/maltparser-1.9.0.jar -w output -c en-backward-nivrestandard -i <REVERSED TRAIN FILE> -a nivrestandard -m learn -l liblinear -llo -s_4_-c_0.1 -d POSTAG -s Input[0] -T 1000 -F NivreStandard.xml

where:

  • The reversed train file (i.e., the file with sentences written right-to-left) is available on our servers at: corpora/processors/deps/combined/wsjtrain-wsjdev-geniatrain-geniadev.conllx.righttoleft. For other languages or domains, you can generate such a reversed datasets from a file in the CoNLL-X format using the org.clulab.processors.clu.syntax.ReverseTreebank class.

Testing a malt model

It is recommended that you test the accuracy of the models you train. We offer a class that does this with a single command line:

sbt 'run-main org.clulab.processors.clulab.syntax.EvaluateMalt <MODEL FILE NAME> <TESTING TREEBANK IN CONLL-X FORMAT>

Once you have parser models you are happy with, you can copy them in this directory: modelsmain/src/main/resources/org/clulab/processors/clu/. Then adjust the config file for CluProcessor to point to the new model files. Here is an example of a valid config file for CluProcessor: main/src/main/resources/cluprocessoropen.conf.