This is a very prototypical API for creating text mining experiments using Java and Scala.
It is based on the idea that every text mining task essentially is an annotation of text with some entities (e.g. this is a word, a verb, a place, interesting, etc). Different text mining components typically create annotations of different types (e.g. token, type, part of speech, location, sentiment, etc). This can be modeled as an Annotation[T]
. The components or agents consume and produce these entities as input and output. This can be represented as Agent[I, O]
, e.g. a POS tagger consumes annotations of tokens and produces annotations of part of speech – it is an Agent[Token, POS]
. These agents interact in syntheses and analyses. An analysis is a linear interaction between agents, e.g. POS tagging involves analysing tokens: Analysis[Token]
. A synthesis combines two annotation types – e.g. to train a classifier: Synthesis[Token, Feature]
. By checking if the interaction types match the agent types, the compiler can check if experiment setups make any sense at all.
TM2 can be used as a regular API from Java, or in a very concise way from Scala, e.g. to define an analysis:
val a: Analysis[Token] = tokenizer -> gazetteer
In a similar way we can define syntheses:
val s: Synthesis[Token, Frequency] = (tokenizer, indexer) -> index
With |
, the interactions can be combined into experiments, which can be executed with !
:
val e: Experiment = corpus -> tokenizer | tokenizer -> ie !
We can combine analyses and syntheses, e.g. to add an evaluation:
val e = corpus -> tok | tok -> (ie, gold) | (ie, gold) -> eval !
Combining this Scala API with Scala’s general features, we can easily define and run experiments with variable configuration parameters, using this general form:
run { for { <configuration> } yield { <experiment> } }
A simple example with two different corpora and two different tokenizers (i.e. 2*2=4 runs) could look like this:
val ie = new Gazetteer
val gold = new GazetteerGoldStandard
val eval = new SimpleEvaluation
run {
for {
corpus <- List(new WorksOfShakespeare, new WorksOfGoethe)
tok <- List(new RuleBasedTokenizer, new TrainableTokenizer)
} yield {
corpus -> tok | tok -> (ie, gold) | (ie, gold) -> eval
}
}
With this syntax we can set up complex experiment series, e.g. training and evaluating a classifier:
val corpus = new Corpus
object trainData extends SensevalData("files/EnglishLS.train.xml")
object testData extends SensevalData("files/EnglishLS.test.xml")
object trainSense extends SensevalSense("files/EnglishLS.train.xml")
run {
/* Variable configuration parameters: */
for {
algo <- List(new NaiveBayes, new BayesNet, new SMO, new HyperPipes, new IBk);
feat <- List("3-gram", "7-gram", "word", "length");
grain <- List("fine", "mixed", "coarse");
context <- List(2, 4, 8, 16);
trainFeat = new TrainFeatures(feat, context)
testFeat = new TestFeatures(feat, context)
classifier = new SensevalClassifier(context, 2f, algo, "S0", "S1")
evaluation = new SensevalEval(grain)
}
/* Fixed agent interaction: */
yield {
/* Preprocessing: */
corpus -> (trainData, trainSense) |
/* Training: */
trainData -> (trainFeat, trainSense) |
(trainFeat , trainSense) -> classifier |
/* Classification: */
testData -> testFeat |
testFeat -> classifier |
/* Evaluation: */
classifier -> evaluation
}
}
This setup will run experiments with all permutations of the given configuration parameters (classifiers, features, context, etc.), i.e. here 5*4*3*4=240 runs. The definition above uses Scala’s type inference and omits explicit type declarations. They can be used optionally:
val corpus: Agent[String, String] = new Corpus
val trainData: Agent[String, Context] = new TrainData()
val testData: Agent[String, Context] = new TestData()
val trainSense: Agent[Context, Ambiguity] = new TrainSense()
run {
for {
/* Configurations: */
algo: weka.classifiers.Classifier <- List(new NaiveBayes, new BayesNet, new SMO, new HyperPipes, new IBk);
feat: String <- List("3-gram", "7-gram", "word", "length");
grain: String <- List("fine", "mixed", "coarse");
context: Int <- List(2, 4, 8, 16);
/* Agents: */
trainFeat: Agent[Context, FeatureVector] = new TrainFeatures(feat, context)
testFeat: Agent[Context, FeatureVector] = new TestFeatures(feat, context)
classifier = new SensevalClassifier(context, 2f, algo, "S0", "S1")
classifierAgent: Agent[FeatureVector, Sense] = classifier
classifierModel: Model[FeatureVector, Ambiguity] = classifier
evaluation: Agent[Sense, String] = new SensevalEval(grain)
/* Interactions: */
corpusData: Analysis[String] = corpus -> (trainData, testData)
corpusContext: Analysis[Context] = trainData -> (trainFeat, trainSense)
trainClassifier: Synthesis[FeatureVector, Ambiguity] = (trainFeat, trainSense) -> classifierModel
testContext: Analysis[Context] = testData -> testFeat
classify: Analysis[FeatureVector] = testFeat -> classifierAgent
evaluate: Analysis[Sense] = classifierAgent -> evaluation
} /* Workflow: */ yield corpusData | corpusContext | trainClassifier | testContext | classify | evaluate
}
TM will generate some documentation about the experiments and their setup, e.g. for the definitions above we get this overview:
For the complete run, an overview page with results is generated:
For a detailed description of the concepts and implementation of TM2 in German, check out this report: arXiv, PDF, TeX
Java 6, Ant (for building), GCC (to build Senseval evaluation scorer app)
- Copy
com.quui.tm2/src/tm2.properties.template
tocom.quui.tm2/src/tm2.properties
- Set the
project
property to the location of your localcom.quui.tm2
project - Set the
dot_home
property to the folder containing your localdot
binary file - Compile the
scorer2
app:cd com.quui.tm2.scala/files
;gcc -o scorer2 scorer2.c
- Build the code and run the tests:
cd com.quui.tm2.scala
;export ANT_OPTS=-Xmx1024m
;ant
- Test result reports are generated at
com.quui.tm2.scala/build/tests/scala/summary
- Batch documentation is generated at
com.quui.tm2/output/batch-result.html