TM2: Type-safe modeling in text mining

This is a very prototypical API for creating text mining experiments using Java and Scala.

Idea

It is based on the idea that every text mining task essentially is an annotation of text with some entities (e.g. this is a word, a verb, a place, interesting, etc). Different text mining components typically create annotations of different types (e.g. token, type, part of speech, location, sentiment, etc). This can be modeled as an Annotation[T]. The components or agents consume and produce these entities as input and output. This can be represented as Agent[I, O], e.g. a POS tagger consumes annotations of tokens and produces annotations of part of speech – it is an Agent[Token, POS]. These agents interact in syntheses and analyses. An analysis is a linear interaction between agents, e.g. POS tagging involves analysing tokens: Analysis[Token]. A synthesis combines two annotation types – e.g. to train a classifier: Synthesis[Token, Feature]. By checking if the interaction types match the agent types, the compiler can check if experiment setups make any sense at all.

API

TM2 can be used as a regular API from Java, or in a very concise way from Scala, e.g. to define an analysis:

val a: Analysis[Token] = tokenizer -> gazetteer

In a similar way we can define syntheses:

val s: Synthesis[Token, Frequency] = (tokenizer, indexer) -> index

With |, the interactions can be combined into experiments, which can be executed with !:

val e: Experiment = corpus -> tokenizer | tokenizer -> ie !

We can combine analyses and syntheses, e.g. to add an evaluation:

val e = corpus -> tok | tok -> (ie, gold) | (ie, gold) -> eval !

Combining this Scala API with Scala’s general features, we can easily define and run experiments with variable configuration parameters, using this general form:

run { for { <configuration> } yield { <experiment> } }

A simple example with two different corpora and two different tokenizers (i.e. 2*2=4 runs) could look like this:

val ie = new Gazetteer
val gold = new GazetteerGoldStandard
val eval = new SimpleEvaluation
run {
  for {
    corpus <- List(new WorksOfShakespeare, new WorksOfGoethe)
    tok <- List(new RuleBasedTokenizer, new TrainableTokenizer)
  } yield {
    corpus -> tok | tok -> (ie, gold) | (ie, gold) -> eval
  }
}

With this syntax we can set up complex experiment series, e.g. training and evaluating a classifier:

val corpus = new Corpus
object trainData extends SensevalData("files/EnglishLS.train.xml")
object testData extends SensevalData("files/EnglishLS.test.xml")
object trainSense extends SensevalSense("files/EnglishLS.train.xml")
run {
  /* Variable configuration parameters: */
  for {
    algo <- List(new NaiveBayes, new BayesNet, new SMO, new HyperPipes, new IBk);
    feat <- List("3-gram", "7-gram", "word", "length");
    grain <- List("fine", "mixed", "coarse");
    context <- List(2, 4, 8, 16);
    trainFeat = new TrainFeatures(feat, context)
    testFeat = new TestFeatures(feat, context)
    classifier = new SensevalClassifier(context, 2f, algo, "S0", "S1")
    evaluation = new SensevalEval(grain)
  } 
  /* Fixed agent interaction: */
  yield {
    /* Preprocessing: */
    corpus -> (trainData, trainSense) |
    /* Training: */
    trainData -> (trainFeat, trainSense) |
    (trainFeat , trainSense) -> classifier |
    /* Classification: */
    testData -> testFeat |
    testFeat -> classifier |
    /* Evaluation: */
    classifier -> evaluation
  }
}

This setup will run experiments with all permutations of the given configuration parameters (classifiers, features, context, etc.), i.e. here 5*4*3*4=240 runs. The definition above uses Scala’s type inference and omits explicit type declarations. They can be used optionally:

val corpus: Agent[String, String] = new Corpus
val trainData: Agent[String, Context] = new TrainData()
val testData: Agent[String, Context] = new TestData()
val trainSense: Agent[Context, Ambiguity] = new TrainSense()
run {
  for {
    /* Configurations: */
    algo: weka.classifiers.Classifier <- List(new NaiveBayes, new BayesNet, new SMO, new HyperPipes, new IBk);
    feat: String <- List("3-gram", "7-gram", "word", "length");
    grain: String <- List("fine", "mixed", "coarse");
    context: Int <- List(2, 4, 8, 16);
    /* Agents: */
    trainFeat: Agent[Context, FeatureVector] = new TrainFeatures(feat, context)
    testFeat: Agent[Context, FeatureVector] = new TestFeatures(feat, context)
    classifier = new SensevalClassifier(context, 2f, algo, "S0", "S1")
    classifierAgent: Agent[FeatureVector, Sense] = classifier
    classifierModel: Model[FeatureVector, Ambiguity] = classifier
    evaluation: Agent[Sense, String] = new SensevalEval(grain)
    /* Interactions: */
    corpusData: Analysis[String] = corpus -> (trainData, testData)
    corpusContext: Analysis[Context] = trainData -> (trainFeat, trainSense)
    trainClassifier: Synthesis[FeatureVector, Ambiguity] = (trainFeat, trainSense) -> classifierModel
    testContext: Analysis[Context] = testData -> testFeat
    classify: Analysis[FeatureVector] = testFeat -> classifierAgent
    evaluate: Analysis[Sense] = classifierAgent -> evaluation
  } /* Workflow: */ yield corpusData | corpusContext | trainClassifier | testContext | classify | evaluate
}

TM will generate some documentation about the experiments and their setup, e.g. for the definitions above we get this overview:

For the complete run, an overview page with results is generated:

For a detailed description of the concepts and implementation of TM2 in German, check out this report: arXiv, PDF, TeX

Prerequisites

Java 6, Ant (for building), GCC (to build Senseval evaluation scorer app)

Setup

Copy com.quui.tm2/src/tm2.properties.template to com.quui.tm2/src/tm2.properties
Set the project property to the location of your local com.quui.tm2 project
Set the dot_home property to the folder containing your local dot binary file
Compile the scorer2 app: cd com.quui.tm2.scala/files; gcc -o scorer2 scorer2.c

Build

Build the code and run the tests: cd com.quui.tm2.scala; export ANT_OPTS=-Xmx1024m; ant
Test result reports are generated at com.quui.tm2.scala/build/tests/scala/summary
Batch documentation is generated at com.quui.tm2/output/batch-result.html

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
com.quui.tm2.agents		com.quui.tm2.agents
com.quui.tm2.scala		com.quui.tm2.scala
com.quui.tm2		com.quui.tm2
.gitignore		.gitignore
README.textile		README.textile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TM2: Type-safe modeling in text mining

Idea

API

Prerequisites

Setup

Build

About

Releases

Packages

Languages

fsteeg/tm2

Folders and files

Latest commit

History

Repository files navigation

TM2: Type-safe modeling in text mining

Idea

API

Prerequisites

Setup

Build

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages