Skip to content

fsteeg/tm2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TM2: Type-safe modeling in text mining

This is a very prototypical API for creating text mining experiments using Java and Scala.

Idea

It is based on the idea that every text mining task essentially is an annotation of text with some entities (e.g. this is a word, a verb, a place, interesting, etc). Different text mining components typically create annotations of different types (e.g. token, type, part of speech, location, sentiment, etc). This can be modeled as an Annotation[T]. The components or agents consume and produce these entities as input and output. This can be represented as Agent[I, O], e.g. a POS tagger consumes annotations of tokens and produces annotations of part of speech – it is an Agent[Token, POS]. These agents interact in syntheses and analyses. An analysis is a linear interaction between agents, e.g. POS tagging involves analysing tokens: Analysis[Token]. A synthesis combines two annotation types – e.g. to train a classifier: Synthesis[Token, Feature]. By checking if the interaction types match the agent types, the compiler can check if experiment setups make any sense at all.

API

TM2 can be used as a regular API from Java, or in a very concise way from Scala, e.g. to define an analysis:

val a: Analysis[Token] = tokenizer -> gazetteer

In a similar way we can define syntheses:

val s: Synthesis[Token, Frequency] = (tokenizer, indexer) -> index

With |, the interactions can be combined into experiments, which can be executed with !:

val e: Experiment = corpus -> tokenizer | tokenizer -> ie !

We can combine analyses and syntheses, e.g. to add an evaluation:

val e = corpus -> tok | tok -> (ie, gold) | (ie, gold) -> eval !

Combining this Scala API with Scala’s general features, we can easily define and run experiments with variable configuration parameters, using this general form:

run { for { <configuration> } yield { <experiment> } }

A simple example with two different corpora and two different tokenizers (i.e. 2*2=4 runs) could look like this:

val ie = new Gazetteer
val gold = new GazetteerGoldStandard
val eval = new SimpleEvaluation
run {
  for {
    corpus <- List(new WorksOfShakespeare, new WorksOfGoethe)
    tok <- List(new RuleBasedTokenizer, new TrainableTokenizer)
  } yield {
    corpus -> tok | tok -> (ie, gold) | (ie, gold) -> eval
  }
}

With this syntax we can set up complex experiment series, e.g. training and evaluating a classifier:

val corpus = new Corpus
object trainData extends SensevalData("files/EnglishLS.train.xml")
object testData extends SensevalData("files/EnglishLS.test.xml")
object trainSense extends SensevalSense("files/EnglishLS.train.xml")
run {
  /* Variable configuration parameters: */
  for {
    algo <- List(new NaiveBayes, new BayesNet, new SMO, new HyperPipes, new IBk);
    feat <- List("3-gram", "7-gram", "word", "length");
    grain <- List("fine", "mixed", "coarse");
    context <- List(2, 4, 8, 16);
    trainFeat = new TrainFeatures(feat, context)
    testFeat = new TestFeatures(feat, context)
    classifier = new SensevalClassifier(context, 2f, algo, "S0", "S1")
    evaluation = new SensevalEval(grain)
  } 
  /* Fixed agent interaction: */
  yield {
    /* Preprocessing: */
    corpus -> (trainData, trainSense) |
    /* Training: */
    trainData -> (trainFeat, trainSense) |
    (trainFeat , trainSense) -> classifier |
    /* Classification: */
    testData -> testFeat |
    testFeat -> classifier |
    /* Evaluation: */
    classifier -> evaluation
  }
}

This setup will run experiments with all permutations of the given configuration parameters (classifiers, features, context, etc.), i.e. here 5*4*3*4=240 runs. The definition above uses Scala’s type inference and omits explicit type declarations. They can be used optionally:

val corpus: Agent[String, String] = new Corpus
val trainData: Agent[String, Context] = new TrainData()
val testData: Agent[String, Context] = new TestData()
val trainSense: Agent[Context, Ambiguity] = new TrainSense()
run {
  for {
    /* Configurations: */
    algo: weka.classifiers.Classifier <- List(new NaiveBayes, new BayesNet, new SMO, new HyperPipes, new IBk);
    feat: String <- List("3-gram", "7-gram", "word", "length");
    grain: String <- List("fine", "mixed", "coarse");
    context: Int <- List(2, 4, 8, 16);
    /* Agents: */
    trainFeat: Agent[Context, FeatureVector] = new TrainFeatures(feat, context)
    testFeat: Agent[Context, FeatureVector] = new TestFeatures(feat, context)
    classifier = new SensevalClassifier(context, 2f, algo, "S0", "S1")
    classifierAgent: Agent[FeatureVector, Sense] = classifier
    classifierModel: Model[FeatureVector, Ambiguity] = classifier
    evaluation: Agent[Sense, String] = new SensevalEval(grain)
    /* Interactions: */
    corpusData: Analysis[String] = corpus -> (trainData, testData)
    corpusContext: Analysis[Context] = trainData -> (trainFeat, trainSense)
    trainClassifier: Synthesis[FeatureVector, Ambiguity] = (trainFeat, trainSense) -> classifierModel
    testContext: Analysis[Context] = testData -> testFeat
    classify: Analysis[FeatureVector] = testFeat -> classifierAgent
    evaluate: Analysis[Sense] = classifierAgent -> evaluation
  } /* Workflow: */ yield corpusData | corpusContext | trainClassifier | testContext | classify | evaluate
}

TM will generate some documentation about the experiments and their setup, e.g. for the definitions above we get this overview:

For the complete run, an overview page with results is generated:

For a detailed description of the concepts and implementation of TM2 in German, check out this report: arXiv, PDF, TeX

Prerequisites

Java 6, Ant (for building), GCC (to build Senseval evaluation scorer app)

Setup

  • Copy com.quui.tm2/src/tm2.properties.template to com.quui.tm2/src/tm2.properties
  • Set the project property to the location of your local com.quui.tm2 project
  • Set the dot_home property to the folder containing your local dot binary file
  • Compile the scorer2 app: cd com.quui.tm2.scala/files; gcc -o scorer2 scorer2.c

Build

  • Build the code and run the tests: cd com.quui.tm2.scala; export ANT_OPTS=-Xmx1024m; ant
  • Test result reports are generated at com.quui.tm2.scala/build/tests/scala/summary
  • Batch documentation is generated at com.quui.tm2/output/batch-result.html

About

TM2: Type-safe modeling in text mining

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages