Skip to content

Quick Notes on Starting with Processors Classifiers

Mihai Surdeanu edited this page Jun 4, 2020 · 2 revisions

Getting Started

These are the major steps you will want to do in your classification process:

  1. create a dataset
  2. train a classifier with the dataset
  3. use the classifier to classify new data

Details:

  1. Step 1: Create a Dataset

Note: this is data pre processing stage. Take your existing data and convert it to the format that processors understands.

A dataset is composed of datums. For example, let's say we want to classify spam and we have the following two documents:

  • doc1: "hello world"
  • doc2: "buy viagra"

The first one is not spam, the second one probably is. First, you need to create a Datum for each document. We will use the BVFDatum [1] which we use for boolean features. You can create your datums like this:

val datum1 = new BVFDatum[String, String]("NotSpam", Array("hello", "world"))
val datum2 = new BVFDatum[String, String]("Spam", Array("buy", "viagra"))

Note that the first argument to BVFDatum is the label and the second is the sequence of features.

Next, you want to build a dataset. We need a BVFDataset [2] since we are using BVFDatum. We also want to populate it with our datums.

val dataset = new BVFDataset[String, String]
dataset += datum1
dataset += datum2

Now that we have a dataset, lets make a clasifier and train it. We will use a PerceptronClassifier [3].

val perceptron = new PerceptronClassifier[String, String]
perceptron.train(dataset)

This classifier can now be used to predict labels for new datums.

val datum3 = new BVFDatum[String, String]("Spam", Array("buy", "something", "else"))
val label = perceptron.classOf(datum3)

And hopefully the perceptron will predict that this new document is spam.

There are more classifiers and other types of datums. Also, classifiers have hyperparameters in their constructors.

Note: a sample code for using a perceptron classifier can be found in [4].

For RVF Data

Then what you need to do is make a Counter [5] object and fill it with your features:

val counter = new Counter[String]
counter.setCount("feature1", 5.3)
counter.setCount("feature2", 8.5)
...

then you build your RVFDatum [6] with your counter:

val datum = new RVFDatum[String, String]("LABEL", counter)

then you build your RVFDataset [7]:

val dataset = new RVFDataset[String, String]
dataset += datum
// add all your datums to the dataset

next, you probably want to scale your features to be between -1 and 1 [4]:

val scaleRanges = Datasets.svmScaleDataset(dataset, lower = -1, upper = 1)

Note that the dataset was scaled in place, scaleRanges contains the scales of each feature, but you can probably ignore it.

Then, use the dataset to train a classifier as in the previous section.

And last, use your classifier to classify some new stuff.

Example implementation of RVF:

Here’s another “real world” example of using RVFDatum with a classifier: https://github.com/clulab/reach-assembly/blob/master/src/main/scala/org/clulab/assembly/relations/classifier/AssemblyRelationClassifier.scala Building features: https://github.com/clulab/reach-assembly/blob/master/src/main/scala/org/clulab/assembly/relations/classifier/FeatureExtractor.scala GitHub clulab/reach-assembly Contribute to reach-assembly development by creating an account on GitHub.

This method is probably relevant to you, as it demonstrates how to instantiate different classifiers: https://github.com/clulab/reach-assembly/blob/master/src/main/scala/org/clulab/assembly/relations/classifier/AssemblyRelationClassifier.scala#L43-L57 GitHub clulab/reach-assembly Contribute to reach-assembly development by creating an account on GitHub.

I defined a convenience method with the following signature for generating a RVF-based representation: def mkRVFDatum(e1: Mention, e2: Mention, label: String): RVFDatum[String, String] = {

In your case, you’ll probably be taking in a processor.Document, instead of an odin.Mention

so something like def mkRVFDatum(doc: Document, label: String): RVFDatum[String, String] = { maybe? or perhaps text that you turn into a Document...

FAQ:

Qn)what is datum

Ans: A way of defining data.

  1. what is bvf datum? what is the difference between RVF and Bvf DATUM?

Ans: RVF stands for real value fields. BVF for Binary Value fields. Which means, that your feature vectors can take binary values or real values depending on the usage.

4.5) what is bvfdataset? a collection of bvfdatum? Ans: any data set is a collection of the corresponding datum

  1. what if i want to use another classifier other than perceptron?

Ans: Call the corresponding class.

6.5) why perceptron? why not svm or something ?

Ans: start with perceptron. Its just mentioned as an example. If you can run the classification for other classifiers and get better F1 scores, go for it.

7.next, you probably want to scale your features to be between -1 and 1 [8]: WHAT DO YOU mean by scaling? is it like normalization? why -1? isnt it better between 0 to 1? or is it some quirk of RVF data? In my data, my feature is only frequency of the word occurences. how do i scale it?

  1. this is a comment from BVF data in code "Important note: to encode feature values > 1, simply store the same feature multiple times (equal to feature value)!"

so i have only frequency. Can that be considered feature value more than 1?

  1. So, I have gradable adjectives picked from agiga. I need to do labelling against COBUILD corpus. In the example you have shown above, we manually do "SPam" ,"Viagra". How do i get code to do it?

def loadFromL, F:PerceptronClassifier[L, F] = {

can i use the above line from this code: https://github.com/clulab/processors/blob/8df48f1f8f21bc53cd07e3d321405590d980ae7f/main/src/main/scala/org/clulab/learning/PerceptronClassifier.scala#L21

  1. What is Counter in RVF data? val counter = new Counter[String] counter.setCount("feature1", 5.3) counter.setCount("feature2", 8.5)

what if i have only one feature?

Qn) Am getting the below error. what does it mean?

Error: /ar:/usr/local/idea-IC-162.2032.8/lib/idea_rt.jar agiga.mainParser Connected to the target VM, address: '127.0.0.1:57251', transport: 'socket' Exception in thread "main" java.lang.ExceptionInInitializerError at agiga.mainParser.main(mainFile.scala) Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at org.clulab.learning.PerceptronClassifier.org$clulab$learning$PerceptronClassifier$$update(PerceptronClassifier.scala:137)

Ans:You're only giving a single datum during training, so it has no notion of the other (negative) class (ex. "NONE"/"NOT GRADABLE").

References:

[1] https://github.com/clulab/processors/blob/8df48f1f8f21bc53cd07e3d321405590d980ae7f/main/src/main/scala/org/clulab/learning/Datum.scala#L42

[2] https://github.com/clulab/processors/blob/8df48f1f8f21bc53cd07e3d321405590d980ae7f/main/src/main/scala/org/clulab/learning/Dataset.scala#L59

[3] https://github.com/clulab/processors/blob/8df48f1f8f21bc53cd07e3d321405590d980ae7f/main/src/main/scala/org/clulab/learning/PerceptronClassifier.scala#L21

[4] https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/learning/LearningExample.scala

[5] https://github.com/clulab/processors/blob/8df48f1f8f21bc53cd07e3d321405590d980ae7f/main/src/main/scala/org/clulab/struct/Counter.scala#L16

[6] https://github.com/clulab/processors/blob/8df48f1f8f21bc53cd07e3d321405590d980ae7f/main/src/main/scala/org/clulab/learning/Datum.scala#L71

[7] https://github.com/clulab/processors/blob/8df48f1f8f21bc53cd07e3d321405590d980ae7f/main/src/main/scala/org/clulab/learning/Dataset.scala#L221

[8] https://github.com/clulab/processors/blob/8df48f1f8f21bc53cd07e3d321405590d980ae7f/main/src/main/scala/org/clulab/learning/Datasets.scala#L60