<a href="https://colab.research.google.com/github/dmcguire81/metapy/blob/master/tutorials/4-classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%%capture
# NOTE: this assumes you've uploaded a Python 3.7 build from our fork to Drive
# TODO: replace this with a stock install when it's published somewhere
%pip install /content/drive/MyDrive/metapy-0.2.13-cp37-cp37m-manylinux_2_24_x86_64.whl

First, let's import the Python bindings, as usual.

In [3]:
import metapy

Now, let's download a list of stopwords and a small dataset to begin playing around with classifiers in MeTA.

In [4]:
%%capture
!wget -N https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt

In [5]:
%%capture
!wget -N https://meta-toolkit.org/data/2016-01-26/ceeaus.tar.gz
!tar xvf ceeaus.tar.gz

Now, let's index this dataset. Since we are doing classification experiments, we will most likely be concerning ourselves with a `ForwardIndex`, since we want to map document ids to their feature vector representations.

In [6]:
%%capture
!wget -N https://raw.githubusercontent.com/dmcguire81/metapy/master/tutorials/ceeaus-config.toml

In [7]:
fidx = metapy.index.make_forward_index('ceeaus-config.toml')

Note that the feature set used for classification depends on your settings in the configuration file _at the time of indexing_. If you want to play with different feature sets, remember to change your `analyzer` pipeline in the configuration file, and also to **reindex** your documents!

Here, we've just chosen simple unigram words. This is actually a surprisingly good baseline feature set for many text classification problems.

Now that we have a `ForwardIndex` on disk, we need to load the documents we want to start playing with into memory. Since this is a small enough dataset, let's load the whole thing into memory at once.

We need to decide what kind of dataset we're using. MeTA has classes for binary classification (`BinaryDataset`) and multi-class classification (`MulticlassDataset`), which you should choose from depending on the kind of classification problem you're dealing with. Let's see how many labels we have in our corpus.

In [8]:
fidx.num_labels()

3

Since this is more than 2, we likely want a `MulticlassDataset` so we can learn a classifier that can predict which of these three labels a document should have. (But we might be interested in only determining one particular class from the rest, in which case we might actually want a `BinaryDataset`.)

For now, let's focus on the multi-class case, as that likely makes the most sense for this kind of data. Let's load or documents.

In [9]:
dset = metapy.classify.MulticlassDataset(fidx)
len(dset)

1008

We have 1008 documents, split across three labels. What are our labels?

In [10]:
set([dset.label(instance) for instance in dset])

{'chinese', 'english', 'japanese'}

This dataset is a small collection of essays written by a bunch of students with different first languages. Our goal will be to try to identify whether an essay was written by a native-Chinese speaker, a native-English speaker, or a native-Japanese speaker.

Now, because these in-memory datasets can potentially be quite large, it's beneficial to not make unnecessary copies of them to, for example, create a new list that's shuffled that contains the same documents. In most cases, you'll be operating with a `DatasetView` (either `MulticlassDatasetView` or `BinaryDatasetView`) so that you can do things like shuffle or rotate the contents of a dataset without having to actually modify it. Doing so is pretty easy: you can use Python's slicing API, or you can just construct one directly.

In [11]:
view = dset[0:len(dset)+1]
# or
view = metapy.classify.MulticlassDatasetView(dset)

Now we can, for example, shuffle this view without changing the underlying datsaet.

In [12]:
view.shuffle()
print("{} vs {}".format(view[0].id, dset[0].id))

73 vs 0


The view has been shuffled and now has documents in random order (useful in many cases to make sure that you don't have clumps of the same-labeled documents together, or to just permute the documents in a stochastic learning algorithm), but the underlying dataset is still sorted by id.

We can also use this slicing API to create a random training and testing set from our shuffled views (views also support slicing). Let's make a 75-25 split of training-testing data. (Note that's really important that we already shuffled the view!)

In [13]:
training = view[0:int(0.75*len(view))]
testing = view[int(0.75*len(view)):len(view)+1]

Now, we're ready to train a classifier! Let's start with very simple one: [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier).

In MeTA, construction of a classifier implies training of that model. Let's train a Naive Bayes classifier on our training view now.

In [14]:
nb = metapy.classify.NaiveBayes(training)

We can now classify individual documents like so.

In [15]:
nb.classify(testing[0].weights)

'japanese'

We might be more interested in how well we classify the testing set.

In [16]:
mtrx = nb.test(testing)
print(mtrx)


            chinese   english   japanese  
          ------------------------------
  chinese | [1m0.833[22m     0.0417    0.125     
  english | 0.025     [1m0.9[22m       0.075     
 japanese | 0.0213    0.0106    [1m0.968[22m     




The `test()` method of MeTA's classifiers returns to you a `ConfusionMatrix`, which contains useful information about what kinds of mistakes your classifier is making.

(Note that, due to the random shuffling, you might see different results than we do here.)

For example, we can see that this classifier seems to have some trouble with confusing native-Chinese students' essays with those of native-Japanese students. We can tell that by looking at the rows of the confusion matrix. Each row tells you what fraction of documents with that _true_ label were assigned the label for each column by the classifier. In the case of the native-Chinese label, we can see that 25% of the time they were miscategorized as being native-Japanese.

The `ConfusionMatrix` also computes a lot of metrics that are commonly used in classifier evaluation.

In [17]:
mtrx.print_stats()

------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
chinese     0.816       0.8         0.833       0.0952      
english     0.911       0.923       0.9         0.159       
japanese    0.968       0.968       0.968       0.746       
------------------------------------------------------------
[1mTotal[22m       [1m0.945[22m       [1m0.945[22m       [1m0.944[22m       
------------------------------------------------------------
252 predictions attempted, overall accuracy: 0.944



If we want to make sure that the classifier isn't overfitting to our training data, a common approach is to do [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). Let's run CV for our Naive Bayes classifier across the whole dataset, using 5-folds, to get an idea of how well we might generalize to new data.

In [18]:
mtrx = metapy.classify.cross_validate(lambda fold: metapy.classify.NaiveBayes(fold), view, 5)

`cross_validate()` returns a `ConfusionMatrix` just like `test()` does. We give it a function to use to create the trained classifiers for each fold, and then pass in the dataset view containing all of our documents, and the number of folds we want to use.

Let's see how we did.

In [19]:
print(mtrx)
mtrx.print_stats()


            chinese   english   japanese  
          ------------------------------
  chinese | [1m0.87[22m      0.0217    0.109     
  english | 0.0208    [1m0.917[22m     0.0625    
 japanese | 0.0169    0.0104    [1m0.973[22m     


------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
chinese     0.851       0.833       0.87        0.0915      
english     0.923       0.93        0.917       0.143       
japanese    0.974       0.975       0.973       0.765       
------------------------------------------------------------
[1mTotal[22m       [1m0.955[22m       [1m0.956[22m       [1m0.955[22m       
------------------------------------------------------------
1005 predictions attempted, overall accuracy: 0.955



Now let's do the same thing, but for an arguably stronger baseline: [SVM](https://en.wikipedia.org/wiki/Support_vector_machine).

MeTA's implementation of SVM is actually an approximation using [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) on the [hinge loss](https://en.wikipedia.org/wiki/Hinge_loss). It's implemented as a `BinaryClassifier`, so we will need to adapt it before it can be used to solve our multi-class clasification problem.

MeTA provides two different adapters for this scenario: [One-vs-All](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) and [One-vs-One](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-one).

In [20]:
ova = metapy.classify.OneVsAll(training, metapy.classify.SGD, loss_id='hinge')

We construct the `OneVsAll` reduction by providing it the training documents, the name of a binary classifier, and then (as keyword arguments) any additional arguments to that chosen classifier. In this case, we use `loss_id` to specify the loss function to use.

We can now use `OneVsAll` just like any other classifier.

In [21]:
mtrx = ova.test(testing)
print(mtrx)
mtrx.print_stats()


            chinese   english   japanese  
          ------------------------------
  chinese | [1m0.792[22m     -         0.208     
  english | -         [1m0.9[22m       0.1       
 japanese | 0.00532   0.00532   [1m0.989[22m     


------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
chinese     0.864       0.95        0.792       0.0952      
english     0.935       0.973       0.9         0.159       
japanese    0.971       0.954       0.989       0.746       
------------------------------------------------------------
[1mTotal[22m       [1m0.956[22m       [1m0.957[22m       [1m0.956[22m       
------------------------------------------------------------
252 predictions attempted, overall accuracy: 0.956



In [22]:
mtrx = metapy.classify.cross_validate(lambda fold: metapy.classify.OneVsAll(fold, metapy.classify.SGD, loss_id='hinge'), view, 5)
print(mtrx)
mtrx.print_stats()


            chinese   english   japanese  
          ------------------------------
  chinese | [1m0.772[22m     0.0326    0.196     
  english | -         [1m0.903[22m     0.0972    
 japanese | 0.0026    0.0065    [1m0.991[22m     


------------------------------------------------------------
[1mClass[22m       [1mF1 Score[22m    [1mPrecision[22m   [1mRecall[22m      [1mClass Dist[22m  
------------------------------------------------------------
chinese     0.861       0.973       0.772       0.0915      
english     0.922       0.942       0.903       0.143       
japanese    0.975       0.96        0.991       0.765       
------------------------------------------------------------
[1mTotal[22m       [1m0.958[22m       [1m0.958[22m       [1m0.958[22m       
------------------------------------------------------------
1005 predictions attempted, overall accuracy: 0.958



That should be enough to get you started! Try looking at `help(metapy.classify)` for a list of what's included in the bindings.