<a href="https://colab.research.google.com/github/akaver/NLP2019/blob/master/Lab05_2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this lab, we will learn how to do text classification using sklearn.

The sklearn Python package (also known as scikit-learn) is a general purpose machine learning toolkit. It has a very nice API, a lot of features and is well documented. This lab is largely based on the Sklearn text classification tutorial: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.

We will also look a bit at the pandas Python package which is excellent for working with tabular data, and works very well together with sklearn.

In [0]:
import sklearn
import sklearn.model_selection
import pandas as pd
import numpy as np

As a sample dataset, we will use Estonian data from the www.hinnavaatlus.ee website which is a price comparison website, originally for IT components but now for other stuff also. The website allows users to give ratings (1-5 stars) to companies that sell stuff, together with comments. This data was used to create a sentiment analysis dataset: the comments with 4-5 stars where determined to be positive (`Positiivne`), comments with 1-2 stars as negative (`Negatiivne`) and commens with 3 stars as neutral (`Neutraalne`). We can load the dataset CSV file directly from the web using the pandas CSV reader:

In [0]:
data = pd.read_csv("https://raw.githubusercontent.com/estnltk/estnltk/master/docs/tutorials/data/hinnavaatlus.csv")
# let's explore the data
data.head()

Unnamed: 0,Kommentaari ID,Kommentaar,Meelsus
0,4,"Väike, aga tubli firma!",Neutraalne
1,8,väga hea firma,Positiivne
2,10,"Viimasel ajal pole midagi halba öelda, aga sam...",Negatiivne
3,11,"Fotode kvaliteet väga pro ja ""jjk"" seal töötam...",Positiivne
4,13,Ainult positiivsed kogemused,Positiivne


To experiment with sentiment calssification, we will split the data into training and development set. This can be easily accomplished using the sklearn's `train_test_split()` function that randomly shuffles the data and splits it into two sets according to the specified ratio. I also add the `random_state=0` argument so that the train/test split is deterministic, not different across runs.

In [0]:
train_data, dev_data = sklearn.model_selection.train_test_split(data, test_size=0.2, random_state=0)
train_data.shape, dev_data.shape

((650, 3), (163, 3))

Next, we will use a Naive Bayes classifier to train a model that classifies comments to the three categories.

First, we have convert each comment into the bag-of-words representation. This is very easy to using sklearn's [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class. It converts a collection of text documents to a matrix of token counts, exactly as needed by the Naive Bayes classifier.

To convert training data to bag-of-words, we use the `fit_transform()` method of `CountVectorizer`. This method first discovers the vocabulary of the training data (fit), and then transforms ther data from textual representation to matrix form.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_data['Kommentaar'])


The result is a sparse matrix, where the number of rows is equal to the number of items in training data, and the number of columns if equal to the vocabulary size:

In [0]:
X_train_counts.shape

(650, 4426)

After the `fit_transform()` method is called, the `vocabulary_` attribute of CountVectorizer gets polulated by the discovered vocabulary of training data:

In [0]:
count_vect.vocabulary_

{'veljo': 3881,
 'ja': 812,
 'vabatahtlike': 3753,
 'abiga': 134,
 'on': 2266,
 'eesti': 430,
 'tõusnud': 3642,
 'üheks': 4124,
 'suurima': 3118,
 'wifi': 4060,
 'võrkude': 4040,
 'arvuga': 250,
 'riigiks': 2767,
 'maailmas': 1746,
 'saavutanud': 2841,
 'edumeelse': 400,
 'riigi': 2764,
 'maine': 1767,
 'huvitav': 723,
 'et': 531,
 'kuidas': 1399,
 'seda': 2880,
 'kohta': 1249,
 'ültse': 4173,
 'võimalik': 4019,
 'ülesse': 4159,
 'leida': 1615,
 'kui': 1397,
 'mitte': 1914,
 'kusagil': 1435,
 'pole': 2576,
 'ühtegi': 4139,
 'viita': 3932,
 'kus': 1434,
 'se': 2872,
 'kauplus': 1051,
 'asub': 303,
 'head': 632,
 'hinnad': 676,
 'konkreetne': 1303,
 'teenindus': 3298,
 'sellel': 2924,
 'aastal': 129,
 'fotokas': 580,
 'televiisor': 3398,
 'tellitud': 3431,
 'nende': 2088,
 'kaudu': 1039,
 'ei': 439,
 'saa': 2793,
 'midagi': 1856,
 'halba': 622,
 'öelda': 4111,
 'kaup': 1042,
 'kohal': 1232,
 'mõlemal': 2010,
 'korral': 1357,
 'järgmine': 895,
 'päev': 2678,
 'toodud': 3489,
 'otse': 2349

Let's check whether the computed counts are in correspondance with the original data.

The first item of the training data looks like this in text form:

In [0]:
train_data['Kommentaar'].iloc[0]

'Veljo ja vabatahtlike abiga on Eesti tõusnud üheks suurima WiFi võrkude arvuga riigiks maailmas ja saavutanud edumeelse riigi maine!'

This means that the count vector corresponding to thet item must have `1` at the position that corresponds to the word 'veljo' (note that CountVectorizer lower-cased our text -- it can be changed of course).

Let's first check what is the ID of the word 'veljo' in `count_vect`'s vocabulary:

In [0]:
count_vect.vocabulary_['veljo']

3881

So, the count vector corresponding to item 1 must have 1 at position 3881. Let's check:

In [0]:
X_train_counts[0, 3881]

1

CountVectorizer also has a method called `inverse_transform` that transforms count vectors back to textual representation. Let's try:

In [0]:
count_vect.inverse_transform(X_train_counts[0])

[array(['maine', 'riigi', 'edumeelse', 'saavutanud', 'maailmas', 'riigiks',
        'arvuga', 'võrkude', 'wifi', 'suurima', 'üheks', 'tõusnud',
        'eesti', 'on', 'abiga', 'vabatahtlike', 'ja', 'veljo'],
       dtype='<U22')]

Of course, the order of words is lost when tranforming data to bag-of-words, so the inverse transform above also resulted in the original text with words mangled.

Next, we have to convert the labels (Positiivne, Neutraalne, Negatiivne) to integers (0, 1, 2), as needed by sklearn's classifiers. Tis can be dome using the LabelEncoder class:

In [0]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(data['Meelsus'])
le.classes_


array(['Negatiivne', 'Neutraalne', 'Positiivne'], dtype=object)

We will now create new lists `y_train` and `y_dev` that hold the numeric categories of training and dev data:

In [0]:
y_train = le.transform(train_data['Meelsus'])
y_dev = le.transform(dev_data['Meelsus'])

Now we are ready to train a Naive Bayes classifer. It's very easy, since we have convrerted the training data to the suitable form already:

In [0]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_counts, y_train)

Let's now apply it on dev data. First, we need to convert the dev data into bag-of-words:

In [0]:
X_dev_counts = count_vect.transform(dev_data['Kommentaar'])
X_dev_counts.shape

(163, 4426)

Now, let's use the classifier to predict the classes for dev data:

In [0]:
predicted = clf.predict(X_dev_counts)

In [0]:
predicted

array([2, 0, 0, 2, 2, 2, 2, 0, 1, 2, 2, 1, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 0, 2, 2, 1, 2, 0, 0, 2, 2, 0, 2, 2, 0, 0, 2, 0, 2, 2, 0, 0,
       0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 1, 2, 0, 2, 2, 2, 0, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 2, 2, 2, 0, 2,
       2, 2, 0, 2, 0, 2, 2, 2, 2])

The accuracy of the classifier can be calculated by comparing the predicted values with the actual values, and taking the mean of the resulting list.

In [0]:
np.mean(predicted == y_dev)

0.6993865030674846

So, the above percentage of the items in dev set were classifed into the right category. The same can be accomplished using the score() function of the classifier:

In [0]:
clf.score(X_dev_counts, y_dev)

0.6993865030674846

Let's look at the classification results in more detail:

In [0]:
from sklearn import metrics
print(metrics.classification_report(y_dev, predicted, target_names=le.classes_))

              precision    recall  f1-score   support

  Negatiivne       0.55      0.67      0.60        33
  Neutraalne       0.40      0.06      0.11        32
  Positiivne       0.76      0.92      0.83        98

   micro avg       0.70      0.70      0.70       163
   macro avg       0.57      0.55      0.51       163
weighted avg       0.65      0.70      0.64       163



In [0]:
cm = metrics.confusion_matrix(y_dev, predicted)
cm

array([[22,  1, 10],
       [12,  2, 18],
       [ 6,  2, 90]])

As can be seen in the above confusion matrix, the main problem is with neutral comments: only 2 of the 28 neutral comments were classified as neutral. But this is quite expected, as distinguishing neutral from positive or negative would also difficult for humans.


Next, we will look jow to use sklearn's Pipeline to make the classifier more streamlined.

Often, data needs many transforms before a classfier can be applied for training or prediction. This quickly get's quite messy. Pipeline can be used to embed all transforms, together with the actual classifier, into one object, so that we can feed the raw data directly to the pipeline. This way we can be sure that exactly the same transforms are applied to both training and dev data:

In [0]:
from sklearn.pipeline import Pipeline
clf_pipeline = Pipeline([('vect', CountVectorizer()), 
                         ('clf', MultinomialNB())])
clf_pipeline.fit(train_data['Kommentaar'], y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [0]:
clf_pipeline.score(dev_data['Kommentaar'], y_dev)

0.6993865030674846

Next, we will learn how to apply some morphological analysis prior to bag-of-words. In Estonian, words occur in many different inflections. The model doesn't know that the words e.g. 'halvad' and 'halb' refer to the same concept. One way to fix this is to first lemmatize the text data. 

Let's first create a function that lemmatizes a text string using stanfordnlp, introduced in Lab 2. You can also use EstNLTK (which is actually more accurate than stanfordnlp for Estonian), os spaCy if working with some bigger language that spaCy supports.

In [0]:
! pip install stanfordnlp

Collecting stanfordnlp
[?25l  Downloading https://files.pythonhosted.org/packages/89/26/069d061ec8e3ea5cadffb00f941f6c79c7557821df8cc0439cbf1906cf53/stanfordnlp-0.1.1-py3-none-any.whl (133kB)
[K    100% |████████████████████████████████| 143kB 6.7MB/s 
Installing collected packages: stanfordnlp
Successfully installed stanfordnlp-0.1.1


In [0]:
import stanfordnlp
stanfordnlp.download('et')

Using the default treebank "et_edt" for language "et".
Would you like to download the models for: et_edt now? (Y/n)
y

Default download directory: /root/stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: et_edt
Download location: /root/stanfordnlp_resources/et_edt_models.zip


100%|██████████| 540M/540M [01:10<00:00, 9.34MB/s]



Download complete.  Models saved to: /root/stanfordnlp_resources/et_edt_models.zip
Extracting models file for: et_edt
Cleaning up...Done.


In [0]:
nlp = stanfordnlp.Pipeline(lang="et")

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/et_edt_models/et_edt_tokenizer.pt', 'lang': 'et', 'shorthand': 'et_edt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/root/stanfordnlp_resources/et_edt_models/et_edt_tagger.pt', 'pretrain_path': '/root/stanfordnlp_resources/et_edt_models/et_edt.pretrain.pt', 'lang': 'et', 'shorthand': 'et_edt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/root/stanfordnlp_resources/et_edt_models/et_edt_lemmatizer.pt', 'lang': 'et', 'shorthand': 'et_edt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': '/root/stanfordnlp_resources/et_edt_models/et_edt_parser.pt', 'pretrain_path': '/root/stanfordnlp_resources/et_edt_models/et_edt.pretrain.pt', 'lang': 'et', 'shorthand

In [0]:
def lemmatize(text):
    doc = nlp(text)
    lemmas = []
    for i, sentence in enumerate(doc.sentences):
      for word in sentence.words:
        lemmas.append(word.lemma)
    return " ".join(lemmas)

lemmatize("Ma olen sellega väga rahul! See oli suurepärane.")

'mina olema see väga rahul ! see olema suure_pärane .'

The CounVectorizer's constructor has conviniently a `preprocessor` attribute that allows specifying a function that is used to preprecess the raw text. We will pass out lemmatize() function to it (it takes a few minutes to run):

In [0]:
from sklearn.pipeline import Pipeline
clf_pipeline = Pipeline([('vect', CountVectorizer(preprocessor=lemmatize)), 
                         ('clf', MultinomialNB())])
clf_pipeline.fit(train_data['Kommentaar'], y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1),
        preprocessor=<function lemmatize at 0x7f9...enizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [0]:
predicted = clf_pipeline.predict(dev_data['Kommentaar'])

In [0]:
np.mean(predicted == y_dev)

0.6687116564417178

It's quite common that lemmatization doesn't really help for text classification. This is because some information that could be useful for classification is lost when we convert surface words to lemmas.

## Exercise 1

But what is we also remove function words (like 'ja', 'et', etc) from the text? Try filtering out words whose 'xpos' attribute (assignmed by standorfnlp) in 'J'. Does it improve accuracy?



## Exercise 2
Your (ungraded) exercise: try if using n-grams (e.g. of length 1 to 3) instead of words improves the classification accuracy. Hint: study the documentation of CountVectorizer.