# Text classification

## *"Words. I know words. I have the best words!"*
*- Noam Chomsky*

# Sentiment Analysis

<video controls src="cartoon.m4v" type="video/mp4" />

# Sentiment Analysis: Implementation

<video controls src="machine_learning.m4v" type="video/mp4" />

# Overview

In order to train a machine learning model to classify text, we need:
1. a way to preprocess text
2. a label for each text, represented as number
3. a way to represent each text as vector input
4. a model to learn  a function $f(input) = label$
5. a way to evaluate how well the model works
6. a way to predict new data

As an example, we will use reviews data and try to classify the rating into $positive$ or $negative$, only based on the text they use.

The same method can be used for any other data, including more labels and other dependent variables (e.g., age or gender of the text author, social constructs expressed in the text, etc...). 

# 1. Getting data

We use `pandas` to read in our data from a CSV file, but you can use almost any format (remember `read_excel()`, `read_sql()`, etc.)

In [1]:
import pandas as pd

data = pd.read_csv('sa_train.csv')
print(len(data), data['output'].unique())
data.head(2)

1800 ['neg' 'pos']


Unnamed: 0.1,Unnamed: 0,input,output,clean_text
0,0,shakespeare in love is quite possibly the most...,neg,shakespeare love be quite possibly most enjoya...
1,1,wizards is an animated feature that begins wit...,neg,wizard be animate feature that begin narration...


## Preprocessing

Text is messy. The goal of preprocessing is to reduce the amount of noise (= unnecessary variation), while maintaining the signal. There is no one-size-fits-all solution, but a good approximation is the following:

In [59]:
import spacy
nlp = spacy.load('en', disable=['parser', 'ner'])

In [104]:
def clean_text(text):
    '''reduce text to lower-case lexicon entry'''
    lemmas = [token.lemma_ for token in nlp(text) 
              if token.pos_ in {'NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN'}]
    return ' '.join(lemmas)

clean_text('This is a test sentence. And here comes another one... Go me!')

'be test sentence here come one go'

Let's clean up the input data. This can take a while, so it's good to save it... I have done that here already.

In [65]:
# data['clean_text'] = data['input'].apply(clean_text)
data['clean_text']

0       shakespeare love be quite possibly most enjoya...
1       wizard be animate feature that begin narration...
2       gun wield arnold schwarzenegger have change he...
3       keep jane austen sense sensibility pride preju...
4       hollywood be pimp fat cigar smoke chump wear f...
5       look new version psycho come world didn t end ...
6       film adapt comic book have have plenty success...
7       capsule verma family be have wedding all relat...
8       watch battlefield earth be wallow misery be mo...
9       tommy lee jone chase innocent victim america w...
10      michael robbin hardball be quite cinematic ach...
11      s almost amusing watch year old christina ricc...
12      accord hollywood movie make last few decade li...
13      young einstein be embarrassingly lame didn t s...
14      kirk dougla be rare american actor who can say...
15      man be not man tael gold star sammo hang sylvi...
16      director nightmare christma say preview which ...
17      phil c

# 2. Labels

Here, we assume that we already have the labels. (In your task, you will have to label them yourself! Hint: use `input()` or a spreadsheet).

However, in order for the machine learning model to work with the labels, we need to translate them into a vector of numbers. We can use `sklearn.LabelEncoder`

In [106]:
from sklearn.preprocessing import LabelEncoder

# transform labels into numbers
labels2numbers = LabelEncoder()

y = labels2numbers.fit_transform(data['output'])
print(data['output'][:10], y[:10], len(y))

0    neg
1    neg
2    neg
3    pos
4    pos
5    neg
6    pos
7    pos
8    neg
9    neg
Name: output, dtype: object [0 0 0 1 1 0 1 1 0 0] 1800


To get the original names back, use `inverse_transform()`:

In [107]:
labels2numbers.inverse_transform([1,1,1,0,0,1])

array(['pos', 'pos', 'pos', 'neg', 'neg', 'pos'], dtype=object)

# 3. Representing text

First, we need to transform the texts into a matrix, where each row represents one text instance. The columns are the **features**


## Bags of words

The easiest way is to represent features is as a counts of all words in the text. It takes two steps:
1. collect the counts for each word
2. transform the individual counts into one big matrix

The result is a matrix $X$ with one row for each instance, and one column for each word in the vocabulary.

![Bag of words procedure](bow.png)

We can use the `TfidfVectorizer` object to get the weighted frequency of each word:

In [111]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=0.001, max_df=0.75, stop_words='english')

X = vectorizer.fit_transform(data.clean_text)
print(X.shape)

(1800, 69016)


We can now trasnlate back and forth between columns and words:

In [115]:
vectorizer.vocabulary_['bad']

3819

In [116]:
vectorizer.get_feature_names()[4191]

'barely scratch'

In [151]:
import numpy as np
# [vectorizer.get_feature_names()[f] for f in (X.sum(axis=1)).A1.argsort()[:10]]
vectorizer.get_feature_names()[1244]

'adverse'

Let's see how often that word is in the data:

In [69]:
len(data[data.input.str.contains('bad')])

749

# 4. Learning a classification model

A classification model is simply a function that takes a text representation as input, and returns an output label.

Inside that function is normally a set of weights. By multiplying the weight vector with the input vector, we get the label.

## 4.1: Fitting a model

Fitting a model is the process of finding the right weights to map the training inputs to the training outputs. Fitting to data in `sklearn` is easy: we use the `fit()` function, giving it the input matrix and output vector.

In [152]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
%time classifier.fit(X, y)
print(classifier)

CPU times: user 181 ms, sys: 33.7 ms, total: 214 ms
Wall time: 70.4 ms
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)




The resulting fitted model has coefficients (betas) for each word/feature in our vocabulary

In [153]:
classifier.coef_.shape

(1, 69016)

We can now examine the weights/coefficients/betas for the individual words (note that each word has an ID):

In [155]:
k = vectorizer.vocabulary_['bad'] # column position for the word
print(vectorizer.get_feature_names()[k], classifier.coef_[0, k])

bad -3.461779370057597


NB: in a two-class problem, our coefficents are in a vector: positive values indicate the positive class, negative values the other class.
In a multi-class problem, we have one **row** of coefficients for each class: positive values indicate that this feature contributes to the class, negative values indicate that it contributes to other classes.

# 5. Evaluating models

Having a model is great, but how well does it do? Can it classify what it has seen? We need a way to estimate how well the model will work on new data.

We need a metric to measure performance and a way to simulate new data.

## 5.1: Metrics

We use three measure:
1. precision
2. recall
3. F1

### Precision

Precision measures how many of our model's predictions were correct. We divide the number of true positives by the number of all positives

$$
p = \frac{tp}{tp+fp}
$$

### Recall

Recall measures how many of the correct answers in the data our model managed to find. We divide the number of true positives by the number of true positives (the instances our model got) and false negatives (the instances our model *should* have gotten)

$$
r = \frac{tp}{tp+fn}
$$

### F1

A model that classified everything as, say, "positive" would get a perfect recall (it does, after all, find all positive examples). However, such a model would obviously be useless, since its precision is bad.

We want to balance the two against each other. F1 does exactly that, by taking the harmonic mean.

$$
F_1 = \frac{p\cdot r}{p+r}
$$

Luckily, all of these metrics are implemented in `sklearn`. All we have to provide are the predictions of our model, and the actual correct answers (called the *gold standard*). 

In [73]:
from sklearn.metrics import classification_report

## 5.2: Cross-validation

How do we measure performance on new data, if we don't know what the correct outputs for those new data points are?

In **$k$-fold cross-validation**, we simulate new data, by fitting our model on parts of the data, and evaluating on other. We can thereby measure the performance on the held-out part. 

However, we have now reduced the amount of data we used to fit the data. In order to address this, we simply repeat the process $k$ times.
We separate the data into $k$ parts, fit the model on $k-1$ parts, and evaluate on the $k$th part. In the end, we have performance scores from $k$ models. The average of them tells us how well the model would work on new data.

![3-fold cross-validation](3foldCV.png)

In [156]:
from sklearn.model_selection import cross_val_score

for k in [2,3,5,10]:
    cv = cross_val_score(LogisticRegression(), X, y=y, cv=k, n_jobs=-1, scoring="f1_micro")
    fold_size = X.shape[0]/k
    
    print("F1 with {} folds for bag-of-words is {}".format(k, cv.mean()))
    print("Training on {} instances/fold, testing on {}".format(fold_size*(k-1), fold_size))
    print()

F1 with 2 folds for bag-of-words is 0.8116880391210359
Training on 900.0 instances/fold, testing on 900.0

F1 with 3 folds for bag-of-words is 0.8161274444898149
Training on 1200.0 instances/fold, testing on 600.0

F1 with 5 folds for bag-of-words is 0.8244532236617053
Training on 1440.0 instances/fold, testing on 360.0

F1 with 10 folds for bag-of-words is 0.8327954052079797
Training on 1620.0 instances/fold, testing on 180.0



## Baselines
So, is that performance good? Let's compare to a **baseline**, i.e., a null-hypothesis. The simplest one is that all instances belong to the most fereuqnt class in the data.

In [157]:
from sklearn.dummy import DummyClassifier

most_frequent = DummyClassifier(strategy='most_frequent')

print(cross_val_score(most_frequent, X, y=y, cv=k, n_jobs=-1, scoring="f1_micro").mean())

0.506111162553028


# Activity

See whether you can apply the previous steps to a new data sets, a description of wines. Choose any of the descriptor columns as target variable. The text is already preprocessed, to save time.

In [102]:
wine = pd.read_csv('wine.csv', encoding='utf8')
wine

Unnamed: 0,description,country,province,variety,description_cleaned
0,This tremendous 100% varietal wine hails from ...,US,California,Cabernet Sauvignon,tremendous varietal wine hail be age year oak ...
1,Mac Watson honors the memory of a wine once ma...,US,California,Sauvignon Blanc,honor memory wine once make his mother tremend...
2,"This spent 20 months in 30% new French oak, an...",US,Oregon,Pinot Noir,spend month new french oak incorporate fruit v...
3,This re-named vineyard was formerly bottled as...,US,Oregon,Pinot Noir,re name vineyard be formerly bottle will find ...
4,The producer sources from two blocks of the vi...,US,California,Pinot Noir,producer source block vineyard wine high eleva...
5,"From 18-year-old vines, this supple well-balan...",US,Oregon,Pinot Noir,old vine supple well balance effort blend flav...
6,A standout even in this terrific lineup of 201...,US,Oregon,Pinot Noir,standout even terrific lineup release open bur...
7,"With its sophisticated mix of mineral, acid an...",US,Oregon,Pinot Noir,its sophisticated mix mineral acid tart fruit ...
8,"First made in 2006, this succulent luscious Ch...",US,Oregon,Chardonnay,first make succulent luscious be all mineralit...
9,"This blockbuster, powerhouse of a wine suggest...",US,California,Cabernet Sauvignon,blockbuster powerhouse wine suggest blueberry ...


In [99]:
# your code here

# 6 Classifying new data

Classifying new (**held-out**) data is called **prediction**. We reuse the weights we have learned before on a new data matrix to predict the new outcomes.
Important: the new data needs to have the same number of features!

In [76]:
# read in new data set
new_data = pd.read_csv('sa_test.csv')
print(len(new_data))
new_data.head()

200


Unnamed: 0,input,output
0,robert redford ' s a river runs through it is ...,pos
1,if the 70 ' s nostalgia didn ' t make you feel...,neg
2,you think that these people only exist in the ...,neg
3,""" knock off "" is exactly that : a cheap knock ...",neg
4,brian depalma needs a hit * really * badly . s...,pos


Don't forget to clean it!

In [159]:
%time new_data['clean_text'] = new_data.input.apply(clean_text)

CPU times: user 15.7 s, sys: 1.91 s, total: 17.6 s
Wall time: 4.46 s


Let's see how well we do on this data:

In [78]:
# transform text into word counts
# IMPORTANT: use same vectorizer we fit on training data to create vectors!
new_X = vectorizer.transform(new_data.clean_text)

# use the old classifier to predict and evaluate
new_predictions = classifier.predict(new_X)
print(new_predictions)

[1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1
 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 1 1 0 1
 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 0 0 0 0
 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0
 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 1 0 0
 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1]


Instead, we can also predict the probabilities of belonging to each class

In [79]:
new_probabilities = classifier.predict_proba(new_X)
print(new_probabilities)

[[0.31108778 0.68891222]
 [0.55471719 0.44528281]
 [0.53620949 0.46379051]
 [0.7502412  0.2497588 ]
 [0.35878419 0.64121581]
 [0.6253547  0.3746453 ]
 [0.53285443 0.46714557]
 [0.53272824 0.46727176]
 [0.639782   0.360218  ]
 [0.50654981 0.49345019]
 [0.39081759 0.60918241]
 [0.27007928 0.72992072]
 [0.36192997 0.63807003]
 [0.62977267 0.37022733]
 [0.366374   0.633626  ]
 [0.34421595 0.65578405]
 [0.30788486 0.69211514]
 [0.35909185 0.64090815]
 [0.48506903 0.51493097]
 [0.46259152 0.53740848]
 [0.72499626 0.27500374]
 [0.47746115 0.52253885]
 [0.35432731 0.64567269]
 [0.39762716 0.60237284]
 [0.50426902 0.49573098]
 [0.56830208 0.43169792]
 [0.68846081 0.31153919]
 [0.67481961 0.32518039]
 [0.45491445 0.54508555]
 [0.53817586 0.46182414]
 [0.30234197 0.69765803]
 [0.63989832 0.36010168]
 [0.70779382 0.29220618]
 [0.80626669 0.19373331]
 [0.64977241 0.35022759]
 [0.72614584 0.27385416]
 [0.3072466  0.6927534 ]
 [0.5791168  0.4208832 ]
 [0.59156876 0.40843124]
 [0.63534134 0.36465866]


For each instance (=row), we get a probability distribution over the classes (=columns)

## 6.1 Regularization

Typically, performance is lower on unseen data, because our model **overfit** the training data: it expects the new data to look *exactly* the same as the training data. That is almost never true.

In order to prevent the model from overfitting, we need to **regularize** it. Essentially, we make it harder to learn the training data.

A simple example of regularization is to "corrupt" the training data by adding a little bit of noise to each training instance. Since the noise is irregular, it becomes harder for the model to learn any patterns.

In [80]:
from scipy.sparse import random

num_instances, num_features = X.shape

for i in range(5):
    X_regularized = X + random(num_instances, num_features, density=0.01)

    print(cross_val_score(LogisticRegression(), X_regularized, y=y, cv=k, n_jobs=-1, scoring="f1_micro").mean())

KeyboardInterrupt: 

If you run the previous cell several times, you see different results (it gets even more varied if you change `density`). This variation arises because we add **random** noise. Not good...

Instead, it makes sense to force the model to spread the weights more evenly over all features, rather than bet on a few feature, which mighht not be present in future data.

We can do this by training the model with the `C` parameter. The default is `1`. Lower values mean stricter regularization.

In [168]:
from sklearn.metrics import f1_score
best_c = None
best_f1_score = 0.0
for c in [50, 20, 10, 1.0, 0.5, 0.1, 0.05, 0.01]:
    clf = LogisticRegression(C=c)
    cv_reg = cross_val_score(clf, X, y=y, cv=5, n_jobs=-1, scoring="f1_micro").mean()

    print("5-CV on train at C={}: {}".format(c, cv_reg.mean()))
    print()

    if cv_reg > best_f1_score:
        best_f1_score = cv_reg
        best_c = c
        
print("best C parameter: {}".format(best_c))

5-CV on train at C=50: 0.8483421725647746

5-CV on train at C=20: 0.8500119428219183

5-CV on train at C=10: 0.8494594822833854

5-CV on train at C=1.0: 0.8244532236617053

5-CV on train at C=0.5: 0.8172356182446538

5-CV on train at C=0.1: 0.7855657622529667

5-CV on train at C=0.05: 0.7355763890496412

5-CV on train at C=0.01: 0.5083364497839918

best C parameter: 20


# Better features = better performance


We now have **a lot** of features! More than we have actual examples...

Not all of them will be helpful, though. Let's select the top 1500 based on how well they predict they outcome of the training data.

We use two libraries from `sklearn`, `SelectKBest` (the selection algorithm) and `chi2` (the selection criterion).

In [178]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

selector = SelectKBest(chi2, k=1500).fit(X, y)
X_sel = selector.transform(X)
print(X_sel.shape)

(1800, 1500)


Let's see how well this new representation performs, by looking at the 5-fold cross-validation. We keep the best regularization value from before.

In [179]:
clf = LogisticRegression(C=best_c)

cv_reg = cross_val_score(clf, X_sel, y=y, cv=5, n_jobs=-1, scoring="f1_micro")
print("5-CV on train: {}".format(cv_reg.mean()))

5-CV on train: 0.8972358154341039


Not too bad! We have handily beaten our previous best! Let's fit a classifier on the whole data now.

In [170]:
clf.fit(X_sel, y)



LogisticRegression(C=20, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Now, let's apply it to the held-out data set. 
We need to 
* vectorize the data with our vectorizer from before (otherwise, we get different features)
* select the top features (using our previously fitted selector)

In [85]:
# select features for new data
new_X_sel = selector.transform(new_X)
print(new_X_sel.shape)

(200, 1500)


Finally, we can use our new classifier to predict the new data labels, and compare them to the truth.

In [171]:
new_predictions_regularized = clf.predict(new_X_sel)
prediction_df = pd.DataFrame(data={'input': new_data['input'], 'prediction': labels2numbers.inverse_transform(new_predictions_regularized), 'truth':new_data['output']})
prediction_df

Unnamed: 0,input,prediction,truth
0,robert redford ' s a river runs through it is ...,pos,pos
1,if the 70 ' s nostalgia didn ' t make you feel...,neg,neg
2,you think that these people only exist in the ...,neg,neg
3,""" knock off "" is exactly that : a cheap knock ...",neg,neg
4,brian depalma needs a hit * really * badly . s...,pos,pos
5,attention moviegoers : you are about to enter ...,neg,neg
6,it used to be that not just anyone could becom...,neg,neg
7,expand the final fifteen minutes of home alone...,pos,pos
8,"capsule : godawful "" comedy "" that ' s amazing...",neg,neg
9,drew barrymore is beginning to corner the mark...,pos,neg


In [172]:
new_data['input'][9]

'drew barrymore is beginning to corner the market on playing the girl outside - the one who \' s the awkward klutz or the spunky do - it - yourselfer ; the one who just doesn \' t fit in with the others . she has perfected these characters in movies such as " the wedding singer " and , most notably , " ever after . " now she \' s back , starring in what could be called a modern - day cinderella fable - " never been kissed . " you know it \' s a fable because she plays a copy editor at a newspaper who has her own office as well as a secretary . trust me on this one , no copy editor has seen the inside of a private office since gutenberg ( and i don \' t mean steve ) invented the printing press . the premise is simple . barrymore \' s josie geller , at 25 the youngest copy editor ever to be hired by the chicago sun - times , is assigned to go undercover and return to high school to do an expose on what today \' s teens are feeling and doing . josie ( she says she was named after the \' 7

## Getting insights

In order to explore which features are most indicative, we need some code

In [173]:
features = vectorizer.get_feature_names() # get the names of the features
top_scores = selector.scores_.argsort()[-1500:] # get the indices of the selection
best_indicator_terms = [features[i] for i in sorted(top_scores)] # sort feature names

top_indicator_scores = pd.DataFrame(data={'feature': best_indicator_terms, 'coefficient': clf.coef_[0]})
top_indicator_scores.sort_values('coefficient')

Unnamed: 0,feature,coefficient
81,bad,-12.740770
1449,waste,-8.606913
1307,suppose,-8.403378
70,attempt,-8.137514
145,boring,-7.687704
1007,plot,-7.204338
1400,unfortunately,-6.920251
1020,poor,-6.274391
849,mess,-5.804681
1120,ridiculous,-5.763368


# Checklist: how to classify my data

1. label at ***least 2000*** tweets in your data set as `positive`, `negative`, or `neutral`
2. preprocess the text of *all* tweets in your data (labeled and unlabeled)
3. read in the labeled tweets and their labels
4. transform the labels into numbers
5. use `TfidfVectorizer` to extract the features and transform them into feature vectors
6. select the top $N$ features (where $N$ is smaller than the number of labeled tweets)
7. create a classifier
8. use 5-fold CV to find the best regularization parameter, top $N$ feature selection, and maybe feature generation and preprocessing steps

Once you are satisfied with the results:
9. read in the rest of the (unlabeled) tweets
10. use the `TfidfVectorizer` from 5. to transform the new data into vectors
11. use the `SelectKBest` selector from 6. to get the top $N$ features
12. use the classifier from 7. to predict the labels for the new data
13. save the predicted labels or probabilities to your database or an Excel file
