## Classifying text

In [5]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, accuracy_score
%matplotlib inline

We turn to applying machine learning classification methods to text. There are
no new principles at stake.  In principle, everything is the same as it was for
learning how to classify irises.

1.  We need to find labeled data; each of the exemplars in the data should be represented with a fixed set of features.  
2. We need to split our data and training and test data.  
3. We need to train learner on the training data and evaluate it (test it) it on the test data.

The problem is that text data is not in a form  that is compatible with
what we have learned about classifiers.  The text must be put in a suitable
form before a linear model; can be trained on it. 

**Training**

1.  Labeled data must be loaded (into Python).  It should be a sequence of documents T accompanied by a sequence of labels L. 
2.  Split T and L into training and test groups, yielding T1 and T2; as well as and L1 and L2.
2.  Train or a **feature model** on the training data T1 (or in scikit learn terminology **fit** the model **to** the training data).  The feature model inputs the text sequence and outputs a **term-document** matrix suitable for training a linear classifier.  The feature model is called a **vectorizer**
(because it turns a document into a vector, a column of numbers).
3.  Using the trained vectorizer, transform T1 into a term document matrix M1.
4.  Train a linear model $\mu$ on M1 and L1.

**Evaluation**

1.  Transform the test data T2 into a term document matrix M2 using the vectorizer fit during step 2 of training;  in particular this means if there are words in the T2 data that were never seen during training, they are ignored in building M2.
2.  Use $\mu$  to classify the texts represented in M2; that is produce a set of predicted labels P2.
3.  Compare the actual labels L2 with the predicted labels P2 using standard evaluation metrics such as precision, accuracy, and recall.


## Review the steps with insult detection

We looked at the insult detection data in  the text classification notebook.

### Training step 1: Loading the data

Let's load the CSV file.

In [3]:
import os.path
site = 'https://raw.githubusercontent.com/gawron/python-for-social-science/master/'\
'text_classification/'
#site = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'
df = pd.read_csv(os.path.join(site,"troll.csv"))

Each row is a comment  taken from a blog or online forum. There are three columns: whether the comment is insulting (1) or not (0), the date, and the comment.

In [49]:
df.tail()

Unnamed: 0,Insult,Date,Comment
3942,1,20120502172717Z,"""you are both morons and that is never happening"""
3943,0,20120528164814Z,"""Many toolbars include spell check, like Yahoo..."
3944,0,20120620142813Z,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,20120528205648Z,"""How about Felix? He is sure turning into one ..."
3946,0,20120515200734Z,"""You're all upset, defending this hipster band..."


Now we define the text sequences $\mathbf{T}$ and the label sequence  $\mathbf{L}$.

In [13]:
T = df['Comment']

In [14]:
L = df['Insult']

### Step 2 Split the data and labels into training and test groups

In [53]:
T1, T2, L1, L2 = train_test_split(T,L)

### Step 3 and 4:  Fit the feature model (vectorizer) to the training data and Transform  it

In [54]:
tf = text.TfidfVectorizer()
# Scikit learn has one function that does both fitting and transforming.
# M1 is the transformed data
# tf is the trained feature model (which will be used to transform the test data)
M1 = tf.fit_transform(T1)

### Step 5 Training the classifier

Now, we are going to train a classifier as usual. We first split the data into a train and test set.

We use a **Bernoulli Naive Bayes classifier**.

In [55]:
# Create classifer
bnb =nb.BernoulliNB()

# Fit (train) the classifier  using the training data and labels
bnb.fit(M1, L1);

### Evaluation

Evaluate the classifier, first using accuracy (what `.score()` returns).

In [56]:
# vectorize the test data using the vectorizer trained on T1
# Notice we DONT call .fit_transform() because that would retrain the vectorizer on the test data
# We call .transform() using the trained model to transform the new data.
# Words not seen during training will be ignored.
M2 = tf.transform(T2)
# Classify the data using the trained classisifer and report the accuracy
bnb.score(M2, L2)

0.7588652482269503

Now try re-executing steps 2 through 5.  (Just re-execute the cells)  The results should be the same, right?

Well, are they?  

What happens:  each training test split produces a different set of test data.  Sometimes the test is harder.
Sometimes it's easier.  Or looking at it another way:  Sometimes the training data is a better preparation for the test than others.  

To get a realistic view of how our classifier is doing we take the average performance on a  number of 
train/test splits.  This is called **cross validation**.  We return to that point below.

#### Using all three evaluation metrics

First let's get more evaluation numbers, in particular precision and recall.  We do
that by calling a method that returns the predicted labels P2, so we can compare
L2 and P2 using different evaluation metrics.

In [58]:
P2 = bnb.predict(M2)
scores = np.array([accuracy_score(P2, L2),
                   precision_score(P2, L2),
                   recall_score(P2, L2)])
print(f'Accuracy: {scores[0]:.2f} Precision: {scores[1]:.2f} Recall: {scores[2]:.2f}')

Accuracy: 0.76 Precision: 0.14 Recall: 0.95


We see that the accuracy is a bit misleading.  There is a serious precision problem.

What does that mean in the setting of insult detection?  It means the BNB classifier is a little too
eager to call something an insult.  When it flags something as an insult, it
is right only 14% of the time.

Why would that be?  Think about how the model is trained and what its weakness might be.
This is what it means to try to interpret or discuss a model's performance.  Zoom
in the model's weakness. Talk about where that weakness comes from.

#### Basic train and test loop

How to get the average of a number of runs.

In [62]:
def split_fit_and_eval(T,L,test_size=.2):
    # This code just collects together the training steps 2-5 + the eval
    # That is, It does one training,test., eval run
    (T1, T2, L1, L2) = train_test_split(T, L, test_size=test_size)
    tf = text.TfidfVectorizer()
    M1 = tf.fit_transform(T1)
    bnb = nb.BernoulliNB()
    bnb.fit(M1,L1)
    # .fit(), ..fit_transform()
    M2 = tf.transform(T2)
    P2 = bnb.predict(M2)
    return np.array([accuracy_score(P2,L2),
                     precision_score(P2, L2),
                     recall_score(P2,L2)])

# Split, Train, test and eval 10 times
num_runs = 10
# an accumulator for acc.,pre.,and rec.
scores = np.zeros((3,))
for test_run in range(num_runs):
    scores += split_fit_and_eval(T,L)
# Compute the average of the num_runs runs for all metrics
normed_stats = scores/num_runs

print(f'Accuracy: {normed_stats[0]:.2f} Precision: {normed_stats[1]:.2f} Recall: {normed_stats[2]:.2f}')

Accuracy: 0.77 Precision: 0.16 Recall: 0.86


## Homework

Read the on line book draft chapter about text classification and and especially
about  movie review data.  Note that you will be using a different classifier implementation (`scikit_learn`) than the one used in the book
(`nltk`).  Therefore, when it comes to writing code for training the calssifier. focus on the code examples in this notebook, which use `scikit_learn`.

Try using two classifiers on the movie review data, the one used in the textbook, an SVM, and
the Bernoulli Naive Bayes model used above. Be sure
to stick with  scikit learn (it has an SVM implementation).
Some points of emphasis;

1.  Be sure to get the average of at runs  least 10 runs for **both** classifiers.
2.  Be sure to get average accuracy, precision, and recall for both classifiers on those multiple runs. You will probably find `split_fit_and_eval` defined above useful, but you may need to modify it.
3.  For your first discussion post turn in the new code you wrote, including the code that labels and shuffles the data (discussed further below).  If you have to do a new import, show that. If you have to rewrite `split_fit_and_eval`, turn in the new version.  Also show the output, which should be a single line giving the accuracy, prcision, and recall.
4.  Discuss which classifier does better.  Discuss which metric the best classifier does the worst at and speculate as to why (this will require reviewing the definitions of precision and recall and thinking about what they mean in a movie review setting).

#### Help with getting the movie reviews data.

Execute the next two cells to get the movie review data.

In [2]:
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/gawron/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [2]:
from nltk.corpus import movie_reviews as mr

def get_file_strings (corpus, file_ids):
    return [corpus.raw(file_id) for file_id in file_ids]

data = dict(pos = mr.fileids('pos'),
            neg = mr.fileids('neg'))

pos_file_ids = data['pos']
neg_file_ids = data['neg']

# Get all the positive and negative reviews.
pos_file_reviews = get_file_strings (mr, pos_file_ids)
neg_file_reviews = get_file_strings (mr, neg_file_ids)

Each review is a string.  In principle, a list of strings like `pos_file_reviews`  can be passed to `text.TfidfVectorizer()` via the `fit_transform` method to train a vectorizer for machine learning.
You could code that up.

What you'd really like to do is use `split_fit_and_eval`, defined above, which does a lot of the work for you.

But hold on. You have a coding problem. You don't have  a sequence of documents and labels.  Instead you have
one sequence of positive documents  and another sequence of negative documents.  

So you will need to turn those two sequences into a sequence of documents and a sequence of labels
because that's what `split_fit_and_eval` wants.  You also want the doc sequence
to contain a random mixture of positive and negative documents, because some machine
learning algorithms are sensitive to the order in which training data is presented to
them.

The next cell does **not** do that for you.  But it illustrates an approach using 
two sets of English letters in place of two sets of English documents.

In [3]:
# Lets work on documents
# There are 2 classes, pos and ne

from random import shuffle
from string import ascii_lowercase


# Now get pairs of letters and labels
pos_pairs = [(rev,'pos') for rev in pos_file_reviews]
neg_pairs = [(rev,'neg') for rev in neg_file_reviews]



###########  Shuffling  ###########################
# Way too orderly, the classes arent mixed yet.
data = pos_pairs + neg_pairs
shuffle(data)
###################  Now they're shuffled! ###############

# Separate the letters from their labels
revs, lbls = zip(*data)

In [7]:
def split_vectorize_and_fit(docs,labels,clf,pos_label='pos'):
    global predictions
    T_train,T_test, y_train,y_test = train_test_split(docs,labels)
    tf = text.TfidfVectorizer()
    X_train = tf.fit_transform(T_train)
    clf_inst = clf()
    clf_inst.fit(X_train, y_train)
    X_test = tf.transform(T_test)
    predictions = clf_inst.predict(X_test)
    return precision_score(y_test,predictions,pos_label=pos_label), \
           recall_score(y_test, predictions,pos_label=pos_label), \
           accuracy_score(y_test,predictions)

Solution Partial:  Running just one classifier:

Bernoulli Naive Bayes

In [8]:
num_test_runs = 10
scores = np.zeros((3,))
for test_run in range(num_test_runs):
    clf = nb.BernoulliNB
    scores += split_vectorize_and_fit(revs,lbls,clf)

p_score,r_score,a_score = scores/num_test_runs
print(f"{'BernoulliNB':<50} Precision {p_score:.3f} Recall {r_score:.3f}  Accuracy: {a_score:.3f}")

BernoulliNB                                        Precision 0.877 Recall 0.676  Accuracy: 0.792


In [9]:
from sklearn import linear_model
from sklearn.svm import LinearSVC

clfs = {"Logistic Regression": linear_model.LogisticRegression,
        "Ridge Classifier":   linear_model.RidgeClassifier,
        "Multinomial NB" : nb.MultinomialNB,
        "Passive Aggressive Classifier": linear_model.PassiveAggressiveClassifier,
        "SVM": LinearSVC,
       }

Testing multiple classifiers in one loop with 10 training test splits for each classifier.
Reporting average precision, recall and accuracy scores for the 10 test runs for each of
the classifiers.

In [10]:
num_test_runs = 10
scores = np.zeros((3,))

print(f"Clf: {'':<50} Precision   Recall  Accuracy ")
for clf in clfs:
    for test_run in range(num_test_runs):
        scores += split_vectorize_and_fit(revs,lbls,clfs[clf])
    p_score,r_score,a_score = scores/num_test_runs
    print(f"Clf: {clf:<50} {p_score:.3f}        {r_score:.3f}   {a_score:.3f}")
    scores = np.zeros((3,))
    print()

Clf:                                                    Precision   Recall  Accuracy 
Clf: Logistic Regression                                0.815        0.836   0.820

Clf: Ridge Classifier                                   0.841        0.851   0.842

Clf: Multinomial NB                                     0.858        0.722   0.795

Clf: Passive Aggressive Classifier                      0.853        0.858   0.853

Clf: SVM                                                0.847        0.873   0.860



The Multinomial Naive Bayes classifier had the best Precision Score, but it also had the worst accuracy
and by far the worst recall.  This is an excellent example of the precision-recall trade-off. The classifier
achieves its high precision at the cost of a lot of false negatives, lowering its recall and accuracy.

The highest Accuracy and the best recall was achieved by the Passive Aggressive Classifier, with the
Ridge Classifier tying for best recall and just a smidge behind in accuracy. Except for the Multinomial Naive Bayes
classifier, the precision scores all lagged behind their recall scores, meaning most of the classifiers
traded in a little precision for better recall.