# Text Classification:  Insults with Naive Bayes

In [38]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, accuracy_score
%matplotlib inline

## Loading and preparing the data

Let's open the CSV file with `pandas`.

In [3]:
import os.path
site = 'https://raw.githubusercontent.com/gawron/python-for-social-science/master/'\
'text_classification/'
#site = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'
df = pd.read_csv(os.path.join(site,"troll.csv"))

Each row is a comment  taken from a blog or online forum. There are three columns: whether the comment is insulting (1) or not (0), the data, and the unicode-encoded contents of the comment.

In [4]:
df[['Insult', 'Comment']].tail()

Unnamed: 0,Insult,Comment
3942,1,"""you are both morons and that is never happening"""
3943,0,"""Many toolbars include spell check, like Yahoo..."
3944,0,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,"""How about Felix? He is sure turning into one ..."
3946,0,"""You're all upset, defending this hipster band..."


Write a pandas command to give you just the insults.

In [39]:
# Solution replaces df on the RHS
insult_df = df[df['Insult'] ==1].copy()

In [6]:
insult_df[:25]

Unnamed: 0,Insult,Date,Comment
0,1,20120618192155Z,"""You fuck your dad."""
7,1,,"""shut the fuck up. you and the rest of your fa..."
8,1,20120502173553Z,"""Either you are fake or extremely stupid...may..."
9,1,20120620160512Z,"""That you are an idiot who understands neither..."
15,1,20120611090207Z,"""FOR SOME REASON U SOUND RETARDED. LOL. DAMN. ..."
16,1,20120320162532Z,"""You with the 'racist' screen name\n\nYou are ..."
18,1,20120320075347Z,"""your such a dickhead..."""
19,1,20120320203947Z,"""Your a retard go post your head up your #%&*"""
34,1,20120515132156Z,"""Allinit123, your\xa0hypocrisy\xa0is sickening..."
37,1,20120620161958Z,"""I can't believe the stupid people on this sit..."


There are documents of a **variety** of lengths, from various kinds of social media.  From pretty long...

In [10]:
df['Comment'][79]

'"Fact : Georgia passed a strict immigration policy and most of the Latino farm workers left the area. Vidalia Georgia now has over 3000 agriculture job openings and they have been able to fill about 250 of them in past year. All you White Real Americans who are looking for work that the Latinos stole from you..Where are you ? The jobs are i Vadalia just waiting for you..Or maybe its the fact that you would rather collect unemployment like the rest of the Tea Klaners.. You scream..you complain..and you sit at home in your wife beaters and drink beer..Typical Real White Tea Klan...."'

To very very short:

In [12]:
insult_df.loc[755]

Insult                   1
Date       20120620121441Z
Comment           "Retard"
Size                     8
Name: 755, dtype: object

A look at the range.  This is part of the challenge of this dataset.

In [11]:
insult_df['Size'] = df['Comment'].apply(len)
insult_df['Size'].sort_values(ascending = False)[]

3208    4016
3931    1600
581     1548
1348    1269
3924    1022
        ... 
3109      11
2180      11
3919       8
45         8
755        8
Name: Size, Length: 1049, dtype: int64

## Analyzing insults with Naive Bayes: pandas and sklearn

We want to use one of the linear classifiers in `sklearn`,
but the learners in `sklearn` only work with numerical arrays. How to convert text into a matrix of numbers?
Obtaining the feature matrix from the text is not trivial. 

The classical solution is to first extract a **vocabulary**: a list of words used throughout the corpus. Then, we can count, for each document in the sample, the frequency of each word. We end up with a **sparse matrix**: a huge matrix containing mostly zeros. Here, `sklearn` and `pandas` make it possible to do this in two lines. 

The text processing component that goes from text to sparse matrix is called a **vectorizer**.

We will use a `TfidfVectorizer`, a version that has been very successul in various Natural Language Processing tasks.

In [None]:
print(text.TfidfVectorizer.__doc__)

Convert a collection of raw documents to a matrix of TF-IDF features.

    Equivalent to :class:`CountVectorizer` followed by
    :class:`TfidfTransformer`.

    Read more in the :ref:`User Guide <text_feature_extraction>`.

    Parameters
    ----------
    input : {'filename', 'file', 'content'}, default='content'
        - If `'filename'`, the sequence passed as an argument to fit is
          expected to be a list of filenames that need reading to fetch
          the raw content to analyze.

        - If `'file'`, the sequence items must have a 'read' method (file-like
          object) that is called to fetch the bytes in memory.

        - If `'content'`, the input is expected to be a sequence of items that
          can be of type string or byte.

    encoding : str, default='utf-8'
        If bytes or files are given to analyze, this encoding is used to
        decode.

    decode_error : {'strict', 'ignore', 'replace'}, default='strict'
        Instruction on what to do if a b

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
        'This is the first document.',
         'This document is the second document.',
         'And this is the third one.',
         'Is this the first document?',
    ]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [41]:
corpus[0].split()

['This', 'is', 'the', 'first', 'document.']

In [37]:
print(X.shape)
X.toarray()

(4, 9)


array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

####  Explaining TFIDF

Here are some statistics from the **British National Corpus**:

```
BNC

Corpus size   51,994,153
 Vocab size      511,928
   Num docs        1,726
```

And here are some interesting cases where word frequency
is close and doc frequency  isn't"

```
social                                             18,419          1,083         
want                                               18,284          1,415         

allow                                               5,285          1,232* 
computer                                            5,262            715
treatment                                           5,250            906 
gives                                               5,258          1,191*
easily                                              5,218          1,212*
```

What we're seeing is that certain words are **clumpier** than others,
in fact, clumpier than you'd expect given their frequency.  Once
they occur once in a document, they are much more likely to occur
again in that same document than you'd expect given their frequency.  Take, for example,
*computer*.  Once you see this word, it's likely
that the document it occurs in deals with
some technical or computer-related topic, and
so chances of seeing the word again are high.
On the other hand, take *gives*, whose overall
frequency is nearly the same as *computer*.  This word
doesn't tell you nearly as much about the topic of the document
we're looking at, and the chances of seeing it again in the same document
are neither higher nor lower than you'd expect for
a word of that frequency: *computer* is clumpy (it's 5K occurences are distributed
over relatively few documents); *gives* is not.

The TFIDF statistic takes into account not just the relative frequency of a word
in a document (the **Term Frequency**). It also takes into account its clumpiness.  
Clumpiness is measured by **Inverse Document Frequency**.

The term frequency of a term in a document is just its **relative frequency** (frequency
divided by document size)


$$
(1) \; \text{tf}(t,d) = \frac{f_{t,d}}{\mid d \mid}
$$


The inverse document frquency of a term $t$ in a set of documents D
is the inverse of its relative frequency in D:

$$
(2) \; \text{idf}(t, D) = \frac{\mid D \mid}{\mid\lbrace d \mid d \in \text{D} \text{ and } t \in d  \rbrace\mid}
$$

$$
\begin{array}[t]{ll}
\text{D}   & \text{the set of  documents in the training data}\\
\mid\text{D}\mid   & \text{ the number of docs in D}\\
t          & \text{the term or word}\\
\mid\lbrace d \mid d \in \text{D} \text{ and } t \in d  \rbrace\mid &
\text{the number of documents } t \text{ occurs in }\\
\end{array}
$$

An important refinement is 

$$
(3) \; \text{log-idf}(t,\text{D})  = \log (\text{idf}(t, \text{D}))
$$

The expression $\log \text{idf}(t, D)$ is $- \log \text{prob}_{D}(t)$,
which in information theory is the amount of the information
gained by knowing $t$ occurs in a document in the corpus.  So TFIDF
weights the term frequency by the information value of the term.


A very popular version of TFIDF is the product of 
the log inverse document frequency and the term count.

$$
(4)\; \text{TFIDF}(t,d)  = \text{tf}(t,d)  \cdot \text{log-idf}(t, D)
$$

Just weight the term frequency of $t$ in $d$ by the information value of $t$.

Another popular version of TFIDF is the product of 
the log inverse document frequency and the term count.

$$
(5)\; \text{TFIDF}(t,d)  = f_{t,d} \cdot \text{log-idf}(t, D)
$$

The raw term frequency is often used rather than the relative frequency
because the document vectors are going to be normalized to
unit length, so the document size will  be taken into
account, but in a slightly different way.


Equation (5) 
is essentially what scikit learn uses, although there are some technical details 
discussed [here](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
being left out.


Let's finish with an example.  Suppose we have a document in which
the word *given* and the word *computer* both occur 3 times.  Let's
use equation (5) and the statistics above to compute the 2 IDFs,

$$
(a)\; \text{TFIDF}(\text{computer},d)  = 
\begin{array}[t]{l}
3 \cdot \text{log-idf}(\text{computer}, D)\\
\log \frac{1726}{715}\\
2.64
\end{array}
$$

$$
(a)\; \text{TFIDF}(\text{gives},d)  = 
\begin{array}[t]{l}
3 \cdot \text{log-idf}(\text{gives}, D)\\
\log \frac{1726}{1191}\\
1.11
\end{array}
$$


So as desired the occurrence of *computer* is more significant, in fact more than twice
as significant.

In [31]:
3*np.log(1726/1191)

1.1130399068642194

Here is an example of how we would use it.

In [44]:
# Split the data into training and test sets FIRST
T_train,T_test, y_train,y_test = train_test_split(df['Comment'],df['Insult'])

In [45]:
tf = text.TfidfVectorizer()
# Train your vectorizer oNLY on the trainingh data.
X_train = tf.fit_transform(T_train)

`X-train` is a **term-document** matrix.  Each row represents a document.  Each column represents a **term** or vocabulary item.

In [46]:
X_train.shape

(2960, 13538)

The number of columns is over 10,000.  That means there  are over 10,000 features in our representation of every document.  Most but not all of the word in the training set are being used as features.  We'll skip
over the details of how those decisions are made.  For now, let;s get the main idea.


The TFIDF vectorizer uses a simple formula to assign a significance score --- called **TFIDF value** --- to the
count of each vocabulary item in each document. Our term document matrix  `X_train`
contains those TFIDF values.

Say the word "moron" occurs 3 times in a document.
TFIDF is a very popular measure of the significance of that fact
first proven to be useful in
document retrieval.  It has some competitors in classification, but
we have used it here mainly because it's the easiest **feature weighting scheme**
to use in `sklearn`.

Now the most important fact about the term-document matrix is that it consists mostly of 0s, because,
for any given document, most of the words in the vocabulary don't occur in it.

In [47]:
# Shape and Number of non zero entries
print(f'Shape: ({X_train.shape[0]:,} x {X_train.shape[1]:,})  Non-zero entries: {X_train.nnz:,}')

Shape: (2,960 x 13,538)  Non-zero entries: 73,754


Let's estimate the sparsity of this feature matrix.

In [48]:
print(("The document matrix X is ~{0:.2%} non-zero features.".format(
          X_train.nnz / float(X_train.shape[0] * X_train.shape[1]))))

The document matrix X is ~0.18% non-zero features.


In [49]:
X_train

<2960x13538 sparse matrix of type '<class 'numpy.float64'>'
	with 73754 stored elements in Compressed Sparse Row format>

So each word in the vocabulary has to be associated with a column number.  That;s called its **encoding**.

A `TdidfVectorizer` instance stores its encoding dictionary in the attribute `vocabulary_` (note
the trailing underscore!):

In [50]:
moron_ind = tf.vocabulary_['moron']
moron_ind

7129

The `sklearn` module stores many of its internally computed arrays as **sparse matrices**.  This is basically a 
very clever computer science device for not wasting all the space that very sparse matrices 
waste.  Natural language representations are often **quite** sparse.  The .15% non zero features
firgure we just looked at was typical.  Sparse matrices come at a cost, however; although some
computations can be done while the matrix is in sparse form, some cannot, and to do those
you have to convert the matrix to a nonsparse matrix, do what you need to do, and then, probably,
convert it back.  This is costly.  We're going to do it now, but only because we're goofing
around. Conversion to non-sparse format should in general be avoided whenever possible.

In [52]:
XA = X_train.toarray()

In [17]:
T_train[0]

'"You fuck your dad."'

In [16]:
XA[0].sum()

4.2789228014509435

Let's find a comment that contains 'moron' and remember its
positional index in the training data so we can look up that doc in X_train.

In [51]:
for (i,comment) in enumerate(T_train):
    if 'morons' in comment:
        break

moron_comment = i
print(T_train.iloc[moron_comment])

"santorum is the only real conservative running.\\xc2\\xa0 \\n\\nAs to the occupiers..... what a bunch of selfish, spoiled brats who are morons. What have these occupiers accomplished / anything of real value.... let's see, murder, rape, robbery, vandalism, created a huge financial burden for the local taxpayers(clean up their messes) even child neglect/endangerment as been reported...\\xc2\\xa0 great accomplishment !\\xc2\\xa0 Mom must be so very proud. Lost all credibility."


Ok, now we can check the TFIDF matrix for the statistic for `'moron'` in this document:

In [20]:
XA[moron_comment][moron_ind]

0.0

Oh, maybe we didn't learn that:

In [54]:
moron_ind = tf.vocabulary_['morons']

Totally different word, found at a totally different place in XA:

In [55]:
XA[moron_comment][moron_ind]

0.12517565622441323

Summary: In this part of the discussion, we have learned about **vectorization**, the computational
procdess of going from a sequence of documents to a term-document matrix.

The key point is that the term document matrix is now exactly the sort of thing we used to
train classifier to recoignize iris types: a matrix whose rows reoresent exemplars
and whose columns reoresents features.  That mean we can just pass the the
term document matrix X_train (aloing with some labels) to a classifier instance to
train it.

## Training

Now, we are going to train a classifier as usual. We first split the data into a train and test set.

We use a **Bernoulli Naive Bayes classifier**.

In [56]:
bnb =nb.BernoulliNB()

bnb.fit(X_train, y_train);

And we're done.  How'd we do?  Now we  test on the test set.  Before we can do that we need to
vectorize the test set.  But don't just copy what we did with the training data:

```
X_test = tf.fit_transform(T_test)
```

That would retrain the vectorizer from scratch.  Any words that occurred in the training texts
but not in the test texts would be forgotten!  Plus training the vectorizer 
is part of the classifier training pipeline.  If we let the vectorizer see
the test data, we'd be compromising the whole
idea of splitting training and test data.  So what we want to do
with the test data is just apply the transform part of vectorizing:

```
X_test = tf.transform(T_test)
```

That is, build a representation of the test data using only the vocabulary you learned
about in training.  Ignore any new words.

In [103]:
X_test = tf.transform(T_test)
bnb.score(X_test, y_test)

0.7416413373860182

Let's summarize what we did by gathering the steps into one cell without all the discussion and re-executing it:

In [105]:
T_train,T_test, y_train,y_test = train_test_split(df['Comment'],df['Insult'])
tf = text.TfidfVectorizer()
X_train = tf.fit_transform(T_train)
bnb =nb.BernoulliNB()
bnb.fit(X_train, y_train)
X_test = tf.transform(T_test)
bnb.score(X_test, y_test)

0.7608915906788247

The result should be the same as when we stepped through it with lots of discussion, right?

Well, is it?  

Ok, re-execute the same cell above again.  Now one more time. 

Now try the following
piece of code:

#### Basic train and test loop

In [25]:
def split_vectorize_and_fit(docs,labels,clf):
    T_train,T_test, y_train,y_test = train_test_split(docs,labels)
    tf = text.TfidfVectorizer()
    X_train = tf.fit_transform(T_train)
    clf_inst = clf()
    clf_inst.fit(X_train, y_train)
    X_test = tf.transform(T_test)
    return clf_inst.predict(X_test), y_test

In [57]:
num_runs = 10
for test_run in range(num_runs):
    predicted, actual = split_vectorize_and_fit(df['Comment'],df['Insult'], nb.BernoulliNB)
    print('{0:.3f}'.format(accuracy_score(predicted, actual)))

0.753
0.769
0.746
0.784
0.772
0.757
0.780
0.763
0.791
0.773


What's happening?  

The training test split function takes a random sample of all the data to use as training data.
Each time there's a train test split we get a different classifier.  Sometimes the
training data is a better preparation for the test than others.   And so the actual
variation in performance is significant.

How should we deal this with this when we report our evaluations?
To get a realistic picture of how good our classifier is,
we need to take the average of multiple training runs, each with a different train/test split of our working
data set. This is called **cross validation.**

### Refined train and test loop

Explain the purpose of the code in the next cell.

In [26]:
num_runs = 100

stats = np.zeros((4,))
for test_run in range(num_runs):
    predicted, actual = split_vectorize_and_fit(df['Comment'],df['Insult'],nb.BernoulliNB)
    y_array = actual.values
    prop_insults = y_array.sum()/len(y_array)
    stats = stats + np.array([accuracy_score(predicted, actual),
                              precision_score(predicted, actual),
                              recall_score(predicted, actual),
                              prop_insults])
normed_stats = stats/num_runs
labels = ['Accuracy','Precision','Recall','Pct Insults']
for (i,s) in enumerate(normed_stats):
    print(f'{labels[i]} {s:.2f}')

Accuracy 0.77
Precision 0.14
Recall 0.90
Pct Insults 0.27


### Most important features

Let's back to the core code sequence and take a look at what features
are the most important in insult detection.

For this experiment we leave out the training test split;
in fact, we leave out anything to do with testing.

In [27]:
def print_topn(vectorizer, clf, top_n=10, class_labels=(True,)):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names_out()
    for i, class_label in enumerate(class_labels):
        word_importance = np.argsort(clf.coef_[i])
        top_inds = word_importance[-top_n:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top_inds)))


tf = text.TfidfVectorizer()
X_train = tf.fit_transform(df['Comment'])
bnb =nb.BernoulliNB()
bnb.fit(X_train, df['Insult'])

# Now find the most heavily weighted features [= words]
print_topn(tf,bnb)

AttributeError: 'BernoulliNB' object has no attribute 'coef_'

The model essentially consists of a set of weights attached to vocab items and stored in

```
bnb.coef_
```
We found the words with the top 10 weights and printed them out.
So the  top 10 best predictors in this model of being an insult is a bunch of function words.

Let's look at some more words.  We some some more natural insult candidates showing up.

In [144]:
print_topn(tf,bnb,top_n=100)

True: need still should her than right by dick nyou make were shut has time some see loser them ignorant too there even more say think from life then why now he moron at would when one off shit really yourself how we will out they my people was back fucking dumb here little me bitch because who about ass know if or but as no can stupid do don go fuck this up get what be idiot with all xa0 so an just re not for on have like in it is that of and to the your are you


### Running the classifier on a list of sentences

Finally, let's look at how to test our estimator on a few test sentences.


In [28]:
ps = bnb.predict(tf.transform(df['Comment']))

In [152]:
ps

array([0, 0, 0, ..., 0, 0, 0])

In [29]:
predicted = bnb.predict(tf.transform([
    "I totally agree with you",
    "You are so stupid",
    "That you are an idiot who understands neither taxation nor women\'s health."
    ]))

print(predicted)

[0 0 1]


Not real impressive.  The word *stupid* was not recognized as an insult.

Naive Bayes is not the best classifier.  On your homework assignment you will try some others.

### Precision, Accuracy, and Recall

The next cell takes the first step toward testing a classifier a little more seriously.  It defines some code for evaluating classifier output.  The evaluation metrics defined are precision, recall, and accuracy.  Call the examples the system predicts to be positive (whether correctly or not) ppos and and the examples it predicts to be negative pneg; consider the following performance on 100 examples:

$$\begin{array}[t]{ccc} &  pos &neg\\ ppos& 31 & 5\\ pneg & 14  & 50 \end{array}$$

The performance of the system has been sorted into 4 classes:

$$\begin{array}[t]{ccc} &  pos &neg\\ ppos& tp & fp\\ pneg & fn & tn \end{array}$$

The $tp$ and $tn$ examples (true positive and true negative) are those the system labeled correctly,
while $fp$ and $fn$ (false positive and false negative) are those labeled incorrectly.
Let N stand for the total number of examples,
100 in our case. 

The three most important measures of system performance are:
  
  1. **Accuracy**: Accuracy is the percentage of correct examples out of the total corpus 

  $$Acc = \frac{tp+tn}{N} = \frac{31 +50}{100}$$ 
  
  This is .81 in our case.

  2. **Precision**: Precision is the percentage of true positives out of all positive guesses the system made 
  
  $$Prec= \frac{tp}{tp + fp} = \frac{31}{31+5}.$$
  
  This is .86 in our case.
  3. **Recall**: Recall is the percentage of true positives out of all positives 
  
  $$Rec = \frac{tp}{tp + fn} = \frac{31}{31+14}.$$
  
  This is .69 in our case.


  The function `do_evaluation`, defined in the next cell, computes precision, recall and accuracy for a test set
  using the scikit learn implementations of those metrics; `do_evaluation` takes as its argument a sequence of docs and labels, as well as a classifier creation function.
  
It also takes as an argument `pos_label`, the label we are trying to predict.  Changing the
label we are trying to "detect" (True or False for our insult detection data) has no effect
on accuracy but it computes different scores for precision and recall.

In [179]:
def do_evaluation(clf, docs, labels, num_runs=100, pos_label=True):
    stats = np.zeros((4,))
    for test_run in range(num_runs):
        predicted, actual = split_vectorize_and_fit(df['Comment'],df['Insult'],clf=clf)
        y_array = actual.values
        prop_insults = y_array.sum()/len(y_array)
        stats = stats + np.array([accuracy_score(predicted, actual),
                                  precision_score(predicted, actual,pos_label=pos_label),
                                  recall_score(predicted, actual,pos_label=pos_label),
                                  prop_insults])
    normed_stats = stats/num_runs
    labels = ['Accuracy','Precision','Recall','Pct Insults']
    for (i,s) in enumerate(normed_stats):
        print(f'{labels[i]} {s:.2f}')
    return normed_stats

The code in the next cell evaluates our NB classifier.  Note that precision and recall give different results depending on which  class we think of ourselves as detecting (which class we think of as positive).  We give evaluation numbers with respect to detecting insults and detecting non insults.  These show that our classifier 
often calls something an insult when it isn't (low precision in detecting insults)
but therefore misses a lot of non insults (not so great recall in detecting non insults).

In [183]:
print('Evaluate our ability to detect insults')
print()
do_evaluation(nb.BernoulliNB, df['Comment'],df['Insult'],num_runs=10,pos_label=True)
print()
print('Now evaluate our ability to detect NON insults')
print()
do_evaluation(nb.BernoulliNB, df['Comment'],df['Insult'],num_runs=10,pos_label=False)

Evaluate our ability to detect insults

Accuracy 0.77
Precision 0.14
Recall 0.87
Pct Insults 0.26

Now evaluate our ability to detect NON insults

Accuracy 0.77
Precision 0.99
Recall 0.77
Pct Insults 0.26


array([0.77092199, 0.99379989, 0.76514169, 0.26443769])

So why does our classifier guess insult more often than it should, given that only about a quarter of the data is insults?  Well, probably because it had more success finding strong positive indicators than it did finding strong negative indicators, as our glance at the most informative features suggested.  This is something we might want to worry about as we design good classifiers.