## Analyzing insults with Naive Bayes: pandas and sklearn

In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score
%matplotlib inline

## Loading and preparing the data

Let's open the CSV file with `pandas`.

In [4]:
import os.path
site = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'
df = pd.read_csv(os.path.join(site,"troll.csv"))

Each row is a comment  taken from a blog or online forum. There are three columns: whether the comment is insulting (1) or not (0), the data, and the unicode-encoded contents of the comment.

In [5]:
df[['Insult', 'Comment']].tail()

Unnamed: 0,Insult,Comment
3942,1,"""you are both morons and that is never happening"""
3943,0,"""Many toolbars include spell check, like Yahoo..."
3944,0,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,"""How about Felix? He is sure turning into one ..."
3946,0,"""You're all upset, defending this hipster band..."


Write a pandas command to give you just the insults.

In [6]:
# Solution replaces df on the RHS
insult_df = df[df['Insult'] ==1]

In [7]:
insult_df[:25]

Unnamed: 0,Insult,Date,Comment
0,1,20120618192155Z,"""You fuck your dad."""
7,1,,"""shut the fuck up. you and the rest of your fa..."
8,1,20120502173553Z,"""Either you are fake or extremely stupid...may..."
9,1,20120620160512Z,"""That you are an idiot who understands neither..."
15,1,20120611090207Z,"""FOR SOME REASON U SOUND RETARDED. LOL. DAMN. ..."
16,1,20120320162532Z,"""You with the 'racist' screen name\n\nYou are ..."
18,1,20120320075347Z,"""your such a dickhead..."""
19,1,20120320203947Z,"""Your a retard go post your head up your #%&*"""
34,1,20120515132156Z,"""Allinit123, your\xa0hypocrisy\xa0is sickening..."
37,1,20120620161958Z,"""I can't believe the stupid people on this sit..."


In [8]:
df['Comment'][79:85]

79    "Fact : Georgia passed a strict immigration po...
80              "Of course you would bottom feeder ..."
81    "M\xe1tenlos!!\nhttp://1.bp.blogspot.com/-YVSZ...
82    "You are\xa0 a fukin moron. \xa0\xa0 You are j...
83    "He is doing what any president doe's on this ...
84    "...yeah, and you're a f'ing expert.....go bac...
Name: Comment, dtype: object

In [9]:
df['Comment'][79]

'"Fact : Georgia passed a strict immigration policy and most of the Latino farm workers left the area. Vidalia Georgia now has over 3000 agriculture job openings and they have been able to fill about 250 of them in past year. All you White Real Americans who are looking for work that the Latinos stole from you..Where are you ? The jobs are i Vadalia just waiting for you..Or maybe its the fact that you would rather collect unemployment like the rest of the Tea Klaners.. You scream..you complain..and you sit at home in your wife beaters and drink beer..Typical Real White Tea Klan...."'

NB:  `insult_df` is **not** modified by the following sort.

In [10]:
insult_df['Size'] = df['Comment'].apply(len)
insult_df['Size'].sort_values(ascending = False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


3208    4016
3931    1600
581     1548
1348    1269
3924    1022
        ... 
3109      11
2180      11
3919       8
45         8
755        8
Name: Size, Length: 1049, dtype: int64

Now we define the feature matrix $\mathbf{X}$ and the labels $\mathbf{y}$.

In [11]:
len(insult_df.loc[3208]['Comment'].split())

703

In [12]:
insult_df.loc[755]

Insult                   1
Date       20120620121441Z
Comment           "Retard"
Size                     8
Name: 755, dtype: object

In [13]:
insult_df.loc[45]

Insult                   1
Date       20120619074710Z
Comment           "faggot"
Size                     8
Name: 45, dtype: object

In [14]:
insult_df.loc[3919]

Insult                   1
Date       20120610154957Z
Comment           "faggot"
Size                     8
Name: 3919, dtype: object

In [15]:
y = df['Insult']

We want to use one of the linear classifiers in `sklearn`,
bit the learners in `sklearn` only work with numerical arrays. How to convert text into a matrix of numbers?
As discussed in lecture and in our text,
obtaining the feature matrix from the text is not trivial. 

The classical solution is to first extract a **vocabulary**: a list of words used throughout the corpus. Then, we can count, for each document in the sample, the frequency of each word. We end up with a **sparse matrix**: a huge matrix containing mostly zeros. Here, `sklearn` and `pandas` make it possible to do this in two lines. 

In [50]:
print(text.TfidfVectorizer.__doc__)

Convert a collection of raw documents to a matrix of TF-IDF features.

    Equivalent to :class:`CountVectorizer` followed by
    :class:`TfidfTransformer`.

    Read more in the :ref:`User Guide <text_feature_extraction>`.

    Parameters
    ----------
    input : string {'filename', 'file', 'content'}
        If 'filename', the sequence passed as an argument to fit is
        expected to be a list of filenames that need reading to fetch
        the raw content to analyze.

        If 'file', the sequence items must have a 'read' method (file-like
        object) that is called to fetch the bytes in memory.

        Otherwise the input is expected to be the sequence strings or
        bytes items are expected to be analyzed directly.

    encoding : string, 'utf-8' by default.
        If bytes or files are given to analyze, this encoding is used to
        decode.

    decode_error : {'strict', 'ignore', 'replace'} (default='strict')
        Instruction on what to do if a byte sequen

In [22]:
tf = text.TfidfVectorizer()
X = tf.fit_transform(df['Comment'])



The TFIDF vectorizer uses a simple formula to assign a significance score to the
count of each vocabulary item in each document. Our TFIDF matrix is stored in `X`.

Say a word occurs n times in a document.
TFIDF is a very popular measure of the significance of that fact
first proven to be useful in
document retrieval.  It has some competitors in classification, but
we have used it here mainly because it's the easiest **feature weighting scheme**
to use in `sklearn`.

In [27]:
# Shappe and Number of non zero entries
print(f'Shape: ({X.shape[0]:,} x {X.shape[1]:,})  Non-zero entries: {X.nnz:,}')

Shape: (3,947 x 16,469)  Non-zero entries: 100,269


There are 3,947 comments and 16,469 different words. Let's estimate the sparsity of this feature matrix.

In [29]:
print(("The document matrix X is ~{0:.2%} non-zero features.".format(
          X.nnz / float(X.shape[0] * X.shape[1]))))

The document matrix X is ~0.15% non-zero features.


A `TdidfVectorizer` instance stores its `decode` dictionary in the attribute `vocabulary_` (note
the trailing underscore!):

In [16]:
tf.vocabulary_['moron']

8704

The `sklearn` module stores many of its internally computed arrays as **sparse matrices**.  This is basically a 
very clever computer science device for not wasting all the space that very sparse matrices 
waste.  Natural language representations are often **quite** sparse.  The .15% non zero features
firgure we just looked at was typical.  Sparse matrices come at a cost, however; although some
computations can be done while the matrix is in sparse form, some cannot, and to do those
you have to convert the matrix to a nonsparse matrix, do what you need to do, and then, probably,
convert it back.  This is costly.  We're going to do it now, but only because we're goofing
around. Conversion to non-sparse format should in general be avoided whenever possible.

In [8]:
XA = X.toarray()

Consider Tweet 3942:

In [31]:
insult_df.loc[3942]['Comment']

'"you are both morons and that is never happening"'

Ok, now we can check the TFIDF matrix for the statistic for `'moron'` in this tweet:

In [9]:
XA[3942][8704]

0.0

Oh, maybe we didn't learn that:

In [11]:
tf.vocabulary_['morons']

8707

Totally different word, found at a totally different place in XA:

In [12]:
XA[3942][8707]

0.5139224706716653

## Training

Now, we are going to train a classifier as usual. We first split the data into a train and test set.

In [35]:
(X_train, X_test,
 y_train, y_test) = train_test_split(X, y,
                                     test_size=.2)

We use a **Bernoulli Naive Bayes classifier**.

In [36]:
bnb =nb.BernoulliNB()

bnb.fit(X_train, y_train);

In [37]:
bnb.score(X_test, y_test)

0.7481012658227848

Now try re-executing the previous cells.  The results shoudl be the same, right?

Well, are they?  

Ok, re-execute the same three cells again.  Now one more time.  Now try the following
piece of code:

In [38]:
num_runs = 10
for test_run in range(num_runs):
    (X_train, X_test,
     y_train, y_test) = train_test_split(X, y,
                                         test_size=.2)
    bnb =nb.BernoulliNB()
    bnb.fit(X_train, y_train)
    print('{0}'.format(bnb.score(X_test, y_test)))

0.7734177215189874
0.740506329113924
0.7291139240506329
0.7354430379746836
0.7759493670886076
0.7367088607594937
0.7227848101265822
0.7455696202531645
0.7493670886075949
0.7468354430379747


What's happening?  How should we deal this with this when we report our evaluations?

Explain the purpose of the code in the next cell.

In [17]:
num_runs = 100
total = 0
p_total = 0
r_total = 0
insults_total = 0
for test_run in range(num_runs):
    (X_train, X_test,
     y_train, y_test) = train_test_split(X, y,
                                         test_size=.2)
    bnb = nb.BernoulliNB()
    bnb.fit(X_train, y_train)
    score = bnb.score(X_test, y_test)
    predicted = bnb.predict(X_test)
    y_array = y_test.values
    prop_insults = float(y_array.sum())/len(y_array)
    p_score = precision_score(predicted, y_test)
    r_score = recall_score(predicted, y_test)
    total += score
    p_total += p_score
    r_total += r_score
    insults_total += prop_insults
print('Accuracy {:.2%}'.format(total/num_runs))
print('Precision {:.2%}'.format(p_total/num_runs))
print('Recall {:.2%}'.format(r_total/num_runs))
print('Avg Pct Insults {:.2%}'.format(insults_total/num_runs))

Accuracy 75.12%
Precision 15.74%
Recall 60.58%
Avg Pct Insults 26.34%


Let's take a look at the words corresponding to the largest coefficients (the words we find frequently in insulting comments).

In [40]:
dir(bnb)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_X',
 '_check_X_y',
 '_check_alpha',
 '_check_n_features',
 '_count',
 '_estimator_type',
 '_get_param_names',
 '_get_tags',
 '_init_counters',
 '_joint_log_likelihood',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_update_class_log_prior',
 '_update_feature_log_prob',
 '_validate_data',
 'alpha',
 'binarize',
 'class_count_',
 'class_log_prior_',
 'class_prior',
 'classes_',
 'coef_',
 'feature_count_',
 'feature_log_prob_',
 'fit',
 'fit_prior',
 'get_params',
 'intercept_',
 'n_features_',
 'n_features_in_',
 '

In [42]:
bnb.feature_count_.shape

(2, 16469)

In [39]:
# We first get the words corresponding to each feature.
names = np.asarray(tf.get_feature_names())
# Next, we display the 50 words with the largest
# coefficients.
# NB Wajnt to switch over to using bnb.feature_count_.shape[0]
coefficient_matrix = bnb.coef_[0,:]
print(coefficient_matrix.shape)
# Sorting gives us smallest first, we reverse the order and take top 50
top_fifty_feat_indices = np.argsort(coefficient_matrix)[::-1][:50]
print((','.join(names[top_fifty_feat_indices])))

(16469,)
you,your,are,to,the,and,of,that,is,it,in,like,on,have,for,not,just,re,an,xa0,all,idiot,with,what,be,fuck,so,don,get,this,up,go,do,no,as,stupid,but,can,or,know,if,because,about,ass,bitch,who,back,little,here,my




Finally, let's test our estimator on a few test sentences.


In [19]:
predicted = bnb.predict(tf.transform([
    "I totally agree with you.",
    "You are so stupid.",
    "I love you."
    ]))

print(predicted)

[0 0 0]


In [20]:
print(predicted)
print(y_test[:3])

[0 0 0]
1768    0
2378    0
350     1
Name: Insult, dtype: int64


Not real impressive.  The word *stupid* was not recognized as an insult.

> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).

In [16]:
print((bnb.predict(tf.transform([ "I totally agree with you.", "You are so stupid.", "I love you." ]))))

[0 0 0]


## Homework

Read the on line book draft chapter about doing the movie review data,
and try the clasifier used there, an SVM, on this data.  Be sure
top stick with the scikit learn (it has an SVM implementation).

Show your code, and print out results.  Which classifier does better?