# Profanity Classifier Using Naive Bayes

We will construct a profanity (insult) classifier using Naive Bayes, then using a combination of Logistic Regression and Naive Bayes. We will extensively use scikit learn, python's most popular Machine Learning Library, scikit-learn.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

We will use adataset from the kaggle competition [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge). Kaggle is a website that hosts Machine Learning competitions and very good source of datasets.

## 1. Dataset Creation

In [None]:
dfInit = pd.read_csv('data/train.csv')

In [None]:
dfInit.head()

The Toxic Comment dataset contains several types of toxic comments; we are only interested in insults. Notice that the dataset is quite unbalanced. Indeed we will compute the proportion of insults in the comments:

Let's compute the proportion of data containing insults. We need to compute the [sum](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html) of the __insult__ column, since it only contains 0 and 1, will give the number of rown with insults. Then divide by the number of rows

In [None]:
<FILL IN> /len(dfInit)

For the algorithm to properly learn to classify if a comment is an insult or not, we need a balanced dataset. We will first divide the dataframe into one with insults on the __insult__ column, and one with no insults. We will take a sample from the non-insult one, and re-merge them together so will will get a 50% insults dataset. 

In [None]:
dfInsult = dfInit[dfInit['insult'] == 1]

In [None]:
noInsults = len(dfInsult)
noInsults

Similarily, construct a __dfNoInsult__ containing non insults

In [None]:
dfNoInsult = <FILL IN>

In [None]:
dfNoInsult = dfNoInsult.iloc[:noInsults,:] #only keep the same number of rows that exista in the insult dataframe

Use [pandas concat](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) to re-merge dfInsult and dfNoInsult

In [None]:
data = pd.concat(<FILL IN>).reset_index() 

We will select only the _comment_text_ and _insult_ column, we don't need information on the other classes for our exercise.

In [None]:
data = data[['comment_text','insult']]

In [None]:
from sklearn.utils import shuffle
data = shuffle(data,random_state=23).reset_index(drop=True)

In [None]:
data.head()

## 2. Features for text classification. Bag of words

In order to train a model, we need features describing the comment. One way to do this is the so called __Bag of Words__ approach, where we consider the set of words within a document, without taking into accont their order.
Let's take the following sencentes, that we will call documents:

In [None]:
sentences = ['roses are red','violets are blue','this is your homework','some cats are red some cats are blue']

### Term Frequency Features

we construct the vocabulary, the set of all the words that appear in the documents:

In [None]:
vocab = list(set((' '.join(sentences)).split()))

In [None]:
vocab.sort()

In [None]:
vocab

We will encode each document with a vector of length equals the lenth of the vocabulary. On each position, the value is the frequency of the term within the document. This model is called __term frequency__

Let's apply this to sentence 3:

Below, you need to count the occurances of an element in a list. For example,
```python
[1, 2, 3, 4, 1, 4, 1].count(1)
```
will output 3

In [None]:
encoding = []
for word in vocab:
    encoding.append(sentences[3].<FILL IN>)

In [None]:
encoding

We have encoded a document with a vector. We are also able to compute distance between vector encoded documents to find if documet B or document C is closer to document A. The most usual distance to be used for text is the cosine similarity, the cosine of the angle between vectors.

We can do this automatically with __CountVectorizer__ fromn scikit learn. The usual way to use a scikit-learn class has multiple stages: first we fit a model, and this corresponds to the training of an algorithm, where it apllies. Next, we predict a model.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

For each model, we begin by creating an instance on the class:

In [None]:
tf1 = CountVectorizer()

Then, we fit the model to the trainingset. Here, we fit it to _sentences_. Notice that this is more an explicite computation, then proper trainable algorithm.

In [None]:
tf1.fit(sentences)

Then, we transform the sentences:

In [None]:
transformed1 = tf1.transform(sentences)

In real problems, the vocabulary size is quite big. If size of vocab is 10,000 words, and a sentence has 5 words, the resulting encoding will have thousands of null values that are worthless to store in memory. Therefore, the encoding is a [sparse vector](https://docs.scipy.org/doc/scipy/reference/sparse.html) only containing the values and positions of the non-null values. Since our example is based on a small vocabulary, we will transform the vectors to dense (i.e. non-sparse), to be able to visualize them.

In [None]:
transformed1.todense()

Notice the last row corresponds to the last document that we have manually encoded.

### Term frequency-inverse document frequency

We can enhance the TF approach by increasing the value for rare vectors, stating thatsentences that are similar in rare words are more similar then sentences similar in frequent words like 'and', 'the', etc
The resulting model is called __Term Frequency-Inverse Document Frequency__, shortly TF-IDF

We will divide the already computed __TF__ with and __IDF__ vector; the i-th component of the IDF vector is the number of documents containing the i-th word in the vocabulary. 

In [None]:
docFrec = []
for word in vocab:
    docsWithWord = sum([word in sentence for sentence in sentences])
    docFrec.append(docsWithWord / len(sentences))

In [None]:
docFrec

In [None]:
list(np.array(encoding) / np.array(docFrec))

Sikit-Learn has a TF-IDF transformer, namely TfidfVectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

The formula that is used to compute the $tf-idf$ of term $t$ is a little bit more complicated, but it measures thesame concept of documents being more similar because of common rare words: $tf-idf(d, t) = tf(t) * idf(d, t)$, and the $idf$ is computed as $idf(d, t) = \log( \frac{n}{df(d, t)}) + 1$

In [None]:
tf2 = TfidfVectorizer()

Notice that fit and transform can be performed together, using the [fit-transform](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) method:

In [None]:
transformed2 =<FILL IN>

In [None]:
transformed2.todense()

## 3. Create features, train-test split

We will apply the above methods to create different sets of features. Let's first use CoutVectorizer to create TF features:

In [None]:
tf = CountVectorizer(strip_accents='ascii',max_df=0.8,min_df=0.05)

Fit-Transform the __comment text__ column of the dataframe

In [None]:

Xtf = tf.fit_transform(<FILL IN>)

Now, we want a TF-IDF set of features. Moreover, we will not only use words, but also bigrams. A n-gram consists by n aconsecutive words. If we wound like to use fords and bigrams to encode the sentence "I have a cat", the corresponding vocab will be: I, have, a, cat, I have, have a, a cat.

In [None]:
tfidf = TfidfVectorizer(ngram_range=(1,2), 
               min_df=3, max_df=0.9, strip_accents='ascii', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )

Fit-Transform the same column but with tfidf to create tfidf features with uni- and bi-grams:

In [None]:
Xtfidf = <FILL IN>


y will be a numpy array with 0 or 1:

In [None]:
y = data['insult'].values

In [None]:
data.head()

Also the train-test split can be easily performed with scikit-learn.

In [None]:
from sklearn.model_selection import train_test_split

## 4. Naive Bayes with tf features

Let's perform Train-Test split with 67% of the data for training and 33% testing:

In [None]:
Xtf_train, Xtf_test, ytf_train, ytf_test = train_test_split(Xtf, y, 
                                                            test_size=0.33, random_state=23)

If $y$ denotes the class and $x_1,...,x_n$ denote the features, via Bayes Theorem, we have
$$
P(y\mid x_1,...,x_n) = \frac{P(y)P(x_1,...,x_n \mid y)}{P(x_1,...x_n)}
$$

Using the naive independence hypothesis, 
$$
P(y\mid x_1,...,x_n) = \frac{P(y)\prod P(x_i \mid y)}{P(x_1,...x_n)}
$$
that we want to maximize. Since the denominator is constant, it's enough to maximize the numerator, so Naive |Bayes will predict

$\hat y =\underset{y}{\mathrm{argmax}} P(y)\prod P(x_i \mid y$

Since TF numbers are intigers, we will use a multinomial Naive Bayes. Again, notice the order of the steps:

In [None]:
from sklearn.naive_bayes import MultinomialNB

Step 1: construct an instance of the class

In [None]:
mnb = MultinomialNB()

Step 2: fit

In [None]:
mnb.fit(Xtf_train,ytf_train)

Step 3: predict

In [None]:
ytf_pred_train = mnb.predict(Xtf_train)
ytf_pred = mnb.predict(Xtf_test)

Step 4: Evaluate
We will use accuracy, namely the proportion of good predictions. 
We will compute the accuracy on both test and training set, to find if our model si overfitting.

In [None]:

(ytf_pred_train == ytf_train).sum() / len(ytf_train)

In [None]:
(ytf_pred == ytf_test).sum() / len(ytf_test)

80% accuracy is pretty good and training and test accuracy are similar.

## 5. Naive Bayes with tfidf features

Let's do the same with TF-IDF features and bigrams

In [None]:
Xtfidf_train, Xtfidf_test, ytfidf_train, ytfidf_test = train_test_split(Xtfidf, y, 
                                                            test_size=0.33, random_state=23)

This time we will use Gaussian Naive Bayes, since tf-idf values are real numbers.

In [None]:
from sklearn.naive_bayes import GaussianNB

We will repeat the above steps

In [None]:
gnb = GaussianNB()

In [None]:
gnb.fit(Xtfidf_train.toarray(),ytfidf_train) #dense is required

In [None]:
ytfidf_pred_train = gnb.predict(Xtfidf_train.toarray())
ytfidf_pred = gnb.predict(Xtfidf_test.toarray())


Same as we have done before, compute the train and test accuracy:

In [None]:
<FILL IN> / len(ytfidf_train)

In [None]:

<FILL IN> / len(ytfidf_test)

The training accuracy has increased, but it overfits. I would prefer the first one

## 6. Naive Bayes and Logistic regression

This approach is inspired by the article [Baselines and Bigrams: Simple, Good Sentiment and Topic Classification
](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf)
We will compute naive bayes features and use logistic regression on top of it

Notice that the decision function for naive Bayes can be rewritten as "predict class $0$ if the odds of $(0 \mid \mathbf {x} )$ exceed those of $(1\mid \mathbf {x} )$". 
Expressing this in log-space gives:

$${\displaystyle \log {\frac {p(0\mid \mathbf {x} )}{p(1 \mid \mathbf {x} )}}=\log p(0\mid \mathbf {x} )-\log p(1 \mid \mathbf {x} )>0}$$

We will multiply $X$ withthe ratio above, called log count ratio, we will get a Naive Bayes adjusted TF-IDF Features.
Intuitivey, initially, we had the TF features. Then, we multiplied them with the idf to put a larger weight to rare words. After that, we multiply these with the above ratio, that increase further the weights that matter most for the naive bayes classifier to make decisions.
We will use a Logistic Regression classifier on these NB features.

We will compute count vectors _p1 = sum of all feature vectors with label 1_ and _p0 = sum of all feature vectors with label 0_

In [None]:
def pr(y_i, y):
    '''
    function to compute the count vectors
    ARGUMENTS: y_i, int, 0 or 1 and y, np.array
    RERURNS: float
    '''
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1) #add one to avoid having 0 at the 

In [None]:
x = Xtfidf_train
r = np.log(pr(1,ytfidf_train) / pr(0,ytfidf_train)) # log count ratio
x_nb = x.multiply(r) #new X

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:

nblog = LogisticRegression(C=4, dual=True)



Fit the model

Fit the [Logistic regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model 
using the above instance and the datasets __x_nb__ and __ytfidf_train__

In [None]:
<FILL IN>

Predict and evaluate

In [None]:
y_nbtfidf_pred_train = <FILL IN>
(y_nbtfidf_pred_train == ytfidf_train).sum() / len(ytfidf_train)

In [None]:
y_nbtfidf_pred = nblog.predict(Xtfidf_test.multiply(r))
(y_nbtfidf_pred == ytfidf_test).sum() / len(ytfidf_test)

Congratz! You got 0.92 test accuracy, that is comparable to the most complex models in text classification!

## 7. Evaluation

Let's introduce more evaluation metrics, beside Accuracy.

In [None]:
from sklearn.metrics import f1_score, roc_curve, auc

There are some issues in using accuracy to assess classification. Let's say we are building a model for automatic interpretaion of an HIV test. Since the prelevance of HIV in Europe is less than 1%, the following funciton:

```python
def test(features):
    """
    ARGUMENTS features, set of feaetures encoding microscopic sample
    RETURNS 0 for HIV negative, 1 for HIV positive
    """
    return 0
```

is a 99% accuracy automatic HIV tester. The __f1-score__ repares this. It is closesly related to two concepts: precision and recall.

Assume we are building an owl detection algorithm that identifies, in the following picture, the four red encircled owls:

<img src="images/cat-and-owls.jpg" width="600">

* 3 out of 4 identified owls are indeed owls; we say that our allgorithm's precision is 3/4
* our algorithm has identified 3 out of 5 owls; we say that it has a recall of 4/5
The average of the two measures would be a better candidate then the accuracy. Since both precision and recall are ratios, the harmonic average is a better case then the arithmetic average

$$f_1 =2 \frac{p \cdot r}{p+r}$$

Let's compare the f1 scores of the three models:

In [None]:
print('the f1 score using Multinomial Naive Bayes with TF features is %0.3f' % f1_score(ytf_pred,ytf_test))
print('the f1 score using Gaussian Naive Bayes with TFIDF features is %0.3f' % f1_score(ytfidf_pred,ytfidf_test))
print('the f1 score using Logistic Regression with Naive Bayes features is %0.3f' % f1_score(y_nbtfidf_pred,ytfidf_test))

One more metric used in evaluating a classification algorithm is the Receiver Operating Characteristic (ROC-curve). Let's assume we take each example from test set that we have labeled with 1. If it is a 1, is a true posivite, if not, a false positive. At each moment, we can compute the true positive rate and false positive rate. 
A random classifier will give the red diagonal line, a perfect one will result in only a point in $(0,1)$. The greater the area (closer to 1) is, the better the classifier is.

In [None]:
fpr, tpr, threshold = roc_curve(ytfidf_test, y_nbtfidf_pred)
roc_auc = auc(fpr, tpr)

In [None]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

## 8. Use the model for our own predictions

Let's take an insult from [The french taunting from the movie Monty Python and the Holy Grail](https://www.youtube.com/watch?v=M9DCAFUerzs) and a noninsult and see how well the classifier performs.

In [None]:
test1 = 'Your mother was a hamster, and your father smelt of elderberries'
test2 = 'roses are red, violets are blue, this was your homework, work done by you'

In [None]:
testlist = [test1, test2]

In [None]:
X_example = tfidf.transform(testlist)

In [None]:
nblog.predict(X_example.multiply(r))

Feel free to change the test1 and test2 with sentences of your choice

Remember the scikit-learn workflow:
* class instance
* fit
* predict
* evaluate