# Insult Classification

In this exercise, we would like to filter out insulting comments on a web forum. 

To train our models, we have a list of historic comments with a judgement wether they're insulting or not.

In [1]:
import pandas as pd
import numpy as np
path_to_insults = '/Users/Vignesh/Dev/ML/students/data/'
data = pd.read_csv(path_to_insults + 'train-utf8.csv')
data.head(2)

Unnamed: 0,Insult,Date,Comment
0,1,20120618192155Z,You fuck your dad.
1,0,20120528192215Z,i really don't understand your point. It seem...


In [2]:
print ("%d comments, of which %d insults (%d%%)" % \
    (len(data), data.Insult.sum(), 100 * data.Insult.mean()))

3947 comments, of which 1049 insults (26%)


### Looking for known bad words

One way to do this, is to load Google's bad word list and flag comments that contain one or more words.

- Load `google_badlist.txt` from `data/insults/`
- Add a column to `data` with a flag (0 or 1) if the comment contains a bad word
- Compute the accuracy of this method - does this look good?
- What would a naive classifier's score be (i.e., always predicting 0 or 1)?
- Also compute the precision, recall, F1 score and AUC score
- What is your verdict?

In [3]:
filename = path_to_insults + 'google_badlist.txt'
filename

'/Users/Vignesh/Dev/ML/students/data/google_badlist.txt'

Creating a **badwords set** reading from the file - '/Users/Vignesh/Dev/ML/students/data/google_badlist.txt'

In [4]:
with open(filename) as f:
    content = f.readlines()
content = [x.strip() for x in content]
badwords = set(content)

Function to check a comment contain a bad word or not

In [5]:
def has_bad_words(comments):
    words = comments.strip().split()
    if any(w in badwords for w in words):
        return 1
    else:
        return 0

In [6]:
result = []
for x in data['Comment']:
    result += has_bad_words(x),

Adding a new column - **Badwords** in the data

In [7]:
data['Badwords'] = np.array(result)

In [8]:
data.shape

(3947, 4)

In [9]:
data.head(4)

Unnamed: 0,Insult,Date,Comment,Badwords
0,1,20120618192155Z,You fuck your dad.,1
1,0,20120528192215Z,i really don't understand your point. It seem...,0
2,0,,A majority of Canadians can and has been wrong...,0
3,0,,listen if you dont wanna get married to a man ...,0


In [27]:
from sklearn import metrics
print ('Accuracy = {0:5.4f}%'.format(np.mean(data['Insult'] == data['Badwords'])*100))

Accuracy = 70.8133%


##### The Naive Classifier - (to predect a given comment is an Insult/ not Insult based on the badword content)

* This kind of classifier will always **predict in term of 0 or 1**. If there is a badword in the comment, the classifier will predict as Insult. If not, will predict as Not Insult.

In [28]:
metrics.confusion_matrix(data['Insult'], data['Badwords'])

array([[2510,  388],
       [ 764,  285]])

In [29]:
print(metrics.classification_report(data['Insult'], data['Badwords']))

             precision    recall  f1-score   support

          0       0.77      0.87      0.81      2898
          1       0.42      0.27      0.33      1049

avg / total       0.68      0.71      0.69      3947



In [30]:
precision = metrics.precision_score(data['Insult'], data['Badwords'],average=None)
recall = metrics.recall_score(data['Insult'], data['Badwords'],average=None)
f1 = metrics.f1_score(data['Insult'], data['Badwords'],average=None)

In [31]:
fpr, tpr, thresholds = metrics.roc_curve(data['Insult'], data['Badwords'])
auc = metrics.auc(fpr, tpr)

In [32]:
print ('Accuracy = {0:5.4f}%'.format(np.mean(data['Insult'] == data['Badwords'])*100))
print ('Precision = {0:5.4f}'.format(np.mean(precision)))
print ('Recall = {0:5.4f}'.format(np.mean(recall)))
print ('F1 = {0:5.4f}'.format(np.mean(f1)))
print ('AUC = {0:5.4f}'.format(auc))

Accuracy = 70.8133%
Precision = 0.5951
Recall = 0.5689
F1 = 0.5722
AUC = 0.5689


#### Summary

For a random classifier the AUC will be 0.5. In this case, it is closer to the random classifier value.
Thus, this model will not be so good in predict the comments as insult only based on the presence of Bad words in it.

---

### Learning bad words on the fly

Another way of doing this, is to learn the insulting words on the fly using `CountVectorizer`. 

Please refer to the scikit learn tutorial at 'http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html' if you need some help.

Here is what you need to do:

- Import `CountVectorizer` from `sklearn.feature_extraction.text`
- Train the `CountVectorizer` on the insults and create a feature set $X$ representing words in the comments
- Train `MultinomialNB` and `BernoulliNB` from `scikitsklearn`  on the new feature set $X$
- Using cross-validation, compute the accuracy, precision, recall, F1 and AUC of your model
- What is your verdict?

NOTE: The F1 score is another useful score to compute when one of the two classes is very rare. We didn't go over it in class but it's basically the harmonic mean between precision and recall and goes from 0 (min) to 1 (max).  You can see more here: 'https://en.wikipedia.org/wiki/F1_score' 

In [33]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

# MultiNomial
----

* Splitting the data into Traning (80%) and Test (20%) set using the train_test_split method
* Used Pipeline, CountVectorizer
* MultiNomial Naive Bayes is choosen as the Classifier and trained the model on traning set.
* Metrics of the model is check on the Test set


In [34]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [35]:
X_train, X_test, y_train, y_test = train_test_split(data['Comment'], data['Insult'], test_size=0.2, random_state=42)

In [36]:
X_train.shape, y_train.shape

((3157,), (3157,))

In [37]:
X_test.shape, y_test.shape

((790,), (790,))

In [38]:
Multi_clf = Pipeline([('vect', CountVectorizer()),
                     ('clf', MultinomialNB()), ])

In [39]:
Multi_clf = Multi_clf.fit(X_train,y_train)

In [40]:
predicted = Multi_clf.predict(X_test)
print ('Accuracy = {0:5.4f}%'.format(np.mean(predicted == y_test)*100))

Accuracy = 82.0253%


### Multinomial Cross Validation 
----

Doing cross validation on the whole data set, assuming we have only small data set. 
Thus, the average score of the KFold crossvalidation is choosen as the metrics for the choosen classifier.

In [42]:
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [43]:
count_vect = CountVectorizer()
X_train = count_vect.fit_transform(data['Comment'])
y = np.array(data['Insult'])
X_train.shape, y.shape

((3947, 15457), (3947,))

In [44]:
clf = MultinomialNB()

5 fold cross validation is choosen here and their metrics is calculated.


In [45]:
precision = cross_val_score(clf, X_train, y, scoring='precision',cv=5,n_jobs=1)
accuracy = cross_val_score(clf, X_train, y, scoring='accuracy',cv=5,n_jobs=1)
recall = cross_val_score(clf, X_train, y, scoring='recall',cv=5,n_jobs=1)
f1 = cross_val_score(clf, X_train, y, scoring='f1',cv=5,n_jobs=1)
roc_auc = cross_val_score(clf, X_train, y, scoring='roc_auc',cv=5,n_jobs=1)

In [46]:
print ('Accuracy = {0:5.4f}%'.format(np.mean(accuracy)*100))
print ('Precision = {0:5.4f}'.format(np.mean(precision)))
print ('Recall = {0:5.4f}'.format(np.mean(recall)))
print ('F1 = {0:5.4f}'.format(np.mean(f1)))
print ('Roc_AUC = {0:5.4f}'.format(np.mean(roc_auc)))

Accuracy = 78.6418%
Precision = 0.5903
Recall = 0.6425
F1 = 0.6149
Roc_AUC = 0.8084


# Bernoulli

***

* Splitting the data into Traning (80%) and Test (20%) set using the train_test_split method, assuming we have **plenty of data**.
* Used Pipeline, CountVectorizer
* Bernoulii Naive Bayes is choosen as the Classifier and trained the model on traning set.
* Metrics of the model is check on the Test set.

In [47]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [48]:
X_train, X_test, y_train, y_test = train_test_split(data['Comment'], data['Insult'], test_size=0.2, random_state=42)

In [49]:
X_train.shape, y_train.shape

((3157,), (3157,))

In [50]:
X_test.shape, y_test.shape

((790,), (790,))

In [51]:
Bernoulli_clf = Pipeline([('vect', CountVectorizer()),
                     ('clf', BernoulliNB()), ])

In [52]:
Bernoulli_clf = Bernoulli_clf.fit(X_train,y_train)
predicted = Bernoulli_clf.predict(X_test)
print ('Accuracy = {0:5.4f}%'.format(np.mean(predicted == y_test)*100))

Accuracy = 75.1899%


### Bernoulli -Cross Validation
----

Doing cross validation on the whole data set, assuming we have only **small data set**. 
Thus, the average score of the KFold crossvalidation is choosen as the metrics for the choosen classifier.

In [53]:
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import BernoulliNB
from sklearn import metrics

In [54]:
count_vect = CountVectorizer()
X_train= count_vect.fit_transform(data['Comment'])
y = np.array(data['Insult'])
X_train.shape, y.shape

((3947, 15457), (3947,))

In [55]:
clf = BernoulliNB()

5 fold cross validation is choosen here and their metrics is calculated.


In [56]:
precision = cross_val_score(clf, X_train, y, scoring='precision',cv=5,n_jobs=1)
accuracy = cross_val_score(clf, X_train, y, scoring='accuracy',cv=5,n_jobs=1)
recall = cross_val_score(clf, X_train, y, scoring='recall',cv=5,n_jobs=1)
f1 = cross_val_score(clf, X_train, y, scoring='f1',cv=5,n_jobs=1)
roc_auc = cross_val_score(clf, X_train, y, scoring='roc_auc',cv=5,n_jobs=1)

In [57]:
print ('Accuracy = {0:5.4f}%'.format(np.mean(accuracy)*100))
print ('Precision = {0:5.4f}'.format(np.mean(precision)))
print ('Recall = {0:5.4f}'.format(np.mean(recall)))
print ('F1 = {0:5.4f}'.format(np.mean(f1)))
print ('Roc_AUC = {0:5.4f}'.format(np.mean(roc_auc)))

Accuracy = 75.6273%
Precision = 0.6497
Recall = 0.1792
F1 = 0.2808
Roc_AUC = 0.8308


### Summary

---

The Multinomial Naive bayes have more accuracy than the Bernoulli Naive bayes Classifier. 
With the help of k fold cross validation - we can predict the model performance on the given dataset, in which also the Multinomial Naive bayes classifer perform better than Bernoulli Naive bayes classifier.