# Naive Bayes Classification

### This notebook tests two word vectorization techniques, Count and TF-IDF, and three variations of the Naive Bayes Classification algorithm: Multinomial, Gaussian and Bernoulli Baive Bayes. We used a dataset of What'sApp messages from customers of an auto insurance company in Latin America. The messages were classified by two criteria, consulations (*consulta*) and complaints (*reclamos*).

In [97]:
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from nltk.corpus import stopwords

In [98]:
#nltk.download('stopwords')
stop_words = stopwords.words("spanish")

In [99]:
df = pd.read_csv('mattermost_etiquetado.csv')

### Set up our test and train data sets

In [100]:
# change class type to bianary
df.loc[df["type"]=='consulta',"type"]=0
df.loc[df["type"]=='reclamo',"type"]=1

In [101]:
# Seperate messages and class into two dataframes
df_x = df["text"]
df_y= df["type"]

In [102]:
#train-test-split the data (this is a commonly used line of code)
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

In [103]:
#for TYPE we need our bianary values to be integers
y_train=y_train.astype('int')

# Vectorization: turn words into numbers

### Which performs better? Count Vectorizer or TF-IDF Vectorizer?

### Count Vectorizer

In [104]:
cv = CountVectorizer()

In [105]:
x_traincv=cv.fit_transform(x_train)

In [106]:
# transform to an array
train_arrai = x_traincv.toarray()
test_arrai  = x_testcv.toarray()

In [107]:
#feature extraction of the data
#cv.get_feature_names()

In [108]:
cv.inverse_transform(train_arrai[0])

[array(['averiguar', 'quería', 'seguro', 'un'], dtype='<U15')]

### Tfidf Vectorizer

In [109]:
cv1 = TfidfVectorizer(min_df=1,stop_words=stop_words)

In [110]:
# for making our language based training data use "fit_transform()"
x_traincv1=cv1.fit_transform(x_train)

In [111]:
arrai1 = x_traincv1.toarray()
arrai1

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [112]:
#feature extraction of the data
#cv1.get_feature_names()

In [113]:
cv1.inverse_transform(arrai1[0])

[array(['averiguar', 'quería', 'seguro'], dtype='<U15')]

In [114]:
x_train.iloc[0]

'Quería averiguar x un seguro'

### Let's test the two vectorizations with the NB classification algo

### Count vectorizer w/ NB Classifier

In [115]:
mnb=MultinomialNB()

In [116]:
# for applying the Naive Bayes algorithm use "fit()"
mnb.fit(train_arrai, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [117]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [118]:
#predictions
pred = mnb.predict(x_testcv)
pred

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1])

In [119]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [120]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy using Count Vectorizer.")

We have 49 correct predictions out of 55 .
Total 89.0909090909091 accuracy using Count Vectorizer.


### TFIDF Vectorizer w/ NB classification

In [121]:
mnb.fit(x_traincv1, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [122]:
x_testcv1 = cv1.transform(x_test)

In [123]:
pred1 = mnb.predict(x_testcv1)
pred1

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [124]:
actual1=np.array(y_test)
actual1

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [125]:
count1=0
for i in range (len(pred1)):
    if pred1[i]==actual1[i]:
        count1=count1+1
print("We have",count1,"correct predictions out of",len(pred1))
print("Total",(count1/len(pred1))*100,"accuracy using Tfidf Vectorizer")

We have 47 correct predictions out of 55
Total 85.45454545454545 accuracy using Tfidf Vectorizer


### We have our answer: The Naive Bayes classifier performed better with the *Count Vectorizer* (89 percent accuracy) than with the TF-IDF Vectorizer (85 percent accuracy). As out data set grows, the vectorizer we want to use may change as well. 

### Now that we know Count Vectorizer performs better, let's try a couple variations of the NB algo

### Gaussian Naive Bayes

In [126]:
gnb=GaussianNB()

In [127]:
# for applying the Naive Bayes algorithm use "fit()"
gnb.fit(train_arrai, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [128]:
#predictions
pred = gnb.predict(test_arrai)
pred

array([0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

In [129]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [130]:
count1=0
for i in range (len(pred1)):
    if pred1[i]==actual1[i]:
        count1=count1+1
print("We have",count1,"correct predictions out of",len(pred1))
print("Total",(count1/len(pred1))*100,"accuracy using Gaussian NB")

We have 47 correct predictions out of 55
Total 85.45454545454545 accuracy using Gaussian NB


### Bernoulli Naive Bayes

In [131]:
bnb=BernoulliNB()

In [132]:
# for applying the Naive Bayes algorithm use "fit()"
bnb.fit(train_arrai, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [133]:
#predictions
pred = bnb.predict(test_arrai)
pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [134]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [135]:
count1=0
for i in range (len(pred1)):
    if pred1[i]==actual1[i]:
        count1=count1+1
print("We have",count1,"correct predictions out of",len(pred1))
print("Total",(count1/len(pred1))*100,"accuracy using Bernoulli NB")

We have 47 correct predictions out of 55
Total 85.45454545454545 accuracy using Bernoulli NB


### Conclusion: All perform pretty similarly. 