## TEXT CLASSIFIERS TESTS: 
#### Multinomial Naive Bayes, Bernoulli Naive Bayes, Gaussian Naive Bayes, Logistical Regression, Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Linear Support Vector, and SVC with number of support vectors specified

In [1]:
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk.corpus import stopwords

In [2]:
nltk.download('stopwords')
stop_words = stopwords.words("spanish")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/brandonjanes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
#read dataset from csv
df = pd.read_csv('mattermost_etiquetado.csv')

In [4]:
# change type to bianary
df.loc[df["type"]=='consulta',"type"]=0
df.loc[df["type"]=='reclamo',"type"]=1

In [5]:
df_x = df["text"]
df_y = df["type"]

In [6]:
#average of 'reclamos' versus 'consultas'
reclamo_total = df_y.sum()
total = df.type.count()
average = reclamo_total / total
print(average*100,"percent of the messages are 'reclamos'.")

(0, "percent of the messages are 'reclamos'.")


### If the percent of 'reclamos' is very low, this will cloud the accuracy of our classification accuracy calculator. For example, if the 'reclamo' percent is 10 percent, a dumb classifier, which predicts 'consulta' every time, would be 90 percent accurate. 

In [7]:
#split the data (this is a commonly used line of code)
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

In [8]:
#for TYPE we need our bianary values to be integers
y_train=y_train.astype('int')

### Count Vectorizer

In [9]:
cv = CountVectorizer()

In [10]:
x_traincv = cv.fit_transform(x_train)
x_traincv

<216x611 sparse matrix of type '<type 'numpy.int64'>'
	with 1895 stored elements in Compressed Sparse Row format>

In [11]:
arrai=x_traincv.toarray()
arrai

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [12]:
#cv.get_feature_names()

In [13]:
cv.inverse_transform(arrai[0])

[array([u'averiguar', u'quer\xeda', u'seguro', u'un'],
       dtype='<U15')]

In [14]:
x_train.iloc[0]

'Quer\xc3\xada averiguar x un seguro'

### Tfidf Vectorizer

In [15]:
 tfidf_v= TfidfVectorizer(min_df=1,stop_words=stop_words)

In [16]:
# for making our language based training data use "fit_transform()"
x_train_tfidf= tfidf_v.fit_transform(x_train)

In [17]:
arrai1 = x_train_tfidf.toarray()
arrai1

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [18]:
#feature extraction of the data
#tfidf_v.get_feature_names()

[u'100',
 u'11',
 u'19',
 u'1983',
 u'2008',
 u'2009',
 u'2011',
 u'351',
 u'3515069733',
 u'430000',
 u'506973',
 u'690',
 u'790',
 u'794',
 u'80',
 u'936',
 u'97',
 u'ab235',
 u'abonaba',
 u'abonar',
 u'aca',
 u'accidente',
 u'acercar',
 u'actu\xf3',
 u'ac\xe1',
 u'adem\xe1s',
 u'adherirse',
 u'agradezco',
 u'aguardo',
 u'ahi',
 u'ahora',
 u'alberto',
 u'alguna',
 u'alta',
 u'andar',
 u'anterior',
 u'anulada',
 u'aparece',
 u'apenas',
 u'aproximadamente',
 u'aqui',
 u'arriba',
 u'as',
 u'asegurado',
 u'asegurar',
 u'asesoro',
 u'asi',
 u'assistance',
 u'as\xed',
 u'atenci\xf3n',
 u'atendian',
 u'atendieron',
 u'atras',
 u'atr\xe1s',
 u'atte',
 u'aumentan',
 u'auto',
 u'automovil',
 u'autos',
 u'averiguar',
 u'avise',
 u'aviso',
 u'ayer',
 u'bersion',
 u'bertuccelli_edo',
 u'bicicletas',
 u'bien',
 u'boleta',
 u'bonito',
 u'boton',
 u'breve',
 u'brindar',
 u'bs',
 u'bsas',
 u'buen',
 u'buena',
 u'buenas',
 u'buenisimo',
 u'bueno',
 u'buenos',
 u'bustos',
 u'cabrera',
 u'camioneta',
 u

In [19]:
tfidf_v.inverse_transform(arrai1[0])

[array([u'averiguar', u'quer\xeda', u'seguro'],
       dtype='<U15')]

In [20]:
x_train.iloc[0]

'Quer\xc3\xada averiguar x un seguro'

### Multinomial Naive Bayes Classifier with Count Vectorizer

In [21]:
#mnb=MultinomialNB()

In [22]:
# for applying the Naive Bayes algorithm use "fit()" - this is the line where we are training our model
mnb_trained = MultinomialNB().fit(x_traincv, y_train)

In [23]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [24]:
#predictions
pred = mnb_trained.predict(x_testcv)
pred

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1])

In [25]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [26]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy using Count Vectorizer.")

('We have', 49, 'correct predictions out of', 55, '.')
('Total', 0, 'accuracy using Count Vectorizer.')


In [27]:
messg = []
for msg in x_test:
    messg.append(msg)

In [28]:
count = 0
predictions = []
for x in pred:
    count = count + 1
    predictions.append(x)
count

55

In [29]:
original = []
for i in y_test:
    original.append(i)

In [30]:
# this is the Original, hand labeled data
dataf = pd.DataFrame({'ORIGINAL_TYPE' : original,
                      'ML_PREDICT' : predictions,
                     'TEXT' : messg})
dataf

Unnamed: 0,ML_PREDICT,ORIGINAL_TYPE,TEXT
0,0,0,Y en SanCor?
1,0,0,gracias
2,0,0,Me podes confirmar si voy a poderlo hacer?
3,0,0,Quería consultar sobre un seguro de accidente ...
4,0,0,Tengo un auto Ford falcon modelo 1983 y quería...
5,0,0,Me llamo Natalia
6,1,0,pense q seria 690 el valor de la cuota
7,0,0,Toco el boton donde dice otros medios de pagos...
8,0,1,Gente necesitamos ayuda
9,0,0,"y asi como yo estoy interviniendo, también pod..."


### Multinomial Naive Bayes Classifier with TFIDF Vectorizer

In [33]:
mnb = MultinomialNB().fit(x_traincv, y_train)
mnb.fit(x_train_tfidf, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [34]:
x_test_tfidf = tfidf_v.transform(x_test)

In [35]:
pred1 = mnb.predict(x_test_tfidf)
pred1

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0])

In [36]:
actual1=np.array(y_test)
actual1

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [37]:
count1=0
for i in range (len(pred1)):
    if pred1[i]==actual1[i]:
        count1=count1+1
print("We have",count1,"correct predictions out of",len(pred1),".")
print("Total",(count1/len(pred1))*100,"accuracy using Tfidf Vectorizer")

('We have', 47, 'correct predictions out of', 55, '.')
('Total', 0, 'accuracy using Tfidf Vectorizer')


### Bernoulli-Naive Bayes Classifier 

In [None]:
from sklearn.naive_bayes import BernoulliNB

In [None]:
bnb=BernoulliNB()

In [None]:
# for applying the Naive Bayes algorithm use "fit()"
bnb.fit(x_traincv, y_train)

In [None]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [None]:
#predictions
pred = bnb.predict(x_testcv)
pred

In [None]:
actual=np.array(y_test)
actual

In [None]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

### Gaussian-Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
gnb=GaussianNB()

In [None]:
# for applying the Naive Bayes algorithm use "fit()"
dense_x_traincv = x_traincv.toarray()
gnb.fit(dense_x_traincv, y_train)

In [None]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [None]:
#predictions
dense_x_testcv = x_testcv.toarray()
pred = gnb.predict(dense_x_testcv)
pred

In [None]:
actual=np.array(y_test)
actual

In [None]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

### Logistic Regression Classifier

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr_classifier=LogisticRegression()

In [None]:
# for applying the Naive Bayes algorithm use "fit()"
dense_x_traincv = x_traincv.toarray()
lr_classifier.fit(dense_x_traincv, y_train)

In [None]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [None]:
#predictions
dense_x_testcv = x_testcv.toarray()
pred = lr_classifier.predict(dense_x_testcv)
pred

In [None]:
actual=np.array(y_test)
actual

In [None]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

### Stochastic Gradient Descent (SGD) Classifier

In [None]:
from sklearn.linear_model import SGDClassifier

In [None]:
sgd_classifier=SGDClassifier()

In [None]:
# for applying the Naive Bayes algorithm use "fit()"
#dense_x_traincv = x_traincv.toarray()
sgd_classifier.fit(x_traincv, y_train)

In [None]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [None]:
#predictions
#dense_x_testcv = x_testcv.toarray()
pred = sgd_classifier.predict(x_testcv)
pred

In [None]:
actual=np.array(y_test)
actual

In [None]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

### Support Vector Classification (three types)

In [None]:
from sklearn.svm import SVC

In [None]:
svc_classifier=SVC()

In [None]:
svc_classifier.fit(x_traincv, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)
#predictions
pred = svc_classifier.predict(x_testcv)
pred

In [None]:
#actual
actual=np.array(y_test)
actual

In [None]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"percent accuracy.")

### Linear Support Vector Classifier

In [None]:
from sklearn.svm import LinearSVC

In [None]:
linear_svc_classifier=LinearSVC()

In [None]:
linear_svc_classifier.fit(x_traincv, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)
#predictions
pred = linear_svc_classifier.predict(x_testcv)
pred

In [None]:
#actual
actual = np.array(y_test)
actual

In [None]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"percent accuracy.")

### Support Vector Classifier w/ control of the number of support vectors (nu=0.5)

In [None]:
from sklearn.svm import NuSVC

In [None]:
num_svc_classifier=NuSVC(nu=.1)

In [None]:
num_svc_classifier.fit(x_traincv, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)
#predictions
pred = num_svc_classifier.predict(x_testcv)
#actual
actual=np.array(y_test)

In [None]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"percent accuracy.")

### Conclusion: According to the performance measures, the "Multinomial Naive Bayes Classifier with Count Vectorizer" performed best, at 89 percent accuracy. Almost all classifiers perform above 80 percent with our dataset, but probably because we have a small data set and, within that data set, a small percentage of 'reclamos.' My suspicion is that the classifiers are just guessing 'consulta' everytime. Even a dumb classifier that guessed 'consulta' everytime would be accurate 88 percent of the time. 