## TEXT CLASSIFIERS: Multinomial Naive Bayes, Bernoulli Naive Bayes, Gaussian Naive Bayes, Logistical Regression, Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Linear Support Vector, and SVC with number of support vectors specified

In [14]:
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk.corpus import stopwords

In [15]:
nltk.download('stopwords')
stop_words = stopwords.words("spanish")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/brandonjanes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [34]:
#read dataset from csv
df = pd.read_csv('mattermost_etiquetado.csv')

In [17]:
# change type to bianary
df.loc[df["type"]=='consulta',"type"]=0
df.loc[df["type"]=='reclamo',"type"]=1

In [18]:
df_x = df["text"]
df_y = df["type"]

In [38]:
#average of 'reclamos' versus 'consultas'
reclamo_total = df_y.sum()
total = df.type.count()
average = reclamo_total / total
print(average*100,"percent of the messages are 'reclamos'.")

12.177121771217712 percent of the messages are 'reclamos'.


### If the percent of 'reclamos' is very low, this will cloud the accuracy of our classification accuracy calculator. For example, if the 'reclamo' percent is 10 percent, a dumb classifier, which predicts 'consulta' every time, would be 90 percent accurate. 

In [6]:
#split the data (this is a commonly used line of code)
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

In [7]:
#for TYPE we need our bianary values to be integers
y_train=y_train.astype('int')

### Count Vectorizer

In [8]:
cv = CountVectorizer()

In [9]:
x_traincv=cv.fit_transform(x_train)

In [10]:
arrai=x_traincv.toarray()
arrai

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [11]:
cv.get_feature_names()

['100',
 '11',
 '19',
 '1983',
 '2008',
 '2009',
 '2011',
 '351',
 '3515069733',
 '430000',
 '506973',
 '690',
 '790',
 '794',
 '80',
 '936',
 '97',
 'ab235',
 'abonaba',
 'abonar',
 'aca',
 'accidente',
 'acercar',
 'actuó',
 'acá',
 'además',
 'adherirse',
 'agradezco',
 'aguardo',
 'ahi',
 'ahora',
 'al',
 'alberto',
 'algo',
 'alguna',
 'alta',
 'andar',
 'anterior',
 'antes',
 'anulada',
 'aparece',
 'apenas',
 'aproximadamente',
 'aqui',
 'arriba',
 'as',
 'asegurado',
 'asegurar',
 'asesoro',
 'asi',
 'assistance',
 'así',
 'atención',
 'atendian',
 'atendieron',
 'atras',
 'atrás',
 'atte',
 'aumentan',
 'auto',
 'automovil',
 'autos',
 'averiguar',
 'avise',
 'aviso',
 'ayer',
 'bersion',
 'bertuccelli_edo',
 'bicicletas',
 'bien',
 'boleta',
 'bonito',
 'boton',
 'breve',
 'brindar',
 'bs',
 'bsas',
 'buen',
 'buena',
 'buenas',
 'buenisimo',
 'bueno',
 'buenos',
 'bustos',
 'cabrera',
 'camioneta',
 'cancelar',
 'carol',
 'casa',
 'categoria',
 'cba',
 'cbu',
 'celular',
 'c

In [12]:
cv.inverse_transform(arrai[0])

[array(['averiguar', 'quería', 'seguro', 'un'], dtype='<U15')]

In [13]:
x_train.iloc[0]

'Quería averiguar x un seguro'

### Tfidf Vectorizer

In [8]:
 tfidf_v= TfidfVectorizer(min_df=1,stop_words=stop_words)

In [9]:
# for making our language based training data use "fit_transform()"
x_train_tfidf= tfidf_v.fit_transform(x_train)

In [10]:
arrai1 = x_train_tfidf.toarray()
arrai1

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [11]:
#feature extraction of the data
tfidf_v.get_feature_names()

['100',
 '11',
 '19',
 '1983',
 '2008',
 '2009',
 '2011',
 '351',
 '3515069733',
 '430000',
 '506973',
 '690',
 '790',
 '794',
 '80',
 '936',
 '97',
 'ab235',
 'abonaba',
 'abonar',
 'aca',
 'accidente',
 'acercar',
 'actuó',
 'acá',
 'además',
 'adherirse',
 'agradezco',
 'aguardo',
 'ahi',
 'ahora',
 'alberto',
 'alguna',
 'alta',
 'andar',
 'anterior',
 'anulada',
 'aparece',
 'apenas',
 'aproximadamente',
 'aqui',
 'arriba',
 'as',
 'asegurado',
 'asegurar',
 'asesoro',
 'asi',
 'assistance',
 'así',
 'atención',
 'atendian',
 'atendieron',
 'atras',
 'atrás',
 'atte',
 'aumentan',
 'auto',
 'automovil',
 'autos',
 'averiguar',
 'avise',
 'aviso',
 'ayer',
 'bersion',
 'bertuccelli_edo',
 'bicicletas',
 'bien',
 'boleta',
 'bonito',
 'boton',
 'breve',
 'brindar',
 'bs',
 'bsas',
 'buen',
 'buena',
 'buenas',
 'buenisimo',
 'bueno',
 'buenos',
 'bustos',
 'cabrera',
 'camioneta',
 'cancelar',
 'carol',
 'casa',
 'categoria',
 'cba',
 'cbu',
 'celular',
 'centro',
 'cerca',
 'chiqui

In [12]:
tfidf_v.inverse_transform(arrai1[0])

[array(['averiguar', 'quería', 'seguro'], dtype='<U15')]

In [13]:
x_train.iloc[0]

'Quería averiguar x un seguro'

### Multinomial Naive Bayes Classifier with Count Vectorizer

In [14]:
mnb=MultinomialNB()

In [15]:
# for applying the Naive Bayes algorithm use "fit()"
mnb.fit(x_traincv, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [17]:
#predictions
pred = mnb.predict(x_testcv)
pred

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1])

In [18]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [19]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy using Count Vectorizer.")

We have 49 correct predictions out of 55 .
Total 89.0909090909091 accuracy using Count Vectorizer.


### Multinomial Naive Bayes Classifier with TFIDF Vectorizer

In [17]:
mnb.fit(x_train_tfidf, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [18]:
x_test_tfidf = tfidf_v.transform(x_test)

In [19]:
pred1 = mnb.predict(x_test_tfidf)
pred1

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [20]:
actual1=np.array(y_test)
actual1

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [21]:
count1=0
for i in range (len(pred1)):
    if pred1[i]==actual1[i]:
        count1=count1+1
print("We have",count1,"correct predictions out of",len(pred1),".")
print("Total",(count1/len(pred1))*100,"accuracy using Tfidf Vectorizer")

we have 47 correct predictions out of 55
Total 85.45454545454545 accuracy using Tfidf Vectorizer


### Bernoulli-Naive Bayes Classifier 

In [13]:
from sklearn.naive_bayes import BernoulliNB

In [14]:
bnb=BernoulliNB()

In [15]:
# for applying the Naive Bayes algorithm use "fit()"
bnb.fit(x_traincv, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [16]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [19]:
#predictions
pred = bnb.predict(x_testcv)
pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [20]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [21]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

We have 47 correct predictions out of 55 .
Total 85.45454545454545 accuracy using Count Vectorizer.


### Gaussian-Naive Bayes Classifier

In [14]:
from sklearn.naive_bayes import GaussianNB

In [15]:
gnb=GaussianNB()

In [17]:
# for applying the Naive Bayes algorithm use "fit()"
dense_x_traincv = x_traincv.toarray()
gnb.fit(dense_x_traincv, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [18]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [20]:
#predictions
dense_x_testcv = x_testcv.toarray()
pred = gnb.predict(dense_x_testcv)
pred

array([0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

In [21]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [22]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

We have 42 correct predictions out of 55 .
Total 76.36363636363637 accuracy using Count Vectorizer.


### Logistic Regression Classifier

In [14]:
from sklearn.linear_model import LogisticRegression

In [15]:
lr_classifier=LogisticRegression()

In [16]:
# for applying the Naive Bayes algorithm use "fit()"
dense_x_traincv = x_traincv.toarray()
lr_classifier.fit(dense_x_traincv, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [17]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [18]:
#predictions
dense_x_testcv = x_testcv.toarray()
pred = lr_classifier.predict(dense_x_testcv)
pred

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [19]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [20]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

We have 45 correct predictions out of 55 .
Total 81.81818181818183 accuracy using Count Vectorizer.


### Stochastic Gradient Descent Classifier

In [14]:
from sklearn.linear_model import SGDClassifier

In [15]:
sgd_classifier=SGDClassifier()

In [16]:
# for applying the Naive Bayes algorithm use "fit()"
#dense_x_traincv = x_traincv.toarray()
sgd_classifier.fit(x_traincv, y_train)



SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [17]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [18]:
#predictions
#dense_x_testcv = x_testcv.toarray()
pred = sgd_classifier.predict(x_testcv)
pred

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1])

In [19]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [20]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

We have 47 correct predictions out of 55 .
Total 85.45454545454545 accuracy using Count Vectorizer.


### Support Vector Classification (three types)

In [15]:
from sklearn.svm import SVC

In [16]:
svc_classifier=SVC()

In [17]:
svc_classifier.fit(x_traincv, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)
#predictions
pred = svc_classifier.predict(x_testcv)
#actual
actual=np.array(y_test)



In [18]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"percent accuracy.")

We have 47 correct predictions out of 55 .
Total 85.45454545454545 percent accuracy using Count Vectorizer.


### Linear Support Vector Classifier

In [19]:
from sklearn.svm import LinearSVC

In [20]:
linear_svc_classifier=LinearSVC()

In [21]:
linear_svc_classifier.fit(x_traincv, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)
#predictions
pred = linear_svc_classifier.predict(x_testcv)
#actual
actual=np.array(y_test)

In [22]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"percent accuracy.")

We have 44 correct predictions out of 55 .
Total 80.0 percent accuracy using Count Vectorizer.


### Support Vector Classifier w/ control of the number of support vectors (nu=0.5)

In [25]:
from sklearn.svm import NuSVC

In [26]:
num_svc_classifier=NuSVC(nu=.1)

In [27]:
num_svc_classifier.fit(x_traincv, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)
#predictions
pred = num_svc_classifier.predict(x_testcv)
#actual
actual=np.array(y_test)



In [28]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"percent accuracy.")

We have 44 correct predictions out of 55 .
Total 80.0 percent accuracy using Count Vectorizer.


### Conclusion: All classifiers perform more or less the same with our dataset, but probably because we have a small data set and, within that data set, small percentage of 'reclamos.' My suspicion is that the classifiers are just guessing "consulta' everytime and is accurate most of the time. 