## TEXT CLASSIFIERS TESTS: 
#### Multinomial Naive Bayes, Bernoulli Naive Bayes, Gaussian Naive Bayes, Logistical Regression, Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Linear Support Vector, and SVC with number of support vectors specified

In [4]:
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk.corpus import stopwords

In [5]:
nltk.download('stopwords')
stop_words = stopwords.words("spanish")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/brandonjanes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
#read dataset from csv
df = pd.read_csv('mattermost_etiquetado.csv')

In [7]:
# change type to bianary
df.loc[df["type"]=='consulta',"type"]=0
df.loc[df["type"]=='reclamo',"type"]=1

In [8]:
df_x = df["text"]
df_y = df["type"]

In [9]:
#average of 'reclamos' versus 'consultas'
reclamo_total = df_y.sum()
total = df.type.count()
average = reclamo_total / total
print(average*100,"percent of the messages are 'reclamos'.")

12.177121771217712 percent of the messages are 'reclamos'.


### If the percent of 'reclamos' is very low, this will cloud the accuracy of our classification accuracy calculator. For example, if the 'reclamo' percent is 10 percent, a dumb classifier, which predicts 'consulta' every time, would be 90 percent accurate. 

In [10]:
#split the data (this is a commonly used line of code)
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

In [11]:
#for TYPE we need our bianary values to be integers
y_train=y_train.astype('int')

### Count Vectorizer

In [12]:
cv = CountVectorizer()

In [13]:
x_traincv = cv.fit_transform(x_train)
x_traincv

<216x611 sparse matrix of type '<class 'numpy.int64'>'
	with 1895 stored elements in Compressed Sparse Row format>

In [14]:
arrai=x_traincv.toarray()
arrai

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [15]:
#cv.get_feature_names()

In [16]:
cv.inverse_transform(arrai[0])

[array(['averiguar', 'quería', 'seguro', 'un'], dtype='<U15')]

In [17]:
x_train.iloc[0]

'Quería averiguar x un seguro'

### Tfidf Vectorizer

In [18]:
 tfidf_v= TfidfVectorizer(min_df=1,stop_words=stop_words)

In [19]:
# for making our language based training data use "fit_transform()"
x_train_tfidf= tfidf_v.fit_transform(x_train)

In [20]:
arrai1 = x_train_tfidf.toarray()
arrai1

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [21]:
#feature extraction of the data
#tfidf_v.get_feature_names()

In [22]:
tfidf_v.inverse_transform(arrai1[0])

[array(['averiguar', 'quería', 'seguro'], dtype='<U15')]

In [23]:
x_train.iloc[0]

'Quería averiguar x un seguro'

### Multinomial Naive Bayes Classifier with Count Vectorizer

In [24]:
#mnb=MultinomialNB()

In [25]:
# for applying the Naive Bayes algorithm use "fit()" - this is the line where we are training our model
mnb_trained = MultinomialNB().fit(x_traincv, y_train)

In [26]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [27]:
#predictions
pred = mnb_trained.predict(x_testcv)
pred

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1])

In [28]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [29]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy using Count Vectorizer.")

We have 49 correct predictions out of 55 .
Total 89.0909090909091 accuracy using Count Vectorizer.


In [30]:
messg = []
for msg in x_test:
    messg.append(msg)

In [31]:
count = 0
predictions = []
for x in pred:
    count = count + 1
    predictions.append(x)
count

55

In [32]:
original = []
for i in y_test:
    original.append(i)

In [33]:
# this is the Original, hand labeled data
dataf = pd.DataFrame({'ORIGINAL_TYPE' : original,
                      'ML_PREDICT' : predictions,
                     'TEXT' : messg})
dataf

Unnamed: 0,ORIGINAL_TYPE,ML_PREDICT,TEXT
0,0,0,Y en SanCor?
1,0,0,gracias
2,0,0,Me podes confirmar si voy a poderlo hacer?
3,0,0,Quería consultar sobre un seguro de accidente ...
4,0,0,Tengo un auto Ford falcon modelo 1983 y quería...
5,0,0,Me llamo Natalia
6,0,1,pense q seria 690 el valor de la cuota
7,0,0,Toco el boton donde dice otros medios de pagos...
8,1,0,Gente necesitamos ayuda
9,0,0,"y asi como yo estoy interviniendo, también pod..."


### Multinomial Naive Bayes Classifier with TFIDF Vectorizer

In [34]:
mnb = MultinomialNB().fit(x_traincv, y_train)
mnb.fit(x_train_tfidf, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [35]:
x_test_tfidf = tfidf_v.transform(x_test)

In [36]:
pred1 = mnb.predict(x_test_tfidf)
pred1

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [37]:
actual1=np.array(y_test)
actual1

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [38]:
count1=0
for i in range (len(pred1)):
    if pred1[i]==actual1[i]:
        count1=count1+1
print("We have",count1,"correct predictions out of",len(pred1),".")
print("Total",(count1/len(pred1))*100,"accuracy using Tfidf Vectorizer")

We have 47 correct predictions out of 55 .
Total 85.45454545454545 accuracy using Tfidf Vectorizer


### Bernoulli-Naive Bayes Classifier 

In [39]:
from sklearn.naive_bayes import BernoulliNB

In [40]:
bnb=BernoulliNB()

In [41]:
# for applying the Naive Bayes algorithm use "fit()"
bnb.fit(x_traincv, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [42]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [43]:
#predictions
pred = bnb.predict(x_testcv)
pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [44]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [45]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

We have 47 correct predictions out of 55 .
Total 85.45454545454545 accuracy.


### Gaussian-Naive Bayes Classifier

In [46]:
from sklearn.naive_bayes import GaussianNB

In [47]:
gnb=GaussianNB()

In [48]:
# for applying the Naive Bayes algorithm use "fit()"
dense_x_traincv = x_traincv.toarray()
gnb.fit(dense_x_traincv, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [49]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [50]:
#predictions
dense_x_testcv = x_testcv.toarray()
pred = gnb.predict(dense_x_testcv)
pred

array([0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

In [51]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [52]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

We have 42 correct predictions out of 55 .
Total 76.36363636363637 accuracy.


### Logistic Regression Classifier

In [53]:
from sklearn.linear_model import LogisticRegression

In [54]:
lr_classifier=LogisticRegression()

In [55]:
# for applying the Naive Bayes algorithm use "fit()"
dense_x_traincv = x_traincv.toarray()
lr_classifier.fit(dense_x_traincv, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [56]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [57]:
#predictions
dense_x_testcv = x_testcv.toarray()
pred = lr_classifier.predict(dense_x_testcv)
pred

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [58]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [59]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

We have 45 correct predictions out of 55 .
Total 81.81818181818183 accuracy.


### Stochastic Gradient Descent (SGD) Classifier

In [60]:
from sklearn.linear_model import SGDClassifier

In [61]:
sgd_classifier=SGDClassifier()

In [62]:
# for applying the Naive Bayes algorithm use "fit()"
#dense_x_traincv = x_traincv.toarray()
sgd_classifier.fit(x_traincv, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [63]:
# test data use "transform()"
x_testcv = cv.transform(x_test)

In [64]:
#predictions
#dense_x_testcv = x_testcv.toarray()
pred = sgd_classifier.predict(x_testcv)
pred

array([0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1])

In [65]:
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [66]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"accuracy.")

We have 46 correct predictions out of 55 .
Total 83.63636363636363 accuracy.


### Support Vector Classification (three types)

In [67]:
from sklearn.svm import SVC

In [68]:
svc_classifier=SVC()

In [69]:
svc_classifier.fit(x_traincv, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)
#predictions
pred = svc_classifier.predict(x_testcv)
pred



array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [70]:
#actual
actual=np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [71]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"percent accuracy.")

We have 47 correct predictions out of 55 .
Total 85.45454545454545 percent accuracy.


### Linear Support Vector Classifier

In [72]:
from sklearn.svm import LinearSVC

In [73]:
linear_svc_classifier=LinearSVC()

In [74]:
linear_svc_classifier.fit(x_traincv, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)
#predictions
pred = linear_svc_classifier.predict(x_testcv)
pred

array([0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [75]:
#actual
actual = np.array(y_test)
actual

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=object)

In [76]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"percent accuracy.")

We have 44 correct predictions out of 55 .
Total 80.0 percent accuracy.


### Support Vector Classifier w/ control of the number of support vectors (nu=0.5)

In [77]:
from sklearn.svm import NuSVC

In [78]:
num_svc_classifier=NuSVC(nu=.1)

In [79]:
num_svc_classifier.fit(x_traincv, y_train)
# test data use "transform()"
x_testcv = cv.transform(x_test)
#predictions
pred = num_svc_classifier.predict(x_testcv)
#actual
actual=np.array(y_test)



In [80]:
count=0
for i in range (len(pred)):
    if pred[i]==actual[i]:
        count=count+1
print("We have",count,"correct predictions out of",len(pred),".")
print("Total",(count/len(pred))*100,"percent accuracy.")

We have 44 correct predictions out of 55 .
Total 80.0 percent accuracy.


### Conclusion: According to the performance measures, the "Multinomial Naive Bayes Classifier with Count Vectorizer" performed best, at 89 percent accuracy. Almost all classifiers perform above 80 percent with our dataset, but probably because we have a small data set and, within that data set, a small percentage of 'reclamos.' My suspicion is that the classifiers are just guessing 'consulta' everytime. Even a dumb classifier that guessed 'consulta' everytime would be accurate 88 percent of the time. 