# Other Classifiers With Countvectorizer



In [0]:
import pandas as pd
import numpy as np
import random
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, roc_auc_score, average_precision_score, recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import eli5
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.svm import SVC
%matplotlib inline
import nltk

Below is code that will read either the entire dataset, or a stratified subset. Due to the size of the data and the computationally-expensive nature of some of these models, this notebook will be run

In [0]:
subset_amount = .05
raw = pd.read_csv(r'../data/train_preprocessed', skiprows=lambda i: i>0 and random.random() > subset_amount, 
                  usecols=['clean_question_text','target','qid'])
raw.dropna(inplace=True)

In [0]:
#raw = pd.read_csv('../data/train_preprocessed.csv', usecols=['clean_question_text','target','qid'])
#raw.dropna(inplace=True)

In [0]:
X, y = raw.clean_question_text, raw.target

### Instantiate the Vector and train/test/split

In [0]:
vectorizer = CountVectorizer(min_df=5, max_df=0.7, 
                             stop_words=stopwords.words('english'))  

In [0]:
vectorizer.fit(X)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.7, max_features=None, min_df=5,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u"you're", u"you've", u"you'll", u"you'd", u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'th...', u"shouldn't", u'wasn', u"wasn't", u'weren', u"weren't", u'won', u"won't", u'wouldn', u"wouldn't"],
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=0) 

In [0]:
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

In [0]:
accuracy_df = pd.DataFrame()
f1_df = pd.DataFrame()

## Naive Bayes (multinomial)

In [0]:
MNB_classifier = MultinomialNB()
MNB_classifier.fit(X_train, y_train)

y_preds_test=MNB_classifier.predict(X_test)
y_probas_test=MNB_classifier.predict_proba(X_test)

y_preds_train=MNB_classifier.predict(X_train)
y_probas_train=MNB_classifier.predict_proba(X_train)

y_true_train = y_train.values
y_true_test = y_test.values

print("Train Accuracy Score: ", accuracy_score(y_true_train, y_preds_train))
print("Test Accuracy Score: ", accuracy_score(y_true_test, y_preds_test))
print("Train F1 Score: ", f1_score(y_true_train, y_preds_train))
print("Test F1 Score: ", f1_score(y_true_test, y_preds_test))


accuracy_df.set_value("NB_Multi" , "countVectorizer" , 
accuracy_score(y_true_test, y_preds_test))

f1_df.set_value("NB_Multi" , "countVectorizer" , 
f1_score(y_true_test, y_preds_test))

('Train Accuracy Score: ', 0.942145278357623)
('Test Accuracy Score: ', 0.9315970892378399)
('Train F1 Score: ', 0.6068965517241379)
('Test F1 Score: ', 0.5149375339489407)




Unnamed: 0,countVectorizer
NB_Multi,0.514938


In [0]:
f1_df

Unnamed: 0,countVectorizer
NB_Multi,0.514938


## Naive Bayes (Bernouli)

In [0]:
BNB_classifier = BernoulliNB()
BNB_classifier.fit(X_train, y_train)

y_preds_test=BNB_classifier.predict(X_test)
y_probas_test=BNB_classifier.predict_proba(X_test)

y_preds_train=BNB_classifier.predict(X_train)
y_probas_train=BNB_classifier.predict_proba(X_train)

y_true_train = y_train.values
y_true_test = y_test.values

print("Train Accuracy Score: ", accuracy_score(y_true_train, y_preds_train))
print("Test Accuracy Score: ", accuracy_score(y_true_test, y_preds_test))
print("Train F1 Score: ", f1_score(y_true_train, y_preds_train))
print("Test F1 Score: ", f1_score(y_true_test, y_preds_test))

accuracy_df.set_value("NB_Bern" , "countVectorizer" , 
accuracy_score(y_true_test, y_preds_test))

f1_df.set_value("NB_Bern" , "countVectorizer" , 
f1_score(y_true_test, y_preds_test))

('Train Accuracy Score: ', 0.942145278357623)
('Test Accuracy Score: ', 0.9324396782841823)
('Train F1 Score: ', 0.5653862753560639)
('Test F1 Score: ', 0.47562425683709875)




Unnamed: 0,countVectorizer
NB_Multi,0.514938
NB_Bern,0.475624


# SVC Classifier

In [0]:
Support Vector Classifiers (SVCs) are considered to be quite strong at 

In [0]:
#%time
#SVC_classifier = SVC(kernel="linear",probability=True,verbose=2)
#SVC_classifier.fit(X_train, y_train)

CPU times: user 0 ns, sys: 3 µs, total: 3 µs
Wall time: 8.11 µs
[LibSVM]

In [0]:
#y_preds_test=SVC_classifier.predict(X_test)
#y_probas_test=SVC_classifier.predict_proba(X_test)

In [0]:
#y_preds_train=SVC_classifier.predict(X_train)
#y_probas_train=SVC_classifier.predict_proba(X_train)

In [0]:
#y_true_train = y_train.values
#y_true_test = y_test.values

In [0]:
#print("Train Accuracy Score: ", accuracy_score(y_true_train, y_preds_train))
#print("Test Accuracy Score: ", accuracy_score(y_true_test, y_preds_test))
#print("Train F1 Score: ", f1_score(y_true_train, y_preds_train))
#print("Test F1 Score: ", f1_score(y_true_test, y_preds_test))

#accuracy_df.set_value("SVC_Classifier" , "countVectorizer" , 
#accuracy_score(y_true_test, y_preds_test))

#f1_df.set_value("SVC_Classifier" , "countVectorizer" , 
#f1_score(y_true_test, y_preds_test))

### Extra Trees

In [0]:
%time
ET_classifier = ExtraTreesClassifier(n_estimators=100,random_state=420,verbose=2,max_features=.05)
ET_classifier.fit(X_train, y_train)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs
building tree 1 of 100


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.7s remaining:    0.0s


building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  2.9min finished


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features=0.05, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=420, verbose=2, warm_start=False)

In [0]:
y_preds_test=ET_classifier.predict(X_test)
y_probas_test=ET_classifier.predict_proba(X_test)

y_preds_train=ET_classifier.predict(X_train)
y_probas_train=ET_classifier.predict_proba(X_train)

y_true_train = y_train.values
y_true_test = y_test.values

print("Train Accuracy Score: ", accuracy_score(y_true_train, y_preds_train))
print("Test Accuracy Score: ", accuracy_score(y_true_test, y_preds_test))
print("Train F1 Score: ", f1_score(y_true_train, y_preds_train))
print("Test F1 Score: ", f1_score(y_true_test, y_preds_test))

accuracy_df.set_value("ExtraTrees" , "countVectorizer" , 
accuracy_score(y_true_test, y_preds_test))

f1_df.set_value("ExtraTrees" , "countVectorizer" , 
accuracy_score(y_true_test, y_preds_test))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   11.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


('Train Accuracy Score: ', 0.9996552846774039)
('Test Accuracy Score: ', 0.9348142474147836)
('Train F1 Score: ', 0.9972560975609756)
('Test F1 Score: ', 0.38908829863603733)


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   11.6s finished
  app.launch_new_instance()


Unnamed: 0,countVectorizer
NB_Multi,0.514938
NB_Bern,0.475624
ExtraTrees,0.934814


Even though ET is know as a "less interpretable" classifier than Logistic Regression, eli5 is still able to interpret its weights.

In [0]:
eli5.explain_prediction(ET_classifier, raw['clean_question_text'][189], vectorizer, 
                        target_names=["Sincere","Insincere"],top=(10,10))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished


Contribution?,Feature
+0.937,<BIAS>
+0.007,people
+0.007,secular
+0.006,muslims
+0.006,girls
+0.004,pakistani
+0.004,since
+0.003,political
… 1375 more positive …,… 1375 more positive …
… 178 more negative …,… 178 more negative …


In [0]:
Even though ETs are known as being less interpretable than 

In [0]:
RF_classifier = RandomForestClassifier(n_estimators=100, max_depth=15,random_state=42,max_features=.05,verbose=2)
RF_classifier.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


building tree 1 of 100
building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
b

[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    4.3s finished


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=15, max_features=0.05, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=42, verbose=2, warm_start=False)

In [0]:
y_preds_test=RF_classifier.predict(X_test)
y_probas_test=RF_classifier.predict_proba(X_test)

y_preds_train=RF_classifier.predict(X_train)
y_probas_train=RF_classifier.predict_proba(X_train)

y_true_train = y_train.values
y_true_test = y_test.values

print("Train Accuracy Score: ", accuracy_score(y_true_train, y_preds_train))
print("Test Accuracy Score: ", accuracy_score(y_true_test, y_preds_test))
print("Train F1 Score: ", f1_score(y_true_train, y_preds_train))
print("Test F1 Score: ", f1_score(y_true_test, y_preds_test))

accuracy_df.set_value("RandomForest" , "countVectorizer" , 
accuracy_score(y_true_test, y_preds_test))

f1_df.set_value("RandomForest" , "countVectorizer" , 
f1_score(y_true_test, y_preds_test))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


('Train Accuracy Score: ', 0.9398663270582377)
('Test Accuracy Score: ', 0.9377250095748755)
('Train F1 Score: ', 0.08614668218859138)
('Test F1 Score: ', 0.0378698224852071)


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.5s finished
  app.launch_new_instance()


Unnamed: 0,countVectorizer
NB_Multi,0.514938
NB_Bern,0.475624
ExtraTrees,0.934814
RandomForest,0.03787


In [0]:
eli5.explain_prediction(RF_classifier, raw['clean_question_text'][760], vectorizer, 
                        target_names=["Sincere","Insincere"], top=(10,10))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished


Contribution?,Feature
+0.937,<BIAS>
+0.003,people
+0.003,trump
+0.002,women
+0.001,muslims
+0.001,men
+0.001,indians
+0.001,white
+0.001,americans
+0.001,black


In [0]:
#eli5.explain_prediction(SVC_classifier, raw['clean_question_text'][980], vectorizer, target_names=["Sincere","Insincere"],top=(10,10))

In [0]:
eli5.explain_prediction(ET_classifier, raw['clean_question_text'][760], vectorizer, 
                        target_names=["Sincere","Insincere"],top=(10,10))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished


Contribution?,Feature
+0.937,<BIAS>
+0.005,people
+0.003,trump
+0.002,women
+0.002,men
+0.002,muslims
+0.001,americans
+0.001,indians
+0.001,quora
+0.001,black


In [0]:
#eli5.explain_prediction(SVC_classifier, raw['clean_question_text'][760], vectorizer, target_names=["Sincere","Insincere"],top=(10,10))

In [0]:
Let's check out each classifier's Accuracy and F1 Scores

### Accuracy

In [0]:
accuracy_df

Unnamed: 0,countVectorizer
NB_Multi,0.931597
NB_Bern,0.93244
ExtraTrees,0.934814
RandomForest,0.937725


### F1

In [0]:
f1_df

Unnamed: 0,countVectorizer
NB_Multi,0.514938
NB_Bern,0.475624
ExtraTrees,0.934814
RandomForest,0.03787


Between Random Forest and Extra Trees, it seems like the latter is a better fit for this task, though it was very overfit to the training data for f1 score. Both of the Naive Bayes models (as well as Logistic Regression) did pretty well, and avoided overfitting. Perhaps in the next notebook we should try "ensembling" multiple classifers together?

Note that the SVC also seems like a very promising option, however due to its high computational cost it was not used during this iteration. If time/computation is no object, it probably should be included in the list of possible models, even if it were to work on a dataset post-dimensionality reduction (although the model is also know to generalize well on high-dimensional datasets, despite the computational cost trade-off.)

Since too many transformations on high dimensional datasets has the potential to trigger a memory error, let's now look at these models with TFIDF vectorization, in the next notebook.