## Table of Contents
1. [TF-IDF Split](#tfidf)
2. [Baseline Model](#base)
3. [Initial Models](#initial)

In [11]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.feature_extraction.text import TfidfVectorizer
from custom import * #personal functions
from sklearn.dummy import DummyClassifier

In [2]:
df = pd.read_pickle('reviews.pkl')

In [3]:
df

Unnamed: 0,Review,Score
0,\n Keough's...,1
1,\n While Th...,1
2,"\n If ""The ...",1
3,"\n ""The Lod...",1
4,\n There's ...,0
...,...,...
9239,\n a Pirand...,1
9240,\n A strang...,1
9241,\n A smart ...,1
9242,\n Resoluti...,1


# TF-IDF Split <a id='tfidf'></a>

In [4]:
vectorizer = TfidfVectorizer()

tf_idf_data = vectorizer.fit_transform(df['Review'])

In [5]:
print(tf_idf_data[0])

  (0, 10186)	0.2922912827741062
  (0, 11043)	0.22076654975669596
  (0, 738)	0.1615446704525089
  (0, 7817)	0.13058808329115612
  (0, 1733)	0.2922912827741062
  (0, 6845)	0.12689512074690182
  (0, 16619)	0.16858733378838553
  (0, 6454)	0.2691250985331828
  (0, 10077)	0.11847799554827516
  (0, 5431)	0.25042176566097996
  (0, 15489)	0.24393273399761936
  (0, 8437)	0.2691250985331828
  (0, 16286)	0.3063277559713364
  (0, 15043)	0.1481049132738171
  (0, 14873)	0.14386319348916785
  (0, 2030)	0.11896653868650128
  (0, 14872)	0.08147888515176381
  (0, 4666)	0.17275675272683824
  (0, 14233)	0.196631551620643
  (0, 13607)	0.2587074193522821
  (0, 7910)	0.07111499828663494
  (0, 16592)	0.16858733378838553
  (0, 8161)	0.2922912827741062


In [6]:
tf_idf_data.shape

(9244, 16791)

Our vectorized data contains 9,244 reviews, with 16,791 unique words in the vocabulary.



In [7]:
non_zero_cols = tf_idf_data.nnz / float(tf_idf_data.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_cols))

percent_sparse = 1 - (non_zero_cols / float(tf_idf_data.shape[1]))
print('Percentage of columns containing 0: {}'.format(percent_sparse))

Average Number of Non-Zero Elements in Vectorized Articles: 19.129489398528776
Percentage of columns containing 0: 0.9988607295933221


As we can see from the output above, the average vectorized article contains 19 non-zero columns. This means that 99.9% of each vector is actually zeroes!

# Baseline Model  <a id='base'></a>

First, let's make a dummy model which always predicts the majority class. We'll use this as a comparison for all of our actual models.

In [12]:
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(tf_idf_data, df['Score'], random_state=13)

In [13]:
# Create dummy classifer
dummy = DummyClassifier(strategy='most_frequent', random_state=1)

# "Train" model
dummy.fit(X_train, y_train)

DummyClassifier(random_state=1, strategy='most_frequent')

In [16]:
summary_df = pd.DataFrame(columns=['Model','Accuracy', 'Recall', 'Precision',  'F1'])
predictions = dummy.predict(X_test)
summary_df = summary_df.append({'Model': 'Dummy',
   'Accuracy': metrics.accuracy_score(y_test, predictions),
   'Recall': metrics.recall_score(y_test, predictions),
   'Precision': metrics.precision_score(y_test, predictions), 
   'F1': metrics.f1_score(y_test, predictions)},ignore_index=True)
summary_df

Unnamed: 0,Model,Accuracy,Recall,Precision,F1
0,Dummy,0.553007,1.0,0.553007,0.712176


In [36]:
print(f"The majority class is {predictions[0]}.")

The majority class is 1.


Here, we can see the dummy model is predicting only class "1". This gives us a recall of 1 (obviously), and an accuracy and precision of 0.55. Let's hope our models will need to do better than this!

# Initial Models <a id='initial'></a>

Using my classification_models() function, let's create models with only the default parameters.

In [28]:
models = ['logistic','knn','tree','rf','AdaBoost','xgb','GrdBoost','svc','Bayes']
classification_models(tf_idf_data,df['Score'],models)

Using logistic
Using knn
Using tree
Using rf
Using AdaBoost
Using xgb
Using GrdBoost
Using svc
Using Bayes


Unnamed: 0,Model,Accuracy,Recall,Precision,F1
0,logistic,0.756625,0.825919,0.759786,0.791474
1,knn,0.492158,0.302708,0.589454,0.4
2,tree,0.646295,0.672147,0.688119,0.680039
3,rf,0.718226,0.82882,0.713572,0.76689
4,AdaBoost,0.624121,0.807544,0.627348,0.706131
5,xgb,0.635479,0.844294,0.62987,0.721488
6,GrdBoost,0.654408,0.863636,0.641984,0.736495
7,svc,0.770146,0.837524,0.771149,0.802967
8,Bayes,0.728502,0.911025,0.696746,0.789606


We want to prioritize accuracy in our text classification, and the best of these models is SVC. Let's check to see if using gridsearch gets us any better results.

In [26]:
grid_models = ['knn','tree','rf','AdaBoost','GrdBoost','svc']
param_grid = [#{'penalty': ['l1', 'l2'], 
               #'C': np.logspace(-4, 4, 10)}, #logistic
              {'n_neighbors': [1,3,5],
               'weights':['uniform','distance'],
               'metric': ['euclidean','manhattan']}, #knn
              {'criterion': ['gini', 'entropy'],
               'max_depth' : [2,5,50],
               'min_samples_leaf':[1,2,8],
               'min_samples_split':[2,4]}, #tree
              {'n_estimators': [10,50,200],
               'max_features': ['auto', 'sqrt','log2'],
               'max_depth': [2,5],
               'min_samples_split' : [2,4],
               'min_samples_leaf' : [1,4],
               'bootstrap': [True, False]}, #rf
             {'n_estimators': [10,50,200],
              'learning_rate': np.linspace(0.01,1,10)}, #adaboost
             {'learning_rate': np.linspace(0.01,1,10),
              'n_estimators': [10,50,100],
              'max_features': ['auto', 'sqrt','log2'],
              'max_depth': [2,4,8],'min_samples_split' : [2,4]}, #grdboost
             {'C' : np.linspace(0.01,10,6),
              'gamma' : np.linspace(0.01,1,6),
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}] #svc

In [27]:
classification_models(tf_idf_data,df['Score'],grid_models,param_grid=param_grid,grid=True)

knn's best parameters are {'metric': 'euclidean', 'n_neighbors': 1, 'weights': 'uniform'}
tree's best parameters are {'criterion': 'gini', 'max_depth': 50, 'min_samples_leaf': 1, 'min_samples_split': 4}
rf's best parameters are {'bootstrap': False, 'max_depth': 5, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 10}
AdaBoost's best parameters are {'learning_rate': 0.56, 'n_estimators': 200}
GrdBoost's best parameters are {'learning_rate': 0.45, 'max_depth': 8, 'max_features': 'sqrt', 'min_samples_split': 4, 'n_estimators': 100}
svc's best parameters are {'C': 6.004, 'gamma': 0.6040000000000001, 'kernel': 'rbf'}


Unnamed: 0,Model,Accuracy,Recall,Precision,F1
0,knn,0.614927,0.997099,0.592529,0.743331
1,tree,0.625203,0.659574,0.666667,0.663102
2,rf,0.570579,0.978723,0.567265,0.71824
3,AdaBoost,0.674419,0.748549,0.693548,0.72
4,GrdBoost,0.703624,0.794004,0.710208,0.749772
5,svc,0.775014,0.810445,0.79206,0.801147
