# Sitcoms 3: Classification with TF-IDF

This is the third of four notebooks completed on Sitcom Pilot Script Data. In this notebook, I will model the sitcom pilot data using classification techniques specific to Natural Language Processing. For those of you just joining us, this is the data that we obtained and cleaned in [Notebook 1 - Obtain/Clean](1_sitcoms_clean.ipynb) and explored in [Notebook 2 - EDA](2_sitcoms_EDA.ipynb). After I finish preprocessing the data and evaluating classifier performance, we will be ready for the grand finale - [Notebook 4 - Deep Learning](4_sitcoms_deep_learning.ipynb).

## Table of Contents


  * 1. [Import Libraries](#import)
      * 1.1 [Open and Read Data](#open)
  * 2. [Text Data Preprocessing](#text)  
      * 2.1. [Converting Scripts to Lower Case](#lower)  
      * 2.2. [Removing Punctuation from Script](#punct)  
      * 2.3. [Removing StopWords](#stop)  
      * 2.4  [Common Word Removal](#common)  
      * 2.5  [Tokenization](#token)  
  * 3. [TF-IDF](#tfidf)  
      * 3.1. [Frequency Distributions](#freqdist)  
      * 3.2. [TF-IDF Vectorization](#vector)  
      * 3.3. [Classifiers](#class)  
      * 3.4. [TF-IDF Performance](#tfperm)  
  * 4. [Latent Semantic Analysis](#lsa)  
      * 4.1. [LSA Performance](#lsaperm)   
  * 5. [Combining TF-IDF with Descriptive Features](#combo)  
      * 5.1. [TF-IDF/Feature Combination Performance](#comboperm)  
      * 5.2. [Scaling Numerical Data](#scale)
      * 5.3. [SciPy Hstack](#combine)
  * 6.  [Classification on Descriptive Feature Set](#feature)  
      * 6.1. [Descriptive Feature Performance ](#featperm)  
  * 7.  [Evaluation of Models](#eval)
      * 7.1. [Test Scores](#test)
      * 7.2. [Train Scores](#train)

Finally, our data is prepared for modeling! Well, almost. We still have a bit of pre-processing to do.

<a id='import'></a>

## 1. Import Libraries

In [1]:
#import necessary libraries
#dataframes
import pandas as pd

#math
import numpy as np
np.random.seed(0)

#NLP
from gensim.models import Word2Vec
from nltk import word_tokenize

#visualizations
import seaborn as sns

#nltk
import nltk
from nltk import FreqDist
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models import Word2Vec

# spacy for lemmatization
import spacy

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import TruncatedSVD

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
# machine learning

from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

<a id='open'></a>

### 1.1. Open and Read Data

In [2]:
data_text = pd.read_csv('scripts.csv') #open and read clean script df
df_clean = pd.read_csv('pilotfeatures.csv') #open and read features script df

<a id='text'></a>

# 2. Text Data Pre-Processing

<a id='lower'></a>

### 2.1. Converting Scripts to Lower Case

We want to replace all instances of upper case letters with their lower case counter-parts. This way, we will not have multiple copies of the same word (i.e., "Discombobulation" and "discombobulation").

In [3]:
df_clean['script'] = df_clean['script'].apply(lambda x: " ".join(x.lower() for x in x.split()))

<a id='punct'></a>

### 2.2. Removing Punctuation from Script

We have already extracted variables from our punctuation, so the next step in cleaning is to remove punctuation as it will make our text easier to work with.

In [4]:
df_clean['script'] = df_clean['script'].str.replace('[^\w\s]','') #replacing punctuation with spaces

<a id='stop'></a>

### 2.3. Removing StopWords

We will be using NLTK's predefined library of stopwords (super common words such as "a" or "on").

In [5]:
from nltk.corpus import stopwords
stop = stopwords.words('english') #define stopwords through NLTK's library

#replacing stopwords with space
df_clean['script'] = df_clean['script'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

<a id='common'></a>

### 2.4. Common Word Removal

Let's take a look at the 10 most frequently occurring words in the script lines and then make a call to remove these words or retain them in the dataset.

In [6]:
#most common words
freq = pd.Series(' '.join(df_clean['script']).split()).value_counts()[:10] 
freq

im       1644
know     1224
oh       1102
dont     1016
like      994
yeah      869
youre     846
get       836
okay      835
right     773
dtype: int64

One can definitely make an argument to leave these words in, but I think that for classification purposes, we are better off without them. They do not seem particularly informative.

In [7]:
#creating list of most common words
freq = list(freq.index)
#removing them from script
df_clean['script'] = df_clean['script'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))

I'm going to go ahead and save our data as is because we may use a different type of tokenizer in our deep learning notebook.

In [8]:
df_clean.to_csv('cleandata.csv') #saving the file to access it later

<a id='token'></a>

### 2.5. Tokenization

Our data also needs to be tokenized, meaning that we need to divide the text into a sequence of words.

In [9]:
from nltk.tokenize import word_tokenize
#apply nltk's word tokenizer to our data
df_clean.script = df_clean['script'].apply(word_tokenize)

<a id='tfidf'></a>

# 3. TF-IDF

To understand TF-IDF, we should go back a step. Term frequency (which is the TF in TF-IDF) is the ratio of the count of a word present in a sentence to the length of said sentence.

In other words:

TF = (Number of times term T appears in the particular row) / (number of terms in that row)

The second part is IDF (Inverse Document Frequency), which can be defined as the log of the ratio of the total number of rows to the number of rows in which said word exists. We use IDF because a word isn't very informative to us if it’s appearing in each of the documents.

IDF = log(N/n), where N is the total number of rows and n is the number of rows in which the word was present

Finally, TF-IDF is the multiplication of the TF and IDF.

<a id='freqdist'></a>

### 3.1. Frequency Distributions

Let's continue to explore our data by using frequency distributions to see which terms pop up the most in our scripts.

In [11]:
#let's start with getting a list of all of the unique words in our data
#since it's a "set" it will only gather unique words, no duplicates
total_vocab = set()

for line in df_clean.script:
    total_vocab.update(line)
len(total_vocab) #let's see how long this list of unique words is!

10810

Looks like we have almost 11,000 unique words in our total vocabulary for this data! Now let's gather each line into a single list and use FreqDist() to examine the frequency distribution.

In [12]:
#concatenate each line of each script into a single list
lines_concat = []
for line in df_clean.script:
    lines_concat += line

In [13]:
#make a list of the FreqDist of our scripts list
lines_freqdist = FreqDist(lines_concat)
#print the most common 200 words in our scripts
lines_freqdist.most_common(200)

[('na', 734),
 ('got', 682),
 ('thats', 680),
 ('hey', 675),
 ('go', 670),
 ('gon', 602),
 ('well', 585),
 ('good', 521),
 ('one', 497),
 ('think', 473),
 ('want', 465),
 ('come', 425),
 ('really', 423),
 ('look', 414),
 ('see', 365),
 ('man', 354),
 ('time', 326),
 ('back', 321),
 ('going', 320),
 ('little', 315),
 ('say', 312),
 ('hes', 311),
 ('yes', 304),
 ('cant', 303),
 ('ill', 295),
 ('mean', 286),
 ('would', 285),
 ('uh', 280),
 ('need', 280),
 ('tell', 275),
 ('didnt', 267),
 ('love', 261),
 ('sorry', 255),
 ('guys', 252),
 ('people', 245),
 ('let', 241),
 ('take', 240),
 ('something', 237),
 ('make', 237),
 ('whats', 229),
 ('could', 228),
 ('god', 226),
 ('us', 225),
 ('never', 223),
 ('thing', 219),
 ('ive', 216),
 ('lets', 213),
 ('theres', 208),
 ('thank', 207),
 ('two', 206),
 ('dad', 206),
 ('great', 201),
 ('way', 199),
 ('even', 197),
 ('said', 191),
 ('mom', 189),
 ('wait', 189),
 ('work', 187),
 ('shes', 184),
 ('first', 179),
 ('sure', 170),
 ('call', 168),
 ('mayb

This is not too informative because we don't have the breakdown of class distributions here, but it's still interesting to see which words are used the most in general. Something that surprised me was that the word "dad" was actually more used than the word "mom"! It's also funny that the word "guys" is right above the word "people", I wonder if those words are used interchangeably a lot, or if failed sitcoms tend to use one term more than the other. To get the information we really want, let's use TF-IDF Vectorization.

<a id='vector'></a>

### 3.2. TF-IDF Vectorization

 To properly use scikit-learn's TfidfVectorizer, we must pass in the data as raw text documents because the TfidfVectorizer likes to handle the Count Vectorization process on it's own and then fit and transforms the data into TF-IDF format. Additionally, for the purposes of TF-IDF vectorization, we do not want to clean or "omit" anything from our data (such as stop words or common words) because we will be losing information.
 
Don't worry though, it was still necessary to pre-process the data to get our total vocabulary variable above and we will use our cleaned dataframe when we work on deep learning in the next notebook.

In [14]:
#splitting our target from our feature set
#normally, we would do an 80/20 split, but because of how small our data is we want to expand the test set as much as possible
X_train, X_test, y_train, y_test  = train_test_split(
    data_text.script, 
    data_text.label,
    train_size=0.75, 
    random_state=0)

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

#initialize vectorizer
vectorizer = TfidfVectorizer()

#fit it to test and train sets
tf_idf_data_train = vectorizer.fit_transform(X_train)
tf_idf_data_test = vectorizer.transform(X_test)

In [16]:
tf_idf_data_train.shape #check the size 

(21362, 8771)

Our vectorized data contains 21025 articles, with 8812 unique words in the vocabulary. 

Since each line in our scripts only uses a tiny subset of our total vocabulary, the majority of our columns for any given line of script will be zero. These extremely zero-heavy vectors are called "Sparse Vectors" and are very common in NLP projects.

In [17]:
#get average of non-zero elements
non_zero_cols = tf_idf_data_train.nnz / float(tf_idf_data_train.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_cols))

#calculate percentage
percent_sparse = 1 - (non_zero_cols / float(tf_idf_data_train.shape[1]))
print('Percentage of columns containing 0: {}'.format(percent_sparse))

Average Number of Non-Zero Elements in Vectorized Articles: 5.308959835221422
Percentage of columns containing 0: 0.9993947144185131


<a id='class'></a>

### 3.3. Classification

We are going to start by running six classifiers on on our TF-IDF data: 

Multinomial Naive Bayes, Random Forest, Logistic Regression, Decision Tree, K-Nearest Neighbors, and Support Vector Machine.

In [18]:
#initialize the classifiers
nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_estimators=100)
lr_classifier = LogisticRegression(penalty='l2', solver='liblinear')
dt_classifier = DecisionTreeClassifier()
knn_classifier = KNeighborsClassifier()
svc_classifier = SVC()

In [19]:
%%time

#fit support vector machine classifier and predict on train and test data
svc_classifier.fit(tf_idf_data_train, y_train)
svc_train_preds = svc_classifier.predict(tf_idf_data_train)
svc_test_preds = svc_classifier.predict(tf_idf_data_test)



Wall time: 56.3 s


In [20]:
%%time

#fit k-nearest neighbors classifier and predict on train and test data
knn_classifier.fit(tf_idf_data_train, y_train)
knn_train_preds = knn_classifier.predict(tf_idf_data_train)
knn_test_preds = knn_classifier.predict(tf_idf_data_test)

Wall time: 18.2 s


In [21]:
%%time

#fit decision tree classifier and predict on test and train data
dt_classifier.fit(tf_idf_data_train, y_train)
dt_train_preds = dt_classifier.predict(tf_idf_data_train)
dt_test_preds = dt_classifier.predict(tf_idf_data_test)

Wall time: 4.7 s


In [22]:
%%time

#fit logistic regression classifier and predict on test and train data
lr_classifier.fit(tf_idf_data_train, y_train)
lr_train_preds = lr_classifier.predict(tf_idf_data_train)
lr_test_preds = lr_classifier.predict(tf_idf_data_test)

Wall time: 154 ms


In [23]:
%time

#fit multinomial naive bayes classifier and predict on test and train data
nb_classifier.fit(tf_idf_data_train, y_train)
nb_train_preds = nb_classifier.predict(tf_idf_data_train)
nb_test_preds = nb_classifier.predict(tf_idf_data_test)

Wall time: 0 ns


In [24]:
%%time

#fit random forest classifier and predict on test and train data
rf_classifier.fit(tf_idf_data_train, y_train)
rf_train_preds = rf_classifier.predict(tf_idf_data_train)
rf_test_preds = rf_classifier.predict(tf_idf_data_test)

Wall time: 42.6 s


Time to see how each of these classifiers performed!

<a id='tfperm'></a>

### 3.4. TF-IDF Performance

In [25]:
#compute accuracy scores on both test set and train set for each classifier
nb_train_score = accuracy_score(y_train, nb_train_preds)
nb_test_score = accuracy_score(y_test, nb_test_preds)
rf_train_score = accuracy_score(y_train, rf_train_preds)
rf_test_score = accuracy_score(y_test, rf_test_preds)
lr_train_score = accuracy_score(y_train, lr_train_preds)
lr_test_score = accuracy_score(y_test, lr_test_preds)
dt_train_score = accuracy_score(y_train, dt_train_preds)
dt_test_score = accuracy_score(y_test, dt_test_preds)
knn_train_score = accuracy_score(y_train, knn_train_preds)
knn_test_score = accuracy_score(y_test, knn_test_preds)
svc_train_score = accuracy_score(y_train, svc_train_preds)
svc_test_score = accuracy_score(y_test, svc_test_preds)


#print test/train accuracy scores for each classifier
print("Multinomial Naive Bayes")
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(nb_train_score, nb_test_score))
print("")
print('-'*70)
print("")
print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score, rf_test_score))
print("")
print('-'*70)
print("")
print('Logistic Regression')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(lr_train_score, lr_test_score))
print("")
print('-'*70)
print("")
print('Decision Tree')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(dt_train_score, dt_test_score))
print("")
print('-'*70)
print("")
print('K-Nearest Neighbors')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(knn_train_score, knn_test_score))
print("")
print('-'*70)
print("")
print('Support Vector Machine')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(svc_train_score, svc_test_score))

Multinomial Naive Bayes
Training Accuracy: 0.7454 		 Testing Accuracy: 0.6249

----------------------------------------------------------------------

Random Forest
Training Accuracy: 0.9256 		 Testing Accuracy: 0.5855

----------------------------------------------------------------------

Logistic Regression
Training Accuracy: 0.7346 		 Testing Accuracy: 0.6192

----------------------------------------------------------------------

Decision Tree
Training Accuracy: 0.9256 		 Testing Accuracy: 0.5607

----------------------------------------------------------------------

K-Nearest Neighbors
Training Accuracy: 0.6505 		 Testing Accuracy: 0.5498

----------------------------------------------------------------------

Support Vector Machine
Training Accuracy: 0.5253 		 Testing Accuracy: 0.527


Our best classifier on the testing data was Multinomial Naive Bayes, but our best classifier on the training data was actually a tie between Random Forest and Decision Tree. Overall, it looks like the classifiers performed much better on the training set than they did on the test set. This means that we have some over-fitting problems with our data, which we may be able to solve through tuning. However, because of the small size of the dataset, there might not be much that we can do.

<a id='lsa'></a>

# 4. Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) is used when we want to perform dimensionality reduction on text data. As we saw from the shape of tf_idf_data_train, our vectors are long and unwieldy because each word in our total vocabulary has its own component. To make this data more manageable, we can use SKlearn'sfunction called "TruncatedSVD" which performs dimensionality reduction. Not only will LSA reduce computational load, it improves the type of features that we keep. In other words, it trades a large number of mixed-performance features for a smaller set of better features.

In [26]:
#initialize SVD and then apply SVD to test and train tf/idf sets
svd = TruncatedSVD(n_components=100, random_state=42) #dimensionality of output data, for LSA a value of 100 is recommended
lsa_train = svd.fit_transform(tf_idf_data_train) 
lsa_test = svd.fit_transform(tf_idf_data_test)

We are now going to go ahead and run the same classifiers that we used above, but this time we will omit Multinomial Naive Bayes, because it does not usually work with SVD or other matrix factorization. Additionally, MNB will not run on negative values, so we would have to use NMP (a non-negative matrix factorization) instead of SVD. 

In [27]:
%%time

#fit support vector machine classifier and predict on train and test data
svc_classifier.fit(lsa_train, y_train)
svc_train_preds = svc_classifier.predict(lsa_train)
svc_test_preds = svc_classifier.predict(lsa_test)



Wall time: 2min 48s


In [28]:
%%time

#fit k-nearest neighbors classifier and predict on train and test data
knn_classifier.fit(lsa_train, y_train)
knn_train_preds = knn_classifier.predict(lsa_train)
knn_test_preds = knn_classifier.predict(lsa_test)

Wall time: 1min 35s


In [29]:
%%time

#fit decision tree classifier and predict on test and train data
dt_classifier.fit(lsa_train, y_train)
dt_train_preds = dt_classifier.predict(lsa_train)
dt_test_preds = dt_classifier.predict(lsa_test)

Wall time: 3.56 s


In [30]:
%%time

#fit logistic regression classifier and predict on test and train data
lr_classifier.fit(lsa_train, y_train)
lr_train_preds = lr_classifier.predict(lsa_train)
lr_test_preds = lr_classifier.predict(lsa_test)

Wall time: 300 ms


In [31]:
%%time

#fit random forest classifier and predict on test and train data
rf_classifier.fit(lsa_train, y_train)
rf_train_preds = rf_classifier.predict(lsa_train)
rf_test_preds = rf_classifier.predict(lsa_test)

Wall time: 22.1 s


<a id='lsaperm'></a>

### 4.1.  LSA Performance

In [32]:
#compute accuracy scores on both test set and train set for each classifier

rf_train_score2 = accuracy_score(y_train, rf_train_preds)
rf_test_score2 = accuracy_score(y_test, rf_test_preds)
lr_train_score2 = accuracy_score(y_train, lr_train_preds)
lr_test_score2 = accuracy_score(y_test, lr_test_preds)
dt_train_score2 = accuracy_score(y_train, dt_train_preds)
dt_test_score2 = accuracy_score(y_test, dt_test_preds)
knn_train_score2 = accuracy_score(y_train, knn_train_preds)
knn_test_score2 = accuracy_score(y_test, knn_test_preds)
svc_train_score2 = accuracy_score(y_train, svc_train_preds)
svc_test_score2 = accuracy_score(y_test, svc_test_preds)


#print test/train accuracy scores for each classifier

print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score2, rf_test_score2))
print("")
print('-'*70)
print("")
print('Logistic Regression')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(lr_train_score2, lr_test_score2))
print("")
print('-'*70)
print("")
print('Decision Tree')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(dt_train_score2, dt_test_score2))
print("")
print('-'*70)
print("")
print('K-Nearest Neighbors')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(knn_train_score2, knn_test_score2))
print("")
print('-'*70)
print("")
print('Support Vector Machine')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(svc_train_score2, svc_test_score2))

Random Forest
Training Accuracy: 0.9203 		 Testing Accuracy: 0.5061

----------------------------------------------------------------------

Logistic Regression
Training Accuracy: 0.5453 		 Testing Accuracy: 0.5165

----------------------------------------------------------------------

Decision Tree
Training Accuracy: 0.9207 		 Testing Accuracy: 0.5088

----------------------------------------------------------------------

K-Nearest Neighbors
Training Accuracy: 0.6763 		 Testing Accuracy: 0.505

----------------------------------------------------------------------

Support Vector Machine
Training Accuracy: 0.5253 		 Testing Accuracy: 0.527


Our test scores look pretty much the same, if not slightly lower than they were when we modeled on the TF-IDF data without LSA. Meanwhile, all of our test scores are hovering around 50%, which is definitely a negative change. Once again, we have much higher scores on our training set than we do on the test set. 

<a id='combo'></a>

# 5. Combining TF-IDF with Our New Features

Well, now that we've explored TF-IDF a bit, let's see what the features that we created are capable of! Let's combine our text-descriptive features with our TF-IDF data and then run a few of our classifiers again.

First things first, though - before we can model our data, we will standardize it using sklearn's "Standard Scaler". Then we will perform our test train split on our new, standardized data numerical columns.

In [33]:
#since we need to perform TF-IDF vectorization again, we will re-open our uncleaned df that still has our features included
scaled_data = pd.read_csv('pilotfeatures.csv') #open and read features script df again

<a id='scale'></a>

### 5.1. Scaling Numerical Data

In [34]:
#create list of columns that we want normalized
num_cols = ['numerics','upper', 'word_count', 'num_exclamation_marks', 'num_question_marks',
       'num_punctuation', 'avg_word', 'contro', 'names', 'nouns', 'adjectives',
       'verbs', 'nouns_vs_words', 'adjectives_vs_words', 'verbs_vs_words']


from sklearn.preprocessing import StandardScaler

#standardize our numerical features
scaled_data[num_cols] = StandardScaler().fit_transform(scaled_data[num_cols])

#splitting our target from our feature set
X_train, X_test, y_train, y_test  = train_test_split(
    scaled_data, 
    scaled_data.label,
    train_size=0.75, 
    random_state=0)

In [35]:
#run tf-idf vectorization on train and test sets again
tf_idf_data_train = vectorizer.fit_transform(X_train.script)
tf_idf_data_test = vectorizer.transform(X_test.script)

Recall that we have a few other columns tucked into our dataset, such as "title" and our handy-dandy script variable. When setting X_train features, we will make sure to only include our text descriptive features in our set. 

In [36]:
#define X_train/X_test features so that we can add them to our tf-idf data
X_train_features = X_train[num_cols] 
X_test_features = X_test[num_cols]

<a id='combine'></a>

### 5.2. SciPy Hstack

Great! Now that we our TF-IDF data and feature data are both ready to go, we can combine them using SciPy's hstack function, which is used to horizontally combine sparse matrices without having to convert anything to a dense format.

In [37]:
from scipy.sparse import hstack
#combine our tf-idf data with self-made features
X_train = hstack([tf_idf_data_train, X_train_features])
X_test = hstack([tf_idf_data_test, X_test_features])

In [38]:
X_train.shape #check shape

(21362, 8786)

Our previous X-train shape (when it only included our TF-IDF data) was (21362, 8771). Our current X-train shape is now (21362, 8786) to account for the 15 descriptive features that we added.

In [39]:
%%time

#fit k-nearest neighbors classifier and predict on train and test data
knn_classifier.fit(X_train, y_train)
knn_train_preds = knn_classifier.predict(X_train)
knn_test_preds = knn_classifier.predict(X_test)

Wall time: 46.9 s


In [40]:
%%time

#fit decision tree classifier and predict on test and train data
dt_classifier.fit(X_train, y_train)
dt_train_preds = dt_classifier.predict(X_train)
dt_test_preds = dt_classifier.predict(X_test)

Wall time: 5.62 s


In [41]:
%%time

#fit logistic regression classifier and predict on test and train data
lr_classifier.fit(X_train, y_train)
lr_train_preds = lr_classifier.predict(X_train)
lr_test_preds = lr_classifier.predict(X_test)

Wall time: 672 ms


In [42]:
%%time

#fit random forest classifier and predict on test and train data
rf_classifier.fit(X_train, y_train)
rf_train_preds = rf_classifier.predict(X_train)
rf_test_preds = rf_classifier.predict(X_test)

Wall time: 43.9 s


<a id='comboperm'></a>

### 5.3. TF-IDF/Feature Combination Performance

Fingers crossed...

In [43]:
rf_train_score3 = accuracy_score(y_train, rf_train_preds)
rf_test_score3 = accuracy_score(y_test, rf_test_preds)
lr_train_score3 = accuracy_score(y_train, lr_train_preds)
lr_test_score3 = accuracy_score(y_test, lr_test_preds)
dt_train_score3 = accuracy_score(y_train, dt_train_preds)
dt_test_score3 = accuracy_score(y_test, dt_test_preds)
knn_train_score3 = accuracy_score(y_train, knn_train_preds)
knn_test_score3 = accuracy_score(y_test, knn_test_preds)

print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score3, rf_test_score3))
print("")
print('-'*70)
print("")
print('Logistic Regression')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(lr_train_score3, lr_test_score3))
print("")
print('-'*70)
print("")
print('Decision Tree')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(dt_train_score3, dt_test_score3))
print("")
print('-'*70)
print("")
print('K-Nearest Neighbors')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(knn_train_score3, knn_test_score3))
print("")
print('-'*70)
print("")

Random Forest
Training Accuracy: 0.9433 		 Testing Accuracy: 0.5828

----------------------------------------------------------------------

Logistic Regression
Training Accuracy: 0.7351 		 Testing Accuracy: 0.6141

----------------------------------------------------------------------

Decision Tree
Training Accuracy: 0.9433 		 Testing Accuracy: 0.5475

----------------------------------------------------------------------

K-Nearest Neighbors
Training Accuracy: 0.6865 		 Testing Accuracy: 0.5091

----------------------------------------------------------------------



The accuracy scores for our TF-IDF/feature combination data are better than our LSA data, but they do not look much different than how our TF-IDF data performed alone. In fact, each of the scores seem to be a bit lower than our lone TF-IDF data. That seems to suggest that the features that we generated for our data do not perform very well, but I think it's still worth it to run a few classifiers on only our feature set.

<a id='feature'></a>

# 6. Classification on Descriptive Feature Set

Finally, let's see how our core four classifiers perform when just using the descriptive features that we generated. I do not have much hope for this run-through, but I would like to know for sure how their performance ranks. The only preparation that we will be doing that we haven't done yet is one-hot encoding our target variable.

In [44]:
from sklearn import preprocessing
#initialize label encoder
le = preprocessing.LabelEncoder()

#one-hot encode our label column
scaled_data['label'] = le.fit_transform(scaled_data['label'])

#normalize our numerical features
scaled_data[num_cols] = StandardScaler().fit_transform(scaled_data[num_cols])

#splitting our target variable from our train set one last time (at least in this notebook)
X_train, X_test, y_train, y_test  = train_test_split(
    scaled_data[num_cols], 
    scaled_data.label,
    train_size=0.75, 
    random_state=0)

In [45]:
%%time

#fit k-nearest neighbors classifier and predict on train and test data
knn_classifier.fit(X_train, y_train)
knn_train_preds = knn_classifier.predict(X_train)
knn_test_preds = knn_classifier.predict(X_test)

Wall time: 5.61 s


In [46]:
%%time

#fit decision tree classifier and predict on test and train data
dt_classifier.fit(X_train, y_train)
dt_train_preds = dt_classifier.predict(X_train)
dt_test_preds = dt_classifier.predict(X_test)

Wall time: 110 ms


In [47]:
%%time

#fit logistic regression classifier and predict on test and train data
lr_classifier.fit(X_train, y_train)
lr_train_preds = lr_classifier.predict(X_train)
lr_test_preds = lr_classifier.predict(X_test)

Wall time: 86 ms


In [48]:
%%time

#fit random forest classifier and predict on test and train data
rf_classifier.fit(X_train, y_train)
rf_train_preds = rf_classifier.predict(X_train)
rf_test_preds = rf_classifier.predict(X_test)

Wall time: 2.72 s


<a id='featperm'></a>

### 6.1. Descriptive Feature Performance

In [49]:
rf_train_score4 = accuracy_score(y_train, rf_train_preds)
rf_test_score4 = accuracy_score(y_test, rf_test_preds)
lr_train_score4 = accuracy_score(y_train, lr_train_preds)
lr_test_score4 = accuracy_score(y_test, lr_test_preds)
dt_train_score4 = accuracy_score(y_train, dt_train_preds)
dt_test_score4 = accuracy_score(y_test, dt_test_preds)
knn_train_score4 = accuracy_score(y_train, knn_train_preds)
knn_test_score4 = accuracy_score(y_test, knn_test_preds)

print('Random Forest')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(rf_train_score4, rf_test_score4))
print("")
print('-'*70)
print("")
print('Logistic Regression')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(lr_train_score4, lr_test_score4))
print("")
print('-'*70)
print("")
print('Decision Tree')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(dt_train_score4, dt_test_score4))
print("")
print('-'*70)
print("")
print('K-Nearest Neighbors')
print("Training Accuracy: {:.4} \t\t Testing Accuracy: {:.4}".format(knn_train_score4, knn_test_score4))
print("")
print('-'*70)
print("")

Random Forest
Training Accuracy: 0.808 		 Testing Accuracy: 0.5036

----------------------------------------------------------------------

Logistic Regression
Training Accuracy: 0.53 		 Testing Accuracy: 0.5276

----------------------------------------------------------------------

Decision Tree
Training Accuracy: 0.808 		 Testing Accuracy: 0.5046

----------------------------------------------------------------------

K-Nearest Neighbors
Training Accuracy: 0.6533 		 Testing Accuracy: 0.504

----------------------------------------------------------------------



As predicted, the descriptive feature set did not do too well by itself. The test set scores for all of the classifiers hover around 50%, with the high being Logistic Regression with 52%. Additionally, our data is still suffering from over-fitting, as evident from the much higher scores on the training set vs the test set.

<a id='eval'></a>

# 7. Evaluation of Models

Let's compile all of our scores into one data frame so that we can easily compare everything.

In [50]:
#creating dataframe for TF-IDF
classifiers1 = ["Multinomial Naive Bayes", "Random Forest", "Logistic Regression", "Decision Tree",
                "K-Nearest Neighbors", "Support Vector Machine"] 
train1 = [nb_train_score, rf_train_score, lr_train_score, dt_train_score, knn_train_score, svc_train_score]
test1 = [nb_test_score, rf_test_score, lr_test_score, dt_test_score, knn_test_score, svc_test_score]
labels1 = ["TF-IDF", "TF-IDF", "TF-IDF", "TF-IDF", "TF-IDF", "TF-IDF"]

df1 = pd.DataFrame(list(zip(classifiers1, labels1, train1, test1)),
                   columns=['Classifier','Data', 'Train Score', 'Test Score'])

#create dataframe for TF-IDF with LSA
classifiers2 = [ "Random Forest", "Logistic Regression", "Decision Tree",
                "K-Nearest Neighbors", "Support Vector Machine"]
train2 = [rf_train_score2, lr_train_score2, dt_train_score2, knn_train_score2, svc_train_score2]
test2 =  [rf_test_score2, lr_test_score2, dt_test_score2, knn_test_score2, svc_test_score2]
labels2 = ["LSA", "LSA", "LSA", "LSA", "LSA"]

df2 = pd.DataFrame(list(zip(classifiers2, labels2, train2, test2)),
                   columns=['Classifier','Data', 'Train Score', 'Test Score'])

#create dataframe for combination of TF-IDF and Features
classifiers3 = [ "Random Forest", "Logistic Regression", "Decision Tree",
                "K-Nearest Neighbors"]
train3 = [rf_train_score3, lr_train_score3, dt_train_score3, knn_train_score3]
test3 =  [rf_test_score3, lr_test_score3, dt_test_score3, knn_test_score3]
labels3 = ["Combo", "Combo", "Combo", "Combo"]

df3 = pd.DataFrame(list(zip(classifiers3, labels3, train3, test3)),
                   columns=['Classifier','Data', 'Train Score', 'Test Score'])

#create dataframe for Features
classifiers4 = ["Random Forest", "Logistic Regression", "Decision Tree",
                "K-Nearest Neighbors"]
train4 = [rf_train_score4, lr_train_score4, dt_train_score4, knn_train_score4]
test4 =  [rf_test_score4, lr_test_score4, dt_test_score4, knn_test_score4]
labels4 = ["Features", "Features", "Features", "Features"]

df4 = pd.DataFrame(list(zip(classifiers4, labels4, train4, test4)),
                   columns=['Classifier','Data', 'Train Score', 'Test Score'])

<a id='test'></a>

### 7.1. Test Scores

Let's first take a look at the best test scores.

In [51]:
df_eval =pd.concat([df1, df2, df3, df4], axis=0) #concatenate dataframes of scores
df_eval.sort_values(by=['Test Score'], ascending = False) #sort by test scores

Unnamed: 0,Classifier,Data,Train Score,Test Score
0,Multinomial Naive Bayes,TF-IDF,0.745389,0.624912
2,Logistic Regression,TF-IDF,0.734622,0.619155
1,Logistic Regression,Combo,0.735137,0.614099
1,Random Forest,TF-IDF,0.925616,0.585451
0,Random Forest,Combo,0.943311,0.582783
3,Decision Tree,TF-IDF,0.925616,0.560736
4,K-Nearest Neighbors,TF-IDF,0.650501,0.549782
2,Decision Tree,Combo,0.943311,0.547535
1,Logistic Regression,Features,0.52996,0.527594
5,Support Vector Machine,TF-IDF,0.525325,0.527033


Our top performing classifier was actually the first classifier that we ran - Multinomial Naive Bayes on TF-IDF! As for the data, TF-IDF by itself easily had the highest scores. The next dataset that was most successful is the combination dataframe, which has 2 of the 5 top scores. The Features alone dataset had the bottom three test scores, with the lowest being .5035. However, that doesn't say too much seeing as 7 of the 19 models have values of about 50%! Let's move on to looking at the most successful training scores.

<a id='train'></a>

### 7.2. Train Scores

In [52]:
df_eval.sort_values(by=['Train Score'], ascending = False) #sort by Train score

Unnamed: 0,Classifier,Data,Train Score,Test Score
2,Decision Tree,Combo,0.943311,0.547535
0,Random Forest,Combo,0.943311,0.582783
3,Decision Tree,TF-IDF,0.925616,0.560736
1,Random Forest,TF-IDF,0.925616,0.585451
2,Decision Tree,LSA,0.9207,0.508777
0,Random Forest,LSA,0.920326,0.506109
2,Decision Tree,Features,0.808024,0.504564
0,Random Forest,Features,0.808024,0.503581
0,Multinomial Naive Bayes,TF-IDF,0.745389,0.624912
1,Logistic Regression,Combo,0.735137,0.614099


It's interesting that when you sort the data by training score, our dataframe looks a lot differently than it did before. Our top performing model was tied between Decision Tree on the combination data and Random Forest on the combination data, both with values of .9433. There also isn't as much of a clear lead between the datasets we used, however there is a pattern for which classifiers are most successful. Decision Tree and Random Forest have nearly identical scores on the train sets and make up the top 8 scores. Our lowest score is Support Vector Machine on the TF-IDF data.

Well, that wraps up our classification notebook. Next up is deep learning! 