### Comparing Models and Vectorization Strategies for Text Classification

This try-it focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in scikitlearn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.

**Note:** The original dataset contains 200K rows of data. It is best to try to use the full dtaset. If the original dataset is too large for your computer, please use the 'dataset-minimal.csv', which has been reduced to 100K.

In [26]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

import pandas as pd

#NLP libraies
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer,TfidfTransformer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

[nltk_data] Downloading package wordnet to /Users/mma0812/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/mma0812/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mma0812/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [138]:
import time
from datetime import datetime as dt
import numpy as np

In [27]:
df = pd.read_csv('text_data/dataset.csv')

In [28]:
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    200000 non-null  object
 1   humor   200000 non-null  bool  
dtypes: bool(1), object(1)
memory usage: 1.7+ MB


### Preprocessing

#### Stemming

In [30]:
def stemmer(text):
    '''
    This function takes in a string of text and returns
    a list of stemmered text.
    
    Arguments
    ---------
    text: str
        string of text to be stemmered
        
    Returns
    -------
    str
       string of stemmered words from text input
    '''
    stem = PorterStemmer()
    return ' '.join([stem.stem(w) for w in word_tokenize(text)])

def lemmatiz(text):
    '''
    This function takes in a string of text and returns
    a list of lemmatized text.
    
    Arguments
    ---------
    text: str
        string of text to be lemmatized
        
    Returns
    -------
    str
       string of lemmatized words from text input
    '''
    lemma = WordNetLemmatizer()
    return ' '.join([lemma.lemmatize(w) for w in word_tokenize(text)]) # Tokenize and then lemmatize. 

In [67]:
stemmed_df = df.copy()
stemmed_df['text'] = df['text'].apply(stemmer)

In [68]:
stem_df = pd.DataFrame(stemmed_df)
stem_df.head()

Unnamed: 0,text,humor
0,"joe biden rule out 2020 bid : 'guy , i 'm not ...",False
1,watch : darvish gave hitter whiplash with slow...,False
2,what do you call a turtl without it shell ? de...,True
3,5 reason the 2016 elect feel so person,False
4,"pasco polic shot mexican migrant from behind ,...",False


In [93]:
lemma_df = df.copy()
lemma_df['text'] = df['text'].apply(lemmatiz)
lemma_df.head()

Unnamed: 0,text,humor
0,"Joe biden rule out 2020 bid : 'guys , i 'm not...",False
1,Watch : darvish gave hitter whiplash with slow...,False
2,What do you call a turtle without it shell ? d...,True
3,5 reason the 2016 election feel so personal,False
4,Pasco police shot mexican migrant from behind ...,False


In [35]:
X = df.drop('humor', axis = 1)
y = df['humor']

Unnamed: 0,text
0,"Joe biden rules out 2020 bid: 'guys, i'm not r..."
1,Watch: darvish gave hitter whiplash with slow ...
2,What do you call a turtle without its shell? d...
3,5 reasons the 2016 election feels so personal
4,"Pasco police shot mexican migrant from behind,..."
...,...
199995,Conor maynard seamlessly fits old-school r&b h...
199996,How to you make holy water? you boil the hell ...
199997,How many optometrists does it take to screw in...
199998,Mcdonald's will officially kick off all-day br...


## Multinomial Naive Bayes

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X['text'], y, test_size=0.25, random_state = 42)

### W/o Stemming and Lemmatization

In [45]:
cvect = CountVectorizer()
cvect.fit_transform(X_train)
pd.DataFrame(dtm.toarray(), columns = cvect.get_feature_names()).tail()

Unnamed: 0,00,000,0000251,0000ff,0001,000th,001,00100,002,00463,...,αστυνομίας,διαδηλωτών,κάιρο,και,μεταξύ,νεκρός,σε,στο,συγκρούσεις,ᵒᴥᵒᶅ
149995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
149996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
149997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
149998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
149999,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [57]:
pipe = Pipeline([('cvect', CountVectorizer()),
                ('lgr', MultinomialNB())])
pipe.fit(X_train, y_train)
test_acc = pipe.score(X_test, y_test)

In [58]:
params = {'cvect__max_features': [100, 300, 1000, 2000, 4000, 7000],
         'cvect__stop_words': ['english', None]}

In [59]:
grid = GridSearchCV(pipe, param_grid=params)
grid.fit(X_train, y_train)
acc = grid.score(X_test, y_test)

In [62]:
grid.best_params_, grid.best_score_

({'cvect__max_features': 7000, 'cvect__stop_words': None}, 0.9084533333333333)

### With Stemming

In [179]:
X_s = stem_df.drop('humor', axis = 1)
y_s = stem_df['humor']

In [180]:
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_s['text'], y_s, test_size=0.25, random_state = 42)

In [181]:
pipe_stem_countvec_mnb = Pipeline([('cvect', CountVectorizer()),
                ('lgr', MultinomialNB())])
pipe_stem_countvec_mnb.fit(X_train_s, y_train_s)
test_acc_s = pipe_stem_countvec_mnb.score(X_test_s, y_test_s)

In [182]:
start = dt.now()

grid_s = GridSearchCV(pipe_stem_countvec_mnb, param_grid=params)
grid_s.fit(X_train_s, y_train_s)

running_secs_nb = (dt.now() - start).seconds

acc_s = grid_s.score(X_test_s, y_test_s)

In [183]:
grid_s.best_params_, grid_s.best_score_,grid_s, running_secs_nb

({'cvect__max_features': 7000, 'cvect__stop_words': None},
 0.9063333333333332,
 GridSearchCV(estimator=Pipeline(steps=[('cvect', CountVectorizer()),
                                        ('lgr', MultinomialNB())]),
              param_grid={'cvect__max_features': [100, 300, 1000, 2000, 4000,
                                                  7000],
                          'cvect__stop_words': ['english', None]}),
 88)

### With Lemmatization

In [129]:
X_lr = lemma_df.drop('humor', axis = 1)
y_lr = lemma_df['humor']

In [130]:
X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(X_lr['text'], y_lr, test_size=0.25, random_state = 42)

In [132]:
pipe_stem_tfidf_mnb = Pipeline([('tfidf', TfidfVectorizer()),
                ('mnb', MultinomialNB())])
pipe_stem_tfidf_mnb.fit(X_train_lr, y_train_lr)
test_acc_lr_mnb = pipe_stem_tfidf_mnb.score(X_test_lr, y_test_lr)

In [135]:
params_s_mnb = {'tfidf__max_features': [1000, 2000, 4000, 7000],
         'tfidf__stop_words': ['english', None]}

In [136]:
start = dt.now()
# process stuff
grid_tf_mnb = GridSearchCV(pipe_stem_tfidf_mnb, param_grid=params_s_mnb)
grid_tf_mnb.fit(X_train_lr, y_train_lr)
acc_tf_mnb = grid_tf_mnb.score(X_test_lr, y_test_lr)

running_secs_nb = (dt.now() - start).seconds

grid_tf_mnb.best_params_, grid_tf_mnb.best_score_, test_acc_lr_mnb, running_secs_nb

({'tfidf__max_features': 7000, 'tfidf__stop_words': None},
 0.9040333333333332,
 0.90494,
 0.912,
 62)

#### With MultiNominal Hyper Tuning

In [139]:
params_s_mnb1 = {'tfidf__max_features': [1000, 2000, 4000, 7000],
         'tfidf__stop_words': ['english', None], 
                'mnb__alpha': np.linspace(0.5, 1.5, 6),
                'mnb__fit_prior': [True, False],}

In [140]:
start = dt.now()
# process stuff
grid_tf_mnb = GridSearchCV(pipe_stem_tfidf_mnb, param_grid=params_s_mnb1)
grid_tf_mnb.fit(X_train_lr, y_train_lr)
acc_tf_mnb = grid_tf_mnb.score(X_test_lr, y_test_lr)

running_secs_nb = (dt.now() - start).seconds

grid_tf_mnb.best_params_, grid_tf_mnb.best_score_, acc_tf_mnb, test_acc_lr_mnb, running_secs_nb

({'mnb__alpha': 1.5,
  'mnb__fit_prior': False,
  'tfidf__max_features': 7000,
  'tfidf__stop_words': None},
 0.9041266666666667,
 0.90472,
 0.912,
 733)

## Logistic Regression

### With Stemming

In [173]:
pipe_stem_countvec_lr = Pipeline([('cvect', CountVectorizer()),
                ('lgr', LogisticRegression(max_iter=10000))])
pipe_stem_countvec_lr.fit(X_train_s, y_train_s)
test_acc_s_lr = pipe_stem_countvec_lr.score(X_test_s, y_test_s)

In [174]:
params_s_lr = {'cvect__max_features': [100, 300, 1000, 2000, 4000, 7000],
         'cvect__stop_words': ['english', None]}

In [175]:
start = dt.now()

grid_s_lr = GridSearchCV(pipe_stem_countvec_lr, param_grid=params_s_lr)
grid_s_lr.fit(X_train_s, y_train_s)

running_secs = (dt.now() - start).seconds

acc_s_lr = grid_s.score(X_test_s, y_test_s)

In [176]:
grid_s_lr.best_params_, grid_s_lr.best_score_, running_secs
#,grid_s.cv_results_

({'cvect__max_features': 7000, 'cvect__stop_words': None},
 0.9230599999999999,
 153)

### With lemmitazation

In [98]:
X_lr = lemma_df.drop('humor', axis = 1)
y_lr = lemma_df['humor']

In [99]:
X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(X_lr['text'], y_lr, test_size=0.25, random_state = 42)

In [103]:
pipe_stem_tfidf_lr = Pipeline([('tfidf', TfidfVectorizer()),
                ('lgr', LogisticRegression(max_iter=10000))])
pipe_stem_tfidf_lr.fit(X_train_lr, y_train_lr)
test_acc_s_lr = pipe_stem_tfidf_lr.score(X_test_lr, y_test_lr)

In [105]:
params_s_lr = {'tfidf__max_features': [100, 300, 1000, 2000, 4000, 7000],
         'tfidf__stop_words': ['english', None]}

In [110]:
start = dt.now()
# process stuff
grid_tf_lr = GridSearchCV(pipe_stem_tfidf_lr, param_grid=params_s_lr)
grid_tf_lr.fit(X_train_lr, y_train_lr)
acc_tf_lr = grid_tf_lr.score(X_test_lr, y_test_lr)

running_secs = (dt.now() - start).seconds

grid_tf_lr.best_params_, grid_tf_lr.best_score_, acc_tf_lr, running_secs

({'tfidf__max_features': 7000, 'tfidf__stop_words': None},
 0.9182733333333333,
 0.9214,
 132)

## Decision Tree

### With Stemmer 

In [141]:
pipe_stem_countvec_dt = Pipeline([('cvect', CountVectorizer()),
                ('dt', DecisionTreeClassifier())])
pipe_stem_countvec_dt.fit(X_train_s, y_train_s)
test_acc_s = pipe_stem_countvec_dt.score(X_test_s, y_test_s)

In [142]:
params_s_dt = {'cvect__max_features': [100, 500, 1000, 2000],
         'cvect__stop_words': ['english', None],
              'dt__criterion': ['gini', 'entropy']}

In [143]:
start = dt.now()

grid_s_dt = GridSearchCV(pipe_stem_countvec_dt, param_grid=params_s_dt, cv=5)
grid_s_dt.fit(X_train_s, y_train_s)
acc_s = grid_s_dt.score(X_test_s, y_test_s)

running_secs_dt = (dt.now() - start).seconds

grid_s_dt.best_params_, grid_s_dt.best_score_,grid_s_dt.cv_results_, running_secs_dt

({'cvect__max_features': 2000,
  'cvect__stop_words': None,
  'dt__criterion': 'entropy'},
 0.8654333333333334,
 {'mean_fit_time': array([ 3.61378303,  3.40246325,  8.97969937,  8.687149  , 13.92021894,
         13.91123772, 18.56151605, 17.50884995, 26.96916919, 24.99748368,
         21.57136016, 19.85271945, 21.35472646, 21.94333692, 25.96873889,
         20.94574633]),
  'std_fit_time': array([0.37074641, 0.11862086, 0.06792432, 0.1619717 , 0.16884839,
         0.25712764, 0.37489712, 1.23526311, 6.4511432 , 5.85126921,
         0.82489619, 0.61415141, 0.69251915, 1.1851206 , 0.64540662,
         0.66847812]),
  'mean_score_time': array([0.27465858, 0.26676159, 0.24453754, 0.24960918, 0.27030754,
         0.27179928, 0.2670742 , 0.26704717, 0.57527714, 0.37485366,
         0.28543701, 0.28565273, 0.27334518, 0.29322724, 0.2906312 ,
         0.28169866]),
  'std_score_time': array([0.06822856, 0.04402015, 0.00261507, 0.00969568, 0.00961976,
         0.01002208, 0.00352887, 0.00454542

### With Lemmatization

In [119]:
pipe_stem_tfidf_dt = Pipeline([('tfidf', TfidfVectorizer()),
                ('dt', DecisionTreeClassifier())])
pipe_stem_tfidf_dt.fit(X_train_lr, y_train_lr)
test_acc_s_lr = pipe_stem_tfidf_dt.score(X_test_lr, y_test_lr)

In [127]:
params_s_lr = {'tfidf__max_features': [2000, 4000, 7000],
         'tfidf__stop_words': ['english', None],
              'dt__criterion': ['gini', 'entropy'],
              'dt__max_depth': [2, 3, 5, 10, 20]}

In [128]:
start = dt.now()

grid_tf_dt = GridSearchCV(pipe_stem_tfidf_dt, param_grid=params_s_lr)
grid_tf_dt.fit(X_train_s, y_train_s)
acc_s = grid_tf_dt.score(X_test_s, y_test_s)

running_secs_dt = (dt.now() - start).seconds

grid_tf_dt.best_params_, grid_tf_dt.best_score_,grid_tf_dt.cv_results_, running_secs_dt

({'dt__criterion': 'gini',
  'dt__max_depth': 6,
  'tfidf__max_features': 7000,
  'tfidf__stop_words': None},
 0.7850400000000001,
 {'mean_fit_time': array([1.40153594, 1.76983294, 1.41835399, 1.87697358, 1.62249241,
         2.19045448, 1.67549906, 2.16936975, 1.74417038, 2.16049771,
         1.73502588, 2.37050366]),
  'std_fit_time': array([0.04386807, 0.01943262, 0.0130557 , 0.05669669, 0.11597427,
         0.07174355, 0.07480921, 0.0902962 , 0.07710206, 0.06695675,
         0.07785028, 0.08561773]),
  'mean_score_time': array([0.2611074 , 0.27077603, 0.26168928, 0.28648162, 0.28124671,
         0.32496138, 0.28095536, 0.30988789, 0.28768764, 0.29783177,
         0.30477991, 0.32540088]),
  'std_score_time': array([0.01183704, 0.00404138, 0.01185791, 0.0078617 , 0.01067146,
         0.01657743, 0.01434522, 0.01629979, 0.0204475 , 0.02024357,
         0.01733746, 0.00948767]),
  'param_dt__criterion': masked_array(data=['gini', 'gini', 'gini', 'gini', 'gini', 'gini',
               

In [None]:
#Calculate scores 
    # Training and test mean accuracy
    #train_error = np.round(decision_tree.score(train_features, train_targets), 2)
    #test_error = np.round(decision_tree.score(test_features, test_targets), 2)

In [184]:
pd.DataFrame({'model': ['Logistic_stem', 'Logistic_lemm','DecisionTree_stem', 'DecisionTree_lemm',
                        'Bayes_stem', 'Bayes_lemm'], 
             'best_params': ['cvect__max_feature: 7000,cvect__stop_words:None',
                             'tfidf__max_features: 7000,tfidf__stop_words: None',
                             'cvect__max_features: 2000,dt__criterion: entropy',
                             'criterion: gini,max_depth:6,max_features:7000',
                             'cvect__max_features: 7000,cvect__stop_words: None',
                             'tfidf__max_features: 7000,tfidf__stop_words: None'],
              
             'best_score': ['0.92305', '0.91827', '0.86543', '0.78504', '0.90633', '0.90472'],
             'test_time_sec': ['153','132','1471','131','88','62']}).set_index('model')

Unnamed: 0_level_0,best_params,best_score,test_time_sec
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Logistic_stem,"cvect__max_feature: 7000,cvect__stop_words:None",0.92305,153
Logistic_lemm,"tfidf__max_features: 7000,tfidf__stop_words: None",0.91827,132
DecisionTree_stem,"cvect__max_features: 2000,dt__criterion: entropy",0.86543,1471
DecisionTree_lemm,"criterion: gini,max_depth:6,max_features:7000",0.78504,131
Bayes_stem,"cvect__max_features: 7000,cvect__stop_words: None",0.90633,88
Bayes_lemm,"tfidf__max_features: 7000,tfidf__stop_words: None",0.90472,62
