### Comparing Models and Vectorization Strategies for Text Classification

This try-it focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in scikitlearn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.

**Note:** The original dataset contains 200K rows of data. It is best to try to use the full dtaset. If the original dataset is too large for your computer, please use the 'dataset-minimal.csv', which has been reduced to 100K.

In [4]:
import pandas as pd

In [6]:
df = pd.read_csv('data/dataset.csv')

In [7]:
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

In [8]:
pd.DataFrame({'model': ['Logistic', 'Decision Tree', 'Bayes'], 
             'best_params': ['', '', ''],
             'best_score': ['', '', '']}).set_index('model')

Unnamed: 0_level_0,best_params,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1
Logistic,,
Decision Tree,,
Bayes,,


In [9]:
import nltk

In [69]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB


In [70]:
# Download necessary NLTK resources
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/datascience/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [85]:
# Pre-processing functions
# As a pre-processing step, perform both stemming and lemmatizing to
# normalize your text before classifying.
def stemming(text):
    stemmer = PorterStemmer()
    return ' '.join([stemmer.stem(w) for w in text])

def lemmatizing(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(w) for w in text])

In [86]:
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


In [108]:
# Split the data for Train and Test

X = df.drop('humor', axis = 1)
y = df['humor']
X_train, X_test, y_train, y_test = train_test_split(X['text'], y, random_state = 42)


In [109]:
# Vectorization 
#For each technique use both the CountVectorizer and TfidifVectorizer 
#and use options for stop words and max features to prepare the text
#data for your estimator

vectorizers = {
    'count': CountVectorizer(stop_words='english', max_features=5000),
    'tfidf': TfidfVectorizer(stop_words='english', max_features=5000)
}


In [148]:
# Classifiers
# Once you have prepared the text data with stemming lemmatizing techniques, consider LogisticRegression, DecisionTreeClassifier, and MultinomialNB as classification algorithms 
#for the data.

classifiers = {
    'logistic': LogisticRegression(max_iter=1000),
    'tree' : DecisionTreeClassifier(),
    'naive_bayes': MultinomialNB()
}

for c in classifiers:
    print (c)

logistic
tree
naive_bayes


In [None]:
# Compare their performance in terms of accuracy and speed.
# Pipelines for stemming and lemmatizing
pipelines = [
    ('stem', stemming),
    ('lemma', lemmatizing)
]

# Results storage
results = []
for c in classifiers:
    
    for process_name in pipelines:
        #X_processed = [process_func(doc) for doc in X_train]

        pipeline = Pipeline([
            ('vectorizer', CountVectorizer()),
            ('classifier', LogisticRegression())
        ])

        param_grid = {
            'vectorizer': [vectorizers['count'], vectorizers['tfidf']],
            'classifier': [classifiers[c]]
        }

        grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1, return_train_score=True)
        grid_search.fit(X_train, y_train)

        best_estimator = grid_search.best_estimator_
        #print(best_estimator)
        best_params = grid_search.best_params_
        #print(best_params)
        best_score = grid_search.best_score_
        #print(best_score)
        fit_time = min(grid_search.cv_results_['mean_fit_time'])
        #print(fit_time)

        results.append({
            'Best Estimator': str(best_estimator),
            'Best Params': best_params,
            'Best Score': best_score
        })

# Display results
results_df = pd.DataFrame(results)
pd.set_option('display.max_columns', 7)
best_results = pd.DataFrame({'model': ['Logistic', 'Decision Tree', 'Bayes'], 
             'best_params': ['', '', ''],
             'best_score': ['', '', '']}).set_index('model')

print(best_results)

In [115]:
best_results = pd.DataFrame({'model': ['Logistic', 'Decision Tree', 'Bayes'], 
             'best_params': ['', '', ''],
             'best_score': ['', '', '']}).set_index('model')

print(best_results)

              best_params best_score
model                               
Logistic                            
Decision Tree                       
Bayes                               


In [126]:
# Preprocessing options

#classifiers['tree']

preprocessing_steps = {
    'stem': stemming,
    'lemma': lemmatizing,
}

# Vectorizers
vectorizers = {
    'count': CountVectorizer(stop_words='english', max_features=5000),
    'tfidf': TfidfVectorizer(stop_words='english', max_features=5000),
}

# Classifiers
classifiers = {
    'logistic': LogisticRegression(max_iter=1000),
    'tree': DecisionTreeClassifier(),
    'naive_bayes': MultinomialNB(),
}

# Pipeline and GridSearch
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', LogisticRegression())
])

param_grid = {
    'vectorizer': [vectorizers['count'], vectorizers['tfidf']],
    'classifier': [classifiers['tree']],
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1, return_train_score=True)
grid_search.fit(X_train, y_train)

# Results
best_estimator = grid_search.best_estimator_
best_params = grid_search.best_params_
print('best_params : ' + str(best_params))
best_score = grid_search.best_score_

# Examining time to fit
cv_results = grid_search.cv_results_

# Extracting relevant information
results_table = {
    'Estimator': [str(best_estimator)],
    'Best Params': [best_params],
    'Best Score': [best_score]
    #'Fit Time (s)': [min(cv_results['mean_fit_time'])]
}

import pandas as pd
results_df = pd.DataFrame(results_table)
print(results_df)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
best_params : {'classifier': DecisionTreeClassifier(), 'vectorizer': TfidfVectorizer(max_features=5000, stop_words='english')}
                                           Estimator  \
0  Pipeline(steps=[('vectorizer',\n              ...   

                                         Best Params  Best Score  
0  {'classifier': DecisionTreeClassifier(), 'vect...     0.82132  


In [127]:
# Pipeline and GridSearch
#[classifiers['logistic']]
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', DecisionTreeClassifier())
])

param_grid = {
    'vectorizer': [vectorizers['count'], vectorizers['tfidf']],
    'classifier': [classifiers['logistic']],
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1, return_train_score=True)
grid_search.fit(X_train, y_train)

# Results
best_estimator = grid_search.best_estimator_
best_params = grid_search.best_params_
print('best_params : ' + str(best_params))
best_score = grid_search.best_score_

# Examining time to fit
cv_results = grid_search.cv_results_

# Extracting relevant information
results_table = {
    'Estimator': [str(best_estimator)],
    'Best Params': [best_params],
    'Best Score': [best_score]
    #'Fit Time (s)': [min(cv_results['mean_fit_time'])]
}

import pandas as pd
results_df = pd.DataFrame(results_table)
print(results_df)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
best_params : {'classifier': LogisticRegression(max_iter=1000), 'vectorizer': CountVectorizer(max_features=5000, stop_words='english')}
                                           Estimator  \
0  Pipeline(steps=[('vectorizer',\n              ...   

                                         Best Params  Best Score  
0  {'classifier': LogisticRegression(max_iter=100...    0.880573  


In [128]:
# Pipeline and GridSearch
#[classifiers['naive_bayes']]
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', DecisionTreeClassifier())
])

param_grid = {
    'vectorizer': [vectorizers['count'], vectorizers['tfidf']],
    'classifier': [classifiers['naive_bayes']],
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1, return_train_score=True)
grid_search.fit(X_train, y_train)

# Results
best_estimator = grid_search.best_estimator_
best_params = grid_search.best_params_
print('best_params : ' + str(best_params))
best_score = grid_search.best_score_

# Examining time to fit
cv_results = grid_search.cv_results_

# Extracting relevant information
results_table = {
    'Estimator': [str(best_estimator)],
    'Best Params': [best_params],
    'Best Score': [best_score]
    #'Fit Time (s)': [min(cv_results['mean_fit_time'])]
}

import pandas as pd
results_df = pd.DataFrame(results_table)
print(results_df)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
best_params : {'classifier': MultinomialNB(), 'vectorizer': CountVectorizer(max_features=5000, stop_words='english')}
                                           Estimator  \
0  Pipeline(steps=[('vectorizer',\n              ...   

                                         Best Params  Best Score  
0  {'classifier': MultinomialNB(), 'vectorizer': ...     0.87456  


In [130]:
best_results = pd.DataFrame({'model': ['Logistic', 'Decision Tree', 'Bayes'], 
             'best_params': ['{vectorizer: CountVectorizer(max_features=5000, stop_words=english)}', '{vectorizer: TfidfVectorizer(max_features=5000, stop_words=english)}', '{vectorizer: CountVectorizer(max_features=5000, stop_words=english)}'],
             'best_score': ['0.880573', '0.82132', '0.87456']}).set_index('model')

print(best_results)

                                                     best_params best_score
model                                                                      
Logistic       {vectorizer: CountVectorizer(max_features=5000...   0.880573
Decision Tree  {vectorizer: TfidfVectorizer(max_features=5000...    0.82132
Bayes          {vectorizer: CountVectorizer(max_features=5000...    0.87456
