### Comparing Models and Vectorization Strategies for Text Classification

This try-it focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in scikitlearn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.

**Note:** The original dataset contains 200K rows of data. It is best to try to use the full dtaset. If the original dataset is too large for your computer, please use the 'dataset-minimal.csv', which has been reduced to 100K.

In [2]:
import nltk
import pandas as pd
import vectorizers

In [3]:
df = pd.read_csv('text_data/dataset.csv')

In [4]:
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


In [5]:
# Download required NLTK data
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/george.li/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/george.li/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/george.li/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
from nltk import PorterStemmer, WordNetLemmatizer


# Get English stop words
stop_words = set(nltk.corpus.stopwords.words('english'))

# Preprocess function
def preprocess(text, method='stem'):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [token for token in tokens if token not in stop_words]
    if method == 'stem':
        stemmer = PorterStemmer()
        return ' '.join([stemmer.stem(token) for token in tokens])
    elif method == 'lemma':
        lemmatizer = WordNetLemmatizer()
        return ' '.join([lemmatizer.lemmatize(token) for token in tokens])

In [7]:
from sklearn.model_selection import train_test_split

# Apply preprocessing
df['stemmed'] = df['text'].apply(lambda x: preprocess(x, 'stem'))
df['lemmatized'] = df['text'].apply(lambda x: preprocess(x, 'lemma'))

# Split the data
X_stem = df['stemmed']
X_lemma = df['lemmatized']
y = df['humor']
X_stem_train, X_stem_test, X_lemma_train, X_lemma_test, y_train, y_test = train_test_split(X_stem, X_lemma, y, test_size=0.2, random_state=42)

In [8]:
import time
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Define pipelines and parameter grids for each model
pipelines = {
    'LogisticRegression': Pipeline([
        ('vectorizer', TfidfVectorizer()),
        ('classifier', LogisticRegression())
    ]),
    'DecisionTreeClassifier': Pipeline([
        ('vectorizer', TfidfVectorizer()),
        ('classifier', DecisionTreeClassifier())
    ]),
    'MultinomialNB': Pipeline([
        ('vectorizer', TfidfVectorizer()),
        ('classifier', MultinomialNB())
    ])
}

param_grids = {
    'LogisticRegression': {
        'vectorizer__max_features': [1000, 2000],
        'vectorizer__ngram_range': [(1, 1), (1, 2)],
        'classifier__C': [0.1, 1, 10],
        'classifier__penalty': ['l1', 'l2']
    },
    'DecisionTreeClassifier': {
        'vectorizer__max_features': [1000, 2000],
        'vectorizer__ngram_range': [(1, 1), (1, 2)],
        'classifier__max_depth': [5, 10, None],
        'classifier__min_samples_split': [2, 5, 10]
    },
    'MultinomialNB': {
        'vectorizer__max_features': [1000, 2000],
        'vectorizer__ngram_range': [(1, 1), (1, 2)],
        'classifier__alpha': [0.1, 0.5, 1.0]
    }
}

# Function to perform grid search
def grid_search(X_train, y_train, pipeline, param_grid):
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)
    grid_search.fit(X_train, y_train)
    return grid_search.best_estimator_, grid_search.best_params_, grid_search.best_score_

# Perform grid search for each preprocessing method and model
results = []
for prep_method, X_train in [('Stemming', X_stem_train), ('Lemmatization', X_lemma_train)]:
    for model_name, pipeline in pipelines.items():
        print(f"\nPerforming grid search for {prep_method} + {model_name}")
        start_time = time.time()
        best_estimator, best_params, best_score = grid_search(X_train, y_train, pipeline, param_grids[model_name])
        duration = time.time() - start_time
        
        results.append({
            'Preprocessing': prep_method,
            'Model': model_name,
            'Best Score': best_score,
            'Best Parameters': best_params,
            'Duration': duration
        })

# Convert results to DataFrame and sort
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('Best Score', ascending=False)

# Print results
print("\nGrid Search Results:")
print(results_df.to_string(index=False))

# Get the best overall model
best_result = results_df.iloc[0]
best_prep = best_result['Preprocessing']
best_model = best_result['Model']
best_params = best_result['Best Parameters']

print(f"\nBest overall model: {best_prep} + {best_model}")
print(f"Best parameters: {best_params}")

# Train the best model on the full training set and evaluate on test set
X_train = X_stem_train if best_prep == 'Stemming' else X_lemma_train
X_test = X_stem_test if best_prep == 'Stemming' else X_lemma_test

best_pipeline = pipelines[best_model]
best_pipeline.set_params(**best_params)
best_pipeline.fit(X_train, y_train)

y_pred = best_pipeline.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print(f"\nTest set accuracy of best model: {test_accuracy:.4f}")

# Print top features for the best model
if best_model in ['LogisticRegression', 'DecisionTreeClassifier']:
    vectorizer = best_pipeline.named_steps['vectorizer']
    classifier = best_pipeline.named_steps['classifier']
    feature_names = vectorizer.get_feature_names_out()
    
    if best_model == 'LogisticRegression':
        importances = classifier.coef_[0]
    else:  # DecisionTreeClassifier
        importances = classifier.feature_importances_
    
    top_features = sorted(zip(feature_names, importances), key=lambda x: abs(x[1]), reverse=True)[:10]
    print(f"\nTop 10 features for {best_prep} + {best_model}:")
    for feature, importance in top_features:
        print(f"{feature}: {importance}")
else:
    print(f"\nFeature importance not available for {best_model}")

# Print scikit-learn version
import sklearn
print(f"\nscikit-learn version: {sklearn.__version__}")



Performing grid search for Stemming + LogisticRegression
Fitting 5 folds for each of 24 candidates, totalling 120 fits


60 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
60 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/george.li/Documents/code/homework/uci-classifiers/venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/george.li/Documents/code/homework/uci-classifiers/venv/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/george.li/Documents/code/homework/uci-classifiers/venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 47


Performing grid search for Stemming + DecisionTreeClassifier
Fitting 5 folds for each of 36 candidates, totalling 180 fits

Performing grid search for Stemming + MultinomialNB
Fitting 5 folds for each of 12 candidates, totalling 60 fits

Performing grid search for Lemmatization + LogisticRegression
Fitting 5 folds for each of 24 candidates, totalling 120 fits


60 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
60 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/george.li/Documents/code/homework/uci-classifiers/venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/george.li/Documents/code/homework/uci-classifiers/venv/lib/python3.12/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/george.li/Documents/code/homework/uci-classifiers/venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 47


Performing grid search for Lemmatization + DecisionTreeClassifier
Fitting 5 folds for each of 36 candidates, totalling 180 fits

Performing grid search for Lemmatization + MultinomialNB
Fitting 5 folds for each of 12 candidates, totalling 60 fits





Grid Search Results:
Preprocessing                  Model  Best Score                                                                                                                          Best Parameters   Duration
Lemmatization     LogisticRegression    0.862069                   {'classifier__C': 1, 'classifier__penalty': 'l2', 'vectorizer__max_features': 2000, 'vectorizer__ngram_range': (1, 2)}  35.754441
     Stemming     LogisticRegression    0.862056                  {'classifier__C': 10, 'classifier__penalty': 'l2', 'vectorizer__max_features': 2000, 'vectorizer__ngram_range': (1, 2)}  37.277516
Lemmatization          MultinomialNB    0.853237                                          {'classifier__alpha': 1.0, 'vectorizer__max_features': 2000, 'vectorizer__ngram_range': (1, 1)}  18.117764
     Stemming          MultinomialNB    0.852831                                          {'classifier__alpha': 1.0, 'vectorizer__max_features': 2000, 'vectorizer__ngram_range': (1, 1)}  16.

#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

In [9]:
results_df

Unnamed: 0,Preprocessing,Model,Best Score,Best Parameters,Duration
3,Lemmatization,LogisticRegression,0.862069,"{'classifier__C': 1, 'classifier__penalty': 'l...",35.754441
0,Stemming,LogisticRegression,0.862056,"{'classifier__C': 10, 'classifier__penalty': '...",37.277516
5,Lemmatization,MultinomialNB,0.853237,"{'classifier__alpha': 1.0, 'vectorizer__max_fe...",18.117764
2,Stemming,MultinomialNB,0.852831,"{'classifier__alpha': 1.0, 'vectorizer__max_fe...",16.710195
4,Lemmatization,DecisionTreeClassifier,0.797962,"{'classifier__max_depth': None, 'classifier__m...",217.476588
1,Stemming,DecisionTreeClassifier,0.794512,"{'classifier__max_depth': None, 'classifier__m...",229.106681


In [12]:
pd.DataFrame({'model': ['Logistic', 'Bayes', 'Decision Tree'], 
             'best_params': ["{'classifier__C': 1, 'classifier__penalty': 'l2', 'vectorizer__max_features': 2000, 'vectorizer__ngram_range': (1, 2)}", "{'classifier__alpha': 1.0, 'vectorizer__max_features': 2000, 'vectorizer__ngram_range': (1, 1)}", "{'classifier__max_depth': None, 'classifier__min_samples_split': 5, 'vectorizer__max_features': 2000, 'vectorizer__ngram_range': (1, 2)}"],
             'best_score': ['0.862069', '0.853237', '0.797962']}).set_index('model')

Unnamed: 0_level_0,best_params,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1
Logistic,"{'classifier__C': 1, 'classifier__penalty': 'l...",0.862069
Bayes,"{'classifier__alpha': 1.0, 'vectorizer__max_fe...",0.853237
Decision Tree,"{'classifier__max_depth': None, 'classifier__m...",0.797962
