## Comparing Models and Vectorization Strategies for Text Classification

### The Objective 

The objective of this project is to preprocess and analyze text data using various techniques to improve classification performance. We will apply text preprocessing methods, including stemming and lemmatizing, and utilize CountVectorizer and TfidfVectorizer for feature extraction. We will then evaluate the performance of different classification algorithms, namely Logistic Regression, Decision Tree, and Multinomial Naive Bayes, by comparing their accuracy and computational efficiency. The final goal is to identify the best-performing model and present the results, including the best parameters and scores, in a clear and comprehensive format.

### Project Outline

1. Import required libraries
2. Connect to the data source and explore it 
3. Text preprocessing:Stemming, Lemmatizing, CountVectorizer and TfidifVectorizer 
4. Classification: LogisticRegression, DecisionTreeClassifier, and MultinomialNB
5. Performance analysis:accuracy and speed
6. Conclusion summary

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  The project will use the text column to classify whether or not the text was humorous. 

### Libraries

In [24]:
# Import required libraries
import pandas as pd
import numpy as np
import time  # Ensure to import time
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import nltk

# Download the 'omw-1.4' resource
nltk.download('omw-1.4')

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\agnek\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
#Connect to the data source
df = pd.read_csv(r'C:\Users\agnek\OneDrive\Documents\Data\dataset.csv')

In [4]:
#Review data
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


In [11]:
# Define the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def stemmed_tokenizer(text):
    tokens = text.split()
    return [stemmer.stem(token) for token in tokens]

def lemmatized_tokenizer(text):
    tokens = text.split()
    return [lemmatizer.lemmatize(token) for token in tokens]

In [12]:
# Prepare the features and target variable
X = df['text']
y = df['humor'] 

# Encode the target variable
le = LabelEncoder()
y_encoded = le.fit_transform(y)

In [13]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

In [14]:
# Define vectorizers with stop words and max features
count_vectorizer = CountVectorizer(stop_words='english', max_features=5000)
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

In [25]:
# Define models
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'MultinomialNB': MultinomialNB()
}

# Initialize results list
results = []

# Perform classification with different vectorizers and preprocessing techniques
for vectorizer, vec_name in zip([count_vectorizer, tfidf_vectorizer], ['CountVectorizer', 'TfidfVectorizer']):
    for stemmer_func, stemmer_name in zip([stemmed_tokenizer, lemmatized_tokenizer], ['Stemming', 'Lemmatizing']):
        print(f'\nUsing {vec_name} with {stemmer_name}:\n')
        
        # Transform text data
        X_train_vec = vectorizer.fit_transform(X_train.apply(lambda x: ' '.join(stemmer_func(x))))
        X_test_vec = vectorizer.transform(X_test.apply(lambda x: ' '.join(stemmer_func(x))))
        
        for model_name, model in models.items():
            print(f'\nTraining {model_name}:\n')
            
            # Measure training time
            start_time = time.time()
            model.fit(X_train_vec, y_train)
            end_time = time.time()
            
            # Predict and evaluate
            y_pred = model.predict(X_test_vec)
            accuracy = accuracy_score(y_test, y_pred)
            report = classification_report(y_test, y_pred, output_dict=True)
            training_time = end_time - start_time
            
            # Store results
            results.append({
                'Vectorizer': vec_name,
                'Stemming/Lemmatizing': stemmer_name,
                'Model': model_name,
                'Accuracy': accuracy,
                'Training Time (s)': training_time,
                'Classification Report': report
            })
            
            print(f'Accuracy: {accuracy}')
            print(f'Training Time: {training_time:.2f} seconds')
            print(f'Classification Report:\n{classification_report(y_test, y_pred)}')

# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Find the best performing model and parameters
best_results = results_df.loc[results_df.groupby(['Vectorizer', 'Stemming/Lemmatizing', 'Model'])['Accuracy'].idxmax()]

# Display the best results
print("\nBest Results by Vectorizer, Stemming/Lemmatizing, and Model:")
print(best_results[['Vectorizer', 'Stemming/Lemmatizing', 'Model', 'Accuracy', 'Training Time (s)']])

# Create a summary table of the best classifiers
summary_table = best_results.groupby('Model').apply(lambda x: x.loc[x['Accuracy'].idxmax()])

# Display summary table
print("\nSummary of Best Classifiers:")
print(summary_table[['Vectorizer', 'Stemming/Lemmatizing', 'Accuracy', 'Training Time (s)']])

# Optionally, save results to CSV
results_df.to_csv('model_comparison_results.csv', index=False)


Using CountVectorizer with Stemming:


Training LogisticRegression:

Accuracy: 0.8878
Training Time: 1.80 seconds
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.88      0.89     29909
           1       0.88      0.89      0.89     30091

    accuracy                           0.89     60000
   macro avg       0.89      0.89      0.89     60000
weighted avg       0.89      0.89      0.89     60000


Training DecisionTreeClassifier:

Accuracy: 0.81685
Training Time: 36.64 seconds
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.80      0.81     29909
           1       0.81      0.83      0.82     30091

    accuracy                           0.82     60000
   macro avg       0.82      0.82      0.82     60000
weighted avg       0.82      0.82      0.82     60000


Training MultinomialNB:

Accuracy: 0.8793833333333333
Training Time: 0.02 seconds
Classification Repor

### Summary of Results

#### Summary of Best Classifiers:
                             Vectorizer Stemming/Lemmatizing  Accuracy  \
Model                                                                    
DecisionTreeClassifier  TfidfVectorizer             Stemming  0.824233   
LogisticRegression      CountVectorizer             Stemming  0.887800   
MultinomialNB           CountVectorizer             Stemming  0.879383   

                        Training Time (s)  
Model                                      
DecisionTreeClassifier          41.925832  
LogisticRegression               1.802860  
MultinomialNB                    0.019012  

### Analysis

#### Accuracy
Logistic Regression with CountVectorizer and Stemming achieved the highest accuracy of 0.887800. This indicates that Logistic Regression performed the best in classifying the text data, with a significant margin over the other classifiers.

MultinomialNB with CountVectorizer and Stemming had an accuracy of 0.879383, which is also high but slightly lower than Logistic Regression.

DecisionTreeClassifier with TfidfVectorizer and Stemming achieved an accuracy of 0.824233. While it still performed well, it was less accurate compared to the other models.

#### Training Time
MultinomialNB had the shortest training time of 0.019012 seconds, making it the fastest classifier. This is expected as it was mentioned in the course earlier MultinomialNB is generally efficient with large datasets, particularly for text classification.

Logistic Regression took 1.802860 seconds to train, which is relatively quick but significantly longer than MultinomialNB.

DecisionTreeClassifier took 41.925832 seconds, making it the slowest among the three. This is expected since decision trees are known for their complexity, especially with large feature sets and deep trees.

### Conclusion

If you prioritize accuracy and can afford slightly longer training times, Logistic Regression is the best choice. If you need the fastest training time and can accept a minor reduction in accuracy, MultinomialNB is preferable. DecisionTreeClassifier provides a decent accuracy but at a much higher training cost, making it less suitable if speed is a concern.

In summary, I would recommend Logistic Regression for best accuracy, while MultinomialNB  for best speed in this particular case study. The choice of vectorizer (CountVectorizer vs. TfidfVectorizer) and text preprocessing technique (Stemming vs. Lemmatizing) showed minimal impact on the overall best performing model, but these aspects could still be worth exploring further depending on specific requirements and constraints of the next task.