## Natural Language Text Classification Models and Vectorization

This project compares estimators: Logistic Regression, Decision Tree, and Naive Bayes, and text vectorization strategies: tfidf, and count for a text classification problem.  
A pipeline is created is created with one of the estimators, with either tfidt or count vectorization technique.  GridSearchCV in scikitlearn is used to find the best hyper-parameters for the different estimators.  The model pipeline results are generated for each estimator and vectorization combination. 

### Machine Learning Models
#### Logistic Regression
Logistic Regression is a linear model used for binary classification tasks, predicting the probability that a given input belongs to a particular class.
#### Decision Tree Classifier
Decision Tree classifiers split the data into subsets based on the value of input features, creating a tree-like model of decisions. 
#### Naive Bayes
Naive Bayes classifiers are based on Bayes’ theorem, assuming independence between features.

### Text Vectorization

This is a crucial step in Natural Language Processing (NLP) that involves converting text data into numerical representations, which can then be processed by machine learning algorithms. This project examines and compares the 2 common vectorization techniques:

#### Bag of Words (BoW): 
Also known as Count Vectorization, converts text into numerical features by counting the occurrences of each word in a document. It creates a sparse matrix where each row represents a document and each column represents a word from the vocabulary. 

#### Term Frequency-Inverse Document Frequency (TF-IDF): 
This technique not only considers the frequency of words in a document but also their importance across the entire corpus. It reduces the weight of commonly occurring words and gives more importance to less frequent but more meaningful words. 



### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset". (https://arxiv.org/pdf/2004.12765.pdf).  You text column 'humor' is used to classify whether or not the text was humorous.  

In [1]:
import pandas as pd
import numpy  as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import string



In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords



In [3]:

#df = pd.read_csv('text_data/dataset-minimal.csv')
df = pd.read_csv('text_data/dataset.csv')

cvfolds      = 5
verbosity    = 0
max_features = 5000

#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

models_df = pd.DataFrame({'model': ['Logistic', 'Decision Tree', 'Bayes'], 
                            'best_params': ['', '', ''],
                             'best_score': ['', '', '']}).set_index('model')

In [4]:

param_grids = {
    'Logistic Regression': {
        'classifier__C': [0.1, 1, 10],
        'classifier__solver': ['liblinear', 'lbfgs']
    },
    'DecisionTree': {
        'classifier__max_depth': [3, 5, 7],
        'classifier__min_samples_split': [2, 5, 10]
    },
    'Bayes': {
        'classifier__alpha' : [0.1, 1, 10]
    }  # Naive Bayes doesn't have hyperparameters that require tuning in this context
}
models__ = [
  { 'classifier_name': 'Bayes',               'classifier' : MultinomialNB(),         'vectorizer_name' : 'Tfidf', 'vectorizer' : TfidfVectorizer()},
  { 'classifier_name': 'Bayes',               'classifier' : MultinomialNB(),         'vectorizer_name' : 'Count', 'vectorizer' : CountVectorizer()},
  { 'classifier_name': 'Logistic Regression', 'classifier' : LogisticRegression(),    'vectorizer_name' : 'Tfidf', 'vectorizer' : TfidfVectorizer()},
  { 'classifier_name': 'Logistic Regression', 'classifier' : LogisticRegression(),    'vectorizer_name' : 'Count', 'vectorizer' : CountVectorizer()},
  { 'classifier_name': 'DecisionTree',        'classifier' : DecisionTreeClassifier(),'vectorizer_name' : 'Tfidf', 'vectorizer' : TfidfVectorizer()},
  { 'classifier_name': 'DecisionTree',        'classifier' : DecisionTreeClassifier(),'vectorizer_name' : 'Count', 'vectorizer' : CountVectorizer()},

]

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin 
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stemmer    = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))

    def normalize_text(self, text):
        text = text.lower()
        text = text.translate(str.maketrans('', '', string.punctuation))
        words = word_tokenize(text)
        words = [word for word in words if word not in self.stop_words]
        return " ".join(words)

    def stem_text(self, text):
        words = word_tokenize(text)
        stemmed_words = [self.stemmer.stem(word) for word in words]
        return " ".join(stemmed_words)

    def lemmatize_text(self, text):
        words = word_tokenize(text)
        lemmatized_words = [self.lemmatizer.lemmatize(word) for word in words]
        return " ".join(lemmatized_words)

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X.apply(self.normalize_text).apply(self.stem_text).apply(self.lemmatize_text)


In [6]:
from sklearn.pipeline import Pipeline

def gen_pipeline( vectorizer, classifier ):
    pipeline_obj = Pipeline([
    ('preprocess', TextPreprocessor()),  # Custom text preprocessing
    ('vectorize',  vectorizer ),    # Convert text to TF-IDF vectors
    ('classifier', classifier )   # Logistic Regression model
    ])
    return pipeline_obj

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['humor'], test_size=0.2, random_state=42)


In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB

from collections import defaultdict
from datetime import datetime


print(f'Model-Vectorization\t\tScore\t\tFit Time\t\tScore Time')
print(f'===================\t\t=====\t\t========\t\t==========')
for model in models__:
    model_pipeline   = gen_pipeline( model['vectorizer'], model['classifier'] )
    model_classifier = GridSearchCV( model_pipeline, param_grids[model['classifier_name']], cv=cvfolds, scoring='accuracy',verbose=verbosity)
    model_classifier.fit(X_train, y_train)

    y_pred          = model_classifier.predict(X_test)
    gs_accuracy     = accuracy_score(y_test, y_pred)
    report_gs_model = classification_report(y_test, y_pred)

    fit_time   = np.mean( model_classifier.cv_results_['mean_fit_time']  )
    score_time = np.mean( model_classifier.cv_results_['mean_score_time'])
    mean_score = np.mean( model_classifier.cv_results_['mean_test_score'])
    print(f'{model['classifier_name']}-{model['vectorizer_name']}\t{mean_score:.2f}\t\t{fit_time:.2f}\t\t{score_time:.2f}')



Model-Vectorization		Score		Fit Time		Score Time
Bayes-Tfidf	0.89		20.67		5.15
Bayes-Count	0.89		23.10		5.88


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression-Tfidf	0.89		24.44		5.95


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Logistic Regression-Count	0.89		25.29		5.98
DecisionTree-Tfidf	0.59		23.67		5.77
DecisionTree-Count	0.59		23.56		5.84
