# Text Analytics Pipeline for Text Classification

This notebook demonstrates how to build a text analytics pipeline that includes text processing, feature extraction, classification, and evaluation.


Install below dependancies/modules in order to to re-run this notebook:

In [None]:
%pip install pandas numpy nltk emoji spacy contractions scikit-learn imbalanced-learn
!python -m spacy download en_core_web_sm

To Download the dataset:

In [None]:
# Pipeline to load or download the dataset
"""
TODO
"""

To Download GloVe pretrained word vectors :

In [None]:
# Pipeline to load or download the dataset
"""
TODO
"""

Modules:

In [30]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer  
from nltk.corpus import stopwords
import emoji
import spacy
import contractions

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.decomposition import TruncatedSVD
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.under_sampling import RandomUnderSampler

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Custom Text Preprocessor

The custom transformer below:

 - **Emoji Conversion:** Converts any emojis to their text descriptions.
 - **Normalization:** Lowercases the text.
 - **Punctuation Removal:** Removes punctuation using regex.
 - **Tokenization:** Uses NLTK’s `word_tokenize`.
 - **Stop-word Removal:** Filters out English stopwords.
 - **Stemming:** Applies Porter stemming.
 
 The transformer implements `fit` and `transform` so that it can be used inside a scikit-learn pipeline.

In [46]:
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, do_stemming=True, do_lemmatization=False, remove_stopwords=True, 
                 do_emoji_conversion=True, use_spacy_tokenizer=True):
        """
        Parameters:
        - do_stemming: Apply stemming (reduces words to their root form)
        - do_lemmatization: Apply lemmatization (converts words to their canonical form)
          Note: When using the default (NLTK) tokenizer, if both do_lemmatization and do_stemming are enabled,
          lemmatization takes precedence.
        - remove_stopwords: Remove common stopwords
        - do_emoji_conversion: Convert emojis to text descriptions
        - use_spacy_tokenizer: Use a custom spaCy-based tokenizer (which already uses lemmatization)
        """
        self.do_stemming = do_stemming
        self.do_lemmatization = do_lemmatization
        self.remove_stopwords = remove_stopwords
        self.do_emoji_conversion = do_emoji_conversion
        self.use_spacy_tokenizer = use_spacy_tokenizer
        self.stemmer = PorterStemmer()
        if self.do_lemmatization:
            self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        
        # Load the spaCy model if using the spaCy tokenizer
        if self.use_spacy_tokenizer:
            self.nlp = spacy.load("en_core_web_sm")
    
    def remove_links(self, text):
        """Remove URLs from text."""
        return re.sub(r'http[s]?://\S+|www\.\S+', '', text)
    
    def remove_user_mentions(self, text):
        """Remove user mentions from text."""
        return re.sub(r'u/\S+', '', text)
    
    def expand_contractions(self, text):
        """Expand contractions in the text."""
        return contractions.fix(text)
    
    def remove_non_ascii(self, text):
        """Remove non-ASCII characters from the text."""
        return text.encode("ascii", "ignore").decode()
    
    def remove_punctuations(self, text):
        """
        Remove or adjust punctuation in text.
        Replaces hyphens with space and ensures separation around punctuation.
        """
        text = re.sub(r'[-]', ' ', text)
        text = re.sub(r'(\S)[' + re.escape(string.punctuation) + r'](\S)', r'\1 \2', text)
        return text
    
    def remove_numbers(self, text):
        """Remove numbers from text."""
        return re.sub(r'[0-9]+', '', text)
    
    def emoji_to_text(self, text):
        """Convert emojis to text descriptions."""
        return emoji.demojize(text)
    
    def normalize(self, text):
        """Lowercase the text."""
        return text.lower()
    
    def tokenize(self, text):
        """
        Tokenize text using either a spaCy-based custom tokenizer or the default NLTK tokenizer.
        """
        if self.use_spacy_tokenizer:
            # Use spaCy's custom tokenization logic:
            doc = self.nlp(text)
            tokens = []
            # Add named entities as tokens
            for ent in doc.ents:
                tokens.append(ent.text)
            # Add non-entity tokens using their lemma
            non_entity_tokens = [token.lemma_.lower() for token in doc if not token.ent_type_ 
                                 and not token.is_punct and not token.is_space]
            tokens.extend(non_entity_tokens)
            if self.remove_stopwords:
                tokens = [token for token in tokens if token.lower() not in self.stop_words]
            if self.do_stemming:
                tokens = [self.stemmer.stem(token) for token in tokens]
            return tokens
        else:
            # Default NLTK-based tokenization:
            # Remove punctuation (if any remains) and then tokenize
            text = re.sub(r'[^\w\s]', '', text)
            tokens = word_tokenize(text)
            # Keep only alphabetic tokens
            tokens = [token for token in tokens if token.isalpha()]
            if self.remove_stopwords:
                tokens = [token for token in tokens if token.lower() not in self.stop_words]
            # Apply lemmatization if enabled; otherwise, apply stemming if enabled
            if self.do_lemmatization:
                tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
            elif self.do_stemming:
                tokens = [self.stemmer.stem(token) for token in tokens]
            return tokens
    
    def preprocess(self, text):
        """Apply the complete preprocessing pipeline to the text."""
        text = self.remove_links(text)
        text = self.remove_user_mentions(text)
        text = self.expand_contractions(text)
        text = self.remove_non_ascii(text)
        text = self.remove_punctuations(text)
        text = self.remove_numbers(text)
        if self.do_emoji_conversion:
            text = self.emoji_to_text(text)
        text = self.normalize(text)
        tokens = self.tokenize(text)
        return ' '.join(tokens)
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X.apply(self.preprocess)

## GloveVectorizer Class

In our project, the `GloveVectorizer` is used to transform text data into numerical features by leveraging pre-trained GloVe embeddings. This approach provides semantic-rich, dense vector representations of documents, which can improve model performance over traditional sparse representations.

 **GloVe Word Vector Used:**  glove.twitter.27B.50d.txt 

### Key Components

- **`__init__`:**  
  Initializes the vectorizer with the GloVe file path and embedding dimension.

- **`fit`:**  
  Loads the GloVe embeddings into a dictionary for quick lookup.

- **`transform`:**  
  Converts each document into an average embedding vector by:
  - Splitting the text into tokens.
  - Retrieving the corresponding embedding for each token.
  - Averaging these embeddings to form a single vector for the document.

This vectorizer is essential for capturing the contextual meaning of words, enhancing the classifier's ability to understand and process text data.

In [32]:
class GloveVectorizer(BaseEstimator, TransformerMixin):
    """
    Loading pre-trained GloVe embeddings and returns the average embedding vector for each document.
    """
    def __init__(self, glove_file='glove.twitter.27B.50d.txt', embedding_dim=50):
        self.glove_file = glove_file
        self.embedding_dim = embedding_dim

    def fit(self, X, y=None):
        self.embeddings_index = {}
        with open(self.glove_file, encoding="utf8") as f:
            for line in f:
                values = line.split()
                word = values[0]
                coefs = np.asarray(values[1:], dtype='float32')
                self.embeddings_index[word] = coefs
        return self

    def transform(self, X):
        vectors = []
        for doc in X:
            # Since TextPreprocessor returns a space-separated string of tokens,
            # we can simply split on spaces.
            tokens = doc.split()
            token_vecs = [self.embeddings_index[token] for token in tokens if token in self.embeddings_index]
            if token_vecs:
                doc_vec = np.mean(token_vecs, axis=0)
            else:
                doc_vec = np.zeros(self.embedding_dim)
            vectors.append(doc_vec)
        return np.array(vectors)

 ## Data Loading and Train/Test Split
 
 We load the dataset and split it into training (80%) and testing (20%) sets.

In [None]:
# Read the dataset 
df = pd.read_csv("../Data/labelled_data.csv")

# Check available columns
print("Columns in dataset:", df.columns.tolist())

# Select the important columns and drop any missing values
df = df[['text', 'label']].dropna()
X = df['text']
y = df['label']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Columns in dataset: ['post_id', 'subreddit', 'post_title', 'post_body', 'number_of_comments', 'readable_datetime', 'post_author', 'number_of_upvotes', 'query', 'text', 'comment_id', 'comment_body', 'comment_author', 'label']


___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

## Stage 1: Basic Pipelines

## Model Pipelines for Sentiment Analysis

This section defines a variety of pipelines to preprocess text data and train classifiers using different vectorization methods. Each pipeline uses a custom text preprocessor and is configured to handle class imbalance via weighted class balancing (or random under-sampling for Naive Bayes).

## Pipeline Categories

- **Logistic Regression Pipelines**  
  Pipelines using both CountVectorizer (with unigrams and n-grams) and TfidfVectorizer (with unigrams and n-grams) paired with Logistic Regression.

- **SVM Pipelines**  
  Pipelines combining CountVectorizer or TfidfVectorizer (with unigrams and n-grams) with LinearSVC, with class weights balanced.

- **Random Forest Pipelines**  
  Similar to the above, these pipelines use CountVectorizer or TfidfVectorizer (with unigrams and n-grams) with a RandomForestClassifier configured for balanced classes.

- **Naive Bayes Pipelines**  
  For Naive Bayes, pipelines integrate random under-sampling to address imbalance, alongside either CountVectorizer or TfidfVectorizer (with unigrams and n-grams).

Each pipeline is built using scikit-learn’s `Pipeline` (or `ImbPipeline` for Naive Bayes) to streamline preprocessing, vectorization, sampling, and classification.

### Kewords:

- lr -> Logistic Regression
- SVM -> Support vector machine
- RF -> Random Forest
- NB -> Multinomial Naive Bayes

In [None]:
#  Model Pipelines (Binary and TF-IDF, Ngram and Unigram, with and without SVD)


# --- Logistic Regression Pipelines (with weighted balancing) ---
pipeline_lr_count_unigram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', CountVectorizer(binary=True, ngram_range=(1,1))),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

pipeline_lr_count_ngram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', CountVectorizer(binary=True, ngram_range=(1,2))),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

pipeline_lr_tfidf_unigram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', TfidfVectorizer(ngram_range=(1,1))),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

pipeline_lr_tfidf_ngram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', TfidfVectorizer(ngram_range=(1,2))),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))
])


# --- SVM Pipelines (using LinearSVC with weighted balancing) ---
pipeline_svm_count_unigram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', CountVectorizer(binary=True, ngram_range=(1,1))),
    ('classifier', LinearSVC(max_iter=1000, class_weight='balanced'))
])

pipeline_svm_count_ngram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', CountVectorizer(binary=True, ngram_range=(1,2))),
    ('classifier', LinearSVC(max_iter=1000, class_weight='balanced'))
])

pipeline_svm_tfidf_unigram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', TfidfVectorizer(ngram_range=(1,1))),
    ('classifier', LinearSVC(max_iter=1000, class_weight='balanced'))
])

pipeline_svm_tfidf_ngram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', TfidfVectorizer(ngram_range=(1,2))),
    ('classifier', LinearSVC(max_iter=1000, class_weight='balanced'))
])


# --- Random Forest Pipelines (with weighted balancing) ---
pipeline_rf_count_unigram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', CountVectorizer(binary=True, ngram_range=(1,1))),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))
])

pipeline_rf_count_ngram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', CountVectorizer(binary=True, ngram_range=(1,2))),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))
])

pipeline_rf_tfidf_unigram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', TfidfVectorizer(ngram_range=(1,1))),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))
])

pipeline_rf_tfidf_ngram = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', TfidfVectorizer(ngram_range=(1,2))),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))
])


# --- Naive Bayes Pipelines (with Random Under-Sampling) ---
pipeline_nb_count_unigram = ImbPipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', CountVectorizer(binary=True, ngram_range=(1,1))),
    ('sampler', RandomUnderSampler(random_state=42)),
    ('classifier', MultinomialNB())
])

pipeline_nb_count_ngram = ImbPipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', CountVectorizer(binary=True, ngram_range=(1,2))),
    ('sampler', RandomUnderSampler(random_state=42)),
    ('classifier', MultinomialNB())
])

pipeline_nb_tfidf_unigram = ImbPipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', TfidfVectorizer(ngram_range=(1,1))),
    ('sampler', RandomUnderSampler(random_state=42)),
    ('classifier', MultinomialNB())
])

pipeline_nb_tfidf_ngram = ImbPipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorizer', TfidfVectorizer(ngram_range=(1,2))),
    ('sampler', RandomUnderSampler(random_state=42)),
    ('classifier', MultinomialNB())
])


### Pipeline Augmentation and Glove Embeddings


1. **Inserting SVD for Dimensionality Reduction:**  
   - The helper function `add_svd` takes an existing pipeline and inserts a `TruncatedSVD` step (defaulting to 100 components) right after the vectorizer.  
   - This reduces the high-dimensional output from text vectorizers, which can speed up training and potentially improve performance.
   - Pipelines that use classifiers other than MultinomialNB (stored in `other_pipelines`) have their SVD versions created using this function. These SVD-augmented pipelines are then merged with the original pipelines (and those for Naive Bayes) into the `all_pipelines` dictionary.

2. **Incorporating Glove Embeddings:**  
   - Separate pipelines are defined that use the `GloveVectorizer` to convert text into fixed-length embeddings by averaging pre-trained GloVe word vectors (from `glove.twitter.27B.50d.txt`).  
   - These pipelines are paired with Logistic Regression, SVM, and Random Forest classifiers.
   - The resulting Glove-based pipelines are added to the `all_pipelines` dictionary, enabling comparison with traditional count/Tf-idf based pipelines.

Overall, this structure allows you to easily experiment with different vectorization methods (including dimensionality reduction via SVD and Glove embeddings) across various classifiers.


In [None]:
# Helper to Insert SVD
def add_svd(pipeline, n_components=100):
    """
    Inserts a TruncatedSVD step right after the vectorizer.
    Assumes the pipeline has steps: preprocessor, vectorizer, classifier.
    """
    steps = pipeline.steps.copy()
    # Insert SVD at position 2 (right after vectorizer)
    steps.insert(2, ('svd', TruncatedSVD(n_components=n_components)))
    return Pipeline(steps)


# Create pipelines without SVD for NB
pipelines_no_svd = {
    # For models that don't use SVD (for NB)
    "NB_Count_Binary_Unigram": pipeline_nb_count_unigram,
    "NB_Count_Binary_Ngram": pipeline_nb_count_ngram,
    "NB_Tfidf_Unigram": pipeline_nb_tfidf_unigram,
    "NB_Tfidf_Ngram": pipeline_nb_tfidf_ngram,
}

# For models other than MultinomialNB, with their SVD versions
other_pipelines = {
    "LR_Count_Binary_Unigram": pipeline_lr_count_unigram,
    "LR_Count_Binary_Ngram": pipeline_lr_count_ngram,
    "LR_Tfidf_Unigram": pipeline_lr_tfidf_unigram,
    "LR_Tfidf_Ngram": pipeline_lr_tfidf_ngram,
    
    "SVM_Count_Binary_Unigram": pipeline_svm_count_unigram,
    "SVM_Count_Binary_Ngram": pipeline_svm_count_ngram,
    "SVM_Tfidf_Unigram": pipeline_svm_tfidf_unigram,
    "SVM_Tfidf_Ngram": pipeline_svm_tfidf_ngram,
    
    "RF_Count_Binary_Unigram": pipeline_rf_count_unigram,
    "RF_Count_Binary_Ngram": pipeline_rf_count_ngram,
    "RF_Tfidf_Unigram": pipeline_rf_tfidf_unigram,
    "RF_Tfidf_Ngram": pipeline_rf_tfidf_ngram,
}

# Create SVD versions for non-NB pipelines
svd_pipelines = {name + "_SVD": add_svd(pipe) for name, pipe in other_pipelines.items()}

# Combine all pipelines
all_pipelines = {}
all_pipelines.update(pipelines_no_svd)
all_pipelines.update(other_pipelines)
all_pipelines.update(svd_pipelines)

In [None]:
# Return a fixed-length embedding for each document (by averaging word embeddings).
pipeline_glove_lr = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('glove', GloveVectorizer(glove_file='glove.twitter.27B.50d.txt', embedding_dim=50)),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

pipeline_glove_svm = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('glove', GloveVectorizer(glove_file='glove.twitter.27B.50d.txt', embedding_dim=50)),
    ('classifier', LinearSVC(max_iter=1000, class_weight='balanced'))
])

pipeline_glove_rf = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('glove', GloveVectorizer(glove_file='glove.twitter.27B.50d.txt', embedding_dim=50)),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))
])

# Add the Glove pipelines to all_pipelines dictionary
all_pipelines["Glove_LR"] = pipeline_glove_lr
all_pipelines["Glove_SVM"] = pipeline_glove_svm
all_pipelines["Glove_RF"] = pipeline_glove_rf

### Evaluating Stage 1 Pipelines
 
 We define a helper function that fits a pipeline and returns evaluation metrics:
 
 - **Accuracy**  
 - **Precision** (weighted)  
 - **Recall** (weighted)  
 - **F1 Score** (weighted)
 
 Then, we loop over all pipelines, evaluate them on the test set, and compile the results into a comparison table.

In [37]:
def evaluate_pipeline_metrics(pipeline, X_train, X_test, y_train, y_test):
    """Train the pipeline and return evaluation metrics."""
    pipeline.fit(X_train, y_train)
    predictions = pipeline.predict(X_test)
    metrics = {
        "Accuracy": accuracy_score(y_test, predictions),
        "Precision": precision_score(y_test, predictions, average='weighted', zero_division=0),
        "Recall": recall_score(y_test, predictions, average='weighted', zero_division=0),
        "F1 Score": f1_score(y_test, predictions, average='weighted', zero_division=0)
    }
    return metrics


# Evaluate each pipeline and store results
results = []
for name, pipe in all_pipelines.items():
    metrics = evaluate_pipeline_metrics(pipe, X_train, X_test, y_train, y_test)
    row = {"Pipeline": name}
    row.update(metrics)
    results.append(row)

# Create a DataFrame of results and sort by F1 Score
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(by="F1 Score", ascending=False)
print("### Model Comparison Table")
results_df.reset_index(drop=True, inplace=True)
print(results_df)

### Model Comparison Table
                        Pipeline  Accuracy  Precision    Recall  F1 Score
0                SVM_Tfidf_Ngram  0.762849   0.742287  0.762849  0.738133
1          LR_Count_Binary_Ngram  0.747520   0.728635  0.747520  0.732432
2                 LR_Tfidf_Ngram  0.732191   0.729445  0.732191  0.730574
3              SVM_Tfidf_Unigram  0.738503   0.725265  0.738503  0.730212
4         SVM_Count_Binary_Ngram  0.746619   0.721449  0.746619  0.722890
5        LR_Count_Binary_Unigram  0.715059   0.723156  0.715059  0.718613
6       SVM_Count_Binary_Unigram  0.712353   0.707297  0.712353  0.709597
7               LR_Tfidf_Unigram  0.696123   0.722199  0.696123  0.706270
8          SVM_Tfidf_Unigram_SVD  0.699729   0.674637  0.699729  0.682471
9                      Glove_SVM  0.689811   0.667190  0.689811  0.675243
10  SVM_Count_Binary_Unigram_SVD  0.688909   0.660859  0.688909  0.667307
11           SVM_Tfidf_Ngram_SVD  0.687106   0.650704  0.687106  0.659032
12    SVM_C

## Stage 1 Analysis of Model Performance Metrics

The table above presents the performance of 31 different pipelines for a three-class sentiment analysis task, evaluated using weighted metrics. The key metrics reported are Accuracy, Precision, Recall, and F1 Score.

## Top Performing Pipelines

- **SVM_Tfidf_Ngram (Row 0)**
  - **Accuracy:** 0.7628
  - **Precision:** 0.7423
  - **Recall:** 0.7628
  - **F1 Score:** 0.7381
  - **Analysis:** This pipeline achieves the highest overall accuracy and consistently high performance across all metrics, making it a strong candidate for sentiment analysis.

- **LR_Count_Binary_Ngram (Row 1) and SVM_Count_Binary_Ngram (Row 4)**
  - **Accuracy:** ~0.7475 and 0.7466 respectively
  - **F1 Score:** ~0.7324 and 0.7229 respectively
  - **Analysis:** These pipelines also perform very well, with metrics that are very competitive with the top performer.

## Other Notable Pipelines

- **LR_Tfidf_Ngram (Row 2) and SVM_Tfidf_Unigram (Row 3)**
  - **Accuracy:** 0.7322 and 0.7385 respectively
  - **F1 Score:** 0.7306 and 0.7302 respectively
  - **Analysis:** These models yield competitive results with a good balance between precision and recall, suggesting robust performance across classes.

## Pipelines with Lower Performance

- **Pipelines Incorporating SVD:**  
  - Examples: SVM_Tfidf_Unigram_SVD (Row 8), SVM_Tfidf_Ngram_SVD (Row 11), LR_Tfidf_Ngram_SVD (Row 26), etc.
  - **Observation:** These models generally show a drop in performance (accuracy and F1 Score below 0.70), indicating that dimensionality reduction via SVD may not be beneficial in this setup.

- **Pipelines Using Glove Embeddings and Naive Bayes:**
  - Examples: Glove_SVM (Row 9), Glove_RF (Row 19), Glove_LR (Row 20), NB_Tfidf_Unigram (Row 21), etc.
  - **Observation:** These pipelines exhibit lower F1 Scores (generally in the range of 0.55–0.60), suggesting that alternative representations and simpler probabilistic models might be less effective for this task.

## Overall Insights

- **Best Approach:**  
  Traditional pipelines using TF-IDF or CountVectorizer in combination with SVM or Logistic Regression outperform more complex methods involving SVD or Glove embeddings. The top performers maintain a strong balance across all metrics.

- **Balanced Performance:**  
  The weighted evaluation metrics indicate that the best pipelines are robust across all sentiment classes, with only small differences among the top models. This balanced performance is essential for a multi-class sentiment analysis task.

  **Conclusion:**  
The analysis suggests that **SVM_Tfidf_Ngram** is the top-performing pipeline based on weighted metrics, with **LR_Count_Binary_Ngram** and **SVM_Count_Binary_Ngram** also showing strong performance. More complex methods involving SVD or alternative embeddings did not outperform these traditional approaches.
"""

___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

## Stage 2: Parameter Tuning

### Parameter Tuning the top 5 pipelines, evaluating each on 20% test set (from 80 - 20 split)

Pipeline Tuning and Evaluation for Sentiment Analysis

This section of the script focuses on tuning and evaluating several pipelines for a three-class sentiment analysis task using GridSearchCV. Each pipeline employs different combinations of text vectorization (using TF-IDF or CountVectorizer) and classification (using SVM or Logistic Regression). The goal is to determine the best hyperparameters for each model and evaluate their performance across multiple metrics.

### Steps Followed

1. **Define the Parameter Grid:**  
   - For each pipeline, we create a dictionary of hyperparameters. 
2. **Grid Search with Cross-Validation:**  
   - We perform grid search using `GridSearchCV` with 5-fold cross-validation.
   - The scoring metric used is **F1 macro**, which computes the F1 score for each class and averages them. This is effective in multi-class sentiment analysis, as it treats all classes equally even if the data is balanced.

3. **Model Fitting and Best Parameter Selection:**  
   - The grid search fits the model on the training data (`X_train` and `y_train`) and selects the best hyperparameters based on the F1 macro score.
   - The best parameters are printed for review.

4. **Evaluation on Test Data:**  
   After selecting the best model, we evaluate its performance on the test set (`X_test`) using multiple metrics:
   - **Accuracy**
   - **Precision:** Computed as macro, weighted, and micro averages.
   - **Recall:** Computed as macro, weighted, and micro averages.
   - **F1 Score:** Computed as macro, weighted, and micro averages.

5. **Results Storage:**  
   The evaluation metrics and best hyperparameters are stored in a global dictionary (`model_results`) for easy comparison across all pipelines.

In [53]:
# Dictionary to hold results for all models
model_results = {}

In [55]:
# Define parameter grid for SVM_Tfidf_Ngram
param_grid = {
    'vectorizer__ngram_range': [(1, 2)],
    'vectorizer__use_idf': [True, False],
    'vectorizer__max_df': [0.9, 1.0],
    'vectorizer__min_df': [1, 2],
    'classifier__C': [0.1, 1, 10],
    'classifier__loss': ['hinge', 'squared_hinge']
}

# Run grid search
gs = GridSearchCV(pipeline_svm_tfidf_ngram, param_grid, cv=5, scoring='f1_macro', n_jobs=1, verbose=1)
gs.fit(X_train, y_train)
print("Best Parameters for SVM_Tfidf_Ngram:")
print(gs.best_params_)

# Evaluate on test data
y_pred = gs.predict(X_test)
acc = accuracy_score(y_test, y_pred)
precision_macro = precision_score(y_test, y_pred, average='macro')
precision_weighted = precision_score(y_test, y_pred, average='weighted')
precision_micro = precision_score(y_test, y_pred, average='micro')
recall_macro = recall_score(y_test, y_pred, average='macro')
recall_weighted = recall_score(y_test, y_pred, average='weighted')
recall_micro = recall_score(y_test, y_pred, average='micro')
f1_macro = f1_score(y_test, y_pred, average='macro')
f1_weighted = f1_score(y_test, y_pred, average='weighted')
f1_micro = f1_score(y_test, y_pred, average='micro')

print("Test Metrics for SVM_Tfidf_Ngram:")
print("Accuracy:", acc)
print("Precision - Macro:", precision_macro, "Weighted:", precision_weighted, "Micro:", precision_micro)
print("Recall    - Macro:", recall_macro, "Weighted:", recall_weighted, "Micro:", recall_micro)
print("F1 Score  - Macro:", f1_macro, "Weighted:", f1_weighted, "Micro:", f1_micro)

# Save results
model_results['SVM_Tfidf_Ngram'] = {
    'best_params': gs.best_params_,
    'accuracy': acc,
    'precision_macro': precision_macro,
    'precision_weighted': precision_weighted,
    'precision_micro': precision_micro,
    'recall_macro': recall_macro,
    'recall_weighted': recall_weighted,
    'recall_micro': recall_micro,
    'f1_macro': f1_macro,
    'f1_weighted': f1_weighted,
    'f1_micro': f1_micro
}


Fitting 5 folds for each of 48 candidates, totalling 240 fits




Best Parameters for SVM_Tfidf_Ngram:
{'classifier__C': 1, 'classifier__loss': 'squared_hinge', 'vectorizer__max_df': 0.9, 'vectorizer__min_df': 2, 'vectorizer__ngram_range': (1, 2), 'vectorizer__use_idf': False}
Test Metrics for SVM_Tfidf_Ngram:
Accuracy: 0.7448151487826871
Precision - Macro: 0.6197368951139476 Weighted: 0.7277096503850062 Micro: 0.7448151487826871
Recall    - Macro: 0.5674228818355823 Weighted: 0.7448151487826871 Micro: 0.7448151487826871
F1 Score  - Macro: 0.5882174800629737 Weighted: 0.7328421177320352 Micro: 0.7448151487826871


In [56]:
# Define parameter grid for LR_Count_Binary_Ngram
param_grid = {
    'vectorizer__ngram_range': [(1, 2)],
    'vectorizer__max_df': [0.9, 1.0],
    'vectorizer__min_df': [1, 2],
    'classifier__C': [0.1, 1, 10],
    'classifier__class_weight': ['balanced']
}

# Run grid search
gs = GridSearchCV(pipeline_lr_count_ngram, param_grid, cv=5, scoring='f1_macro', n_jobs=1, verbose=1)
gs.fit(X_train, y_train)
print("Best Parameters for LR_Count_Binary_Ngram:")
print(gs.best_params_)

# Evaluate on test data
y_pred = gs.predict(X_test)
acc = accuracy_score(y_test, y_pred)
precision_macro = precision_score(y_test, y_pred, average='macro')
precision_weighted = precision_score(y_test, y_pred, average='weighted')
precision_micro = precision_score(y_test, y_pred, average='micro')
recall_macro = recall_score(y_test, y_pred, average='macro')
recall_weighted = recall_score(y_test, y_pred, average='weighted')
recall_micro = recall_score(y_test, y_pred, average='micro')
f1_macro = f1_score(y_test, y_pred, average='macro')
f1_weighted = f1_score(y_test, y_pred, average='weighted')
f1_micro = f1_score(y_test, y_pred, average='micro')

print("Test Metrics for LR_Count_Binary_Ngram:")
print("Accuracy:", acc)
print("Precision - Macro:", precision_macro, "Weighted:", precision_weighted, "Micro:", precision_micro)
print("Recall    - Macro:", recall_macro, "Weighted:", recall_weighted, "Micro:", recall_micro)
print("F1 Score  - Macro:", f1_macro, "Weighted:", f1_weighted, "Micro:", f1_micro)

# Save results
model_results['LR_Count_Binary_Ngram'] = {
    'best_params': gs.best_params_,
    'accuracy': acc,
    'precision_macro': precision_macro,
    'precision_weighted': precision_weighted,
    'precision_micro': precision_micro,
    'recall_macro': recall_macro,
    'recall_weighted': recall_weighted,
    'recall_micro': recall_micro,
    'f1_macro': f1_macro,
    'f1_weighted': f1_weighted,
    'f1_micro': f1_micro
}


Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Parameters for LR_Count_Binary_Ngram:
{'classifier__C': 0.1, 'classifier__class_weight': 'balanced', 'vectorizer__max_df': 0.9, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)}
Test Metrics for LR_Count_Binary_Ngram:
Accuracy: 0.7123534715960325
Precision - Macro: 0.568175326083589 Weighted: 0.7076506422245021 Micro: 0.7123534715960325
Recall    - Macro: 0.5618741995393995 Weighted: 0.7123534715960325 Micro: 0.7123534715960325
F1 Score  - Macro: 0.5643067712736646 Weighted: 0.7095117507230844 Micro: 0.7123534715960325


In [57]:
# Define parameter grid for LR_Tfidf_Ngram
param_grid = {
    'vectorizer__ngram_range': [(1, 2)],
    'vectorizer__use_idf': [True, False],
    'classifier__C': [0.1, 1, 10],
    'classifier__class_weight': ['balanced']
}

# Run grid search
gs = GridSearchCV(pipeline_lr_tfidf_ngram, param_grid, cv=5, scoring='f1_macro', n_jobs=1, verbose=1)
gs.fit(X_train, y_train)
print("Best Parameters for LR_Tfidf_Ngram:")
print(gs.best_params_)

# Evaluate on test data
y_pred = gs.predict(X_test)
acc = accuracy_score(y_test, y_pred)
precision_macro = precision_score(y_test, y_pred, average='macro')
precision_weighted = precision_score(y_test, y_pred, average='weighted')
precision_micro = precision_score(y_test, y_pred, average='micro')
recall_macro = recall_score(y_test, y_pred, average='macro')
recall_weighted = recall_score(y_test, y_pred, average='weighted')
recall_micro = recall_score(y_test, y_pred, average='micro')
f1_macro = f1_score(y_test, y_pred, average='macro')
f1_weighted = f1_score(y_test, y_pred, average='weighted')
f1_micro = f1_score(y_test, y_pred, average='micro')

print("Test Metrics for LR_Tfidf_Ngram:")
print("Accuracy:", acc)
print("Precision - Macro:", precision_macro, "Weighted:", precision_weighted, "Micro:", precision_micro)
print("Recall    - Macro:", recall_macro, "Weighted:", recall_weighted, "Micro:", recall_micro)
print("F1 Score  - Macro:", f1_macro, "Weighted:", f1_weighted, "Micro:", f1_micro)

# Save results
model_results['LR_Tfidf_Ngram'] = {
    'best_params': gs.best_params_,
    'accuracy': acc,
    'precision_macro': precision_macro,
    'precision_weighted': precision_weighted,
    'precision_micro': precision_micro,
    'recall_macro': recall_macro,
    'recall_weighted': recall_weighted,
    'recall_micro': recall_micro,
    'f1_macro': f1_macro,
    'f1_weighted': f1_weighted,
    'f1_micro': f1_micro
}


Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best Parameters for LR_Tfidf_Ngram:
{'classifier__C': 1, 'classifier__class_weight': 'balanced', 'vectorizer__ngram_range': (1, 2), 'vectorizer__use_idf': True}
Test Metrics for LR_Tfidf_Ngram:
Accuracy: 0.7321911632100991
Precision - Macro: 0.600318552601035 Weighted: 0.7294454546067841 Micro: 0.7321911632100991
Recall    - Macro: 0.5977719095294677 Weighted: 0.7321911632100991 Micro: 0.7321911632100991
F1 Score  - Macro: 0.5986477334344934 Weighted: 0.7305738680821262 Micro: 0.7321911632100991


In [58]:
# Define parameter grid for SVM_Tfidf_Unigram
param_grid = {
    'vectorizer__ngram_range': [(1, 1)],
    'vectorizer__use_idf': [True, False],
    'vectorizer__max_df': [0.9, 1.0],
    'vectorizer__min_df': [1, 2],
    'classifier__C': [0.1, 1, 10],
    'classifier__loss': ['hinge', 'squared_hinge'],
    'classifier__class_weight': ['balanced']
}

# Run grid search
gs = GridSearchCV(pipeline_svm_tfidf_unigram, param_grid, cv=5, scoring='f1_macro', n_jobs=1, verbose=1)
gs.fit(X_train, y_train)
print("Best Parameters for SVM_Tfidf_Unigram:")
print(gs.best_params_)

# Evaluate on test data
y_pred = gs.predict(X_test)
acc = accuracy_score(y_test, y_pred)
precision_macro = precision_score(y_test, y_pred, average='macro')
precision_weighted = precision_score(y_test, y_pred, average='weighted')
precision_micro = precision_score(y_test, y_pred, average='micro')
recall_macro = recall_score(y_test, y_pred, average='macro')
recall_weighted = recall_score(y_test, y_pred, average='weighted')
recall_micro = recall_score(y_test, y_pred, average='micro')
f1_macro = f1_score(y_test, y_pred, average='macro')
f1_weighted = f1_score(y_test, y_pred, average='weighted')
f1_micro = f1_score(y_test, y_pred, average='micro')

print("Test Metrics for SVM_Tfidf_Unigram:")
print("Accuracy:", acc)
print("Precision - Macro:", precision_macro, "Weighted:", precision_weighted, "Micro:", precision_micro)
print("Recall    - Macro:", recall_macro, "Weighted:", recall_weighted, "Micro:", recall_micro)
print("F1 Score  - Macro:", f1_macro, "Weighted:", f1_weighted, "Micro:", f1_micro)

# Save results
model_results['SVM_Tfidf_Unigram'] = {
    'best_params': gs.best_params_,
    'accuracy': acc,
    'precision_macro': precision_macro,
    'precision_weighted': precision_weighted,
    'precision_micro': precision_micro,
    'recall_macro': recall_macro,
    'recall_weighted': recall_weighted,
    'recall_micro': recall_micro,
    'f1_macro': f1_macro,
    'f1_weighted': f1_weighted,
    'f1_micro': f1_micro
}


Fitting 5 folds for each of 48 candidates, totalling 240 fits




Best Parameters for SVM_Tfidf_Unigram:
{'classifier__C': 1, 'classifier__class_weight': 'balanced', 'classifier__loss': 'squared_hinge', 'vectorizer__max_df': 0.9, 'vectorizer__min_df': 2, 'vectorizer__ngram_range': (1, 1), 'vectorizer__use_idf': False}
Test Metrics for SVM_Tfidf_Unigram:
Accuracy: 0.7276825969341749
Precision - Macro: 0.59049139842036 Weighted: 0.7173221303949213 Micro: 0.7276825969341749
Recall    - Macro: 0.5664867359537832 Weighted: 0.7276825969341749 Micro: 0.7276825969341749
F1 Score  - Macro: 0.5769678402386047 Weighted: 0.7213548930090299 Micro: 0.7276825969341749


In [59]:
# Define parameter grid for SVM_Count_Binary_Ngram
param_grid = {
    'vectorizer__ngram_range': [(1, 2)],
    'vectorizer__max_df': [0.9, 1.0],
    'vectorizer__min_df': [1, 2],
    'classifier__C': [0.1, 1, 10],
    'classifier__loss': ['hinge', 'squared_hinge'],
    'classifier__class_weight': ['balanced']
}

# Run grid search
gs = GridSearchCV(pipeline_svm_count_ngram, param_grid, cv=5, scoring='f1_macro', n_jobs=1, verbose=1)
gs.fit(X_train, y_train)
print("Best Parameters for SVM_Count_Binary_Ngram:")
print(gs.best_params_)

# Evaluate on test data
y_pred = gs.predict(X_test)
acc = accuracy_score(y_test, y_pred)
precision_macro = precision_score(y_test, y_pred, average='macro')
precision_weighted = precision_score(y_test, y_pred, average='weighted')
precision_micro = precision_score(y_test, y_pred, average='micro')
recall_macro = recall_score(y_test, y_pred, average='macro')
recall_weighted = recall_score(y_test, y_pred, average='weighted')
recall_micro = recall_score(y_test, y_pred, average='micro')
f1_macro = f1_score(y_test, y_pred, average='macro')
f1_weighted = f1_score(y_test, y_pred, average='weighted')
f1_micro = f1_score(y_test, y_pred, average='micro')

print("Test Metrics for SVM_Count_Binary_Ngram:")
print("Accuracy:", acc)
print("Precision - Macro:", precision_macro, "Weighted:", precision_weighted, "Micro:", precision_micro)
print("Recall    - Macro:", recall_macro, "Weighted:", recall_weighted, "Micro:", recall_micro)
print("F1 Score  - Macro:", f1_macro, "Weighted:", f1_weighted, "Micro:", f1_micro)

# Save results
model_results['SVM_Count_Binary_Ngram'] = {
    'best_params': gs.best_params_,
    'accuracy': acc,
    'precision_macro': precision_macro,
    'precision_weighted': precision_weighted,
    'precision_micro': precision_micro,
    'recall_macro': recall_macro,
    'recall_weighted': recall_weighted,
    'recall_micro': recall_micro,
    'f1_macro': f1_macro,
    'f1_weighted': f1_weighted,
    'f1_micro': f1_micro
}


Fitting 5 folds for each of 24 candidates, totalling 120 fits




Best Parameters for SVM_Count_Binary_Ngram:
{'classifier__C': 0.1, 'classifier__class_weight': 'balanced', 'classifier__loss': 'squared_hinge', 'vectorizer__max_df': 0.9, 'vectorizer__min_df': 2, 'vectorizer__ngram_range': (1, 2)}
Test Metrics for SVM_Count_Binary_Ngram:
Accuracy: 0.7385031559963932
Precision - Macro: 0.6149719615790351 Weighted: 0.7186568422593567 Micro: 0.7385031559963932
Recall    - Macro: 0.5514887912714003 Weighted: 0.7385031559963932 Micro: 0.7385031559963932
F1 Score  - Macro: 0.575187325515886 Weighted: 0.7234374944469549 Micro: 0.7385031559963932


1. f1_macro calculates the F1 score for each class independently and averages them, ensuring that all classes contribute equally.


In [60]:
# Convert the model_results dictionary into a DataFrame for easy comparison
results_df = pd.DataFrame(model_results).T.reset_index().rename(columns={'index': 'Model'})

pd.set_option('display.float_format', '{:.4f}'.format)
print(results_df)

                    Model                                        best_params  \
0         SVM_Tfidf_Ngram  {'classifier__C': 1, 'classifier__loss': 'squa...   
1   LR_Count_Binary_Ngram  {'classifier__C': 0.1, 'classifier__class_weig...   
2          LR_Tfidf_Ngram  {'classifier__C': 1, 'classifier__class_weight...   
3       SVM_Tfidf_Unigram  {'classifier__C': 1, 'classifier__class_weight...   
4  SVM_Count_Binary_Ngram  {'classifier__C': 0.1, 'classifier__class_weig...   

  accuracy precision_macro precision_weighted precision_micro recall_macro  \
0   0.7448          0.6197             0.7277          0.7448       0.5674   
1   0.7124          0.5682             0.7077          0.7124       0.5619   
2   0.7322          0.6003             0.7294          0.7322       0.5978   
3   0.7277          0.5905             0.7173          0.7277       0.5665   
4   0.7385          0.6150             0.7187          0.7385       0.5515   

  recall_weighted recall_micro f1_macro f1_weighte

## Stage 2 Analysis of Model Performance Metrics

## Key Observations

- **Overall Accuracy:**
  - The **SVM_Tfidf_Ngram** pipeline achieved the highest accuracy (0.7448), suggesting it predicts the correct class more often than the others.

- **F1 Macro Score:**
  - The **LR_Tfidf_Ngram** pipeline shows the highest macro F1 score (0.5986), indicating a better balance between precision and recall across all classes.
  - The macro F1 score is important in multi-class sentiment analysis as it treats each class equally, regardless of class frequencies.

- **Precision and Recall:**
  - While **SVM_Count_Binary_Ngram** has a relatively high macro precision (0.6150), its macro recall (0.5515) is lower compared to the others. This suggests it is more conservative—fewer false positives but possibly missing more true positives.
  - **LR_Count_Binary_Ngram** and **SVM_Tfidf_Unigram** show similar performance patterns with moderate precision and recall, leading to comparable macro F1 scores (around 0.56 to 0.58).

- **Weighted Averages:**
  - The weighted metrics (which account for class frequency) are quite similar across all pipelines, with F1 weighted scores ranging from approximately 0.7095 to 0.7328. This consistency reflects the balanced nature of the dataset.
  
## Conclusion

Each model has its own strengths:
- **SVM_Tfidf_Ngram** leads in overall accuracy.
- **LR_Tfidf_Ngram** achieves the best balance across classes as reflected in its higher macro F1 score.
- The differences among the models are relatively small, so the final model selection might depend on whether you prioritize overall accuracy or balanced class performance.

This comprehensive evaluation helps in understanding the trade-offs between different pipelines, ensuring that the final choice aligns with the specific requirements of the sentiment analysis task.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------