# __Financial News Sentiment Analysis__

### Introduction

Sentiment analysis of financial news is a powerful tool for understanding market trends and investor sentiment. By analyzing statements from financial experts, it provides insights into the market’s direction without requiring individuals to manually interpret every piece of news. Accurately classifying sentiment in financial statements can help investors, analysts, and decision-makers respond to market conditions more effectively.

### Motivation

The goal of this project is to develop a machine learning model that classifies the sentiment of financial news statements as positive, neutral, or negative. This serves as both a practical application and a learning opportunity to deepen my understanding of natural language processing (NLP) and its applications in finance. Through this project, I explore different text vectorization techniques, model selection, and ensemble learning strategies to improve sentiment classification performance.

### High-Level Summary of Results

The key takeaways from this project include:

- Gaining a deeper understanding of the pros and cons of different text vectorization methods (TF-IDF, Bag of Words, and Word Embeddings).
- Building a stack ensemble model to improve sentiment classification performance.
- Comparing the ensemble model’s accuracy to individual machine learning models and evaluating statistical significance using McNemar’s test.
- Addressing and mitigating data leakage, ensuring a robust and fair model evaluation.

### Data Overview

The dataset used for training the models is a financial phrase bank, where financial statements were annotated by 16 domain experts. Each phrase was classified into sentiment categories (positive, neutral, or negative) based on the level of agreement among the annotators. The dataset is split into four different files based on the percentage of agreement: 50%, 66%, 75%, and 100%.

Initial Approach

My initial plan was:

- __Training Data__: Combine the 50% and 66% agreement files to maximize the number of training examples, allowing the model to learn from a diverse set of financial statements.
- __Tuning Data__: Use the 75% agreement file for hyperparameter tuning.
- __Test Data__: Use the 100% agreement file to evaluate final model performance.

However, this plan ultimately failed due to data leakage, which I will discuss in detail later. As a result, I had to pivot to a more traditional approach—removing overlapping phrases and manually splitting the data into training, tuning, and testing sets.

### Methodology

1. Data Preprocessing 
    
    Before combining all of the files together into one dataset I decided to first check and remove the duplicates. I also split the text files along the delimiter '@' which allowed for creation fof a pandas dataframe that contains the phrase in one column and the sentiment in another column. I split the data by using 80% of the data as the training set, 10% as the tuning set, and the final 10% of the dataset as the test set. 

2. Text Vectorization 

    Text vectorization is the process of converting words (text data) into numerical representations so that machine learning models can process and analyze them. There are three main methods of text vectorization:

    - Bog of Words (BoW)

        BoW creates a vocabulary of unique words from the text, counts how often each word appears in each document, and represents each document as a vector of word counts.

        Pro's

        - Simple and easy to understand.
        - Works well for basic text classification tasks
        
        Con's

        - Produces sparse, binary-like vectors that do not capture word meaning or importance.
        - Ignores word order and context.
    
    - Term Frequency - Inverse Document Frequency (TF-IDF)

        TF-IDF builds upon BoW by weighting words based on their importance. Frequently occurring words (e.g., “the,” “is”) are given lower weights, while rarer words receive higher scores.

        Pro's 

        - Reduces the impact of common words while emphasizing more meaningful ones.
        - Useful for feature selection in NLP tasks
        
        Con's

        - Does not capture the important or meaning of the word.
        - Weights are purely based on frequency, which may not always be optimal.
        
    
    - Word Embeddings 

        Word embeddings represent words in high-dimensional spaces where similar words have similar vector representations. Examples include Word2Vec, GloVe, and FastText.
    
        Pro's 

        - Preserves the semantic relatinonship between words.
        - More effective for capturing meaning and context.
        
        Con's 

        - Requires large datsets and significant computation resources for effective training.
        - Pretrained embeddings may not always generalize well to specialized domains.

    For the purposes of this project, we will be using the Term Frequency - Inverse Document Frequency (TF-IDF) because it captures the importance of words without the requirement of a large dataset.

    3. Machine Learning Models

        The machine learning models that I used for this project were **Multinomial Naive Bayes**, **Logistic Regression**, **K Nearest Neighbors**, and **Random Forests**. 

        - Multinomial Naive Bayes (MNB)
        
            Probabilistic machine learning algorithm that is primarily used for text classification such as spam detection, document classification, and sentiment analysis. A specializaed form of Naive Bayes designed for discrete data, making it extremely well-suited for Natural Language Processing Applications.

            - Fast and Efficient: Works well with high-dimensional text data
            - Performs well with small datasets
            - Handles word frequency well

            However, MNB assumes independence between words, something that is not always true in language. 
        
        - Logistic Regression 

            A supervised learning algorithm used for classification tasks. It is used for predicting categorical outcomes rather than regression. Widely applied in areas such as sentiment analysis, spam detection, and medical diagnosis.

            - Interpretable: Coefficients can explain feature importance 
            - Efficient: Works well on small-to-medium datasets
            - Handles high-dimensional data well
            - Works with TF-IDF and BoW representations

            One of the main drawbacks for logistic regression is that it can only learn linear decision boundaries, it struggles when relationships between the features are highly non-linear.
        
        - K-Nearest Neighbors (KNN)

            A simple, non-parametric machine learning algorithm used for classification and regression. It makes predictions based on the K most similar (nearest) data points in the training set.

            - Works well for problems with multiple lables 
            - Works with different Vectorization techniques 

            However, one major drawback is that text data or large datasets leads to high-dimensional feature spaces, making distance calculations less meaningful.
        
        - Random Forests

            An ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting. 

            - Handles high-dimensional data well.
            - Robust to overfitting.
            - Can handle imbalanced data.

            One of the main drawbacks of Random Forests in the context of text analysis is that it does not capture word meaning. Another drawback it is computationally expensive compared to simpler machine learning methods like Logistic Resgression and MNB.
        
        - Stacked Ensemble Learning 

            An advanced machine learning technique that combines multiple base models to create a meta-model that makes more accurate predictions. Instead of relying on a single algorithm, stackign takes the strength of multiple models and learns how best to weigh their predictions. 

            - Combines the strength of multiple models, reducing the bias and variance by leveraging different mdoels' unique advantages.
            - Handles edge cases better than individual models.
            Reduces overfitting by ensuring the final model does not overly rely on one pattern. 

            Some drawbacks of stack ensemble learning include increased computation complexity, as it requires training multiple models leading to longer processing times, and hyperparameter tuning complexity. 
    
### Model Evaluation Metrics 

To evaluate the performance of our models, we used precision, recall, and F1-score. With a greater emphasis on overall classification accuracy. Since the dataset has three sentiment classes (positive, neutral, negative), we also examined confusion matrices to understand model misclassifications. 

Prediction -> Out of all predicted sentiments of one class, how many actually belong to that class?
Accuracy -> Out of all actual sentiments of one class, how many were correctly predicted?
F1-Score -> Harmonic mean of precision and recall (balances both).

| Model | Accuracy (Tune) | Accuracy (Test) | Precision (Neg, Neu, Pos) | Recall Precision (Neg, Neu, Pos) | F1-Score Precision (Neg, Neu, Pos) |
| ----- | --------------- | --------------- | --------- | ------ | -------- |
| Multinomial NB | 0.7038 | 0.6921 | (0.90, 0.70, 0.69) | (0.15, 0.98, 0.36) | (0.25, 0.81, 0.47) |
| Multinomial NB (Tuned) | 0.7391 | 0.7087 | (0.74, 0.75, 0.67) | (0.46, 0.92, 0.47) | (0.57, 0.83, 0.55) |
| Logistic Regression | 0.7723 | 0.7376 | (0.87, 0.74, 0.71) | (0.33, 0.93, 0.51) | (0.48, 0.82, 0.60) | 
| KNN | 0.7226 | 0.6818 | (0.50, 0.75, 0.57) | (0.49, 0.84, 0.43) | (0.50, 0.79, 0.49) |
| KNN (Tuned) | 0.7143 | 0.7293 | (0.67, 0.77, 0.66) | (0.43, 0.91, 0.50) | (0.52, 0.83, 0.57) |
| Random Forest | 0.7536 | 0.7748 | (0.87, 0.77, 0.77) | (0.44, 0.95, 0.56) | (0.59, 0.85, 0.65) | 
| Random Forest (Tuned) | 0.7619 | 0.7707 | (0.87, 0.76, 0.76) | (0.44, 0.94, 0.55) | (0.59, 0.84, 0.64) |
| Stack Ensemble | 0.7826 | 0.7810 | (0.76, 0.81, 0.74) | (0.52, 0.91, 0.65) | (0.62, 0.86, 0.70) |

<br>
<br>

### Results Interpretation & Discussion

The primary objective of this project was to develop a machine learning model capable of classifying financial news sentiments as positive, neutral, negative. The results show that the stack ensemble model achieved the highest accuracy on the test set, closely followed by the general Random Forest model. However, certain trends emerged when evaluating performance across different sentiment categories. 

- Model Performance Breakdown by Sentiment 

    Analyzing the classification report, we observe clear disparities in model performance across sentiment categories:

    - Negative Sentiment 

        >All models struggled with recall for negative sentiment (meaning they miss a lot of negative phrases). Negative statements tend to be more nuanced which could mislead the model into thinking that a certain phrases is neutral or positive. "Despite record revenue, uncertainty looms over regulartory changes." The words "record revenue" may mislead the model into thinking that this phrase is positive or neutral. The best individual model, general Random Forest, identified 44% of negative cases correctly, while Stack Ensemble improves this to 52%. This increase in recall suggests that combining models like MNB (which handles word frequency well) and RF (which finds patterns in high-dimensional data) might help to mitigate misclassification. 
    
    - Neutral Sentiment

        >Every model performed best on neutral sentiment with recall values (~0.90+). Neutral statements often contain more common, factual, and non-emotional words that are easier for models to classify. "The company announced its quarterly earnings report today." Even simpler models like Naive Bayes performed well meaning that the words associated with neutral statements are more distinct and consistent. When comparing Stack Ensemble and RF, we see that they performed almost identically, 0.91 recall for Stack and 0.95 recall for RF. Since positive negative phrases often drive action, correctly classifying neutral phrases is less impactful in sentiment-driven decision-making.

    - Positive Sentiment

        >Precision was generally lower for positive sentiment than for neutral sentiment but still much better than negative sentiment. The reason for this could be that positive phrases often share characteristics with neutral phrases but could more subtle. "The company exceeded expectations but remains cautious about future performance." The words " exceeded expectations" sounds positive, but "remainds cautious" introduces uncertainty, making classification harder. Stack Ensemble helps by balancing recall and precision better than individual models, boosting the F1-Score to 0.70 from 0.65 when using the RF model, but precision and recall remain suboptimal, indicating room for improvement. 
    
    To determine whether Stack Ensemble's improvement over Random Forest was statistically significant, we performed McNemar's test, which evaluates whether two models have meaningfully different misclassfication patterns. The resulting test statistics was 2.8 with a p-value of 0.0906. This indicating no statistically significant difference between the stack ensemble model and the general RF model. While the stack ensemble model achieved slightly higher accuracy, it is not strong enough to conclude it is meaningfully better.

<br>

### Practical Implementations of the Results

Since there is no statistically significant difference between the general RF model and the stack ensemble model, it would make sense to stick with the RF model. This is due to RF model's being less computationally expensive, and is comparatively simpler than a stack ensemble model. In a real-world financial application, these models would definitely need to be touched up and improved upon. Introducing deep learning models would also be a way to improve the accuracy of sentiment analysis, however, this would come with its own challenges - a major one being that there would be a need for a much larger and diverse dataset.
        
### Limitations of the Approach 

1. Data Limitations

    The dataset was relatively small which could mean that there were patterns the machine learning models did not fully intergrate into its system due to a lack of examples. 

    Since the dataset was annotated by humans, there could be slight bias that was introduced. The files that contained different percentages of agreed upon phrases were combined into one larger dataset. This step introduced more subjectivity into the overall dataset, which could have played a role in the machine learning models getting confused by the lack of agreement on the 50% phrases.

2. Feature Representation Limitations

    The TF-IDF vectorization method used in this project does not capture word relationships, meaning, or context.

    Phrase-based sentiment is oftentimes more complex than word based sentiment as they could contain both positive and negative elements, increasing the difficulty of classification. 

    Word embedding would have enhanced model performance by capturing semantic meaning of phrases. However, a larger dataset would have been needed in order to properly utilize this. 

3. Modeling Limitations

    Hyperparameter tuning did not significantly improve performance, and even lowered it in the case of RF models. This suggests that either the default hyperparameters were already well-optimized, or the dataset size was too small for tuning to provide meaningful results. 

    The lack of deep learning models incorporated in this project means that the mdoel may struggle with nuanced sentiments, only being able to rely on traditional machine learning approaches. 

### Future Work 

1. Exploring Advanced NLP Models 

    Implement word embeddings such as Word2Vec, FastText, or GloVe instead of TF-IDF

    Experiment with transformer-based models such as BERT, RoBERTa, or FinBERT. These models can understand context better and may perform significantly better for financial sentiment analysis tasks. 

    Test LSTMs or GRUs, which recurrent neural networks designed to handle sequential text data.

2. Annotation Subjectivity and Bias 

    The dataset was annotated by humans, meaning there is inherent subjectivity in how phrases were classified as positive, neutral, or negative. 

    Combining the datasets with different levels of annotator agreements (50%, 66%, 75% and 100%) likely introduced some inconsistencies, as mentioned previously, making it harder for models to learn clear distinctions between sentiment categories. 

    Specifically, phrases in the 50% agreement category may have had more ambiguous sentiment, potentially confusing the models and leading to lower performance on negative sentiment classification. 

3. Financial Domain-Specific Language 

    The dataset consists only of short financial phrases, rather than full financial reports or news articles. 

    Financial text often contains complex language, techincal jargon, and context-dependent phrases, which the models may struggle to to interpret correctly. 

    Expanding the datasets to include long-form financial news or earnings reports couuld improve model performance in real-world applications. 


### Conclusion 

The goal of this project was to teach myself to develop a machine learning model capable of accurately classifying financial news statements into positive, neutral, or negative sentiments. 

The results showed that while the Stack Ensemble model has the highest accuracy, McNemar's test revealed its improvement over the Random Forest model was not statistically significant. Given the computational efficiency and relative simplicity of Random Forest, it would likely be the preferred model in real-world applications unless further improvements to ensemble techinques demonstrate a meaningful advantage. 

Additionally, the analysis highlighted challenges in sentiment classification, particularly in distinguishing negative statements, which were frequently misclassified as neutral or positive. The models performed best on neutral statementts, likely due to their consistent linguistic patterns. Future work could focus on handling nuanced language, incorporating deep learning methods, and expanding the dataset to improve model robustness. 

Overall, this project provided valuable insights into sentiment analysis, natural language processing (NLP), and machine learning techniques, reinforcing the important of data preprocessing, model evaluation, and statistical validation in predictive modeling. 

In [66]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk 
from transformers import pipeline
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from statsmodels.stats.contingency_tables import mcnemar



In [67]:
data_50 = pd.read_csv('Phrases/Sentences_50Agree.txt', delimiter='@', header=None, encoding='ISO-8859-1')
data_60 = pd.read_csv('Phrases/Sentences_66Agree.txt', delimiter='@', header=None, encoding='ISO-8859-1')
data_70 = pd.read_csv('Phrases/Sentences_75Agree.txt', delimiter='@', header=None, encoding='ISO-8859-1')
data_100 = pd.read_csv('Phrases/Sentences_AllAgree.txt', delimiter='@', header=None, encoding='ISO-8859-1')

In [68]:
def preprocess_text(df, text_column):
    """
    Cleans and preprocesses text data in a DataFrame.
    
    Steps:
    - Strips whitespace
    - Removes multiple spaces
    - Removes special characters (except alphanumeric & spaces)
    - Converts to lowercase
    - Tokenizes by splitting on spaces
    - Removes stopwords
    - Joins back into a cleaned string
    
    Args:
    df (pd.DataFrame): DataFrame containing the text data.
    text_column (str): The column name that contains text data.
    
    Returns:
    pd.DataFrame: Updated DataFrame with a new 'clean_text' column.
    """
    
    # Define stopwords
    stop_words = set(["the", "is", "in", "it", "this", "to", "and", "for", "of", "on", "at", "a", "an"])
    
    # 1. Strip whitespace
    df[text_column] = df[text_column].str.strip()
    
    # 2. Remove multiple spaces
    df[text_column] = df[text_column].str.replace(r"\s+", " ", regex=True)
    
    # 3. Remove special characters (except letters, numbers, spaces)
    df[text_column] = df[text_column].apply(lambda x: re.sub(r"[^a-zA-Z0-9\s]", "", x))
    
    # 4. Convert to lowercase
    df[text_column] = df[text_column].str.lower()
    
    # 5. Tokenize (split on spaces)
    df["tokens"] = df[text_column].apply(lambda x: x.split())
    
    # 6. Remove stopwords
    df["tokens"] = df["tokens"].apply(lambda words: [w for w in words if w not in stop_words])
    
    # 7. Join tokens back into cleaned text
    df["clean_text"] = df["tokens"].apply(lambda words: " ".join(words))
    
    return df

In [69]:
data = pd.concat([data_50, data_60, data_70, data_100])

In [70]:
data.rename(columns={0: 'text', 1: 'sentiment'}, inplace=True)

In [71]:
data = preprocess_text(data, 'text')

In [72]:
data = data.drop_duplicates(subset=['clean_text'])

In [73]:
le = LabelEncoder()
vectorizer = TfidfVectorizer(max_features=5000)

In [74]:
data['sentiment_encoded'] = le.fit_transform(data['sentiment'])

In [75]:
train_data, temp_data = train_test_split(data, test_size=0.2, random_state=30, stratify=data['sentiment'])
tune_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=30, stratify=temp_data['sentiment'])

In [76]:
X_train_tfidf = vectorizer.fit_transform(train_data['clean_text'])
X_tune_tfidf = vectorizer.transform(tune_data['clean_text'])
X_test_tfidf = vectorizer.transform(test_data['clean_text'])

In [77]:
y_train = train_data['sentiment']
y_tune = tune_data['sentiment']
y_test = test_data['sentiment']

In [78]:
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

In [79]:
nb_y_tune_pred = nb_model.predict(X_tune_tfidf)
nb_tune_acc = accuracy_score(y_tune, nb_y_tune_pred)

In [80]:
print(f"Naive Bayes Accuracy: {nb_tune_acc:.4f}")

Naive Bayes Accuracy: 0.7039


In [81]:
nb_y_test_pred = nb_model.predict(X_test_tfidf)
nb_test_acc = accuracy_score(y_test, nb_y_test_pred)

In [82]:
print(f"Naive Bayes Test Accuracy: {nb_test_acc:.4f}")

Naive Bayes Test Accuracy: 0.6921


In [83]:
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train_tfidf, y_train)

In [84]:
lr_y_tune_pred = lr_model.predict(X_tune_tfidf)
lr_tune_acc = accuracy_score(y_tune, lr_y_tune_pred)

In [85]:
print(f"Logistic Regression Accuracy: {lr_tune_acc:.4f}")

Logistic Regression Accuracy: 0.7723


In [86]:
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_tfidf, y_train)

In [87]:
knn_y_tune_pred = knn_model.predict(X_tune_tfidf)
knn_tune_acc = accuracy_score(y_tune, knn_y_tune_pred)

In [88]:
print(f"K Nearest Neighbors Accuracy: {knn_tune_acc:.4f}")

K Nearest Neighbors Accuracy: 0.7226


In [89]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=30)
rf_model.fit(X_train_tfidf, y_train)

In [90]:
rf_y_tune_pred = rf_model.predict(X_tune_tfidf)
rf_tune_acc = accuracy_score(y_tune, rf_y_tune_pred)

In [91]:
print(f"Random Forest Accuracy: {rf_tune_acc:.4f}")

Random Forest Accuracy: 0.7536


In [92]:
rf_param_grid = {
    'n_estimators': [50, 100, 200], 
    'max_depth': [None, 10, 20, 30],  
    'min_samples_split': [2, 5, 10],  
    'min_samples_leaf': [1, 2, 4]     
}

rf_grid_search = GridSearchCV(estimator=rf_model, param_grid=rf_param_grid,
                              cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
rf_grid_search.fit(X_train_tfidf, y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   2.6s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   4.2s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   4.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   4.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=50; total time=   2.8s
[CV] END max_dep

In [93]:
print(f"Best RF Parameters: {rf_grid_search.best_params_}")
print(f"Best RF Accuracy: {rf_grid_search.best_score_:.4f}")


Best RF Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Best RF Accuracy: 0.7525


In [94]:
rf_best_model = rf_grid_search.best_estimator_

In [95]:
rf_best_y_tune_pred = rf_best_model.predict(X_tune_tfidf)
rf_best_tune_acc = accuracy_score(y_tune, rf_best_y_tune_pred)

In [96]:
print(f"Hypertuned Random Forest {rf_best_tune_acc:.4f}")

Hypertuned Random Forest 0.7619


In [97]:
nb_param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0, 5.0],  
    'fit_prior': [True, False]  
}

nb_grid_search = GridSearchCV(estimator=nb_model, param_grid=nb_param_grid,
                              cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
nb_grid_search.fit(X_train_tfidf, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END ..........................alpha=0.5, fit_prior=True; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END ..........................alpha=0.5, fi

In [98]:
print(f"Best NB Parameters: {nb_grid_search.best_params_}")
print(f"Best NB Accuracy: {nb_grid_search.best_score_:.4f}")

Best NB Parameters: {'alpha': 0.1, 'fit_prior': True}
Best NB Accuracy: 0.7054


In [99]:
nb_best_model = nb_grid_search.best_estimator_

In [100]:
nb_best_y_tune_pred = nb_best_model.predict(X_tune_tfidf)
nb_best_tune_acc = accuracy_score(y_tune, nb_best_y_tune_pred)

In [101]:
print(f"Hypertuned Naive Bayes Tuning Accuracy : {nb_best_tune_acc:.4f}")

Hypertuned Naive Bayes Tuning Accuracy : 0.7391


In [102]:
knn_param_grid = {
    'n_neighbors': [3, 5, 10],  # Try different numbers of neighbors
    'metric': ['euclidean', 'cosine'],  # Test different distance metrics
    'weights': ['uniform', 'distance']  # Weighting schemes
}

In [103]:
knn_grid_search = GridSearchCV(knn_model, knn_param_grid, 
                               cv=5, scoring='accuracy', 
                               n_jobs=-1, verbose=2)

In [104]:
knn_grid_search.fit(X_train_tfidf, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ..metric=euclidean, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END ..metric=euclidean, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END ..metric=euclidean, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=   0.2s


[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ...metric=euclidean, n_neighbors=5, weights=uniform; total time=   0.1s
[CV] END ..metric=euclidean, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END ..metric=euclidean, n_neighbors=5, weights=distance; total time=   0.1s
[CV] END ..metric=euclidean, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END ...metric=euclidean, n_neighbors=5, weights=uniform; total time=   0.1s
[CV] END ...metric=euclidean, n_neighbors=5, weights=uniform; total time=   0.1s
[CV] END ...metric=euclidean, n_neighbors=5, weights=uniform; total time=   0.1s
[CV] END ...metric=euclidean, n_neighbors=5, weights=uniform; total time=   0.2s
[CV] END ..metric=euclidean, n_neighbors=5, weights=distance; total time=   0.1s
[CV] END ..metric=euclidean, n_neighbors=5, weights=distance; total time=   0.2s
[CV] END ..metric=euclidean, n_neighbors=10, weights=uniform; total time=   0.2s
[CV] END .metric=euclidean, 

In [105]:
print("Best KNN Parameters:", knn_grid_search.best_params_)

Best KNN Parameters: {'metric': 'cosine', 'n_neighbors': 10, 'weights': 'distance'}


In [106]:
knn_best_model = knn_grid_search.best_estimator_

In [107]:
knn_best_y_tune_pred = knn_best_model.predict(X_tune_tfidf)
knn_best_tune_acc = accuracy_score(y_tune, knn_best_y_tune_pred)

In [108]:
print(f"K Nearest Neighbors Tuning Accuracy : {knn_best_tune_acc:.4f}")

K Nearest Neighbors Tuning Accuracy : 0.7143


In [109]:
nb_best_y_test_pred = nb_best_model.predict(X_test_tfidf)
nb_best_test_acc = accuracy_score(y_test, nb_best_y_test_pred)

In [110]:
print(f"Hypertuned Naive Bayes Test Accuracy : {nb_best_test_acc:.4f}")

Hypertuned Naive Bayes Test Accuracy : 0.7087


In [111]:
lr_y_test_pred = lr_model.predict(X_test_tfidf)
lr_test_acc = accuracy_score(y_test, lr_y_test_pred)

In [112]:
print(f"Logistic Test Accuracy : {lr_test_acc:.4f}")

Logistic Test Accuracy : 0.7376


In [113]:
rf_best_y_test_pred = rf_best_model.predict(X_test_tfidf)
rf_best_test_acc = accuracy_score(y_test, rf_best_y_test_pred)

In [114]:
print(f"Hypertuned Random Forest Test Accuracy : {rf_best_test_acc:.4f}")

Hypertuned Random Forest Test Accuracy : 0.7707


In [115]:
knn_best_y_test_pred = knn_best_model.predict(X_test_tfidf)
knn_best_test_acc = accuracy_score(y_test, knn_best_y_test_pred)

In [116]:
print(f"Hypertuned K Neareset Neighbors Test Accuracy : {knn_best_test_acc:.4f}")

Hypertuned K Neareset Neighbors Test Accuracy : 0.7293


In [117]:
base_models = [
    ('nb', MultinomialNB(**nb_grid_search.best_params_)),
    ('rf', RandomForestClassifier(**rf_grid_search.best_params_)),
    ('knn', KNeighborsClassifier(**knn_grid_search.best_params_)),
]

In [118]:
meta_model = LogisticRegression()

In [119]:
stacking_model = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,  
    passthrough=False, 
    n_jobs=-1
)

In [120]:
stacking_model.fit(X_train_tfidf, y_train)

In [121]:
stacking_accuracy = stacking_model.score(X_tune_tfidf, y_tune)
print(f"Stacking Model Accuracy: {stacking_accuracy:.4f}")

Stacking Model Accuracy: 0.7764


In [122]:
stacking_test_pred = stacking_model.score(X_test_tfidf, y_test)
print(f"Stacking Model Test Accuracy: {stacking_test_pred:.4f}")

Stacking Model Test Accuracy: 0.7769


In [123]:
rf_preds = rf_model.predict(X_test_tfidf)  
stacking_preds = stacking_model.predict(X_test_tfidf) 

In [124]:
A = np.sum((rf_preds == y_test) & (stacking_preds == y_test))  
B = np.sum((rf_preds == y_test) & (stacking_preds != y_test))  
C = np.sum((rf_preds != y_test) & (stacking_preds == y_test))  
D = np.sum((rf_preds != y_test) & (stacking_preds != y_test)) 

table = np.array([[A, B], [C, D]])

In [125]:
result = mcnemar(table, exact=False, correction=False)
print(f"McNemar’s test statistic: {result.statistic}")
print(f"McNemar’s test p-value: {result.pvalue}")

if result.pvalue < 0.05:
    print("There is a statistically significant difference between the models.")
else:
    print("No statistically significant difference between the models.")

McNemar’s test statistic: 1.3728813559322033
McNemar’s test p-value: 0.2413174431838346
No statistically significant difference between the models.
