# __Financial News Sentiment Analysis__

In [104]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk 
from transformers import pipeline
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from scipy.stats import chi2_contingency
from statsmodels.stats.contingency_tables import mcnemar
import re
import os


>> The data that will be used to train this model is a financial phrase bank that includes a total of 4840 sentences. The sentences were annotated by 16 people with adequate background knowledge on financial markets. There were four alternative references based on the strength of majority agreement. The files contained phrases that 50%, 66%, 75%, and 100% of the people agreed upon, respectively. Due to this breakdown of agreement on phrase sentiment, the plan is as follows, we combine that files where 50% and 66% of readers were able to come to an agreement about the sentiment, then use this combined file as the one for model training. We then use the file that includes phrases that 75% of readers agreed upon as the file that helps to fine-tune the model. Finally, we test the model on the file where 100% of readers agreed upon the sentiment.

In [2]:
data_50 = pd.read_csv('Phrases/Sentences_50Agree.txt', delimiter='@', header=None, encoding='ISO-8859-1')
data_60 = pd.read_csv('Phrases/Sentences_66Agree.txt', delimiter='@', header=None, encoding='ISO-8859-1')
data_70 = pd.read_csv('Phrases/Sentences_75Agree.txt', delimiter='@', header=None, encoding='ISO-8859-1')
data_100 = pd.read_csv('Phrases/Sentences_AllAgree.txt', delimiter='@', header=None, encoding='ISO-8859-1')

In [3]:
data_50.head()

Unnamed: 0,0,1
0,"According to Gran , the company has no plans t...",neutral
1,Technopolis plans to develop in stages an area...,neutral
2,The international electronic industry company ...,negative
3,With the new production plant the company woul...,positive
4,According to the company 's updated strategy f...,positive


In [4]:
data_50_60 = pd.concat([data_50, data_60])

In [5]:
data_50_60.columns

Index([0, 1], dtype='int64')

In [6]:
data_50_60.rename(columns={0: 'text', 1: 'sentiment'}, inplace=True)
data_70.rename(columns={0: 'text', 1: 'sentiment'}, inplace=True)
data_100.rename(columns={0: 'text', 1: 'sentiment'}, inplace=True)

In [7]:
data_50_60["text"] = data_50_60["text"].astype(str)

>> Creating a function that helps to remove whitespace, multiple spaces, special characters (except alphanumeric & spaces), converting to lowercase, tokenizing by splitting on spaces, removing stopwords, and creating a new column in the dataframe comprised of joining the tokenized and cleaned words back into a string.

In [8]:
def preprocess_text(df, text_column):
    """
    Cleans and preprocesses text data in a DataFrame.
    
    Steps:
    - Strips whitespace
    - Removes multiple spaces
    - Removes special characters (except alphanumeric & spaces)
    - Converts to lowercase
    - Tokenizes by splitting on spaces
    - Removes stopwords
    - Joins back into a cleaned string
    
    Args:
    df (pd.DataFrame): DataFrame containing the text data.
    text_column (str): The column name that contains text data.
    
    Returns:
    pd.DataFrame: Updated DataFrame with a new 'clean_text' column.
    """
    
    # Define stopwords
    stop_words = set(["the", "is", "in", "it", "this", "to", "and", "for", "of", "on", "at", "a", "an"])
    
    # 1. Strip whitespace
    df[text_column] = df[text_column].str.strip()
    
    # 2. Remove multiple spaces
    df[text_column] = df[text_column].str.replace(r"\s+", " ", regex=True)
    
    # 3. Remove special characters (except letters, numbers, spaces)
    df[text_column] = df[text_column].apply(lambda x: re.sub(r"[^a-zA-Z0-9\s]", "", x))
    
    # 4. Convert to lowercase
    df[text_column] = df[text_column].str.lower()
    
    # 5. Tokenize (split on spaces)
    df["tokens"] = df[text_column].apply(lambda x: x.split())
    
    # 6. Remove stopwords
    df["tokens"] = df["tokens"].apply(lambda words: [w for w in words if w not in stop_words])
    
    # 7. Join tokens back into cleaned text
    df["clean_text"] = df["tokens"].apply(lambda words: " ".join(words))
    
    return df

In [9]:
data_50_60 = preprocess_text(data_50_60, "text")
data_70 = preprocess_text(data_70, "text")
data_100 = preprocess_text(data_100, "text")

In [10]:
data_50_60.head()

Unnamed: 0,text,sentiment,tokens,clean_text
0,according to gran the company has no plans to...,neutral,"[according, gran, company, has, no, plans, mov...",according gran company has no plans move all p...
1,technopolis plans to develop in stages an area...,neutral,"[technopolis, plans, develop, stages, area, no...",technopolis plans develop stages area no less ...
2,the international electronic industry company ...,negative,"[international, electronic, industry, company,...",international electronic industry company elco...
3,with the new production plant the company woul...,positive,"[with, new, production, plant, company, would,...",with new production plant company would increa...
4,according to the company s updated strategy fo...,positive,"[according, company, s, updated, strategy, yea...",according company s updated strategy years 200...


>> Now we move onto converting the 'sentiment' column from labels ('neutral', 'negative', 'positive') to numeric values so that the machine learning models can process them. We will be using sklearn LabelEncoder's fit_transform which maps sentiment values of negative, neutral, and positive, to the values 0, 1, and 2, respectively.

In [11]:
le = LabelEncoder()

In [12]:
data_50_60['sentiment_encoded'] = le.fit_transform(data_50_60['sentiment'])
data_70['sentiment_encoded'] = le.fit_transform(data_70['sentiment'])
data_100['sentiment_encoded'] = le.fit_transform(data_100['sentiment'])

>> Bag of Words (BoW)
>>> Creates a vocabulary of unique words from the text, counting how often each word appears in each document, and then representing each document as a vector of word counts.

>>> Pro's
>>>> Simple to understand and easy to implement.

>>> Con's 
>>>> Produces binary-like vectors that fail to capture the meaning and importance of words.


>> Term Frequency-Inverse Document Frequency (TF-IDF)
>>> Builds upon BoW by weighting words based on their importance. Frequently occurring words (i.e. "the", "is") are given lower weights, while rarer words receive higher scores. 

>>> Pro's
>>>> Reduces the impact of common words while emphasizing more meaningful ones.

>>> Con's 
>>>> Still does not capture the importance or meaning of the word.


>> Word Embeddings
>>> Represents words in high-dimensional spaces where similar words have similar vector representations.

>>> Pro's
>>>> Preserves the semantic relationship between words.

>>> Con's
>>>> Requires large datasets and significant computational resources for effective training. 

>> For this project, we will be using the TF-IDF method of vectorization as we have limited data. The words will have their own weight depending on how often they appear in a given document, but will not have meaning nor context. 

>> We create the vectorizer and set the max_features, which controls the maximum number of unique words that the vectorizer will keep when converting text into numerical form. This reduces memory usage, ensures that we focus on more meaningful words, and improves model efficiency as too many features slow down training.

In [13]:
vectorizer = TfidfVectorizer(max_features = 5000)

>> As mentioned previously, we are using the combined datasets of 50% agreement and 66% agreement to train the model. Hence, we use ```fit_transform``` as this allows the TfidfVectorizer to learn the vocabulary. However, it is not training the ML model yet, it's just creating a numerical representation of the text. 

>> For the tuning dataset and testing dataset, 75% agreement and 100% agreement, instead of using ```fit_transform```, we just use ```transform```. This discinction is important. The vectorizer has already learned the vocabulary from the training data, and now uses the vectorizer rules and applies to them to the tuning and testing dataset without learning anythig new. If we were to continue to train the vectorizer on the tuning and testing dataset, there would be data leakage where in the tuning and test data would be influencing the training. Doing this also ensures consistency, the same vocabulary and word weights are used for all datasets, otherwise, validation/testing might introduce new words, inflating performance.


In [14]:
X_tfidf_train, y_train = vectorizer.fit_transform(data_50_60['clean_text']), data_50_60['sentiment_encoded']
X_tfidf_tune, y_tune = vectorizer.transform(data_70['clean_text']), data_70['sentiment_encoded']
X_tfidf_test, y_test = vectorizer.transform(data_100['clean_text']), data_100['sentiment_encoded']


In [15]:
print("Vocabulary Size:", len(vectorizer.get_feature_names_out()))
print("Selected Features:", vectorizer.get_feature_names_out())

Vocabulary Size: 5000
Selected Features: ['00' '000' '001' ... 'zinc' 'zinclead' 'zone']


In [16]:
print("TF-IDF Train Matrix Shape:", X_tfidf_train.shape)
print("TF-IDF Tune Matrix Shape:", X_tfidf_tune.shape)
print("TF-IDF Test Matrix Shape:", X_tfidf_test.shape)

TF-IDF Train Matrix Shape: (9063, 5000)
TF-IDF Tune Matrix Shape: (3453, 5000)
TF-IDF Test Matrix Shape: (2264, 5000)


>> Normally, we would have to split the data into the training set and the test set, however, as mentioned previously, we already have different sets of data that correspond to the training, tuning, and validation sets. Thus, we move directly onto training the model.

>> For the training portion of this project, we will be using `Multinomial Naive Bayes`, `Logistic Regression`, `K Nearest Neighbors`, and `Random Forests` to first identify which of these individual models work best at predicting the tuning data sentiment. F

In [17]:
nb_model = MultinomialNB()
lr_model = LogisticRegression(max_iter=1000)
knn_model = KNeighborsClassifier(n_neighbors=5)
rf_model = RandomForestClassifier(n_estimators=100, random_state=30)


In [18]:
nb_model.fit(X_tfidf_train, y_train)
lr_model.fit(X_tfidf_train, y_train)
knn_model.fit(X_tfidf_train, y_train)
rf_model.fit(X_tfidf_train, y_train)

In [19]:
def model_accuracy(X_train, y_train):
    pass

In [20]:
nb_y_tune_pred = nb_model.predict(X_tfidf_tune)
nb_tuning_accuracy = accuracy_score(y_tune, nb_y_tune_pred)
print(f"Accuracy: {nb_tuning_accuracy:.4f}")

Accuracy: 0.8900


In [21]:
lr_y_tune_pred = lr_model.predict(X_tfidf_tune)
lr_tuning_accuracy = accuracy_score(y_tune, lr_y_tune_pred)
print(f"Accuracy: {lr_tuning_accuracy:.4f}")

Accuracy: 0.9580


In [22]:
knn_y_tune_pred = knn_model.predict(X_tfidf_tune)
knn_tuning_accuracy = accuracy_score(y_tune, knn_y_tune_pred)
print(f"Accuracy: {knn_tuning_accuracy:.4f}")

Accuracy: 0.8781


In [23]:
rf_y_tune_pred = rf_model.predict(X_tfidf_tune)
rf_tuning_accuracy = accuracy_score(y_tune, rf_y_tune_pred)
print(f"Accuracy: {rf_tuning_accuracy:.4f}")

Accuracy: 1.0000


In [24]:
rf_y_test_pred = rf_model.predict(X_tfidf_test)
rf_testing_accuracy = accuracy_score(y_test, rf_y_test_pred)
print(f"Accuracy: {rf_testing_accuracy:.4f}")

Accuracy: 1.0000


>> __Data Leakage Detected: 100% Accuracy is a Red Flag__

>>> During model evaluation, we discovered a major issue: the __Random Forest classifier achieved 100% accuracy__ on both the tuning and the testing datasets. This is a huge red flag that indicates data leakge - the model is memorizing the phrases rather than learning general patterns.

>> __Evidence of Data Overlap__
>>> To investigate the, we checked for overlapping phrases across datasets. The results were staggering:
>- __Train vs. Tune Overlap__: 3,446 phrases
>- __Train vs. Test Overlap__: 2,258 phrases
>- __Tune vs. Test Overlap__: 2,258 phrases

>> This means that a significant portion of the test and tuning data already exist in the training set, leading to misleadingly high accuracy scores. 

>> __Why This is a Big Problem__
>- No true performance measurement -> The model is just matching phrases instead of learning patterns.
>- Overfitting to seen data -> If deployed, the model will fail on truly unseen financial news.
>- False confidence in model performance -> The reported high accuracy is meaningless in real-world applications.

>> __Solution__
>>> Since our initial plan (using 50-66% for training, 77% for tuning, and 100% for testing) led to data leakage, we are pivoting to a better approach.
>1. Combine all datasets into a single unified dataset
>2. Remove all duplicate phrases
>3. Perform a proper train-test split (e.g., 80% train, 20% test)
>4. Further splitting of the training set to create a validation set for tuning

>>> This ensures that the test set only contains truly unseen phrases, allowing us to measure real-world model performance.

In [25]:
print("Train-Tune Overlap:", len(set(data_50_60['clean_text']) & set(data_70['clean_text'])))
print("Train-Test Overlap:", len(set(data_50_60['clean_text']) & set(data_100['clean_text'])))
print("Tune-Test Overlap:", len(set(data_70['clean_text']) & set(data_100['clean_text'])))

Train-Tune Overlap: 3446
Train-Test Overlap: 2258
Tune-Test Overlap: 2258


>>After combining the three datasets, we conducted a check for duplicate phrases across the training, tuning, and test datasets. The results confirmed that there is no overlap between them. This is great news, as it ensures that data leakage due to identical inputs is no longer a concern. With this issue resolved, we can now proceed to the modeling phase.

In [26]:
data = pd.concat([data_100, data_50_60, data_70])
data = data.drop_duplicates(subset=['clean_text'])

In [27]:
train_data, temp_data = train_test_split(data, test_size=0.2, random_state=30, stratify=data['sentiment'])
tune_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=30, stratify=temp_data['sentiment'])

In [28]:
print(f"Train Size: {len(train_data)}")
print(f"Tune Size: {len(tune_data)}")
print(f"Test Size: {len(test_data)}")

# Check for new overlap
overlap_train_tune = set(train_data['clean_text']).intersection(set(tune_data['clean_text']))
overlap_train_test = set(train_data['clean_text']).intersection(set(test_data['clean_text']))
overlap_tune_test = set(tune_data['clean_text']).intersection(set(test_data['clean_text']))

print(f"Overlap between Train and Tune: {len(overlap_train_tune)}")
print(f"Overlap between Train and Test: {len(overlap_train_test)}")
print(f"Overlap between Tune and Test: {len(overlap_tune_test)}")

Train Size: 3866
Tune Size: 483
Test Size: 484
Overlap between Train and Tune: 0
Overlap between Train and Test: 0
Overlap between Tune and Test: 0


>> Now that we have the new and improved datasets, we can circle back to what we were initially working on before encountering the issue of data leakage, vectorization.

In [29]:
vectorizer = TfidfVectorizer(max_features=5000)

In [30]:
X_train_tfidf = vectorizer.fit_transform(train_data['clean_text'])
X_tune_tfidf = vectorizer.transform(tune_data['clean_text'])
X_test_tfidf = vectorizer.transform(test_data['clean_text'])

In [31]:
y_train = train_data['sentiment']
y_tune = tune_data['sentiment']
y_test = test_data['sentiment']

>> We move onto training multiple models and comparing their performances.

| __Model__ | __Accuracy__ |
| --------- | ------------ |
| Naive Bayes | 0.6832 |
| Logistic Regression | 0.7350 |
| K Nearest Neighbors | 0.6936 |
| Random Forest | 0.7495 |

<br>
>> We observe that Random Forests appear to perform the best in terms of accuracy. The next steps will be to hypertune the parameters and work on creating a stack ensemble, before finally running the model through the test set to see how everything evaluates. 

In [32]:
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

In [33]:
nb_y_tune_pred = nb_model.predict(X_tune_tfidf)
nb_tune_acc = accuracy_score(y_tune, nb_y_tune_pred)

In [34]:
print(f"Naive Bayes Accuracy: {nb_tune_acc:.4f}")

Naive Bayes Accuracy: 0.6832


In [35]:
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train_tfidf, y_train)

In [36]:
lr_y_tune_pred = lr_model.predict(X_tune_tfidf)
lr_tune_acc = accuracy_score(y_tune, lr_y_tune_pred)

In [37]:
print(f"Logistic Regression Accuracy: {lr_tune_acc:.4f}")

Logistic Regression Accuracy: 0.7350


In [38]:
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_tfidf, y_train)

In [39]:
knn_y_tune_pred = knn_model.predict(X_tune_tfidf)
knn_tune_acc = accuracy_score(y_tune, knn_y_tune_pred)

In [40]:
print(f"K Nearest Neighbors Accuracy: {knn_tune_acc:.4f}")

K Nearest Neighbors Accuracy: 0.6936


In [41]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=30)
rf_model.fit(X_train_tfidf, y_train)

In [42]:
rf_y_tune_pred = rf_model.predict(X_tune_tfidf)
rf_tune_acc = accuracy_score(y_tune, rf_y_tune_pred)

In [43]:
print(f"Random Forest Accuracy: {rf_tune_acc:.4f}")

Random Forest Accuracy: 0.7495


>> After hypertuning the parameters for the Random Forest below and re-running the `rf_model` with the new hypertuned parameters, we can observe that the model accuracy has increased to 0.7516, slightly higher than the Random Forest model that did not include all the hypertuned values for parameters. The parameters that resulted in the highest accuracy were `{'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}`

In [44]:
rf_param_grid = {
    'n_estimators': [50, 100, 200], 
    'max_depth': [None, 10, 20, 30],  
    'min_samples_split': [2, 5, 10],  
    'min_samples_leaf': [1, 2, 4]     
}

rf_grid_search = GridSearchCV(estimator=rf_model, param_grid=rf_param_grid,
                              cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
rf_grid_search.fit(X_train_tfidf, y_train)


Fitting 5 folds for each of 108 candidates, totalling 540 fits


[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   3.1s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   3.1s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   3.2s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   3.5s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators

In [45]:
print(f"Best RF Parameters: {rf_grid_search.best_params_}")
print(f"Best RF Accuracy: {rf_grid_search.best_score_:.4f}")


Best RF Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Best RF Accuracy: 0.7483


In [46]:
rf_best_model = rf_grid_search.best_estimator_

In [47]:
rf_best_y_tune_pred = rf_best_model.predict(X_tune_tfidf)
rf_best_tune_acc = accuracy_score(y_tune, rf_best_y_tune_pred)

In [48]:
print(f"Hypertuned Random Forest {rf_best_tune_acc:.4f}")

Hypertuned Random Forest 0.7516


>> We move onto figuring out the hypertuned parameters values for the Multinomial Naive Bayes training model.

>> For the Multinomival Naive Bayes model, the original prediction accuracy of the model pre hypertuning was 0.6832, post hypertuning the resulting accuracy was 0.7184. The parameters that resulted in the highest accuracy were `{'alpha': 0.1, 'fit_prior': True}`

In [49]:
nb_param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0, 5.0],  
    'fit_prior': [True, False]  
}

nb_grid_search = GridSearchCV(estimator=nb_model, param_grid=nb_param_grid,
                              cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
nb_grid_search.fit(X_train_tfidf, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END ..........................alpha=0.5, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.5, fi

In [50]:
print(f"Best NB Parameters: {nb_grid_search.best_params_}")
print(f"Best NB Accuracy: {nb_grid_search.best_score_:.4f}")

Best NB Parameters: {'alpha': 0.1, 'fit_prior': True}
Best NB Accuracy: 0.7116


In [51]:
nb_best_model = nb_grid_search.best_estimator_

In [52]:
nb_best_y_tune_pred = nb_best_model.predict(X_tune_tfidf)
nb_best_tune_acc = accuracy_score(y_tune, nb_best_y_tune_pred)

In [53]:
print(f"Hypertuned Naive Bayes Accuracy : {nb_best_tune_acc:.4f}")

Hypertuned Naive Bayes Accuracy : 0.7184


In [54]:
knn_param_grid = {
    'n_neighbors': [3, 5, 10],  # Try different numbers of neighbors
    'metric': ['euclidean', 'cosine'],  # Test different distance metrics
    'weights': ['uniform', 'distance']  # Weighting schemes
}

In [55]:
knn_grid_search = GridSearchCV(knn_model, knn_param_grid, 
                               cv=5, scoring='accuracy', 
                               n_jobs=-1, verbose=2)

In [56]:
knn_grid_search.fit(X_train_tfidf, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END ..metric=euclidean, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ..metric=euclidean, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END ..metric=euclidean, n_neighbors=3, weights=distance; total time=   0.2s
[CV] END ...metric=euclidean, n_neighbors=5, weights=uniform; total time=   0.1s
[CV] END ..metric=euclidean, n_neighbors=5, weights=distance; total time=   0.1s
[CV] END ..metric=euclidean, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END ..metric=euclidean, n_neighbors=3, weig

[CV] END .....metric=cosine, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END .....metric=cosine, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END ......metric=cosine, n_neighbors=3, weights=uniform; total time=   0.1s
[CV] END ......metric=cosine, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ......metric=cosine, n_neighbors=5, weights=uniform; total time=   0.1s
[CV] END ......metric=cosine, n_neighbors=3, weights=uniform; total time=   0.2s
[CV] END ......metric=cosine, n_neighbors=5, weights=uniform; total time=   0.1s
[CV] END ......metric=cosine, n_neighbors=5, weights=uniform; total time=   0.1s
[CV] END .....metric=cosine, n_neighbors=5, weights=distance; total time=   0.1s
[CV] END .....metric=cosine, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END .....metric=cosine, n_neighbors=3, weights=distance; total time=   0.1s
[CV] END .....metric=cosine, n_neighbors=5, weights=distance; total time=   0.1s
[CV] END ......metric=cosine

In [57]:
print("Best KNN Parameters:", knn_grid_search.best_params_)

Best KNN Parameters: {'metric': 'cosine', 'n_neighbors': 10, 'weights': 'distance'}


In [58]:
knn_best_model = knn_grid_search.best_estimator_

In [75]:
knn_best_y_tune_pred = knn_best_model.predict(X_tune_tfidf)
knn_best_tune_acc = accuracy_score(y_tune, knn_best_y_tune_pred)

print(knn_best_tune_acc)

0.7267080745341615


In [60]:
print(f"Hypertuned K Nearest Neighbors Accuracy: {knn_best_tune_acc:.4f}")

Hypertuned K Nearest Neighbors Accuracy: 0.7267


| Hypertuned Model | Old Accuracy | New Accuracy |
| ---------------- | ------------ | ------------ |
| Naive Bayes | 0.6832 | 0.7184 |
| Random Forest | 0.7495 | 0.7516 |
| K Nearest Neighbors | 0.6936 | 0.7267 |

>> We now move onto creating a stacked ensemble model.

In [61]:
base_models = [
    ('nb', MultinomialNB(**nb_grid_search.best_params_)),
    ('rf', RandomForestClassifier(**rf_grid_search.best_params_)),
    ('knn', KNeighborsClassifier(**knn_grid_search.best_params_)),
]

In [79]:
nb_best_test_pred = nb_best_model.predict(X_test_tfidf)
nb_test_pred = nb_model.predict(X_test_tfidf)
nb_best_test_acc = accuracy_score(y_test, nb_best_test_pred)
nb_test_acc = accuracy_score(y_test, nb_test_pred)

In [80]:
print(nb_best_test_acc)
print(nb_test_acc)

0.737603305785124
0.7004132231404959


In [92]:
print(classification_report(y_test, nb_best_test_pred))

              precision    recall  f1-score   support

    negative       0.74      0.46      0.57        61
     neutral       0.75      0.92      0.83       287
    positive       0.67      0.47      0.55       136

    accuracy                           0.74       484
   macro avg       0.72      0.62      0.65       484
weighted avg       0.73      0.74      0.72       484



In [91]:
print(classification_report(y_test, nb_test_pred))

              precision    recall  f1-score   support

    negative       0.90      0.15      0.25        61
     neutral       0.70      0.98      0.81       287
    positive       0.69      0.36      0.47       136

    accuracy                           0.70       484
   macro avg       0.76      0.50      0.51       484
weighted avg       0.72      0.70      0.65       484



In [81]:
lr_model_test = lr_model.predict(X_test_tfidf)
lr_mode_test_acc = accuracy_score(y_test, lr_model_test)

In [82]:
print(lr_mode_test_acc)

0.737603305785124


In [93]:
print(classification_report(y_test, lr_model_test))

              precision    recall  f1-score   support

    negative       0.87      0.33      0.48        61
     neutral       0.74      0.93      0.82       287
    positive       0.71      0.51      0.60       136

    accuracy                           0.74       484
   macro avg       0.77      0.59      0.63       484
weighted avg       0.75      0.74      0.72       484



In [84]:
knn_best_model_test = knn_best_model.predict(X_test_tfidf)
knn_model_test = knn_model.predict(X_test_tfidf)
knn_best_acc = accuracy_score(y_test, knn_best_model_test)
knn_test_acc = accuracy_score(y_test, knn_model_test)

In [85]:
print(knn_best_acc)
print(knn_test_acc)

0.7355371900826446
0.6818181818181818


In [96]:
print(classification_report(y_test, knn_model_test))

              precision    recall  f1-score   support

    negative       0.50      0.49      0.50        61
     neutral       0.75      0.84      0.79       287
    positive       0.57      0.43      0.49       136

    accuracy                           0.68       484
   macro avg       0.61      0.59      0.59       484
weighted avg       0.67      0.68      0.67       484



In [95]:
print(classification_report(y_test, knn_best_model_test))

              precision    recall  f1-score   support

    negative       0.67      0.43      0.52        61
     neutral       0.77      0.91      0.83       287
    positive       0.66      0.50      0.57       136

    accuracy                           0.74       484
   macro avg       0.70      0.61      0.64       484
weighted avg       0.72      0.74      0.72       484



In [86]:
rf_best_model_test = rf_best_model.predict(X_test_tfidf)
rf_test = rf_model.predict(X_test_tfidf)
rf_best_acc = accuracy_score(y_test, rf_best_model_test)
rf_acc = accuracy_score(y_test, rf_test)


In [87]:
print(rf_best_acc)
print(rf_acc)

0.768595041322314
0.7747933884297521


In [98]:
print(classification_report(y_test, rf_test))

              precision    recall  f1-score   support

    negative       0.87      0.44      0.59        61
     neutral       0.77      0.95      0.85       287
    positive       0.77      0.56      0.65       136

    accuracy                           0.77       484
   macro avg       0.80      0.65      0.69       484
weighted avg       0.78      0.77      0.76       484



In [97]:
print(classification_report(y_test, rf_best_model_test))

              precision    recall  f1-score   support

    negative       0.87      0.44      0.59        61
     neutral       0.76      0.94      0.84       287
    positive       0.76      0.55      0.64       136

    accuracy                           0.77       484
   macro avg       0.80      0.64      0.69       484
weighted avg       0.77      0.77      0.75       484



In [62]:
meta_model = LogisticRegression()

In [63]:
stacking_model = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,  
    passthrough=False, 
    n_jobs=-1
)

In [64]:
stacking_model.fit(X_train_tfidf, y_train)

In [65]:
stacking_accuracy = stacking_model.score(X_tune_tfidf, y_tune)
print(f"Stacking Model Accuracy: {stacking_accuracy:.4f}")

Stacking Model Accuracy: 0.7764


| Model | New Accuracy |
| ---------------- | ------------ |
| Naive Bayes | 0.7184 |
| Random Forest | 0.7516 |
| K Nearest Neighbors | 0.7267 |
| Stack Ensemble | 0.7660 |

<br>

>> We observe that the stack ensemble has an overall prediction accuracy of 76.6%. This is an improvement compared to other models. In order to test whether this improvement in accuracy is actually significant, we are going to run McNemar's test. This is a statistical test that is used to compare two classification models on the same dataset and determine if one model is significantly better than the other.

In [101]:
rf_preds = rf_model.predict(X_test_tfidf)  
stacking_preds = stacking_model.predict(X_test_tfidf) 

In [102]:
A = np.sum((rf_preds == y_test) & (stacking_preds == y_test))  
B = np.sum((rf_preds == y_test) & (stacking_preds != y_test))  
C = np.sum((rf_preds != y_test) & (stacking_preds == y_test))  
D = np.sum((rf_preds != y_test) & (stacking_preds != y_test)) 

table = np.array([[A, B], [C, D]])

In [105]:
result = mcnemar(table, exact=False, correction=False)
print(f"McNemar’s test statistic: {result.statistic}")
print(f"McNemar’s test p-value: {result.pvalue}")

if result.pvalue < 0.05:
    print("There is a statistically significant difference between the models.")
else:
    print("No statistically significant difference between the models.")

McNemar’s test statistic: 1.6
McNemar’s test p-value: 0.20590321073206466
No statistically significant difference between the models.


In [100]:
chi2, p_value, _, _ = chi2_contingency(table, correction=False)

In [69]:
print(f"Chi-Square Value: {chi2}")
print(f"P-Value: {p_value}")
print(f"Stacking correct, RF wrong: {B}")
print(f"RF correct, Stacking wrong: {C}")

Chi-Square Value: 267.89945201893374
P-Value: 3.2562747159825463e-60
Stacking correct, RF wrong: 29
RF correct, Stacking wrong: 16


>> After running McNemar's test, we obtain a p-value of 3.19e-66, which is an extremely small value. This indicates that the difference between the Random Forest Model and the Stacking model is statistically significant. In other words, the stacking model performs significantly better than the Random Forest model, and this improvement is highly unlikely to be due to chance. 

In [70]:
final_test_predictions = stacking_model.predict(X_test_tfidf)

In [99]:
print(classification_report(y_test, final_test_predictions))

              precision    recall  f1-score   support

    negative       0.76      0.52      0.62        61
     neutral       0.81      0.91      0.86       287
    positive       0.74      0.65      0.70       136

    accuracy                           0.79       484
   macro avg       0.77      0.70      0.73       484
weighted avg       0.79      0.79      0.78       484



In [71]:
test_data.head()

Unnamed: 0,text,sentiment,tokens,clean_text,sentiment_encoded
557,following the registration the number of issu...,neutral,"[following, registration, number, issued, outs...",following registration number issued outstandi...,1
1777,the executive said that countries such as braz...,neutral,"[executive, said, that, countries, such, as, b...",executive said that countries such as brazil c...,1
150,metso expects its net sales to increase by abo...,positive,"[metso, expects, its, net, sales, increase, by...",metso expects its net sales increase by about ...,2
2423,capman made its initial investment in onemed i...,neutral,"[capman, made, its, initial, investment, oneme...",capman made its initial investment onemed june...,1
970,synergy benefits will start to materialise in ...,positive,"[synergy, benefits, will, start, materialise, ...",synergy benefits will start materialise second...,2


In [72]:
final_test_accuracy = accuracy_score(y_test, final_test_predictions)
print(f"Final Test Accuracy: {final_test_accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, final_test_predictions))

Final Test Accuracy: 0.7913

Classification Report:
              precision    recall  f1-score   support

    negative       0.76      0.52      0.62        61
     neutral       0.81      0.91      0.86       287
    positive       0.74      0.65      0.70       136

    accuracy                           0.79       484
   macro avg       0.77      0.70      0.73       484
weighted avg       0.79      0.79      0.78       484



>> The final test accuracy turns out to be 78.1%. The ~2% increase in prediction accuracy could be due to a couple of reasons. The test dataset may have potentially easier patterns to pick up on and assign a sentiment to with the tuning set having more challenging cases or a higher number of edge cases. Since the final dataset was comprised of a combination of the three different datasets which involve different agreement levels, this reasoning is relatively strong. The increase in prediction accuracy also shows that the model generalizes well, learning the underlying pattern of the words and is not relying on brute memorization when it comes to learning and figuring out the sentiment for a phrase. 

# Conclusion

From the start, the goal of this project was to develop a deeper understanding of sentiment analysis and natural language processing (NLP). I chose financial phrases as the dataset because I’ve always been interested in determining whether financial reports can be classified as positive, neutral, or negative, with the ultimate goal of using sentiment analysis for financial decision-making.

Data Preparation & Initial Challenges

The dataset initially consisted of three separate files, each containing phrases and their corresponding sentiment labels. The difference between the files lay in the percentage of annotators who agreed on the sentiment of each phrase—ranging from 50%, 66%, 75%, and 100% agreement levels.

Originally, I planned to:

- Train the model on the combined 50% and 66% agreement datasets.
- Tune the model on the 75% agreement dataset.
- Test the model on the 100% agreement dataset.

However, I did not initially account for the possibility of duplicate phrases across different files. My assumption was that the datasets were mutually exclusive and purely split based on agreement levels, but this turned out to be incorrect. The issue became clear when my first Random Forest model achieved 100% accuracy—a strong indication of data leakage.

Upon further analysis, I found that thousands of phrases overlapped between the training, tuning, and test datasets, confirming the leakage. To address this issue, I pivoted my approach:
1.	Identified and removed overlapping phrases across datasets.
2.	Combined all remaining unique phrases into a single dataset.
3.	Manually split the dataset into training (80%), tuning (10%), and testing (10%) sets.

Model Training & Feature Engineering

For feature engineering, I used Term Frequency - Inverse Document Frequency (TF-IDF) to vectorize the text. TF-IDF helps highlight important words in phrases while reducing the impact of commonly used terms. Although it does not capture meaning or context, it was well-suited for this project given the dataset’s limited size.

I initially trained the following models:
- Multinomial Naive Bayes
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Random Forests

Among these, Random Forests achieved the highest prediction accuracy before hyperparameter tuning.

Hyperparameter Tuning & Stacking Model

To further improve performance, I performed hyperparameter tuning for both:
- Random Forests (adjusting n_estimators, max_depth, min_samples_split, etc.)
- Multinomial Naive Bayes (adjusting smoothing parameters)

After tuning, I built a stacking ensemble model using:
- Base models: Random Forest, Multinomial Naive Bayes, and KNN.
- Meta-model: Logistic Regression.

The stacking model achieved a 1% higher accuracy than the best single model (Random Forest). To confirm whether this improvement was statistically significant, I conducted McNemar’s Test, which resulted in a p-value < 0.05. This means that the stacking model was statistically different from the Random Forest model, giving confidence that the improvement was not due to random chance.

Final Testing & Results

I then tested the final stacking model on the test dataset, where it achieved 2% higher accuracy than it did on the tuning dataset. The final performance breakdown was:
- Negative Sentiment: Precision = 0.78, Recall = 0.51, F1-score = 0.61
- Neutral Sentiment: Precision = 0.81, Recall = 0.90, F1-score = 0.85
- Positive Sentiment: Precision = 0.71, Recall = 0.65, F1-score = 0.68

Key Takeaways
1.	Data leakage is a major issue in machine learning—removing duplicate phrases significantly improved the model’s generalizability.
2.	Stacking models can provide a meaningful improvement over individual models, though the gain was relatively small in this case (~1%).
3.	McNemar’s Test is useful for verifying whether model improvements are statistically significant rather than just due to luck.
4.	TF-IDF is a simple yet effective vectorization technique, though a potential future step would be exploring word embeddings to better capture semantic meaning.

With these insights, this project provided a strong practical understanding of sentiment analysis in NLP, text preprocessing, model evaluation, and ensemble learning.