## 1. Introduction

In today's data-driven financial landscape, the development of a Financial Sentiment Analysis Model is of paramount importance. This model harnesses natural language processing and machine learning techniques to systematically evaluate sentiment in financial data, offering invaluable insights for investment decisions, risk management, and market forecasting. By quantifying sentiments expressed in various financial sources, this model not only addresses the growing demand for sophisticated data analysis tools but also paves the way for innovative applications that can enhance decision-making processes across the financial industry, promising a more informed and competitive future. Therefore, this project's objective is to develop a Financial Sentiment Analysis Model that can predict the sentiment of financial markets and assets accurately, enabling investors, traders, and financial professionals to make more informed decisions and navigate the complexities of the ever-evolving financial landscape with greater confidence.

## 2. Methodology

The methodology encompasses key stages, starting with data collection from Kaggle. Subsequently, data is split, preprocessed (including sentence cleaning, tokenization, stopword removal, stemming, and lemmatization), and undergoes exploratory analysis. Imbalanced data issues are addressed, followed by label encoding. The modeling phase employs Naive Bayes with TF-IDF, Bidirectional LSTM with Word2Vec, and the RoBERTa model, concluding with comprehensive evaluation.

## 3. Development of Financial Sentiment Analysis Model

### 3.1 Import Libraries

In [1]:
# Operating System and File Handling
import os

# Data Manipulation and Analysis
import pandas as pd
import numpy as np
from wordcloud import WordCloud
from sklearn.utils import resample
import random

# Data Visualization
import matplotlib.pyplot as plt

# Machine Learning - Data Splitting
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

# Natural Language Processing (NLP) - Tokenization and Text Processing
import nltk
import re
import emoji
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer, SnowballStemmer
nltk.download('omw-1.4')  # Downloading a specific nltk resource

# Label Encoding
from sklearn.preprocessing import LabelEncoder

# Modeling (Deep Learning)
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import Adam, RMSprop
from sklearn.metrics import f1_score
from tensorflow.keras.callbacks import EarlyStopping
from keras import backend as K

# Machine Learning - Naive Bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer, f1_score
from sklearn.model_selection import RandomizedSearchCV

# Natural Language Processing (NLP) - Word Embeddings
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional, BatchNormalization
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim
from gensim.models import Word2Vec
import multiprocessing
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\aswin\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


### 3.2 Data Collection

This code snippet loads a CSV file named 'data.csv' from the current working directory into a Pandas DataFrame called 'data' and displays its initial rows.

In [2]:
# Get the data path
root_folder = os.getcwd()  
data_filename = 'data.csv' 
data_path = os.path.join(root_folder, data_filename)

# Load the CSV file
data = pd.read_csv(data_path)

# Find duplicate rows based on all columns
duplicate_rows = data[data.duplicated()]

# Remove duplicate rows and update the DataFrame
data = data.drop_duplicates()

# Print data shape and df
print("Shape of df:", data.shape)
data.head()

Shape of df: (5836, 2)


Unnamed: 0,Sentence,Sentiment
0,The GeoSolutions technology will leverage Bene...,positive
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
2,"For the last quarter of 2010 , Componenta 's n...",positive
3,According to the Finnish-Russian Chamber of Co...,neutral
4,The Swedish buyout firm has sold its remaining...,neutral


### 3.3 Data Splitting

This code splits the data into training, validation, and test sets with stratification, calculates their sizes as percentages of the total data.

In [3]:
# Split the data into features (X) and labels (y)
X = data.drop(columns=['Sentiment'])  # Assuming you have other features besides 'Sentiment'
y = data['Sentiment']

# Stratified split by the "Sentiment" label
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.1, random_state=7)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=7)

# Calculate the percentages of each subset
total_samples = len(data)
train_percentage = len(X_train) / total_samples * 100
val_percentage = len(X_val) / total_samples * 100
test_percentage = len(X_test) / total_samples * 100

# Print the sizes and percentages of the subsets
print("Train data size:", len(X_train), f"({train_percentage:.0f}%)")
print("Validation data size:", len(X_val), f"({val_percentage:.0f}%)")
print("Test data size:", len(X_test), f"({test_percentage:.0f}%)")

Train data size: 5252 (90%)
Validation data size: 292 (5%)
Test data size: 292 (5%)


In [4]:
from imblearn.under_sampling import RandomUnderSampler

# Create an instance of RandomOverSampler
ros = RandomUnderSampler(sampling_strategy='auto', random_state=62)

# Fit and transform the training data
X_train, y_train = ros.fit_resample(X_train, y_train)

### 3.4 Modeling Training and Evaluation

### A. Naive Bayes with TF-IDF

Naive Bayes with TF-IDF is a common and effective approach for text classification tasks. It leverages TF-IDF scores to represent text data and employs the probabilistic framework of Naive Bayes for classification.

##### Data Preprocessing

Text preprocessing involves cleaning and transforming text data to make it suitable for analysis or modeling. This code defines a function, preprocess_text, which preprocesses text by replacing contractions, removing URLs, special characters, and non-ASCII characters, converting to lowercase, removing emojis, tokenizing, removing common stopwords, and applying stemming. This function is applied to text data in X_train, X_val, and X_test DataFrames. The original 'Sentence' column is then dropped from each DataFrame, resulting in processed text data. Lastly, this code also label encodes the sentiment classes.

In [5]:
X_train_nb = X_train.copy()
X_val_nb = X_val.copy()
X_test_nb = X_test.copy()

y_train_nb = y_train[:]
y_val_nb = y_val[:]
y_test_nb = y_test[:]

In [6]:
def preprocess_text(text):
    # Step 1: Clean the text
    # Replace URLs with a space
    text = re.sub(r"http\S+", " ", text)
    text = re.sub(r"http", " ", text)
    # Replace '@' with 'at'
    text = re.sub(r"@", "at", text)
    # Replace hashtags with a space
    text = re.sub(r"#[A-Za-z0-9_]+", ' ', text)
    # Remove special characters, punctuation, and non-alphanumeric characters
    text = re.sub(r"[^A-Za-z(),!?@\'\"_\n]", " ", text)
    # Convert to lowercase
    text = text.lower()
    # Remove emoji characters
    text = emoji.demojize(text)
    # Remove non-ASCII characters
    text = ''.join([c for c in text if ord(c) < 128])

    # Step 2: Tokenization
    tokens = word_tokenize(text)

    # Step 3: Stopword Removal
    stop_words = set(stopwords.words('english'))
    additional_stopwords = ['rt', 'mkr', 'didn', 'bc', 'n', 'm', 'im', 'll', 'y', 've', 'u', 'ur', 'don', 'p', 't', 's', 'aren', 'kp', 'o', 'kat', 'de', 're', 'amp', 'will']
    stop_words.update(additional_stopwords)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    # Step 4: Lemmatization with WordNet Lemmatizer
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    
    # Step 5: Stemming with Snowball Stemmer
    stemmer = SnowballStemmer("english")
    stemmed_tokens = [stemmer.stem(token) for token in lemmatized_tokens]

    return lemmatized_tokens


# Apply the preprocess_text function to the 'Sentence' column of X_train, X_val, and X_test
X_train_nb['Processed_Tokens'] = X_train_nb['Sentence'].apply(preprocess_text)
X_val_nb['Processed_Tokens'] = X_val_nb['Sentence'].apply(preprocess_text)
X_test_nb['Processed_Tokens'] = X_test_nb['Sentence'].apply(preprocess_text)

# Remove the original 'Sentence' column from X_train, X_val, and X_test
X_train_nb.drop(columns=['Sentence'], inplace=True)
X_val_nb.drop(columns=['Sentence'], inplace=True)
X_test_nb.drop(columns=['Sentence'], inplace=True)

# Print the processed DataFrame
X_train_nb.head()

Unnamed: 0,Processed_Tokens
2979,"[sale, unit, slumped, last, year, industry, hi..."
4511,"[disappointment, see, plan, folded]"
1665,"[nokia, 's, share, price, fell, le, one, perce..."
2469,"[operating, profit, month, period, decreased, ..."
2037,"[operating, profit, ,, excluding, non, recurri..."


In [7]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the labels for training data
y_train_nb = label_encoder.fit_transform(y_train_nb)

# Apply preprocessing to validation and test data
y_val_nb = label_encoder.transform(y_val_nb)
y_test_nb = label_encoder.transform(y_test_nb)

# Print the first few encoded labels for training data
print(y_train_nb[:10])

[0 0 0 0 0 0 0 0 0 0]


##### TF-IDF

This code performs TF-IDF vectorization on text data, creating TF-IDF representations that can be used as input features for machine learning models.

In [8]:
def tfidf_vectorize(train_set, val_set, test_set):
    
    # Join the tokenized words back into sentences
    train_set['Processed_Sentences'] = train_set['Processed_Tokens'].apply(lambda tokens: ' '.join(tokens))
    val_set['Processed_Sentences'] = val_set['Processed_Tokens'].apply(lambda tokens: ' '.join(tokens))
    test_set['Processed_Sentences'] = test_set['Processed_Tokens'].apply(lambda tokens: ' '.join(tokens))
    
    # Drop the Processed_Tokens column
    train_set.drop(columns=['Processed_Tokens'], inplace=True)
    val_set.drop(columns=['Processed_Tokens'], inplace=True)
    test_set.drop(columns=['Processed_Tokens'], inplace=True)
    
    # Initialize the TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer()
    
    # Fit and transform the TF-IDF vectorizer on the train set
    X_train_tfidf = tfidf_vectorizer.fit_transform(train_set['Processed_Sentences'])
    
    # Transform the validation and test sets using the same vectorizer
    X_val_tfidf = tfidf_vectorizer.transform(val_set['Processed_Sentences'])
    X_test_tfidf = tfidf_vectorizer.transform(test_set['Processed_Sentences'])
    
    return X_train_tfidf.toarray(), X_val_tfidf.toarray(), X_test_tfidf.toarray()

# Usage
X_train_tfidf, X_val_tfidf, X_test_tfidf = tfidf_vectorize(X_train_nb, X_val_nb, X_test_nb)

# Print the TF-IDF matrix for the first sentence on X_train
print(X_train_tfidf[0])

[0. 0. 0. ... 0. 0. 0.]


##### Naive Bayes Model

In this code, a Multinomial Naive Bayes classifier is optimized using RandomizedSearchCV with a predefined hyperparameter grid. The best hyperparameters are printed along with their corresponding macro F1-score, and the macro F1-scores on the training and validation sets using the best model are calculated and displayed.

In [9]:
# Define the hyperparameter grid for random search
param_dist = {
    'alpha': [0.1, 0.5, 1.0, 1.5, 2.0],
    'fit_prior': [True, False]
}

# Create a Multinomial Naive Bayes classifier
nb_model = MultinomialNB()

# Define the scoring metric (macro F1-score in this case)
scorer = make_scorer(f1_score, average='macro')

# Create RandomizedSearchCV object with n-fold cross-validation
random_search = RandomizedSearchCV(estimator=nb_model, param_distributions=param_dist, scoring=scorer, cv=5, n_iter=10, random_state=72)

# Fit the random search to your data 
random_search.fit(X_train_tfidf, y_train_nb)

# Print the best hyperparameters and corresponding macro F1-score
best_hyperparameters = random_search.best_params_
best_macro_f1_score = random_search.best_score_
print("Best Hyperparameters:", best_hyperparameters)

# Calculate and print the macro F1-scores on the training and validation sets using the best model
best_model = random_search.best_estimator_
train_predictions = best_model.predict(X_train_tfidf)
val_predictions = best_model.predict(X_val_tfidf)

# Calculate and print the accuracy on the training and validation sets using the best model
train_accuracy = accuracy_score(y_train_nb, train_predictions)
val_accuracy = accuracy_score(y_val_nb, val_predictions)

print(f"Train Accuracy using Best Model: {train_accuracy:.2f}")
print(f"Validation Accuracy using Best Model: {val_accuracy:.2f}")

# Calculate and print the f1 score on the training and validation sets using the best model
train_macro_f1 = f1_score(y_train_nb, train_predictions, average='macro')
val_macro_f1 = f1_score(y_val_nb, val_predictions, average='macro')

print(f"Train F1-score using Best Model: {train_macro_f1:.2f}")
print(f"Validation F1-score using Best Model: {val_macro_f1:.2f}")

Best Hyperparameters: {'fit_prior': False, 'alpha': 2.0}
Train Accuracy using Best Model: 0.85
Validation Accuracy using Best Model: 0.60
Train F1-score using Best Model: 0.85
Validation F1-score using Best Model: 0.56


### B.  Bidirectional LSTM with Word2Vec

This code trains a Word2Vec model on tokenized text data and transforms the data into Word2Vec embeddings, producing Word2Vec embeddings.

In [10]:
# Create copies of the DataFrames
X_train_bi_lstm = X_train.copy()
X_val_bi_lstm = X_val.copy()
X_test_bi_lstm = X_test.copy()

y_train_bi_lstm = y_train[:]
y_val_bi_lstm = y_val[:]
y_test_bi_lstm = y_test[:]

In [11]:
def preprocess_text(text):
    # Step 1: Clean the text
    # Replace URLs with a space
    text = re.sub(r"http\S+", " ", text)
    text = re.sub(r"http", " ", text)
    # Replace '@' with 'at'
    text = re.sub(r"@", "at", text)
    # Replace hashtags with a space
    text = re.sub(r"#[A-Za-z0-9_]+", ' ', text)
    # Remove special characters, punctuation, and non-alphanumeric characters
    text = re.sub(r"[^A-Za-z(),!?@\'\"_\n]", " ", text)
    # Convert to lowercase
    text = text.lower()
    
    # Replace contractions with their expanded forms
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "can not", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"'re", " are", text)
    text = re.sub(r"'s", " is", text)
    text = re.sub(r"'d", " would", text)
    text = re.sub(r"'ll", " will", text)
    text = re.sub(r"'t", " not", text)
    text = re.sub(r"'ve", " have", text)
    text = re.sub(r"'m", " am", text)

    # Step 2: Tokenization
    tokens = word_tokenize(text)

    # Step 3: Stopword Removal
    stop_words = set(stopwords.words('english'))
    additional_stopwords = ['rt', 'mkr', 'didn', 'bc', 'n', 'm', 'im', 'll', 'y', 've', 'u', 'ur', 'don', 'p', 't', 's', 'aren', 'kp', 'o', 'kat', 'de', 're', 'amp', 'will']
    stop_words.update(additional_stopwords)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    # Step 4: Lemmatization with WordNet Lemmatizer
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    
    # Step 5: Stemming with Snowball Stemmer
    stemmer = SnowballStemmer("english")
    stemmed_tokens = [stemmer.stem(token) for token in lemmatized_tokens]

    return lemmatized_tokens


# Apply the preprocess_text function to the 'Sentence' column of X_train, X_val, and X_test
X_train_bi_lstm['Processed_Tokens'] = X_train_bi_lstm['Sentence'].apply(preprocess_text)
X_val_bi_lstm['Processed_Tokens'] = X_val_bi_lstm['Sentence'].apply(preprocess_text)
X_test_bi_lstm['Processed_Tokens'] = X_test_bi_lstm['Sentence'].apply(preprocess_text)

# Remove the original 'Sentence' column from X_train, X_val, and X_test
X_train_bi_lstm.drop(columns=['Sentence'], inplace=True)
X_val_bi_lstm.drop(columns=['Sentence'], inplace=True)
X_test_bi_lstm.drop(columns=['Sentence'], inplace=True)

# Print the processed DataFrame
X_train_bi_lstm.head()

Unnamed: 0,Processed_Tokens
2979,"[sale, unit, slumped, last, year, industry, hi..."
4511,"[disappointment, see, plan, folded]"
1665,"[nokia, share, price, fell, le, one, percent, ..."
2469,"[operating, profit, month, period, decreased, ..."
2037,"[operating, profit, ,, excluding, non, recurri..."


In [12]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the labels for training data
y_train_bi_lstm = label_encoder.fit_transform(y_train_bi_lstm)

# Apply preprocessing to validation and test data
y_val_bi_lstm = label_encoder.transform(y_val_bi_lstm)
y_test_bi_lstm = label_encoder.transform(y_test_bi_lstm)

# Print the first few encoded labels for training data
print(y_train_bi_lstm[:10])

[0 0 0 0 0 0 0 0 0 0]


##### Word2Vec

In [13]:
def train_word2vec_model(df):
    # Ensure that the 'Processed_Tokens' column contains lists of tokens
    sentences = df['Processed_Tokens'].tolist()
    
    # Create and train the Word2Vec model
    model = Word2Vec(sentences, 
                     vector_size=100,  # You can adjust the vector size as needed
                     window=3,          # Context window size
                     min_count=2,       # Minimum word frequency
                     sg=0,              # CBOW model (sg=1 for Skip-gram)
                     workers=multiprocessing.cpu_count())
    
    return model

def transform_to_word2vec(df, model):
    # Tokenize sentences
    sentences = df['Processed_Tokens'].tolist()
    
    # Transform sentences to Word2Vec vectors
    transformed_data = []
    for sentence in sentences:
        word_vectors = []
        for word in sentence:
            if word in model.wv:
                word_vectors.append(model.wv[word])
        transformed_data.append(word_vectors)
    
    return transformed_data

# Combine train, validation, and test sets into one DataFrame
combined_df = pd.concat([X_train_bi_lstm, X_val_bi_lstm, X_test_bi_lstm], ignore_index=True)

# Train the Word2Vec model on the combined data
word2vec_model = train_word2vec_model(combined_df)

# Transform the data using the trained model
X_train_word2vec = transform_to_word2vec(X_train_bi_lstm, word2vec_model)
X_val_word2vec = transform_to_word2vec(X_val_bi_lstm, word2vec_model)
X_test_word2vec = transform_to_word2vec(X_test_bi_lstm, word2vec_model)

# Print the word2vec embedding for first row
print(X_train_word2vec[0])

[array([-0.17835921,  0.15375838,  0.15511887,  0.15005662,  0.07745324,
       -0.3121139 ,  0.32480076,  0.7818806 , -0.30706242, -0.27127847,
       -0.03503095, -0.45709884, -0.16601317,  0.24844386,  0.09457346,
       -0.3154079 ,  0.11175033, -0.39157084, -0.058618  , -0.8477492 ,
        0.14082128,  0.21532547,  0.28559312, -0.11156128,  0.00475884,
       -0.10021405, -0.11515267, -0.07942633, -0.48533   ,  0.01235928,
        0.43980172,  0.06331208,  0.36226758, -0.32920563, -0.13420144,
        0.26673862,  0.0842419 , -0.29874036, -0.15484992, -0.48903206,
        0.17389674, -0.21252078, -0.20924143,  0.17631343,  0.26387066,
       -0.05864932, -0.11540092, -0.00688855,  0.2586503 ,  0.25666988,
        0.21254857, -0.22767673, -0.04937591,  0.00699232, -0.20692594,
        0.24967958,  0.21334487, -0.13408518, -0.23693162,  0.09742279,
       -0.11466881,  0.28453517, -0.21527539,  0.0255277 , -0.4660117 ,
        0.43069145,  0.14200063,  0.10814288, -0.46602917,  0.3

##### Padding

In [14]:
def pad_sequences_custom(train_data, val_data, test_data):
    # Determine the maximum sequence length for each dataset
    max_sequence_length_train = max(len(seq) for seq in train_data)
    max_sequence_length_val = max(len(seq) for seq in val_data)
    max_sequence_length_test = max(len(seq) for seq in test_data)
    
    # Calculate the overall maximum sequence length
    max_sequence_length = max(max_sequence_length_train, max_sequence_length_val, max_sequence_length_test)
    
    # Pad sequences to match the calculated maximum sequence length
    train_padded = pad_sequences(train_data, maxlen=max_sequence_length)
    val_padded = pad_sequences(val_data, maxlen=max_sequence_length)
    test_padded = pad_sequences(test_data, maxlen=max_sequence_length)
    
    return train_padded, val_padded, test_padded, max_sequence_length

# Usage
X_train_padded, X_val_padded, X_test_padded, max_sequence_length = pad_sequences_custom(X_train_word2vec, X_val_word2vec, X_test_word2vec)

# Print padded sentence for the first row on X_train
print(X_train_padded[0])

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


##### Reshaping

This code reshapes the data into a 3D format, adding a dimension, for compatibility with deep learning models. 

In [15]:
def reshape_for_deep_learning(data):
    # Determine the maximum number of words in a sentence
    max_len = max(len(sentence) for sentence in data)
    
    # Create a 3D array to hold the data
    num_samples = len(data)
    embedding_dim = len(data[0][0])  # Assuming all embeddings have the same dimension
    reshaped_data = np.zeros((num_samples, max_len, embedding_dim))
    
    # Fill the 3D array with data
    for i, sentence in enumerate(data):
        for j, word_vector in enumerate(sentence):
            reshaped_data[i, j, :] = word_vector
    
    return reshaped_data

# Reshape the data for deep learning models
X_train_reshaped = reshape_for_deep_learning(X_train_padded)
X_val_reshaped = reshape_for_deep_learning(X_val_padded)
X_test_reshaped = reshape_for_deep_learning(X_test_padded)

# Print the shape of the reshaped data to verify
print("Shape of X_train_reshaped:", X_train_reshaped.shape)
print("Shape of X_val_reshaped:", X_val_reshaped.shape)
print("Shape of X_test_reshaped:", X_test_reshaped.shape)

Shape of X_train_reshaped: (2358, 54, 100)
Shape of X_val_reshaped: (292, 54, 100)
Shape of X_test_reshaped: (292, 54, 100)


##### Bidirectional LSTM Model 

This code defines and trains a Bidirectional LSTM neural network model. It uses F1 metric, a specific learning rate, and early stopping to prevent overfitting during training. The training history is stored in the 'history' variable. 

In [None]:
# Set a specific random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Determine the number of unique classes (labels)
num_classes = len(np.unique(y_train_bi_lstm))

# Create an LSTM model with fewer layers and simpler architecture
bi_lstm_model = Sequential()
bi_lstm_model.add(Bidirectional(LSTM(128, return_sequences=True), input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2])))
bi_lstm_model.add(Bidirectional(LSTM(128, return_sequences=True)))
bi_lstm_model.add(Bidirectional(LSTM(128, return_sequences=True)))
bi_lstm_model.add(Bidirectional(LSTM(128, return_sequences=True)))
bi_lstm_model.add(Bidirectional(LSTM(128, kernel_initializer='glorot_uniform')))
bi_lstm_model.add(Dense(128, activation='relu'))
bi_lstm_model.add(Dense(64, activation='relu'))
bi_lstm_model.add(Dense(32, activation='relu'))
bi_lstm_model.add(Dense(num_classes, activation='softmax'))

# Define custom metrics
def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2 * ((precision * recall) / (precision + recall + K.epsilon()))

# Define the learning rate
learning_rate = 0.0001

# Create an optimizer with the specified learning rate
optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate)

# Compile the model with categorical cross-entropy loss and custom metrics
bi_lstm_model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy', f1_m])

# Define early stopping to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=7, restore_best_weights=True)

# Train the model
history = bi_lstm_model.fit(
    X_train_reshaped,
    tf.keras.utils.to_categorical(y_train_bi_lstm, num_classes=num_classes),
    epochs=30,
    batch_size=3,
    validation_data=(X_val_reshaped, tf.keras.utils.to_categorical(y_val_bi_lstm)),
    callbacks=[early_stopping]
)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30

#### 3.8.3  RoBERTa Model with RoBERTa preprocessing

RoBERTa model with RoBERTa preprocessing involves using the RoBERTa tokenizer and pre-trained RoBERTa model for sentiment analysis. The key idea is to leverage the power of pre-trained RoBERTa models for various NLP tasks by fine-tuning or using them as feature extractors. The RoBERTa preprocessing ensures that our text data is compatible with the RoBERTa model's input format.

##### Data Preprocessing

In [None]:
# Create copies of the DataFrames
X_train_roberta = X_train.copy()
X_val_roberta = X_val.copy()
X_test_roberta = X_test.copy()

y_train_roberta = y_train[:]
y_val_roberta = y_val[:]
y_test_roberta = y_test[:]

In [None]:
def preprocess_text(text):
    # Step 1: Clean the text
    # Replace contractions with their expanded forms
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "can not", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"'re", " are", text)
    text = re.sub(r"'s", " is", text)
    text = re.sub(r"'d", " would", text)
    text = re.sub(r"'ll", " will", text)
    text = re.sub(r"'t", " not", text)
    text = re.sub(r"'ve", " have", text)
    text = re.sub(r"'m", " am", text)

    # Replace URLs with a space
    text = re.sub(r"http\S+", " ", text)
    text = re.sub(r"http", " ", text)
    # Replace '@' with 'at'
    text = re.sub(r"@", "at", text)
    # Replace hashtags with a space
    text = re.sub(r"#[A-Za-z0-9_]+", ' ', text)
    # Remove special characters, punctuation, and non-alphanumeric characters
    text = re.sub(r"[^A-Za-z(),!?@\'\"_\n]", " ", text)
    # Convert to lowercase
    text = text.lower()
    # Remove emoji characters
    text = emoji.demojize(text)
    # Remove non-ASCII characters
    text = ''.join([c for c in text if ord(c) < 128])

    # Step 2: Stopword Removal
    stop_words = set(stopwords.words('english'))
    additional_stopwords = ['rt', 'mkr', 'didn', 'bc', 'n', 'm', 'im', 'll', 'y', 've', 'u', 'ur', 'don', 'p', 't', 's', 'aren', 'kp', 'o', 'kat', 'de', 're', 'amp', 'will']
    stop_words.update(additional_stopwords)

    # Remove stopwords from the text
    cleaned_text = ' '.join([word for word in text.split() if word not in stop_words])

    return cleaned_text

# Apply the preprocess_text function to the 'Sentence' column of X_train, X_val, and X_test
X_train_roberta['Processed_Tokens'] = X_train_roberta['Sentence'].apply(preprocess_text)
X_val_roberta['Processed_Tokens'] = X_val_roberta['Sentence'].apply(preprocess_text)
X_test_roberta['Processed_Tokens'] = X_test_roberta['Sentence'].apply(preprocess_text)

# Remove the original 'Sentence' column from X_train, X_val, and X_test
X_train_roberta.drop(columns=['Sentence'], inplace=True)
X_val_roberta.drop(columns=['Sentence'], inplace=True)
X_test_roberta.drop(columns=['Sentence'], inplace=True)

# Print the processed DataFrame
X_train_roberta.head()

In [None]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the labels for training data
y_train_roberta = label_encoder.fit_transform(y_train_roberta)

# Apply preprocessing to validation and test data
y_val_roberta = label_encoder.transform(y_val_roberta)
y_test_roberta = label_encoder.transform(y_test_roberta)

# Print the first few encoded labels for training data
print(y_train_roberta[:10])

##### RoBERTa Preprocessing

These steps prepare our data for passing as input to the RoBERTa model, enabling the train and evaluation stages of the model effectively.

In [None]:
# Load RoBERTa tokenizer
model_name = "roberta-base"  # Choose the appropriate variant
tokenizer = RobertaTokenizer.from_pretrained(model_name)

# Tokenization and preprocessing function
def preprocess_data(X, y, max_seq_length=128):
    # Flatten X
    X = X.squeeze()

    # Join tokenized words into a single string sentence
    tokenized_sentences = [" ".join(sentence) for sentence in X]

    # Tokenize and pad the sentences
    tokenized_sentences = [tokenizer.encode(sentence, max_length=max_seq_length, pad_to_max_length=True) for sentence in tokenized_sentences]

    # Convert labels to categorical if needed
    num_classes = len(np.unique(y))
    y = tf.keras.utils.to_categorical(y, num_classes=num_classes)

    return tokenized_sentences, y

# Preprocess the training data
X_train_processed, y_train_processed = preprocess_data(X_train_roberta, y_train_roberta)

# Create a TensorFlow Dataset for training data
batch_size = 32  # Adjust this based on your preference
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_processed, y_train_processed))
train_dataset = train_dataset.shuffle(buffer_size=10000).batch(batch_size)
train_dataset = train_dataset.prefetch(buffer_size=-1)

# Preprocess the validation data
X_val_processed, y_val_processed = preprocess_data(X_val_roberta, y_val_roberta)

# Create a TensorFlow Dataset for validation data
val_dataset = tf.data.Dataset.from_tensor_slices((X_val_processed, y_val_processed))
val_dataset = val_dataset.batch(batch_size)
val_dataset = val_dataset.prefetch(buffer_size=-1)

# Preprocess the test data
X_test_processed, y_test_processed = preprocess_data(X_test_roberta, y_test_roberta)

# Create a TensorFlow Dataset for test data
test_dataset = tf.data.Dataset.from_tensor_slices((X_test_processed, y_test_processed))
test_dataset = test_dataset.batch(batch_size)
test_dataset = test_dataset.prefetch(buffer_size=-1)

##### RoBERTa Model

This code performs fine-tuning of a RoBERTa model for sentiment prediction and evaluates its performance on a validation dataset. It uses F1 metric and the training history is stored in the 'history' variable.

In [None]:
# Load RoBERTa tokenizer and model
model_name = "roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = TFRobertaForSequenceClassification.from_pretrained(model_name)

# Fine-tune some layers within the RoBERTa model
for layer in model.layers[:-4]:
    layer.trainable = False

# Add dropout and regularization
classifier = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(num_classes, activation='softmax', kernel_regularizer=tf.keras.regularizers.l2(0.01))
])

# Combine the RoBERTa model and the classifier
input_ids = tf.keras.layers.Input(shape=(128,), dtype=tf.int32)
outputs = model(input_ids)[0]  # Output from RoBERTa
predictions = classifier(outputs)

# Define custom metrics
def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2 * ((precision * recall) / (precision + recall + K.epsilon()))

# Create the combined model
classification_model = tf.keras.Model(inputs=input_ids, outputs=predictions)

# Compile the model
classification_model.compile(optimizer='adam',
                              loss='categorical_crossentropy',
                              metrics=['accuracy', f1_m])

# Train the model on the training data
history = classification_model.fit(train_dataset, validation_data=val_dataset, epochs=5)

# Evaluate the model on the validation data
score = classification_model.evaluate(val_dataset)

# Print the evaluation results
print("Validation Loss:", score[0])
print("Validation Accuracy:", score[1])