### Introduction

In this project, we aim to develop and evaluate a sentiment analysis model based on financial articles related to specific companies. Our goal is to predict the sentiment of these articles (positive or negative) and see how these predictions correlate with the stock performance of a relevant Exchange Traded Fund (ETF). We use the powerful `Transformers` library by Hugging Face to leverage pre-trained language models like DistilBERT for this task.

The process begins with data loading and preprocessing, followed by training the model on a portion of the data. We then evaluate the model's performance on a separate test set using various metrics like F1 score, precision, recall, and accuracy. The results will be analyzed using a confusion matrix and classification report to understand the model's effectiveness in capturing sentiment from financial texts. We also explore different variations in training, such as adjusting the text input length and incorporating advanced features like mixed precision training to optimize performance.


This code cell sets up the environment and imports necessary libraries for data processing, model training, and evaluation. We use `pandas` for data handling, `scikit-learn` for splitting data and computing evaluation metrics, and `transformers` for leveraging pre-trained language models. The `TEXT_COLUMN` variable is defined to specify whether we are working with summaries or full articles for sentiment analysis.

To run this code, you may need to install the following libraries:
- `pandas`: `pip install pandas`
- `scikit-learn`: `pip install scikit-learn`
- `transformers`: `pip install transformers`
- `torch`: `pip install torch`
- `matplotlib`: `pip install matplotlib`


In [None]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, f1_score, recall_score, precision_score, accuracy_score
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from transformers import pipeline
import torch
import matplotlib.pyplot as plt

# TEXT_COLUMN = "Summary"
TEXT_COLUMN = "full_article"

These functions handle data preprocessing and splitting for model training and evaluation. The `load_and_clean_data` function reads the input CSV file, removes rows with missing text data, and converts sentiment labels to numeric format. The `split_data` function divides the cleaned data into training, validation, and test sets, ensuring a balanced split for model development and evaluation. These steps are essential for preparing the data in a format suitable for machine learning tasks.


In [None]:
"""
Function to load and clean data
Parameters:
- file_path: Path to the CSV file
- Returns: Cleaned DataFrame with relevant columns
"""
def load_and_clean_data(file_path):
    df = pd.read_csv(file_path)
    df = df.dropna(subset=[TEXT_COLUMN])
    df['label'] = df['label'].map({'POSITIVE': 1, 'NEGATIVE': 0})
    return df

"""
Function to split data into training, validation, and testing sets
Parameters:
- df: DataFrame with the cleaned data
- Returns: Splits of training, validation, and testing sets
"""
def split_data(df):
    X_train, X_temp, y_train, y_temp = train_test_split(df[TEXT_COLUMN], df['label'], test_size=0.3, random_state=42)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
    return X_train, X_val, X_test, y_train, y_val, y_test

The `train_model` function is designed to build and train a sentiment analysis model using a specified pre-trained transformer model from Hugging Face. The function accepts several key parameters, including `model_name` (the name of the pre-trained model to be used), the training and validation datasets (`X_train`, `y_train`, `X_val`, `y_val`), and various hyperparameters such as `epochs`, `batch_size`, `warmup_steps`, and `weight_decay`.

The function starts by initializing the tokenizer and model using the provided `model_name`. Depending on whether the input text is a summary or a full article, it tokenizes the data, truncating and padding sequences as necessary. A custom `SentimentDataset` class is defined to create PyTorch datasets for training and validation, which are then passed to a `Trainer` instance to manage the training process.

The `TrainingArguments` class is used to configure various aspects of the training, including:
- `output_dir`: Specifies where the trained model and other outputs will be saved.
- `num_train_epochs`: Defines the number of epochs for training.
- `per_device_train_batch_size` and `per_device_eval_batch_size`: Set the batch sizes for training and evaluation, respectively.
- `warmup_steps`: The number of warmup steps for the learning rate scheduler.
- `weight_decay`: The strength of weight decay to prevent overfitting.
- `logging_dir` and `logging_steps`: Control where and how frequently training logs are stored.
- `evaluation_strategy`: Determines when to evaluate the model, in this case, at the end of each epoch.
- `fp16`: Enables mixed precision training to optimize memory usage and performance, particularly important when using full articles.
- `learning_rate`: Specifies the learning rate for model optimization.

After training, the model and tokenizer are saved for future use. The function also returns the training history, which includes detailed logs of the training process, allowing for analysis and troubleshooting.


In [None]:
"""
Function to build and train the model
Parameters:
- model_name: Name of the pre-trained model from Hugging Face
- X_train: Training data
- y_train: Training labels
- X_val: Validation data
- y_val: Validation labels
- epochs: Number of training epochs (default: 3)
- batch_size: Batch size for training and evaluation (default: 8)
- warmup_steps: Number of warmup steps for learning rate scheduler (default: 500)
- weight_decay: Strength of weight decay (default: 0.01)
- Returns: Trained model and tokenizer
"""
def train_model(model_name, X_train, y_train, X_val, y_val, epochs=10, batch_size=8, warmup_steps=500, weight_decay=0.01):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    if TEXT_COLUMN == "full_article":
        train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True, max_length=256)
        val_encodings = tokenizer(X_val.tolist(), truncation=True, padding=True, max_length=256)
    else:
        train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True)
        val_encodings = tokenizer(X_val.tolist(), truncation=True, padding=True)

    class SentimentDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels
        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx])
            return item
        def __len__(self):
            return len(self.labels)

    train_dataset = SentimentDataset(train_encodings, y_train.tolist())
    val_dataset = SentimentDataset(val_encodings, y_val.tolist())
    if TEXT_COLUMN == "full_article":
        fp16=True
    else:
        fp16=False
    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        warmup_steps=warmup_steps,
        weight_decay=weight_decay,
        logging_dir='./logs',
        logging_steps=10,
        evaluation_strategy="epoch",
        fp16=fp16,
        learning_rate=1e-5
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset
    )
    train_result = trainer.train()
    model_save_path = f'./{model_name}_{TEXT_COLUMN}_sentiment_model_high_relevance'
    model.save_pretrained(model_save_path)
    tokenizer.save_pretrained(model_save_path)
    training_history = trainer.state.log_history
    return model, tokenizer, training_history

The `evaluate_model` function is designed to assess the performance of a trained sentiment analysis model on a test dataset. It takes several parameters:

- `model`: The trained model to be evaluated.
- `tokenizer`: The tokenizer used with the model for text processing.
- `X_test`: The test data containing the input texts.
- `y_test`: The true labels for the test data.
- `batch_size`: The batch size used during evaluation (default is 8).

The function begins by moving the model to the appropriate device (GPU if available, otherwise CPU) to optimize evaluation performance. It then processes the test data in batches to manage memory usage effectively, particularly when working with large datasets. Each batch of test data is tokenized and passed through the model to generate predictions.

The predictions are then converted to class labels (0 or 1), which are compared to the true labels (`y_test`) to compute various evaluation metrics. The function calculates the confusion matrix and generates a classification report, providing detailed insights into the model's performance across different classes. Additionally, key performance metrics such as F1 score, recall, precision, and accuracy are computed and returned.

These metrics allow for a comprehensive evaluation of the model's effectiveness in predicting sentiment, helping to identify areas of strength and potential improvement.


In [5]:
"""
Function to evaluate the model
Parameters:
- model: Trained model
- tokenizer: Tokenizer used with the model
- X_test: Test data
- y_test: Test labels
- Returns: Confusion matrix, classification report, and performance metrics (f1 score, recall, precision, accuracy)
"""
def evaluate_model(model, tokenizer, X_test, y_test, batch_size=8):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    predictions = []
    for i in range(0, len(X_test), batch_size):
        batch = X_test[i:i + batch_size]
        batch_encodings = tokenizer(batch.tolist(), truncation=True, padding=True, return_tensors="pt").to(device)
        
        with torch.no_grad():
            outputs = model(**batch_encodings)
        
        batch_predictions = torch.argmax(outputs.logits, dim=1).cpu().numpy()
        predictions.extend(batch_predictions)

    y_pred = [1 if pred == 1 else 0 for pred in predictions]
    cm = confusion_matrix(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    
    return cm, report, f1, recall, precision, accuracy

In this section, we bring together the functions defined earlier to build, train, and evaluate a sentiment analysis model using a pre-trained transformer. The process begins by loading and cleaning the dataset using the `load_and_clean_data` function, which reads the data from the specified file path and prepares it for training. If the input text is the full article (`TEXT_COLUMN` set to `"full_article"`), a sample of 80% of the data is taken to reduce the dataset size and optimize training time.

Next, the dataset is split into training, validation, and test sets using the `split_data` function. These splits ensure that the model is trained on one portion of the data, validated on another during training, and finally evaluated on a separate test set.

A list of pre-trained model names is provided, and in this example, we select the first model, `distilbert-base-uncased`, for training. The `train_model` function is then called with the selected model and the training and validation datasets. This function handles the entire training process, including setting up the tokenizer, configuring training arguments, and saving the trained model and tokenizer.

Finally, the model is evaluated on the test set using the `evaluate_model` function (not shown in this block but referenced earlier), which calculates various performance metrics to assess how well the model predicts sentiment in unseen data.


In [8]:
file_path = "Full_Data_Sentiment_Analysis_LLM_Output_ETF_value_Label.csv"

df = load_and_clean_data(file_path)
if TEXT_COLUMN == "full_article":
    df = df.sample(frac=0.8)

X_train, X_val, X_test, y_train, y_val, y_test = split_data(df)

model_names = [
    "distilbert-base-uncased",
#         "bert-base-uncased",
#         "roberta-base",
#         "albert-base-v2",
#         "xlnet-base-cased"
]

model_name = model_names[0]
print(f"Training and evaluating model: {model_name}")
model, tokenizer, training_history = train_model(model_name, X_train, y_train, X_val, y_val)

Training and evaluating model: distilbert-base-uncased


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifi

Epoch,Training Loss,Validation Loss
1,0.6969,0.692704
2,0.6462,0.671218
3,0.501,0.594909
4,0.4256,0.618861
5,0.4647,0.940498
6,0.2666,1.265939
7,0.1775,1.458818
8,0.0276,1.598887
9,0.0281,1.672779
10,0.1236,1.71754


Saving model checkpoint to ./results\checkpoint-500
Configuration saved in ./results\checkpoint-500\config.json
Model weights saved in ./results\checkpoint-500\pytorch_model.bin
Saving model checkpoint to ./results\checkpoint-1000
Configuration saved in ./results\checkpoint-1000\config.json
Model weights saved in ./results\checkpoint-1000\pytorch_model.bin
Saving model checkpoint to ./results\checkpoint-1500
Configuration saved in ./results\checkpoint-1500\config.json
Model weights saved in ./results\checkpoint-1500\pytorch_model.bin
Saving model checkpoint to ./results\checkpoint-2000
Configuration saved in ./results\checkpoint-2000\config.json
Model weights saved in ./results\checkpoint-2000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 3667
  Batch size = 8
Saving model checkpoint to ./results\checkpoint-2500
Configuration saved in ./results\checkpoint-2500\config.json
Model weights saved in ./results\checkpoint-2500\pytorch_model.bin
Saving model checkpoint to .

Model weights saved in ./results\checkpoint-21000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 3667
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)


Configuration saved in ./distilbert-base-uncased_full_article_sentiment_model_high_relevance\config.json
Model weights saved in ./distilbert-base-uncased_full_article_sentiment_model_high_relevance\pytorch_model.bin
tokenizer config file saved in ./distilbert-base-uncased_full_article_sentiment_model_high_relevance\tokenizer_config.json
Special tokens file saved in ./distilbert-base-uncased_full_article_sentiment_model_high_relevance\special_tokens_map.json


After training the model, we evaluate its performance on the test dataset using the `evaluate_model` function. This function returns several key metrics, including the confusion matrix, classification report, F1 score, recall, precision, and accuracy. These metrics provide a comprehensive understanding of how well the model performs in predicting sentiment.

The confusion matrix shows the number of correct and incorrect predictions for each class, helping to visualize the model's strengths and weaknesses. The classification report provides additional details, including precision, recall, and F1 scores for each class. These metrics are critical for understanding the balance between precision and recall and how well the model handles both positive and negative sentiments.

Finally, the overall F1 score, recall, precision, and accuracy metrics are printed, offering a quick summary of the model's performance. These results allow us to assess the effectiveness of the model in a real-world context and guide any necessary adjustments or improvements for future iterations.


In [9]:
cm, report, f1, recall, precision, accuracy = evaluate_model(model, tokenizer, X_test, y_test)
print(f"Confusion Matrix for {model_name}:\n", cm)
print(f"Classification Report for {model_name}:\n", report)
print(f"F1 Score for {model_name}: {f1}")
print(f"Recall for {model_name}: {recall}")
print(f"Precision for {model_name}: {precision}")
print(f"Accuracy for {model_name}: {accuracy}")

  batch = X_test[i:i + batch_size]


Confusion Matrix for distilbert-base-uncased:
 [[1215  490]
 [ 441 1522]]
Classification Report for distilbert-base-uncased:
               precision    recall  f1-score   support

           0       0.73      0.71      0.72      1705
           1       0.76      0.78      0.77      1963

    accuracy                           0.75      3668
   macro avg       0.75      0.74      0.74      3668
weighted avg       0.75      0.75      0.75      3668

F1 Score for distilbert-base-uncased: 0.7657861635220126
Recall for distilbert-base-uncased: 0.7753438614365766
Precision for distilbert-base-uncased: 0.7564612326043738
Accuracy for distilbert-base-uncased: 0.7461832061068703




In [12]:
model_save_name = f"financial_sentiment_analysis_model_{model_name}.pt"
torch.save(model.state_dict(), model_save_name)

This block of code initializes the model architecture using the pre-trained transformer specified by `model_name` and prepares it for classification with two output labels. The model's state is then loaded from a previously saved file, restoring the trained parameters. Finally, the model is moved to the appropriate device (either GPU if available or CPU) to ensure optimal performance during inference or further training.


In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.load_state_dict(torch.load(model_save_name))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
