# Text Mining Project - Stock Sentiment - Final Notebook

## *Predicting market behavior from tweets*

### Group 42

Carolina Pinto - 20240494 <br>
Fábio dos Santos - 20240678 <br>
Guilherme Cordeiro – 20240527 <br>
Mariana Sousa – 20240516 <br>

Remarks: <br>
- This Notebook is done to be used in Google Colab.
- This Notebook assumes you have train.csv and test.csv datasets in your Google Drive.
- When loading the datasets please adapt the files locations to where you have the datasets located.

## Table of Contents
- [1. Import Libraries](#1-import-libraries)
- [2. Data Integration](#2-data-integration)
- [3. Corpus Split](3-corpus-split)
- [4. Data Preprocessing](#4-data-preprocessing)
- [5. Model Trainning](#5-model-trainning)
- [6. Deployment](#6-deployment)

# 1. Import Libraries

`Step 1` Import the required libraries.

In [None]:
# Get our custom funcions
from tm_utils_42 import *

#Library for Google Drive connection
from google.colab import drive

# Libraries for data manipulation
import pandas as pd
import numpy as np
from copy import deepcopy

# Libraries for corpus split and preprocessing
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer, SnowballStemmer
from nltk.tokenize import word_tokenize
from collections import Counter
import re
import string
from tqdm import tqdm
from nltk.tokenize import word_tokenize
from transformers import pipeline
import torch

# Libraries for the BART model
from transformers import AutoModel
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
from transformers import AutoTokenizer
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay


# Download required NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

# 2. Data Integration

Our best model is BART. Because it is computationally expensive to run it on a laptop without a GPU, our group ran this notebook in Google Colab.

`Step 2` Setup to run notebook in Google Colab.

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sat Jun 14 23:15:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   42C    P8              9W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


Please make sure you have the train and test datasets in your Google drive, and then adapt the location of the file in the following code cell to meet the actual location where you have the datasets.

`Step 3` Import the datasets __train.csv__ and __test.csv__ using the method **read_csv()** from pandas.

In [4]:
df_train = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/train.csv')
df_test = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/test.csv')

`Step 4` Check the first 10 rows of the datasets to verify that the import occured successfully.

In [5]:
df_train.head(10)

Unnamed: 0,text,label
0,$BYND - JPMorgan reels in expectations on Beyo...,0
1,$CCL $RCL - Nomura points to bookings weakness...,0
2,"$CX - Cemex cut at Credit Suisse, J.P. Morgan ...",0
3,$ESS: BTIG Research cuts to Neutral https://t....,0
4,$FNKO - Funko slides after Piper Jaffray PT cu...,0
5,$FTI - TechnipFMC downgraded at Berenberg but ...,0
6,$GM - GM loses a bull https://t.co/tdUfG5HbXy,0
7,$GM: Deutsche Bank cuts to Hold https://t.co/7...,0
8,$GTT: Cowen cuts to Market Perform,0
9,$HNHAF $HNHPD $AAPL - Trendforce cuts iPhone e...,0


In [6]:
df_test.head(10)

Unnamed: 0,id,text
0,0,ETF assets to surge tenfold in 10 years to $50...
1,1,Here’s What Hedge Funds Think Evolution Petrol...
2,2,$PVH - Phillips-Van Heusen Q3 2020 Earnings Pr...
3,3,China is in the process of waiving retaliatory...
4,4,"Highlight: “When growth is scarce, investors s..."
5,5,Marvell Technology (MRVL) Gains As Market Dips...
6,6,UPDATE 1-Italian airline Alitalia's rescue in ...
7,7,why macro funds are shutting down left and rig...
8,8,Uber's arrival caused binge drinking to increa...
9,9,New Dungeons & Dragons game announced


# 3. Corpus Split

Since our corpus has less than 10000 rows, we will split it in train, validation and test in a 80%/10%/10% split.

`Step 5` Create a copy of the original dataframe named **data_train**.

In [7]:
data_train = deepcopy(df_train)
data_train

Unnamed: 0,text,label
0,$BYND - JPMorgan reels in expectations on Beyo...,0
1,$CCL $RCL - Nomura points to bookings weakness...,0
2,"$CX - Cemex cut at Credit Suisse, J.P. Morgan ...",0
3,$ESS: BTIG Research cuts to Neutral https://t....,0
4,$FNKO - Funko slides after Piper Jaffray PT cu...,0
...,...,...
9538,The Week's Gainers and Losers on the Stoxx Eur...,2
9539,Tupperware Brands among consumer gainers; Unil...,2
9540,vTv Therapeutics leads healthcare gainers; Myo...,2
9541,"WORK, XPO, PYX and AMKR among after hour movers",2


__`Step 6`__ Create a variable `X` that stores the values of the input features, and a variable `y` that stores the values of the target feature.

In [8]:
X = data_train.drop(columns=['label'], axis =1)
y = data_train['label']

__`Step 7`__ Split the data in train and validation set in a 80/20 split, with random_state = 42, stratification by `y`, and with shuffle of the dataset.

In [9]:
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y,
                                                    test_size=0.1,
                                                    random_state=42,
                                                    stratify=y,
                                                    shuffle=True
                                                    )

In [10]:
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val,
                                                    test_size=1/9,
                                                    random_state=42,
                                                    stratify=y_train_val,
                                                    shuffle=True
                                                    )

# 4. Data Preprocessing

__`Step 8`__ Create a function to do the data preprocessing. It includes:<br>
|Preprocessement                                  |
|------------------------------------|
| Lowercasing |
| Removes Emojis   |
| Remove unknown character �           |
| Removes Regular Unwanted Expressions   |
| Remove Punctuation           |
| Tokenization |
| Remove Stop Words                    |
| Lemmatization |
| Stemming                    |

In [11]:
lemma = WordNetLemmatizer()
stemmer = SnowballStemmer('english')

def preprocess(text_list, lemma = None, stemmer = None, word2vec=False):
    """
    Return the prepocessed text in a list "updates".

    Parameters:
    text_list : list to be preprocessed
    use_lemmatize : bool, optional
        If True, applies lemmatization to the tokens. Default is True.
    use_stemmer : bool, optional
        If True, applies stemming to the tokens. Default is False.
    """

    stop_words = set(stopwords.words('english'))

    updates = []

    for j in tqdm(text_list):

        text = j

        # Lower case text
        text = text.lower()

        # Remove emojis
        text = re.sub(r'[\U00010000-\U0010ffff]', '', text)

        # Remove unknown character �
        text = text.replace("�", "")

        # Remove Regular Unwanted Expressions
        text = re.sub(r"http\S+|www\S+|https\S+", '', text)
        text = re.sub(r'@\w+|#\w+', '', text)
        text = re.sub(r'\d+', '', text)

        # Remove Punctuation
        text = re.sub(rf"[{re.escape(string.punctuation)}]", '', text)

        # Tokenize the text
        tokens = word_tokenize(text)

        #Remove Stopwords
        tokens = [word for word in tokens if word not in stop_words]

        #Lemmatize
        if lemma:
            tokens = [lemma.lemmatize(word) for word in tokens]

        #Stemming
        if stemmer:
            tokens = [stemmer.stem(word) for word in tokens]

        # Rejoin tokens

        if word2vec:
            cleaned_text=tokens
        else:
            cleaned_text = " ".join(tokens)

        updates.append(cleaned_text)

    return updates

We chose to use lemmatization in our approach to reduce the dimension and also because it was more frequently used in class.

__`Step 9`__ Apply the preprocessement to X_train and X_val.

In [12]:
X_train['tokens'] = preprocess(X_train['text'])
X_train

100%|██████████| 7633/7633 [00:00<00:00, 9302.07it/s]


Unnamed: 0,text,tokens
447,Brazil's central bank stepped in to prop up th...,brazils central bank stepped prop currency
5073,Singapore Frees Listed Local Developers From H...,singapore frees listed local developers homesa...
5941,$RPAY - Repay Holdings buys Ventanex for up to...,rpay repay holdings buys ventanex
5479,WHO Pushes Countries to Share More Patient Det...,pushes countries share patient details combat ...
4654,How clean hydrogen could make the steel indust...,clean hydrogen could make steel industry less ...
...,...,...
6226,JPMorgan anticipates ‘disorderly’ year-end fun...,jpmorgan anticipates ‘ disorderly ’ yearend fu...
9186,$IMMU (+3.2% pre) FDA GRANTS FAST TRACK DESIGN...,immu pre fda grants fast track designation sac...
3590,Hero MotoCorp Q3 Results: Profit Beats Estimat...,hero motocorp q results profit beats estimates...
1261,Applied DNA Announces Issuance of U.S. Patent ...,applied dna announces issuance us patent prote...


In [13]:
X_val['tokens'] = preprocess(X_val['text'])
X_val

100%|██████████| 955/955 [00:00<00:00, 6518.85it/s]


Unnamed: 0,text,tokens
742,What the Fed meeting minutes could say about i...,fed meeting minutes could say interest rates p...
1218,Alibaba's books close early in $13.4 billion H...,alibabas books close early billion hong kong l...
497,Bank of Japan : Accounts (March 20) #BankofJap...,bank japan accounts march
4430,Europe's richest man is spending $1 billion on...,europes richest man spending billion departmen...
5829,$EFX - Four Chinese military hackers charged i...,efx four chinese military hackers charged equi...
...,...,...
6615,$AMTD: TD Ameritrade Investor Movement Index: ...,amtd td ameritrade investor movement index imx...
6387,President Trump reportedly walks away from vap...,president trump reportedly walks away vaping ban
8125,Why Hecla Mining Is a Buy,hecla mining buy
3209,News Highlights : Top Energy News of the Day #...,news highlights top energy news day


In [14]:
X_train_cleaned=preprocess(X_train['text'])
X_val_cleaned=preprocess(X_val['text'])
X_test_cleaned=preprocess(X_test['text'])

100%|██████████| 7633/7633 [00:01<00:00, 6412.80it/s]
100%|██████████| 955/955 [00:00<00:00, 6364.35it/s]
100%|██████████| 955/955 [00:00<00:00, 6729.69it/s]


__`Step 10`__ Apply the preprocessement to df_test.

In [15]:
df_test['tokens'] = preprocess(df_test['text'])
df_test

100%|██████████| 2388/2388 [00:00<00:00, 6667.74it/s]


Unnamed: 0,id,text,tokens
0,0,ETF assets to surge tenfold in 10 years to $50...,etf assets surge tenfold years trillion bank a...
1,1,Here’s What Hedge Funds Think Evolution Petrol...,’ hedge funds think evolution petroleum corpor...
2,2,$PVH - Phillips-Van Heusen Q3 2020 Earnings Pr...,pvh phillipsvan heusen q earnings preview
3,3,China is in the process of waiving retaliatory...,china process waiving retaliatory tariffs impo...
4,4,"Highlight: “When growth is scarce, investors s...",highlight “ growth scarce investors seem willi...
...,...,...,...
2383,2383,$IVC - Invacare Corporation (IVC) CEO Matthew ...,ivc invacare corporation ivc ceo matthew monag...
2384,2384,"Domtar EPS misses by $0.05, revenue in-line",domtar eps misses revenue inline
2385,2385,India Plans Incentives to Bring In Foreign Man...,india plans incentives bring foreign manufactu...
2386,2386,$NVCR shows institutional accumulation with bl...,nvcr shows institutional accumulation blue sky...


In [16]:
df_test_cleaned=preprocess(df_test['text'])

100%|██████████| 2388/2388 [00:00<00:00, 6486.94it/s]


# 5. Model Trainning

We start by training BART on the train dataset before predicting labels for the test dataset.

__`Step 11`__ Set Up the Model (BART + Custom Classifier).

In [None]:
class BARTSentimentClassifier(nn.Module):
    def __init__(self, model_name: str = "facebook/bart-large", num_labels: int = 3, dropout: float = 0.1):
        super(BARTSentimentClassifier, self).__init__()
        self.bart = AutoModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bart.config.hidden_size, num_labels)

    def forward(self, input_ids, attention_mask):
        outputs = self.bart(input_ids=input_ids, attention_mask=attention_mask, return_dict=True,)
        pooled = outputs.last_hidden_state[:, 0]
        logits = self.classifier(self.dropout(pooled))
        return logits

__`Step 12`__ Tokenize & Dataset Preparation.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")

class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_length)
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": torch.tensor(self.encodings["input_ids"][idx]),
            "attention_mask": torch.tensor(self.encodings["attention_mask"][idx]),
            "labels": torch.tensor(self.labels[idx])
        }

train_dataset = TextDataset(X_train_cleaned, y_train.tolist(), tokenizer)
val_dataset = TextDataset(X_val_cleaned, y_val.tolist(), tokenizer)
test_dataset = TextDataset(X_test_cleaned, y_test.tolist(), tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

__`Step 13`__ Define the function to train the Classifier.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_bart = BARTSentimentClassifier().to(device)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)
test_loader = DataLoader(test_dataset, batch_size=16)

def train_transformer(train_loader, val_loader, model):

  optimizer = AdamW(model.parameters(), lr=2e-5)
  criterion = CrossEntropyLoss()

  for epoch in range(3):
      model.train()
      total_loss = 0
      for batch in train_loader:
          input_ids = batch["input_ids"].to(device)
          attention_mask = batch["attention_mask"].to(device)
          labels = batch["labels"].to(device)

          optimizer.zero_grad()
          outputs = model(input_ids, attention_mask)
          loss = criterion(outputs, labels)
          loss.backward()
          optimizer.step()

          total_loss += loss.item()

      print(f"Epoch {epoch + 1} — Loss: {total_loss / len(train_loader):.4f}")

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

__`Step 14`__ Define the function to get the metrics of BART.

In [None]:
def get_metrics_transformers(data_loader, model):
  model.eval()
  all_preds, all_labels = [], []
  with torch.no_grad():
      for batch in data_loader:
          input_ids = batch["input_ids"].to(device)
          attention_mask = batch["attention_mask"].to(device)
          labels = batch["labels"].to(device)

          outputs = model(input_ids, attention_mask)
          preds = torch.argmax(outputs, dim=1)

          all_preds.extend(preds.cpu().numpy())
          all_labels.extend(labels.cpu().numpy())
      report=classification_report(all_labels, all_preds, target_names=["0", "1", "2"],output_dict=True, digits=4)

      filtered_report = {
        label: {
            "precision": report[label]["precision"],
            "recall": report[label]["recall"],
            "f1-score": report[label]["f1-score"]
        }
        for label in ["0", "1", "2", "macro avg"]
      }

      df_metrics = pd.DataFrame.from_dict(filtered_report, orient="index")
      return df_metrics

__`Step 15`__ Train BART.

In [21]:
train_transformer(train_loader, val_loader, model_bart)

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch 1 — Loss: 0.6502
Epoch 2 — Loss: 0.4285
Epoch 3 — Loss: 0.3282


__`Step 16`__ Get BART metrics in train.

In [22]:
print(get_metrics_transformers(train_loader, model_bart))

           precision    recall  f1-score
0           0.926920  0.868284  0.896644
1           0.878602  0.932336  0.904672
2           0.964032  0.959935  0.961979
macro avg   0.923184  0.920185  0.921098


__`Step 17`__ Get BART metrics in validation.

In [23]:
print(get_metrics_transformers(val_loader, model_bart))

           precision    recall  f1-score
0           0.768000  0.666667  0.713755
1           0.737374  0.756477  0.746803
2           0.890823  0.911003  0.900800
macro avg   0.798732  0.778049  0.787119


__`Step 18`__ Get BART metrics in test.

In [24]:
print(get_metrics_transformers(test_loader, model_bart))

           precision    recall  f1-score
0           0.776978  0.750000  0.763251
1           0.821622  0.787565  0.804233
2           0.900158  0.919094  0.909528
macro avg   0.832920  0.818886  0.825670


# 6. Deployment

__`Step 19`__ Make predictions in the test dataset and save them in a csv file with the id and the label.

In [None]:
class InferenceDataset(Dataset):
    """Dataset that carries *only* the inputs; dummy label so __getitem__ matches the training collate."""
    def __init__(self, texts, tokenizer, max_length=128):
        enc = tokenizer(
            texts,
            truncation=True,
            padding=True,
            max_length=max_length,
            return_tensors="pt"
        )
        self.input_ids = enc["input_ids"]
        self.attention_mask = enc["attention_mask"]

    def __len__(self):
        return self.input_ids.size(0)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "attention_mask": self.attention_mask[idx],
        }

# Build the dataset and loader
infer_ds      = InferenceDataset(df_test["tokens"].tolist(), tokenizer)
infer_loader  = DataLoader(infer_ds, batch_size=32)

# Run the model
model_bart.eval()
preds = []

with torch.no_grad():
    for batch in infer_loader:
        input_ids      = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        logits   = model_bart(input_ids, attention_mask)
        batch_pm = torch.argmax(logits, dim=1)
        preds.extend(batch_pm.cpu().tolist())

# Attach predictions & save
df_test["label"] = preds
out_cols = ["id", "label"]
df_test.to_csv("pred_42.csv", columns=out_cols, index=False)

print("Saved", len(df_test), "pred_42.csv")

success()

Saved 2388 pred_42.csv
