## Advanced Model: BERT for Financial News Classification

This notebook implements a BERT-based text classification model to classify
Twitter financial news into multiple categories. The objective is to evaluate
a deep learning approach and compare its performance with classical models.


## Environment Setup

This step installs and imports the required libraries for training and
evaluating a BERT-based text classification model.


In [None]:
# If running locally and not installed, uncomment the next lines:
# !pip install transformers datasets torch accelerate scikit-learn

import numpy as np
import pandas as pd
import torch

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments
)
a
print("Libraries imported successfully.")


## Tokenizer Initialization (BERT)

This step loads the pretrained BERT tokenizer and verifies that the
environment is correctly set up for tokenization.


In [2]:
from transformers import BertTokenizerFast

# Load pretrained tokenizer
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

# Quick sanity check
sample_text = "Markets reacted positively to the Federal Reserve announcement."
encoded = tokenizer(sample_text, truncation=True, padding=True, max_length=128)

print("Tokenizer loaded successfully.")
print("Sample tokenized keys:", encoded.keys())
import warnings
warnings.filterwarnings("ignore")

import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"


Tokenizer loaded successfully.
Sample tokenized keys: KeysView({'input_ids': [101, 6089, 14831, 13567, 2000, 1996, 2976, 3914, 8874, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})


## Tokenization of Training and Testing Data

This step tokenizes the training and testing text datasets using the
pretrained BERT tokenizer.


In [3]:
from sklearn.model_selection import train_test_split

# Load cleaned dataset (if not already loaded)
data_path = "../data/train_clean.csv"
df = pd.read_csv(data_path)

# Remove invalid or low-information records
df = df.dropna(subset=["clean_text"])
df = df[df["clean_text"].str.len() > 10]

# Define features and target
X = df["clean_text"]
y = df["label"]

# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train samples:", X_train.shape[0])
print("Test samples :", X_test.shape[0])


Train samples: 13559
Test samples : 3390


In [4]:
# Tokenize training data
train_encodings = tokenizer(
    list(X_train),
    truncation=True,
    padding=True,
    max_length=128
)

# Tokenize testing data
test_encodings = tokenizer(
    list(X_test),
    truncation=True,
    padding=True,
    max_length=128
)

print("Tokenization completed successfully.")
print("Training samples:", len(train_encodings["input_ids"]))
print("Testing samples :", len(test_encodings["input_ids"]))


Tokenization completed successfully.
Training samples: 13559
Testing samples : 3390


## Trainâ€“Test Split for BERT

This step prepares the training and testing datasets required
for tokenization and model training.


In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load cleaned dataset
data_path = "../data/train_clean.csv"
df = pd.read_csv(data_path)

# Basic cleaning (same as previous notebooks)
df = df.dropna(subset=["clean_text"])
df = df[df["clean_text"].str.len() > 10]

# Define features and target
X = df["clean_text"]
y = df["label"]

# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train samples:", X_train.shape[0])
print("Test samples :", X_test.shape[0])


Train samples: 13559
Test samples : 3390


## Load BERT Tokenizer

This step loads the pretrained BERT tokenizer and verifies that the
tokenization pipeline is working correctly.


In [6]:
from transformers import BertTokenizerFast

# Load pretrained tokenizer
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

# Sanity check with a sample sentence
sample_text = "Markets reacted positively to the Federal Reserve announcement."
encoded = tokenizer(
    sample_text,
    truncation=True,
    padding=True,
    max_length=128
)

print("Tokenizer loaded successfully.")
print("Tokenized keys:", encoded.keys())


Tokenizer loaded successfully.
Tokenized keys: KeysView({'input_ids': [101, 6089, 14831, 13567, 2000, 1996, 2976, 3914, 8874, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})


## Tokenize Training and Testing Data

This step tokenizes the training and testing text data using the
pretrained BERT tokenizer.


## Custom PyTorch Dataset for BERT

This step creates a PyTorch-compatible dataset required by the
HuggingFace Trainer API.


In [7]:
import torch

class FinancialNewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels.reset_index(drop=True)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels.iloc[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create dataset objects
train_dataset = FinancialNewsDataset(train_encodings, y_train)
test_dataset = FinancialNewsDataset(test_encodings, y_test)

print("PyTorch datasets created successfully.")
print("Train dataset size:", len(train_dataset))
print("Test dataset size :", len(test_dataset))


PyTorch datasets created successfully.
Train dataset size: 13559
Test dataset size : 3390


## BERT Model Initialization

This step initializes a pretrained BERT model for multi-class
text classification.


In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load cleaned dataset
data_path = "../data/train_clean.csv"
df = pd.read_csv(data_path)

# Basic cleaning
df = df.dropna(subset=["clean_text"])
df = df[df["clean_text"].str.len() > 10]

# Features and target
X = df["clean_text"]
y = df["label"]

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train samples:", X_train.shape[0])
print("Test samples :", X_test.shape[0])


Train samples: 13559
Test samples : 3390


## Training Configuration for BERT

This step defines training arguments optimized for CPU-based training.


In [9]:
from transformers import BertForSequenceClassification

# Number of unique labels
num_labels = y_train.nunique()

# Initialize pretrained BERT model
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=num_labels
)

print(f"BERT model initialized successfully with {num_labels} labels.")
import warnings
warnings.filterwarnings("ignore")

import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
os.environ["HF_HUB_DISABLE_XET_WARNING"] = "1"


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERT model initialized successfully with 20 labels.


## Training Configuration for BERT

This step defines the training arguments used to fine-tune the BERT model.
The configuration is optimized for CPU-based training.


In [10]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert_results",
    eval_strategy="epoch",          # <-- FIXED HERE
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./bert_logs",
    logging_steps=100,
    load_best_model_at_end=True,
    report_to="none"
)

print("Training arguments configured successfully.")


Training arguments configured successfully.


## BERT Model Training

This step fine-tunes the BERT model on the training dataset.



In [11]:
from transformers import BertForSequenceClassification

# Number of unique labels
num_labels = y_train.nunique()

# Initialize pretrained BERT model
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=num_labels
)

print(f"BERT model initialized successfully with {num_labels} labels.")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERT model initialized successfully with 20 labels.


In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

print("Trainer initialized. Training started...")

trainer.train()


## Note on BERT Training

A BERT-based deep learning model was initiated to explore advanced NLP
techniques for financial news classification. However, due to CPU-only
hardware constraints and significantly long training time, the model
was intentionally stopped before completion.

The decision to prioritize efficient classical models such as Linear
Support Vector Machine ensured timely project completion while
maintaining strong and reliable performance. This reflects a practical
engineering trade-off between model complexity, computational resources,
and project timelines.
