### **FINANCE NEWS - SENTIMENT ANALYSIS**

**Financial PhraseBank:** A dataset containing financial news sentences annotated with sentiment labels.

**sentences_allagree:** A specific subset of this dataset where every annotator provided the same sentiment label for each sentence. There are 2264 entries in this dataset.

For the sentences_allagree subset, the labels are encoded as follows:


*   0: Negative sentiment
*   1: Neutral sentiment
*   2: Positive sentiment

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.nn import CrossEntropyLoss
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import torch
import numpy as np
from datasets import load_dataset
import re
from bs4 import BeautifulSoup
import html
import random
from sklearn.metrics import accuracy_score
import pandas as pd
device = 'cuda'

import os
from google.colab import drive
drive.mount('/content/drive')

# Define a directory to save the model in Google Drive
output_dir = '/content/drive/MyDrive/bert_model_save/'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

Mounted at /content/drive


In [None]:
# Load dataset
dataset = load_dataset('financial_phrasebank', 'sentences_allagree')
df = pd.DataFrame(dataset['train'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 2264
    })
})

In [None]:
# Extract model input and output
texts = dataset['train']['sentence']
labels = dataset['train']['label']

# Split the dataset into 90% training and 10% validation
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.1, random_state=42)

### **Create a new model and train**

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
model = model.to(device)

optimizer = AdamW(model.parameters(), lr=1e-5)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Tokenize input
print('Tokenizing the input...')
train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors='pt')

#Convert to tensors
print('Converting to tensors...')
train_inputs = train_encodings['input_ids'].to(device)
train_masks = train_encodings['attention_mask'].to(device)
train_outputs = torch.tensor(train_labels).to(device)

#Create DataLoader
print('Loading the data...')
train_dataset = TensorDataset(train_inputs, train_masks, train_outputs)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=32)

Tokenizing the input...
Converting to tensors...
Loading the data...


In [None]:
#Training loop
print('Training...')
model.train()
for epoch in range(30):
  total_loss = 0
  for step, batch in enumerate(train_dataloader):
    b_input_ids, b_input_mask, b_labels = batch
    optimizer.zero_grad()
    outputs = model(b_input_ids, labels=b_labels)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    total_loss += loss.item()
  print(f"Epoch {epoch + 1} --> Total Loss: {total_loss}")

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Training...
Epoch 1 --> Total Loss: 52.95922142267227
Epoch 2 --> Total Loss: 31.36957322061062
Epoch 3 --> Total Loss: 19.933651193976402
Epoch 4 --> Total Loss: 8.969113893806934
Epoch 5 --> Total Loss: 5.068694304674864
Epoch 6 --> Total Loss: 2.8888102816417813
Epoch 7 --> Total Loss: 2.0480386009439826
Epoch 8 --> Total Loss: 2.117433732841164
Epoch 9 --> Total Loss: 1.5583045086823404
Epoch 10 --> Total Loss: 0.8383911603596061
Epoch 11 --> Total Loss: 1.516265967860818
Epoch 12 --> Total Loss: 0.9270119604188949
Epoch 13 --> Total Loss: 0.45259864535182714
Epoch 14 --> Total Loss: 0.48733214871026576
Epoch 15 --> Total Loss: 0.2946990287164226
Epoch 16 --> Total Loss: 0.22030778019689023
Epoch 17 --> Total Loss: 0.11928629758767784
Epoch 18 --> Total Loss: 0.11239819612819701
Epoch 19 --> Total Loss: 0.1630220192600973
Epoch 20 --> Total Loss: 0.7645423716749065
Epoch 21 --> Total Loss: 1.6661932612769306
Epoch 22 --> Total Loss: 0.5597841199487448
Epoch 23 --> Total Loss: 0.159

In [None]:
# Save a trained model, configuration and tokenizer
print("Saving model to %s" % output_dir)

model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

Saving model to /content/drive/MyDrive/bert_model_save/


('/content/drive/MyDrive/bert_model_save/tokenizer_config.json',
 '/content/drive/MyDrive/bert_model_save/special_tokens_map.json',
 '/content/drive/MyDrive/bert_model_save/vocab.txt',
 '/content/drive/MyDrive/bert_model_save/added_tokens.json')

### **TESTING THE MODEL WITH THE VALIDATION DATA**

In [None]:
# Tokenize output
print('Tokenizing the output...')
val_encodings = tokenizer(val_texts, truncation=True, padding=True, return_tensors='pt')

#Convert to tensors
print('Converting to tensors...')
val_inputs = val_encodings['input_ids'].to(device)
val_masks = val_encodings['attention_mask'].to(device)
val_outputs = torch.tensor(val_labels).to(device)

#Create DataLoader
print('Loading the data...')
val_dataset = TensorDataset(val_inputs, val_masks, val_outputs)
val_sampler = SequentialSampler(val_dataset)
val_dataloader = DataLoader(val_dataset, sampler=val_sampler, batch_size=32)

Tokenizing the output...
Converting to tensors...
Loading the data...


In [None]:
# Evaluation
print('Evaluating...')
model.eval()
val_pred_labels, val_true_labels = [], []
with torch.no_grad():
    for batch in val_dataloader:
        b_input_ids, b_input_mask, b_labels = batch
        outputs = model(input_ids=b_input_ids, attention_mask=b_input_mask)
        logits = outputs.logits
        val_pred_labels.extend(torch.argmax(logits, dim=1).cpu().numpy())
        val_true_labels.extend(b_labels.cpu().numpy())
accuracy = accuracy_score(val_true_labels, val_pred_labels)

Evaluating...


In [None]:
print(f'Validation Accuracy: {accuracy}')

Validation Accuracy: 0.9427312775330396


### **TESTING THE DATA WITH CUSTOM DATA**

In [None]:
# Function to predict labels for custom inputs
def predict_custom_sentences(sentences, model, tokenizer, device):
    # Tokenize the input sentences
    encodings = tokenizer(sentences, truncation=True, padding=True, return_tensors='pt')

    # Move the encodings to the appropriate device
    input_ids = encodings['input_ids'].to(device)
    attention_mask = encodings['attention_mask'].to(device)

    # Put the model in evaluation mode
    model.eval()

    # Make predictions
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits

    # Get the predicted labels
    preds = torch.argmax(logits, dim=1).cpu().numpy()

    return preds

# Function to map label integers to words
def label_to_word(label):
    label_dict = {0: 'negative', 1: 'neutral', 2: 'positive'}
    return label_dict.get(label, 'unknown')

In [None]:
custom_sentences = [
    "The company reported a significant increase in revenue.",
    "There are concerns about the sustainability of the growth.",
    "The new product launch has been very successful.",
    "Despite a challenging market environment, the company's strategic decisions have led to considerable improvements in their financial performance.",
    "The recent partnership with a major tech firm is expected to drive innovation and increase market share in the coming years.",
    "My profit last year was $10. This year it is $8.",
    "My profit last year was $10. This year it is reduced to $8"
]

# Predict labels for custom sentences
predicted_labels = predict_custom_sentences(custom_sentences, model, tokenizer, device)

# Map predicted labels to words
predicted_labels_words = [label_to_word(label) for label in predicted_labels]

# DataFrame to store the results
df_predictions = pd.DataFrame({
    'Financial News': custom_sentences,
    'Predicted Label': predicted_labels_words
})

print(df_predictions)

                                      Financial News Predicted Label
0  The company reported a significant increase in...        positive
1  There are concerns about the sustainability of...        negative
2   The new product launch has been very successful.        positive
3  Despite a challenging market environment, the ...        positive
4  The recent partnership with a major tech firm ...        positive
5   My profit last year was $10. This year it is $8.        positive
6  My profit last year was $10. This year it is r...        negative


BERT and other transformer-based models are pre-trained on large corpora and are adept at capturing the nuances of language, but they aren't explicitly designed to handle numerical reasoning or arithmetic operations. When dealing with sentences containing numerical data, the model might rely more on the surrounding context and words rather than understanding the numerical relationships.