<a href="https://colab.research.google.com/github/Incredible88/FinBERT-FOMC/blob/main/Finbert-finetuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Get Started

First, create the environment from the `environment.yml`:
```bash
conda env create -f environment.yml         # created using `conda env export > environment.yml`
```
By running this command in terminal...
- A new Conda environmnet gets created (with the name defined in the YAML)
- The packages as specified gets installed (both Conda and pip dependencies)
- Compatible versions for your OS gets automatically resolved

Once the packages are installed, run:
```bash
conda activate finbert-sec-env
```


In [None]:
# Import different python libraries
import pandas as pd
import numpy as np
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt

To remove Jupyter Notebook output when committing to GitHub:
```bash
nbstripout --install
pip install nbstripout
```

In [None]:
# !pip install chardet
## chardet is used for detecting file encoding before reading raw bytes.
## No need for chardet — Parquet stores text as UTF-8 internally.

# import chardet
# result = chardet.detect(parquet["text_pr"])
# encoding = result['encoding']

## To find what encoding type of data
# encoding

# Load SEC Press Releases

In [None]:
# Read Parquet File `sec.parquet`
parquet = pd.read_parquet('sec.parquet')
parquet.head()

# Focus on Sec Press Releases - `text_pr` column
parquet_text = parquet["text_pr"]
parquet_text.head()

In [None]:
parquet.info()

# Load Finbert 

In [None]:
# pip install transformers==4.28.

from transformers import BertTokenizer, BertForSequenceClassification, pipeline

finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
finbert = pipeline("text-classification", model=finbert, tokenizer=tokenizer)

## Using spacy to split sentences

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

sentences = []
for text in tqdm(parquet_text):
    doc = nlp(text)
    for sent in doc.sents:
        sentences.append(sent.text)


# Predict the original dataset


In [None]:
nlp(parquet_text[1])

In [None]:
results = []
for sentence in tqdm(sentences):
    doc = finbert(sentence)
    label = doc[0]['label']
    score = doc[0]['score']
    results.append({'sentence': sentence, 'label': label, 'score': score})

df = pd.DataFrame(results)

In [None]:
df.head()

In [None]:
# Save the results
df.to_csv('SEC_results.csv', index=False)

# Text Simplification (sentiment focus)

**Note on Text Simplification**

SEC enforcement press releases differ fundamentally from narrative policy texts like FOMC statements. They are shorter, more direct, and written in a formal legal register, typically featuring constructions such as “The SEC today charged…”, “Without admitting or denying…”, or “The firm agreed to pay…”. 

These sentences rarely use hedged or contrastive connectors (e.g., “although”, “but”, “while”) in ways that affect sentiment. Applying preprocessing steps like `remove_comma()` or `sentiment_focus()` could inadvertently strip legally significant clauses such as names, charges, or disclaimers, which contains essential context. 

As a result, such preprocessing would not enhance sentiment extraction accuracy and could, in fact, distort the legal tone central to SEC communications.


---
**Note on Sentiment Focus - Why `sentiment_focus()` Was Not Applied to SEC Press Releases**

The `sentiment_focus()` filtering step, used in FOMC sentiment analysis, was omitted for SEC enforcement press releases. Unlike policy statements, nearly every sentence in an enforcement release carries informational or evaluative weight. Clauses that appear factual, such as “The SEC today charged…”, “Without admitting or denying…”, or “The firm agreed to pay…”, are legally meaningful and influence perceived tone and market response. Filtering them out could remove crucial signals about enforcement severity, cooperation, or settlement outcomes. 

Therefore, all sentences were retained, and sentiment was evaluated at the sentence level before aggregation, preserving the full legal and contextual nuance of SEC communications.

In [None]:
# import spacy

# nlp = spacy.load("en_core_web_sm")

# def remove_comma(sentence):
#     doc = nlp(sentence)
#     indices = []
#     for i, token in enumerate(doc):
#         if token.dep_ == "punct":
#             try:               
#                 next_token = doc[i+1]
#                 if next_token.dep_ == "ROOT" or next_token.dep_ == "conj":
#                     indices.append(i)
#             except IndexError:
#                 pass
#     if not indices:
#         return sentence
#     else:
#         parts = []
#         last_idx = 0
#         for idx in indices:
#             parts.append(doc[last_idx:idx].text.strip())

#             last_idx = idx+1
#         parts.append(doc[last_idx:].text.strip())
#         return " ".join(parts)
    
# # Example of remove_comma
# remove_comma("The personal saving rate--while still slightly negative,moved up in October.")


In [None]:
# def sentiment_focus(sentence):
#     doc = nlp(sentence)
#     focus = ""
#     focus_changed = 1
#     for token in doc[:-1]:
#       if token.lower_ == "but":
#           focus = doc[token.i + 1:]
#           return str(focus).strip(),focus_changed

#     for sent in doc.sents:
#         sent_tokens = [token for token in sent]
#         for token in sent_tokens:
#             if token.lower_ == "although" or token.lower_ == "though":
#                 try:
#                     comma_index_back = [token1.i for token1 in doc[token.i:] if token1.text == ','][0]
#                 except IndexError:
#                     try:
#                       comma_index_front = [token1.i for token1 in doc[:token.i] if token1.text == ','][-1]
#                     except IndexError:
#                       return str(doc).strip(),focus_changed
#                     focus = doc[:comma_index_front].text
#                     return str(focus).strip(),focus_changed
#                 try:
#                       comma_index_front = [token1.i for token1 in doc[:token.i] if token1.text == ','][-1]
#                 except IndexError:
#                   focus = doc[comma_index_back+1:].text
#                   return str(focus).strip(),focus_changed
#                 focus = doc[:comma_index_front].text+doc[comma_index_back:].text
#                 return str(focus).strip(),focus_changed

#     if doc[0].lower_ == "while":
#       try:
#         comma_index_back1 = [token2.i for token2 in doc if token2.text == ','][0]
#       except IndexError:
#         return str(doc).strip(),focus_changed
#       focus = doc[comma_index_back1+1:].text
#       return str(focus).strip(),focus_changed

#     focus_changed = 0 
#     return str(doc).strip(),focus_changed

# Fine-tuning FinBERT



Import libraries needed in fine tuning FinBERT.

In [None]:
!pip install transformers==4.28.1
!pip install datasets
from transformers import BertTokenizer, Trainer, BertForSequenceClassification, TrainingArguments
import torch
from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import transformers
torch.__version__, transformers.__version__

In [None]:
torch.cuda.is_available()

In [None]:
# Split CSV file into training and testing sets

In [None]:
# load training data
df = pd.read_csv('/content/training_data.csv') 
df.head()

In [None]:
# We only need new labels
df = df[['sentence', 'label_new']].rename(columns={'label_new': 'label'})
df.head()

In [None]:
df['label'] = df['label'].replace({'Neutral': 0, 'Positive': 1, 'Negative': 2})
df.head()

## Preparing training/validation/testing

In [None]:
df_train, df_test, = train_test_split(df, stratify=df['label'], test_size=0.1, random_state=42)
df_train, df_val = train_test_split(df_train, stratify=df_train['label'],test_size=0.1, random_state=42)
print(df_train.shape, df_test.shape, df_val.shape)

## Load FinBERT pretrained model
The pretrained FinBERT model path on Huggingface is https://huggingface.co/yiyanghkust/finbert-pretrain


In [None]:
model = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-pretrain',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-pretrain')

## Prepare Dataset for Fine-tuning

In [None]:
dataset_train = Dataset.from_pandas(df_train)
dataset_val = Dataset.from_pandas(df_val)
dataset_test = Dataset.from_pandas(df_test)

dataset_train = dataset_train.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
dataset_val = dataset_val.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
dataset_test = dataset_test.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length' , max_length=128), batched=True)

dataset_train.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataset_val.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataset_test.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])

## Define Training Options

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy' : accuracy_score(predictions, labels)}

args = TrainingArguments(
        output_dir = 'temp/',
        evaluation_strategy = 'epoch',
        save_strategy = 'epoch',
        learning_rate=2e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=5,
        weight_decay=0.005,
        load_best_model_at_end=True,
        metric_for_best_model='accuracy',
)

trainer = Trainer(
        model=model,                        
        args=args,                  
        train_dataset=dataset_train,         
        eval_dataset=dataset_val,           
        compute_metrics=compute_metrics
)

trainer.train()  

## Evaluate on Testing Set

In [None]:
model.eval()
trainer.predict(dataset_test).metrics

In [None]:
dataset_test

## Save the fine-tuned model

In [None]:
trainer.save_model('finbert-sentiment/')

# Evaluate on All Press Releases