<a href="https://colab.research.google.com/github/chloetychang/UWA-FinBert-SEC/blob/main/Finbert-finetuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Get Started

First, create the environment from the `environment.yml`:
```bash
conda env create -f environment.yml         # created using `conda env export > environment.yml`
```
By running this command in terminal...
- A new Conda environmnet gets created (with the name defined in the YAML)
- The packages as specified gets installed (both Conda and pip dependencies)
- Compatible versions for your OS gets automatically resolved

Once the packages are installed, run:
```bash
conda activate finbert-sec-env
```


In [None]:
# Import different python libraries
import pandas as pd
import numpy as np
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt

To remove Jupyter Notebook output when committing to GitHub:
```bash
nbstripout --install
pip install nbstripout
```

In [None]:
# !pip install chardet
## chardet is used for detecting file encoding before reading raw bytes.
## No need for chardet — Parquet stores text as UTF-8 internally.

# import chardet
# result = chardet.detect(parquet["text_pr"])
# encoding = result['encoding']

## To find what encoding type of data
# encoding

# Load SEC Press Releases

In [None]:
# Read Parquet File `sec.parquet`
parquet = pd.read_parquet('sec.parquet')          # In Google Colab, change to "/content/sec.parquet"
parquet.head()

# Focus on SEC Press Releases - `text_pr` column
parquet_text = parquet["text_pr"]
parquet_text.head()

In [None]:
# Get basic info of the dataframe
parquet.info()


# Load Finbert

In [None]:
# Colab command to install the library "transformers": !pip install transformers==4.28.

from transformers import BertTokenizer, BertForSequenceClassification, pipeline

finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
finbert = pipeline("text-classification", model=finbert, tokenizer=tokenizer)

## Using spacy to split sentences

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

sentences = []
for text in tqdm(parquet_text):
    doc = nlp(text)
    for sent in doc.sents:
        sentences.append(sent.text)


# Predict the original dataset


In [None]:
nlp(parquet_text[1])

In [None]:
results = []
for sentence in tqdm(sentences):
    doc = finbert(sentence)
    label = doc[0]['label']
    score = doc[0]['score']
    results.append({'sentence': sentence, 'label': label, 'score': score})

df = pd.DataFrame(results)

In [None]:
df.head()

In [None]:
# Save the results
# Potential use in the future, but a csv with sentences grouped by original dataframe entry will be generated later in the notebook.
# df.to_csv('SEC_results.csv', index=False)

# Text Simplification (sentiment focus)

**Note on Text Simplification**

SEC enforcement press releases differ fundamentally from narrative policy texts like FOMC statements. They are shorter, more direct, and written in a formal legal register, typically featuring constructions such as “The SEC today charged…”, “Without admitting or denying…”, or “The firm agreed to pay…”.

These sentences rarely use hedged or contrastive connectors (e.g., “although”, “but”, “while”) in ways that affect sentiment. Applying preprocessing steps like `remove_comma()` or `sentiment_focus()` could inadvertently strip legally significant clauses such as names, charges, or disclaimers, which contains essential context.

As a result, such preprocessing would not enhance sentiment extraction accuracy and could, in fact, distort the legal tone central to SEC communications.


---
**Note on Sentiment Focus - Why `sentiment_focus()` Was Not Applied to SEC Press Releases**

The `sentiment_focus()` filtering step, used in FOMC sentiment analysis, was omitted for SEC enforcement press releases. Unlike policy statements, nearly every sentence in an enforcement release carries informational or evaluative weight. Clauses that appear factual, such as “The SEC today charged…”, “Without admitting or denying…”, or “The firm agreed to pay…”, are legally meaningful and influence perceived tone and market response. Filtering them out could remove crucial signals about enforcement severity, cooperation, or settlement outcomes.

Therefore, all sentences were retained, and sentiment was evaluated at the sentence level before aggregation, preserving the full legal and contextual nuance of SEC communications.

In [None]:
'''
# Function to remove comma before root or conjunction using SpaCy.
# This function is not used in the current implementation but can be useful for preprocessing sentences - code commented out for potential future reference.
'''

# import spacy

# nlp = spacy.load("en_core_web_sm")

# def remove_comma(sentence):
#     doc = nlp(sentence)
#     indices = []
#     for i, token in enumerate(doc):
#         if token.dep_ == "punct":
#             try:
#                 next_token = doc[i+1]
#                 if next_token.dep_ == "ROOT" or next_token.dep_ == "conj":
#                     indices.append(i)
#             except IndexError:
#                 pass
#     if not indices:
#         return sentence
#     else:
#         parts = []
#         last_idx = 0
#         for idx in indices:
#             parts.append(doc[last_idx:idx].text.strip())

#             last_idx = idx+1
#         parts.append(doc[last_idx:].text.strip())
#         return " ".join(parts)

# # Example of remove_comma
# remove_comma("The personal saving rate--while still slightly negative,moved up in October.")


In [None]:
'''
Function to identify sentiment focus in a sentence using SpaCy.
This function is not used in the current implementation but can be useful for identifying the main sentiment-bearing part of a sentence.
Ccode commented out for potential future reference.
'''

# def sentiment_focus(sentence):
#     doc = nlp(sentence)
#     focus = ""
#     focus_changed = 1
#     for token in doc[:-1]:
#       if token.lower_ == "but":
#           focus = doc[token.i + 1:]
#           return str(focus).strip(),focus_changed

#     for sent in doc.sents:
#         sent_tokens = [token for token in sent]
#         for token in sent_tokens:
#             if token.lower_ == "although" or token.lower_ == "though":
#                 try:
#                     comma_index_back = [token1.i for token1 in doc[token.i:] if token1.text == ','][0]
#                 except IndexError:
#                     try:
#                       comma_index_front = [token1.i for token1 in doc[:token.i] if token1.text == ','][-1]
#                     except IndexError:
#                       return str(doc).strip(),focus_changed
#                     focus = doc[:comma_index_front].text
#                     return str(focus).strip(),focus_changed
#                 try:
#                       comma_index_front = [token1.i for token1 in doc[:token.i] if token1.text == ','][-1]
#                 except IndexError:
#                   focus = doc[comma_index_back+1:].text
#                   return str(focus).strip(),focus_changed
#                 focus = doc[:comma_index_front].text+doc[comma_index_back:].text
#                 return str(focus).strip(),focus_changed

#     if doc[0].lower_ == "while":
#       try:
#         comma_index_back1 = [token2.i for token2 in doc if token2.text == ','][0]
#       except IndexError:
#         return str(doc).strip(),focus_changed
#       focus = doc[comma_index_back1+1:].text
#       return str(focus).strip(),focus_changed

#     focus_changed = 0
#     return str(doc).strip(),focus_changed

# Fine-tuning FinBERT



Import libraries needed in fine tuning FinBERT.

In [None]:
'''
Colab Commmands to install necessary libraries:

!pip install transformers==4.28.1
!pip install datasets
'''

from transformers import BertTokenizer, Trainer, BertForSequenceClassification, TrainingArguments
import torch
from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import transformers
torch.__version__, transformers.__version__

In [None]:
torch.cuda.is_available()

In [None]:
# Split CSV file into training, validation, and testing sets

# Filter file to only include sentences with a score above 0.95
df_filtered = df[df['score'] > 0.95]

In [None]:
# Load training data
df_filtered.reset_index(drop=True, inplace=True)
df_filtered.head()

In [None]:
# Copy the filtered dataframe to a new dataframe
df = df_filtered[['sentence', 'label']].copy()

# Convert text labels to numeric values
df['label'] = df['label'].replace({'Neutral': 0, 'Positive': 1, 'Negative': 2})

# Dropping the score column if you don't need it for training
df = df[['sentence', 'label']]  # use this version instead if keeping it clean

df.head()

## Preparing Training/Validation/Testing

In [None]:
df_train, df_test, = train_test_split(df, stratify=df['label'], test_size=0.1, random_state=42)
df_train, df_val = train_test_split(df_train, stratify=df_train['label'],test_size=0.1, random_state=42)
print(df_train.shape, df_test.shape, df_val.shape)

## Load FinBERT Pre-trained Model
The pretrained FinBERT model path on Huggingface is https://huggingface.co/yiyanghkust/finbert-pretrain


In [None]:
model = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-pretrain',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-pretrain')

## Prepare Dataset for Fine-tuning

In [None]:
dataset_train = Dataset.from_pandas(df_train)
dataset_val = Dataset.from_pandas(df_val)
dataset_test = Dataset.from_pandas(df_test)

dataset_train = dataset_train.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
dataset_val = dataset_val.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
dataset_test = dataset_test.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length' , max_length=128), batched=True)

dataset_train.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataset_val.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataset_test.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])

## Define Training Options

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy' : accuracy_score(predictions, labels)}

args = TrainingArguments(
        output_dir = 'temp/',
        eval_strategy = 'epoch',
        save_strategy = 'epoch',
        learning_rate=2e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=5,
        weight_decay=0.005,
        load_best_model_at_end=True,
        metric_for_best_model='accuracy',
)

trainer = Trainer(
        model=model,
        args=args,
        train_dataset=dataset_train,
        eval_dataset=dataset_val,
        compute_metrics=compute_metrics
)

for name, param in model.named_parameters():
    if not param.is_contiguous():
        param.data = param.data.contiguous()

trainer.train()



## Evaluate on Testing Set

In [None]:
model.eval()
trainer.predict(dataset_test).metrics

In [None]:
dataset_test

## Save the Fine-tuned Model

In [None]:
trainer.save_model('finbert-sentiment-sec-press-releases/')

# Evaluate on All Press Releases

In [None]:
'''
Dataframe with sentences grouped by original dataframe entry.
`df_sentences` contains: unique_id (index), sentence
'''
# Essential imports - If required to run sole code block in the future
import spacy
from tqdm import tqdm
import pandas as pd

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Load original dataframe with column 'text_pr' featured
df = pd.read_parquet("sec.parquet")               # In Google Colab, change to "/content/sec.parquet"
df = df.reset_index(drop=True)
df["unique_id"] = df.index  # give each press release a unique ID

# Store sentence-level data
sentence_records = []

for _, row in tqdm(df.iterrows(), total=len(df)):
    uid = row["unique_id"]
    text = row["text_pr"]

    doc = nlp(text)
    for sent in doc.sents:
        sentence_records.append({
            "unique_id": uid,
            "sentence": sent.text.strip()
        })

# Convert to DataFrame
df_sentences = pd.DataFrame(sentence_records)
df_sentences.to_csv("sec_sentences_with_ids.csv", index=False)

In [None]:
'''
Produces dataframe with sentences grouped by original dataframe entry, with the label and score for each sentence
df_sentences contains: unique_id (index), sentence, label, score (confidence level that tuned FinBert has regarding its analysis)
'''

# Essential imports - If required to run sole code block in the future
from transformers import pipeline, BertTokenizer
from nltk.tokenize import sent_tokenize

model_path = "./finbert-sentiment-sec-press-releases"           # In Google Colab, change to "/content/finbert-sentiment-sec-press-releases"
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

# Load model and tokenizer
sec_model = pipeline("text-classification", model=model_path, tokenizer=tokenizer)

# Loop through each sentence to generate a label and a confidence score
results = []
for sentence in tqdm(df_sentences["sentence"], total=len(df_sentences)):
    doc = sec_model(sentence)
    label = doc[0]['label']
    score = doc[0]['score']
    results.append({"label": label, "score": score})

# Add predictions into DataFrame
df_sentences[["label", "score"]] = pd.DataFrame(results, index=df_sentences.index)

# Replace labels with descriptive adjectives
df_sentences["label"] = df_sentences["label"].replace({
    "LABEL_0": "Neutral",
    "LABEL_1": "Positive",
    "LABEL_2": "Negative"
})

# Save Generated DataFrame into CSV format
df_sentences.to_csv("sec_sentences_with_labels.csv", index=False)

In [None]:
df_sentences.head()

In [None]:
# Make Labels Numerical, so we get an average score from all sentences in the same row of the dataframe
df_sentences["numerical_score"] = df_sentences["label"].replace({
    "Neutral": 0,
    "Positive": 1,
    "Negative": -1
})

'''
df_average contains: 
- unique_id (index)
- avg_score (the average of all numerical scores obtained from the sentences in the same row of the original dataframe)
- avg_label (the sentiment label assigned based on the avg_score) 
'''

# Compute the average numerical score per document
df_average = (
    df_sentences
    .groupby("unique_id", as_index=False)["numerical_score"]
    .mean()
    .rename(columns={"numerical_score": "avg_score"})
)

# Assign a sentiment label based on the averaged score
def assign_label(x):
    if x < 0:
        return "Negative"
    elif x == 0:
        return "Neutral"
    else:
        return "Positive"

df_average["avg_label"] = df_average["avg_score"].apply(assign_label)

# Inspect the result
df_average.head()

In [None]:
'''
Final DataFrame: Appending original text_pr column to df_average
- unique_id (index)
- text_pr (the original press release text)
- avg_score (the average of all numerical scores obtained from the sentences in the same row of the original dataframe)
- avg_label (the sentiment label assigned based on the avg_score)
'''

deliverable = pd.DataFrame(
    {"unique_id": df_average["unique_id"],
     "text_pr": df.loc[df_average["unique_id"], "text_pr"].values,
     "avg_score": df_average["avg_score"],
     "avg_label": df_average["avg_label"]}
)

deliverable.to_csv("sec_press_releases_with_avg_sentiment.csv", index=False)

In [None]:
deliverable.head()
deliverable.to_parquet("sec_press_releases_with_avg_sentiment.parquet", index=False)