<a href="https://colab.research.google.com/github/chloetychang/UWA-FinBert-SEC/blob/main/Finbert-finetuned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Get Started

First, create the environment from the `environment.yml`:
```bash
conda env create -f environment.yml         # created using `conda env export > environment.yml`
```
By running this command in terminal...
- A new Conda environmnet gets created (with the name defined in the YAML)
- The packages as specified gets installed (both Conda and pip dependencies)
- Compatible versions for your OS gets automatically resolved

Once the packages are installed, run:
```bash
conda activate finbert-sec-env
```


In [5]:
# Import different python libraries
import pandas as pd
import numpy as np
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt

To remove Jupyter Notebook output when committing to GitHub:
```bash
nbstripout --install
pip install nbstripout
```

In [6]:
# !pip install chardet
## chardet is used for detecting file encoding before reading raw bytes.
## No need for chardet — Parquet stores text as UTF-8 internally.

# import chardet
# result = chardet.detect(parquet["text_pr"])
# encoding = result['encoding']

## To find what encoding type of data
# encoding

# Load SEC Press Releases

In [8]:
# Read Parquet File `sec.parquet`
parquet = pd.read_parquet('sec.parquet')
parquet.head()

# Focus on Sec Press Releases - `text_pr` column
parquet_text = parquet["text_pr"]
parquet_text.head()

Unnamed: 0,text_pr
0,The Securities and Exchange Commission today c...
1,The Securities and Exchange Commission today s...
2,The Securities and Exchange Commission today c...
3,For Immediate Release 99-70 SEC Charges 11 Ind...
4,The Securities and Exchange Commission institu...


In [None]:
# Get basic info of the dataframe
parquet.info()


# Load Finbert

In [9]:
# pip install transformers==4.28.

from transformers import BertTokenizer, BertForSequenceClassification, pipeline

finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
finbert = pipeline("text-classification", model=finbert, tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/533 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

Device set to use cuda:0


## Using spacy to split sentences

In [10]:
import spacy

nlp = spacy.load('en_core_web_sm')

sentences = []
for text in tqdm(parquet_text):
    doc = nlp(text)
    for sent in doc.sents:
        sentences.append(sent.text)



  0%|          | 0/959 [00:00<?, ?it/s][A
  0%|          | 1/959 [00:00<10:31,  1.52it/s][A
  0%|          | 2/959 [00:01<11:41,  1.36it/s][A
  0%|          | 3/959 [00:02<11:18,  1.41it/s][A
  0%|          | 4/959 [00:02<09:15,  1.72it/s][A
  1%|          | 5/959 [00:03<08:58,  1.77it/s][A
  1%|          | 6/959 [00:03<08:08,  1.95it/s][A
  1%|          | 7/959 [00:03<06:30,  2.44it/s][A
  1%|          | 8/959 [00:04<06:46,  2.34it/s][A
  1%|          | 9/959 [00:04<06:18,  2.51it/s][A
  1%|          | 10/959 [00:04<06:00,  2.63it/s][A
  1%|          | 11/959 [00:05<05:18,  2.98it/s][A
  1%|▏         | 12/959 [00:05<04:47,  3.29it/s][A
  1%|▏         | 13/959 [00:06<06:56,  2.27it/s][A
  1%|▏         | 14/959 [00:06<05:33,  2.83it/s][A
  2%|▏         | 15/959 [00:06<04:45,  3.31it/s][A
  2%|▏         | 16/959 [00:06<03:58,  3.95it/s][A
  2%|▏         | 18/959 [00:06<02:49,  5.54it/s][A
  2%|▏         | 19/959 [00:06<02:52,  5.44it/s][A
  2%|▏         | 20/959 [00:0

# Predict the original dataset


In [11]:
nlp(parquet_text[1])

The Securities and Exchange Commission today sued Livent, Inc. and nine former senior officers, directors, and members of the accounting staff of Livent, Inc. for engaging in a multi-faceted and pervasive accounting fraud spanning eight years from 1990 through the first quarter of 1998. Five individuals are also alleged to have engaged in insider trading. Also today, the U.S. Attorney for the Southern District of New York announced the indictment of former Livent officials Garth Drabinsky and Myron Gottlieb for sixteen felony counts each, for violations of the federal securities laws. In addition, the U.S. Attorney announced that former Livent officials Gordon C. Eckstein and Maria Messina pled guilty to one felony count each, for violations of the federal securities laws. Richard H. Walker, Director of the SEC's Division of Enforcement, said, "Accounting fraud strikes at the heart of the integrity of the securities markets and will not be tolerated by the Commission. This case should 

In [12]:
results = []
for sentence in tqdm(sentences):
    doc = finbert(sentence)
    label = doc[0]['label']
    score = doc[0]['score']
    results.append({'sentence': sentence, 'label': label, 'score': score})

df = pd.DataFrame(results)

  0%|          | 2/22482 [00:00<1:29:23,  4.19it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 22482/22482 [03:41<00:00, 101.72it/s]


In [13]:
df.head()

Unnamed: 0,sentence,label,score
0,The Securities and Exchange Commission today c...,Neutral,0.852387
1,Five of the firms settled the charges and four...,Neutral,0.992689
2,The Commission's rules required firms to file ...,Neutral,0.730382
3,Three of the nine transfer agents agreed to ce...,Neutral,0.828894
4,"They are: CSJ, LLC, Houston, TX; Davidson Trus...",Neutral,0.999998


In [14]:
# Save the results
df.to_csv('SEC_results.csv', index=False)

# Text Simplification (sentiment focus)

**Note on Text Simplification**

SEC enforcement press releases differ fundamentally from narrative policy texts like FOMC statements. They are shorter, more direct, and written in a formal legal register, typically featuring constructions such as “The SEC today charged…”, “Without admitting or denying…”, or “The firm agreed to pay…”.

These sentences rarely use hedged or contrastive connectors (e.g., “although”, “but”, “while”) in ways that affect sentiment. Applying preprocessing steps like `remove_comma()` or `sentiment_focus()` could inadvertently strip legally significant clauses such as names, charges, or disclaimers, which contains essential context.

As a result, such preprocessing would not enhance sentiment extraction accuracy and could, in fact, distort the legal tone central to SEC communications.


---
**Note on Sentiment Focus - Why `sentiment_focus()` Was Not Applied to SEC Press Releases**

The `sentiment_focus()` filtering step, used in FOMC sentiment analysis, was omitted for SEC enforcement press releases. Unlike policy statements, nearly every sentence in an enforcement release carries informational or evaluative weight. Clauses that appear factual, such as “The SEC today charged…”, “Without admitting or denying…”, or “The firm agreed to pay…”, are legally meaningful and influence perceived tone and market response. Filtering them out could remove crucial signals about enforcement severity, cooperation, or settlement outcomes.

Therefore, all sentences were retained, and sentiment was evaluated at the sentence level before aggregation, preserving the full legal and contextual nuance of SEC communications.

In [None]:
'''
# Function to remove comma before root or conjunction using SpaCy.
# This function is not used in the current implementation but can be useful for preprocessing sentences - code commented out for potential future reference.
'''

# import spacy

# nlp = spacy.load("en_core_web_sm")

# def remove_comma(sentence):
#     doc = nlp(sentence)
#     indices = []
#     for i, token in enumerate(doc):
#         if token.dep_ == "punct":
#             try:
#                 next_token = doc[i+1]
#                 if next_token.dep_ == "ROOT" or next_token.dep_ == "conj":
#                     indices.append(i)
#             except IndexError:
#                 pass
#     if not indices:
#         return sentence
#     else:
#         parts = []
#         last_idx = 0
#         for idx in indices:
#             parts.append(doc[last_idx:idx].text.strip())

#             last_idx = idx+1
#         parts.append(doc[last_idx:].text.strip())
#         return " ".join(parts)

# # Example of remove_comma
# remove_comma("The personal saving rate--while still slightly negative,moved up in October.")


In [None]:
'''
Function to identify sentiment focus in a sentence using SpaCy.
This function is not used in the current implementation but can be useful for identifying the main sentiment-bearing part of a sentence.
Ccode commented out for potential future reference.
'''

# def sentiment_focus(sentence):
#     doc = nlp(sentence)
#     focus = ""
#     focus_changed = 1
#     for token in doc[:-1]:
#       if token.lower_ == "but":
#           focus = doc[token.i + 1:]
#           return str(focus).strip(),focus_changed

#     for sent in doc.sents:
#         sent_tokens = [token for token in sent]
#         for token in sent_tokens:
#             if token.lower_ == "although" or token.lower_ == "though":
#                 try:
#                     comma_index_back = [token1.i for token1 in doc[token.i:] if token1.text == ','][0]
#                 except IndexError:
#                     try:
#                       comma_index_front = [token1.i for token1 in doc[:token.i] if token1.text == ','][-1]
#                     except IndexError:
#                       return str(doc).strip(),focus_changed
#                     focus = doc[:comma_index_front].text
#                     return str(focus).strip(),focus_changed
#                 try:
#                       comma_index_front = [token1.i for token1 in doc[:token.i] if token1.text == ','][-1]
#                 except IndexError:
#                   focus = doc[comma_index_back+1:].text
#                   return str(focus).strip(),focus_changed
#                 focus = doc[:comma_index_front].text+doc[comma_index_back:].text
#                 return str(focus).strip(),focus_changed

#     if doc[0].lower_ == "while":
#       try:
#         comma_index_back1 = [token2.i for token2 in doc if token2.text == ','][0]
#       except IndexError:
#         return str(doc).strip(),focus_changed
#       focus = doc[comma_index_back1+1:].text
#       return str(focus).strip(),focus_changed

#     focus_changed = 0
#     return str(doc).strip(),focus_changed

# Fine-tuning FinBERT



Import libraries needed in fine tuning FinBERT.

In [15]:
!pip install transformers==4.28.1
!pip install datasets
from transformers import BertTokenizer, Trainer, BertForSequenceClassification, TrainingArguments
import torch
from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import transformers
torch.__version__, transformers.__version__

Collecting transformers==4.28.1
  Using cached transformers-4.28.1-py3-none-any.whl.metadata (109 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.1)
  Using cached tokenizers-0.13.3.tar.gz (314 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Using cached transformers-4.28.1-py3-none-any.whl (7.0 MB)
Building wheels for collected packages: tokenizers
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for tokenizers (pyproject.toml) ... [?25l[?25herror
[31m  ERROR: Failed building wheel for tokenizers[0m[31m
[0

('2.8.0+cu126', '4.56.2')

In [2]:
torch.cuda.is_available()

True

In [16]:
# Split CSV file into training, validation, and testing sets

# Filter file to only include sentences with a score above 0.95
df_filtered = df[df['score'] > 0.95]

In [17]:
# load training data
df_filtered.reset_index(drop=True, inplace=True)
df_filtered.head()

Unnamed: 0,sentence,label,score
0,Five of the firms settled the charges and four...,Neutral,0.992689
1,"They are: CSJ, LLC, Houston, TX; Davidson Trus...",Neutral,0.999998
2,"The two are Corporate Planners, Inc., Fountain...",Neutral,0.999998
3,"They are: Alpha Tech Stock Transfer Trust, Dra...",Neutral,0.999783
4,Investors and other market participants have t...,Neutral,0.997276


In [18]:
df = df_filtered[['sentence', 'label']].copy()

# Convert text labels to numeric values
df['label'] = df['label'].replace({'Neutral': 0, 'Positive': 1, 'Negative': 2})

# Dropping the score column if you don't need it for training
df = df[['sentence', 'label']]  # use this version instead if keeping it clean

df.head()

  df['label'] = df['label'].replace({'Neutral': 0, 'Positive': 1, 'Negative': 2})


Unnamed: 0,sentence,label
0,Five of the firms settled the charges and four...,0
1,"They are: CSJ, LLC, Houston, TX; Davidson Trus...",0
2,"The two are Corporate Planners, Inc., Fountain...",0
3,"They are: Alpha Tech Stock Transfer Trust, Dra...",0
4,Investors and other market participants have t...,0


## Preparing training/validation/testing

In [19]:
df_train, df_test, = train_test_split(df, stratify=df['label'], test_size=0.1, random_state=42)
df_train, df_val = train_test_split(df_train, stratify=df_train['label'],test_size=0.1, random_state=42)
print(df_train.shape, df_test.shape, df_val.shape)

(13185, 2) (1628, 2) (1465, 2)


## Load FinBERT pretrained model
The pretrained FinBERT model path on Huggingface is https://huggingface.co/yiyanghkust/finbert-pretrain


In [20]:
model = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-pretrain',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-pretrain')

config.json:   0%|          | 0.00/359 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/442M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at yiyanghkust/finbert-pretrain and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/442M [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

## Prepare Dataset for Fine-tuning

In [21]:
dataset_train = Dataset.from_pandas(df_train)
dataset_val = Dataset.from_pandas(df_val)
dataset_test = Dataset.from_pandas(df_test)

dataset_train = dataset_train.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
dataset_val = dataset_val.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
dataset_test = dataset_test.map(lambda e: tokenizer(e['sentence'], truncation=True, padding='max_length' , max_length=128), batched=True)

dataset_train.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataset_val.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataset_test.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])

Map:   0%|          | 0/13185 [00:00<?, ? examples/s]

Map:   0%|          | 0/1465 [00:00<?, ? examples/s]

Map:   0%|          | 0/1628 [00:00<?, ? examples/s]

## Define Training Options

In [25]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy' : accuracy_score(predictions, labels)}

args = TrainingArguments(
        output_dir = 'temp/',
        # epochs removed for quicker testing - can be added back in for full training
        eval_strategy = 'epoch',
        save_strategy = 'epoch',
        learning_rate=2e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=5,
        weight_decay=0.005,
        load_best_model_at_end=True,
        metric_for_best_model='accuracy',
)

trainer = Trainer(
        model=model,
        args=args,
        train_dataset=dataset_train,
        eval_dataset=dataset_val,
        compute_metrics=compute_metrics
)

trainer.train()

for name, param in model.named_parameters():
    if not param.is_contiguous():
        param.data = param.data.contiguous()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mchloetychang[0m ([33mchloetychang-university-of-western-australia[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.089453,0.976109
2,0.119000,0.044409,0.989761
3,0.023900,0.033219,0.993174
4,0.005500,0.036201,0.991126
5,0.001700,0.036038,0.990444


## Evaluate on Testing Set

In [26]:
model.eval()
trainer.predict(dataset_test).metrics

{'test_loss': 0.030718736350536346,
 'test_accuracy': 0.992014742014742,
 'test_runtime': 12.0364,
 'test_samples_per_second': 135.257,
 'test_steps_per_second': 4.237}

In [27]:
dataset_test

Dataset({
    features: ['sentence', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1628
})

## Save the fine-tuned model

In [28]:
trainer.save_model('finbert-sentiment-sec-press-releases/')

# Evaluate on All Press Releases