# FinBERT Example Notebook

This notebooks shows how to train and use the FinBERT pre-trained language model for financial sentiment analysis.

## Modules 

In [1]:
from pathlib import Path
import shutil
import os
import logging
import sys
import pandas as pd

from textblob import TextBlob
from pprint import pprint
from sklearn.metrics import classification_report

from transformers import AutoModelForSequenceClassification

from finbert.finbert import *
import finbert.utils as tools

%load_ext autoreload
%autoreload 2

project_dir = Path.cwd()
pd.set_option('max_colwidth', -1)

import warnings
warnings.filterwarnings("ignore")

  pd.set_option('max_colwidth', -1)


In [2]:
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.ERROR)

## Prepare the model

### Setting path variables:
1. `lm_path`: the path for the pre-trained language model (If vanilla Bert is used then no need to set this one).
2. `cl_path`: the path where the classification model is saved.
3. `cl_data_path`: the path of the directory that contains the data files of `train.csv`, `validation.csv`, `test.csv`.
---

In the initialization of `bertmodel`, we can either use the original pre-trained weights from Google by giving `bm = 'bert-base-uncased`, or our further pre-trained language model by `bm = lm_path`


---
All of the configurations with the model is controlled with the `config` variable. 

In [3]:
lm_path = project_dir/'Models'/'language_model'/'finbertTRC2'
cl_path = project_dir/'Models'/'classifier_model'/'finbert-sentiment'
cl_data_path = project_dir/'Data'/'sentiment_data_stocktwits_finbert'
cl_data_path_financial_phrase_bank = project_dir/'Data'/'sentiment_data_finbert'

###  Configuring training parameters

You can find the explanations of the training parameters in the class docsctrings. 

In [4]:
# Clean the cl_path
try:
    shutil.rmtree(cl_path) 
except:
    pass

bertmodel = AutoModelForSequenceClassification.from_pretrained(lm_path,cache_dir=None, num_labels=3)


config = Config(   data_dir=cl_data_path,
                   bert_model=bertmodel,
                   num_train_epochs=9,
                   model_dir=cl_path,
                   max_seq_length = 96,
                   train_batch_size = 64,
                   learning_rate = 2e-5,
                   output_mode='classification',
                   warm_up_proportion=0.1,
                   local_rank=-1,
                   discriminate=True,
                   gradual_unfreeze=True)

Some weights of the model checkpoint at C:\Users\eikde\source\repos\help-dissertation\Models\language_model\finbertTRC2 were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceCl

`finbert` is our main class that encapsulates all the functionality. The list of class labels should be given in the prepare_model method call with label_list parameter.

In [5]:
finbert = FinBert(config)
finbert.base_model = 'bert-base-uncased'
finbert.config.discriminate=True
finbert.config.gradual_unfreeze=True

In [6]:
finbert.prepare_model(label_list=['positive','negative','neutral'])

08/11/2022 20:05:57 - INFO - finbert.finbert -   device: cuda n_gpu: 1, distributed training: False, 16-bits training: False


## Fine-tune the model

In [7]:
train_data = finbert.get_data('train')

In [8]:
model = finbert.create_the_model()

### [Optional] Fine-tune only a subset of the model
The variable `freeze` determines the last layer (out of 12) to be freezed. You can skip this part if you want to fine-tune the whole model.

<span style="color:red">Important: </span>
Execute this step if you want a shorter training time in the expense of accuracy.

In [9]:
# This is for fine-tuning a subset of the model.

freeze = 6

for param in model.bert.embeddings.parameters():
    param.requires_grad = False
    
for i in range(freeze):
    for param in model.bert.encoder.layer[i].parameters():
        param.requires_grad = False

In [10]:
#out of memory required to run this for refresh memory
import torch
torch.cuda.empty_cache()
import gc
gc.collect()
torch.cuda.memory_summary(device=None, abbreviated=False)



### Training

In [11]:
trained_model = finbert.train(train_examples = train_data, model = model)

08/11/2022 20:06:10 - INFO - finbert.utils -   *** Example ***
08/11/2022 20:06:10 - INFO - finbert.utils -   guid: train-1
08/11/2022 20:06:10 - INFO - finbert.utils -   tokens: [CLS] form k entry material definitive agreement september flu ##or entered amendment lend ##ers amended rest ##ated revolving loan l [SEP]
08/11/2022 20:06:10 - INFO - finbert.utils -   input_ids: 101 2433 1047 4443 3430 15764 3820 2244 19857 2953 3133 7450 18496 2545 13266 2717 4383 24135 5414 1048 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 20:06:10 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 20:06:10 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 20:14:16 - INFO - finbert.utils -   *** Example ***
08/11/2022 20:14:16 - INFO - finbert.utils -   guid: validation-1
08/11/2022 20:14:16 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 20:14:16 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 20:14:16 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 20:14:16 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Validation losses: [0.8264636042659268]
No best model found


Epoch:   8%|▊         | 1/12 [08:38<1:35:00, 518.23s/it]

Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 20:27:29 - INFO - finbert.utils -   *** Example ***
08/11/2022 20:27:29 - INFO - finbert.utils -   guid: validation-1
08/11/2022 20:27:29 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 20:27:29 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 20:27:29 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 20:27:29 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Validation losses: [0.8264636042659268, 0.7320586856157502]


Epoch:  17%|█▋        | 2/12 [21:51<1:53:17, 679.76s/it]

Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 20:45:08 - INFO - finbert.utils -   *** Example ***
08/11/2022 20:45:08 - INFO - finbert.utils -   guid: validation-1
08/11/2022 20:45:08 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 20:45:08 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 20:45:08 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 20:45:08 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Validation losses: [0.8264636042659268, 0.7320586856157502, 0.675052142764893]


Epoch:  25%|██▌       | 3/12 [39:28<2:07:49, 852.16s/it]

Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 21:03:55 - INFO - finbert.utils -   *** Example ***
08/11/2022 21:03:55 - INFO - finbert.utils -   guid: validation-1
08/11/2022 21:03:55 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 21:03:55 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 21:03:55 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 21:03:55 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Validation losses: [0.8264636042659268, 0.7320586856157502, 0.675052142764893, 0.6709313586445674]


Epoch:  33%|███▎      | 4/12 [58:15<2:08:06, 960.83s/it]

Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 21:23:14 - INFO - finbert.utils -   *** Example ***
08/11/2022 21:23:14 - INFO - finbert.utils -   guid: validation-1
08/11/2022 21:23:14 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 21:23:14 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 21:23:14 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 21:23:14 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Validation losses: [0.8264636042659268, 0.7320586856157502, 0.675052142764893, 0.6709313586445674, 0.6702866307431203]


Epoch:  42%|████▏     | 5/12 [1:17:36<2:00:31, 1033.07s/it]

Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 21:42:26 - INFO - finbert.utils -   *** Example ***
08/11/2022 21:42:26 - INFO - finbert.utils -   guid: validation-1
08/11/2022 21:42:26 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 21:42:26 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 21:42:26 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 21:42:26 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Epoch:  50%|█████     | 6/12 [1:36:44<1:47:11, 1071.97s/it]

Validation losses: [0.8264636042659268, 0.7320586856157502, 0.675052142764893, 0.6709313586445674, 0.6702866307431203, 0.678784914360456]


Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 22:01:27 - INFO - finbert.utils -   *** Example ***
08/11/2022 22:01:27 - INFO - finbert.utils -   guid: validation-1
08/11/2022 22:01:27 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 22:01:27 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 22:01:27 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 22:01:27 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Epoch:  58%|█████▊    | 7/12 [1:55:44<1:31:11, 1094.39s/it]

Validation losses: [0.8264636042659268, 0.7320586856157502, 0.675052142764893, 0.6709313586445674, 0.6702866307431203, 0.678784914360456, 0.6729712009064259]


Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 22:20:30 - INFO - finbert.utils -   *** Example ***
08/11/2022 22:20:30 - INFO - finbert.utils -   guid: validation-1
08/11/2022 22:20:30 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 22:20:30 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 22:20:30 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 22:20:30 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Epoch:  67%|██████▋   | 8/12 [2:14:48<1:13:59, 1109.91s/it]

Validation losses: [0.8264636042659268, 0.7320586856157502, 0.675052142764893, 0.6709313586445674, 0.6702866307431203, 0.678784914360456, 0.6729712009064259, 0.69882751480202]


Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 22:39:31 - INFO - finbert.utils -   *** Example ***
08/11/2022 22:39:31 - INFO - finbert.utils -   guid: validation-1
08/11/2022 22:39:31 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 22:39:31 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 22:39:31 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 22:39:31 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Epoch:  75%|███████▌  | 9/12 [2:33:49<55:59, 1119.68s/it]  

Validation losses: [0.8264636042659268, 0.7320586856157502, 0.675052142764893, 0.6709313586445674, 0.6702866307431203, 0.678784914360456, 0.6729712009064259, 0.69882751480202, 0.7079824185444533]


Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 22:58:22 - INFO - finbert.utils -   *** Example ***
08/11/2022 22:58:22 - INFO - finbert.utils -   guid: validation-1
08/11/2022 22:58:22 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 22:58:22 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 22:58:22 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 22:58:22 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Epoch:  83%|████████▎ | 10/12 [2:52:40<37:26, 1123.21s/it]

Validation losses: [0.8264636042659268, 0.7320586856157502, 0.675052142764893, 0.6709313586445674, 0.6702866307431203, 0.678784914360456, 0.6729712009064259, 0.69882751480202, 0.7079824185444533, 0.7161743022912851]


Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 23:17:06 - INFO - finbert.utils -   *** Example ***
08/11/2022 23:17:06 - INFO - finbert.utils -   guid: validation-1
08/11/2022 23:17:06 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 23:17:06 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 23:17:06 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 23:17:06 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Epoch:  92%|█████████▏| 11/12 [3:11:26<18:44, 1124.01s/it]

Validation losses: [0.8264636042659268, 0.7320586856157502, 0.675052142764893, 0.6709313586445674, 0.6702866307431203, 0.678784914360456, 0.6729712009064259, 0.69882751480202, 0.7079824185444533, 0.7161743022912851, 0.7357265718875488]


Iteration:   0%|          | 0/1301 [00:00<?, ?it/s]

08/11/2022 23:35:53 - INFO - finbert.utils -   *** Example ***
08/11/2022 23:35:53 - INFO - finbert.utils -   guid: validation-1
08/11/2022 23:35:53 - INFO - finbert.utils -   tokens: [CLS] misses good ask microsoft [SEP]
08/11/2022 23:35:53 - INFO - finbert.utils -   input_ids: 101 22182 2204 3198 7513 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 23:35:53 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 23:35:53 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0

Validating:   0%|          | 0/163 [00:00<?, ?it/s]

Epoch: 100%|██████████| 12/12 [3:30:09<00:00, 1050.83s/it]

Validation losses: [0.8264636042659268, 0.7320586856157502, 0.675052142764893, 0.6709313586445674, 0.6702866307431203, 0.678784914360456, 0.6729712009064259, 0.69882751480202, 0.7079824185444533, 0.7161743022912851, 0.7357265718875488, 0.736610238720303]





## Test the model

`bert.evaluate` outputs the DataFrame, where true labels and logit values for each example is given

In [12]:
test_data = finbert.get_data('test')

In [13]:
results = finbert.evaluate(examples=test_data, model=trained_model)

08/11/2022 23:36:45 - INFO - finbert.utils -   *** Example ***
08/11/2022 23:36:45 - INFO - finbert.utils -   guid: test-1
08/11/2022 23:36:45 - INFO - finbert.utils -   tokens: [CLS] loving vol ##ati ##lity stock great hold huge long term potential great day ##tra ##ding short swing finds pockets consolidate pocket seems [SEP]
08/11/2022 23:36:45 - INFO - finbert.utils -   input_ids: 101 8295 5285 10450 18605 4518 2307 2907 4121 2146 2744 4022 2307 2154 6494 4667 2460 7370 4858 10306 24939 4979 3849 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 23:36:45 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 23:36:45 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Testing:   0%|          | 0/163 [00:00<?, ?it/s]

### Prepare the classification report

In [14]:
def report(df, cols=['label','prediction','logits']):
    #print('Validation loss:{0:.2f}'.format(metrics['best_validation_loss']))
    cs = CrossEntropyLoss(weight=finbert.class_weights)
    loss = cs(torch.tensor(list(df[cols[2]])),torch.tensor(list(df[cols[0]])))
    print("Loss:{0:.2f}".format(loss))
    print("Accuracy:{0:.2f}".format((df[cols[0]] == df[cols[1]]).sum() / df.shape[0]) )
    print("\nClassification Report:")
    print(classification_report(df[cols[0]], df[cols[1]]))

In [15]:
results['prediction'] = results.predictions.apply(lambda x: np.argmax(x,axis=0))

In [16]:
report(results,cols=['labels','prediction','predictions'])

Loss:0.67
Accuracy:0.73

Classification Report:
              precision    recall  f1-score   support

           0       0.64      0.74      0.69      3308
           1       0.71      0.75      0.73      3475
           2       0.87      0.69      0.77      3620

    accuracy                           0.73     10403
   macro avg       0.74      0.73      0.73     10403
weighted avg       0.74      0.73      0.73     10403



### Get predictions

With the `predict` function, given a piece of text, we split it into a list of sentences and then predict sentiment for each sentence. The output is written into a dataframe. Predictions are represented in three different columns: 

1) `logit`: probabilities for each class

2) `prediction`: predicted label

3) `sentiment_score`: sentiment score calculated as: probability of positive - probability of negative

Below we analyze a paragraph taken out of [this](https://www.economist.com/finance-and-economics/2019/01/03/a-profit-warning-from-apple-jolts-markets) article from The Economist. For comparison purposes, we also put the sentiments predicted with TextBlob.
> Later that day Apple said it was revising down its earnings expectations in the fourth quarter of 2018, largely because of lower sales and signs of economic weakness in China. The news rapidly infected financial markets. Apple’s share price fell by around 7% in after-hours trading and the decline was extended to more than 10% when the market opened. The dollar fell by 3.7% against the yen in a matter of minutes after the announcement, before rapidly recovering some ground. Asian stockmarkets closed down on January 3rd and European ones opened lower. Yields on government bonds fell as investors fled to the traditional haven in a market storm.

In [17]:
text = "Later that day Apple said it was revising down its earnings expectations in \
the fourth quarter of 2018, largely because of lower sales and signs of economic weakness in China. \
The news rapidly infected financial markets. Apple’s share price fell by around 7% in after-hours \
trading and the decline was extended to more than 10% when the market opened. The dollar fell \
by 3.7% against the yen in a matter of minutes after the announcement, before rapidly recovering \
some ground. Asian stockmarkets closed down on January 3rd and European ones opened lower. \
Yields on government bonds fell as investors fled to the traditional haven in a market storm."

In [18]:
cl_path = project_dir/'models'/'classifier_model'/'finbert-sentiment'
model = AutoModelForSequenceClassification.from_pretrained(cl_path, cache_dir=None, num_labels=3)

In [19]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\eikde\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [20]:
result = predict(text,model)

08/11/2022 23:37:40 - INFO - root -   Using device: cpu 
08/11/2022 23:37:40 - INFO - finbert.utils -   *** Example ***
08/11/2022 23:37:40 - INFO - finbert.utils -   guid: 0
08/11/2022 23:37:40 - INFO - finbert.utils -   tokens: [CLS] later that day apple said it was rev ##ising down its earnings expectations in the fourth quarter of 2018 , largely because of lower sales and signs of economic weakness in china . [SEP]
08/11/2022 23:37:40 - INFO - finbert.utils -   input_ids: 101 2101 2008 2154 6207 2056 2009 2001 7065 9355 2091 2049 16565 10908 1999 1996 2959 4284 1997 2760 1010 4321 2138 1997 2896 4341 1998 5751 1997 3171 11251 1999 2859 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 23:37:40 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 23:37:40 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [21]:
blob = TextBlob(text)
result['textblob_prediction'] = [sentence.sentiment.polarity for sentence in blob.sentences]
result

Unnamed: 0,sentence,logit,prediction,sentiment_score,textblob_prediction
0,"Later that day Apple said it was revising down its earnings expectations in the fourth quarter of 2018, largely because of lower sales and signs of economic weakness in China.","[0.066904604, 0.5872885, 0.3458069]",negative,-0.520384,0.051746
1,The news rapidly infected financial markets.,"[0.090098254, 0.66293967, 0.24696203]",negative,-0.572841,0.0
2,Apple’s share price fell by around 7% in after-hours trading and the decline was extended to more than 10% when the market opened.,"[0.06670103, 0.7469901, 0.18630888]",negative,-0.680289,0.5
3,"The dollar fell by 3.7% against the yen in a matter of minutes after the announcement, before rapidly recovering some ground.","[0.10917396, 0.076080896, 0.8147452]",neutral,0.033093,0.0
4,Asian stockmarkets closed down on January 3rd and European ones opened lower.,"[0.008379536, 0.3944989, 0.5971216]",neutral,-0.386119,-0.051111
5,Yields on government bonds fell as investors fled to the traditional haven in a market storm.,"[0.19705588, 0.36530006, 0.4376441]",neutral,-0.168244,0.0


In [22]:
print(f'Average sentiment is %.2f.' % (result.sentiment_score.mean()))

Average sentiment is -0.38.


Here is another example

In [23]:
text2 = "Shares in the spin-off of South African e-commerce group Naspers surged more than 25% \
in the first minutes of their market debut in Amsterdam on Wednesday. Bob van Dijk, CEO of \
Naspers and Prosus Group poses at Amsterdam's stock exchange, as Prosus begins trading on the \
Euronext stock exchange in Amsterdam, Netherlands, September 11, 2019. REUTERS/Piroschka van de Wouw \
Prosus comprises Naspers’ global empire of consumer internet assets, with the jewel in the crown a \
31% stake in Chinese tech titan Tencent. There is 'way more demand than is even available, so that’s \
good,' said the CEO of Euronext Amsterdam, Maurice van Tilburg. 'It’s going to be an interesting \
hour of trade after opening this morning.' Euronext had given an indicative price of 58.70 euros \
per share for Prosus, implying a market value of 95.3 billion euros ($105 billion). The shares \
jumped to 76 euros on opening and were trading at 75 euros at 0719 GMT."

In [24]:
result2 = predict(text2,model)
blob = TextBlob(text2)
result2['textblob_prediction'] = [sentence.sentiment.polarity for sentence in blob.sentences]

08/11/2022 23:37:44 - INFO - root -   Using device: cpu 
08/11/2022 23:37:44 - INFO - finbert.utils -   *** Example ***
08/11/2022 23:37:44 - INFO - finbert.utils -   guid: 0
08/11/2022 23:37:44 - INFO - finbert.utils -   tokens: [CLS] shares in the spin - off of south african e - commerce group nas ##pers surged more than 25 % in the first minutes of their market debut in amsterdam on wednesday . [SEP]
08/11/2022 23:37:44 - INFO - finbert.utils -   input_ids: 101 6661 1999 1996 6714 1011 2125 1997 2148 3060 1041 1011 6236 2177 17235 7347 18852 2062 2084 2423 1003 1999 1996 2034 2781 1997 2037 3006 2834 1999 7598 2006 9317 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 23:37:44 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/11/2022 23:37:44 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [25]:
result2

Unnamed: 0,sentence,logit,prediction,sentiment_score,textblob_prediction
0,Shares in the spin-off of South African e-commerce group Naspers surged more than 25% in the first minutes of their market debut in Amsterdam on Wednesday.,"[0.04696756, 0.003628546, 0.9494039]",neutral,0.043339,0.25
1,"Bob van Dijk, CEO of Naspers and Prosus Group poses at Amsterdam's stock exchange, as Prosus begins trading on the Euronext stock exchange in Amsterdam, Netherlands, September 11, 2019.","[0.30016178, 0.027376967, 0.6724612]",neutral,0.272785,0.0
2,"REUTERS/Piroschka van de Wouw Prosus comprises Naspers’ global empire of consumer internet assets, with the jewel in the crown a 31% stake in Chinese tech titan Tencent.","[0.08334563, 0.004112478, 0.9125419]",neutral,0.079233,0.0
3,"There is 'way more demand than is even available, so that’s good,' said the CEO of Euronext Amsterdam, Maurice van Tilburg.","[0.13846976, 0.0485448, 0.8129854]",neutral,0.089925,0.533333
4,'It’s going to be an interesting hour of trade after opening this morning.',"[0.36207414, 0.0576006, 0.58032525]",neutral,0.304474,0.5
5,"Euronext had given an indicative price of 58.70 euros per share for Prosus, implying a market value of 95.3 billion euros ($105 billion).","[0.067955956, 0.025940917, 0.9061031]",neutral,0.042015,0.0
6,The shares jumped to 76 euros on opening and were trading at 75 euros at 0719 GMT.,"[0.21840131, 0.15594982, 0.62564886]",neutral,0.062451,0.0


In [26]:
print(f'Average sentiment is %.2f.' % (result2.sentiment_score.mean()))

Average sentiment is 0.13.
