# 7CCMFM18 Machine Learning
King's College London <br>
Academic year 2023-2024 <br>
Lecturer: Mario Martone
Tutor: Akmal Rafiq


Version Date: <i>8th April 2023</i>

You will need to install: 

1. SentenceTransformer.
2. NLTK
3. Sklearn
4. english-words

In [None]:
!pip install sentence_transformers sklearn english_words

First let's load in a few libraries:

In [1]:
import scipy.sparse.linalg
import sklearn
from sklearn.metrics import pairwise_distances

## NLP: sentence transformer

Then load our model (you can try comparing different models!):

In [2]:
from sentence_transformers import SentenceTransformer
model_name = 'all-distilroberta-v1'
model = SentenceTransformer(model_name)

You can find a list of pre-trained modeels here: https://www.sbert.net/docs/pretrained_models.html

Now let's embed two simple sentences:

In [3]:
sent1=model.encode('Mario studies hard')
sent2=model.encode('Mario goes to school')

**Cosine Similarity** is a measure used to gauge how similar two vectors are, irrespective of their size. Mathematically, it calculates the cosine of the angle between two vectors projected in a multi-dimensional space. This metric is widely used in various fields, including data analysis, natural language processing, and machine learning, particularly in systems involving text comparison.

The cosine similarity between two vectors is calculated by taking the dot product of the vectors and then dividing that by the product of the magnitudes (or lengths) of the vectors. 

$$cos(Î¸) = \frac{A \cdot B}{||A||  ||B||}$$

This results in a value between -1 and 1, where 1 indicates that the vectors are identical, 0 indicates that the vectors are orthogonal (or have no similarity), and -1 indicates that the vectors are diametrically opposed.

In the context of text analysis, vectors often represent word counts or tf-idf scores (which reflect how important a word is within a document in a collection of documents). By calculating the cosine similarity between these vectors, it's possible to determine how similar the documents are in terms of their content.

Let's compute the similarity score:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
sklearn.metrics.pairwise.cosine_similarity([sent1,sent2])

array([[1.        , 0.76597095],
       [0.76597095, 0.99999994]], dtype=float32)

The off-diagonal terms are the one that matter and they are indeed close to 1! Now let's try to see what happens with a sentence which has a different meaning:

In [5]:
sent3=model.encode('Tomorrow the sky will be blue')
sklearn.metrics.pairwise.cosine_similarity([sent1,sent2,sent3])

array([[1.        , 0.76597095, 0.0752472 ],
       [0.76597095, 0.99999994, 0.14101554],
       [0.0752472 , 0.14101554, 0.99999976]], dtype=float32)

And you can see that the similarity of sentence 3 with sentence 1 and 2 is far smaller.

### Look for similarities:

Now let's play with the large English vocabulary - we shall grab a sample list from the english_words packagee

In [None]:
from english_words import english_words_alpha_set

In [6]:
len(list(english_words_alpha_set))

25474

In [7]:
list(english_words_alpha_set)[:10]

['gneiss',
 'Dobbs',
 'boil',
 'dogwood',
 'Ainu',
 'parsimonious',
 'Coriolanus',
 'jenny',
 'Sicily',
 'gravid']

And let's check the words that have the highest similarity with the word football:

In [9]:
similarity={}
encode_football=model.encode('football')

for word in list(english_words_alpha_set):
    word_encode=model.encode(word)
    similarity[word]=sklearn.metrics.pairwise.cosine_similarity([encode_football,word_encode])[0,1]

And now let's check the words that are the most similar with football:

In [10]:
sorted(similarity.items(), key=lambda item: item[1],reverse=True)

[('football', 0.99999976),
 ('soccer', 0.86900795),
 ('basketball', 0.81611115),
 ('sport', 0.70091105),
 ('hockey', 0.6746446),
 ('cricket', 0.67358565),
 ('baseball', 0.670014),
 ('volleyball', 0.6460309),
 ('softball', 0.6318909),
 ('sportswriting', 0.6087425),
 ('ball', 0.60338104),
 ('volley', 0.58906734),
 ('tennis', 0.5827517),
 ('lacrosse', 0.5714959),
 ('athlete', 0.57062495),
 ('golf', 0.5705202),
 ('chess', 0.5621939),
 ('athletic', 0.55184615),
 ('snowball', 0.5504932),
 ('sportsmen', 0.54786897),
 ('badminton', 0.54423267),
 ('ballet', 0.5431351),
 ('pong', 0.5393827),
 ('madden', 0.5365356),
 ('knuckleball', 0.53595364),
 ('polo', 0.52711654),
 ('jockey', 0.5201056),
 ('sporty', 0.5108114),
 ('television', 0.50035995),
 ('karate', 0.499893),
 ('sportswriter', 0.4973939),
 ('skate', 0.4969545),
 ('quarterback', 0.49339512),
 ('sportsman', 0.49299866),
 ('touchdown', 0.4866844),
 ('Olympic', 0.4849062),
 ('sadden', 0.48392475),
 ('hooligan', 0.48306835),
 ('wrestle', 0.4815

### Now fine-tune the model:

We can notice that these two words have unusually high/low similarity score:

In [13]:
stadium_emb=model.encode('stadium')
ballet_emb=model.encode('ballet')

print(sklearn.metrics.pairwise.cosine_similarity([encode_football,stadium_emb])[0,1])
print(sklearn.metrics.pairwise.cosine_similarity([encode_football,ballet_emb])[0,1])

0.4538225
0.5431351


So let's fine tune the model to increase the similarity of stadium and decrease that of ballet:

In [None]:
from torch.utils.data import DataLoader
from sentence_transformers import  InputExample, losses

In [14]:
# Fine-tune the model: Adjust embeddings for specific words to modify their similarity scores
#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('all-distilroberta-v1')

#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['football', 'ballet'], label=0.1),
    InputExample(texts=['football', 'stadium'], label=0.7)]

#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

# Fine-tune the model with the specified training objectives
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=10, warmup_steps=100)

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

After fine-tuning, check the new similarity scores between 'football' and 'stadium'/'ballet':

In [15]:
stadium_emb=model.encode('stadium')
ballet_emb=model.encode('ballet')

print(sklearn.metrics.pairwise.cosine_similarity([uni_encode,stadium_emb])[0,1])
print(sklearn.metrics.pairwise.cosine_similarity([uni_encode,ballet_emb])[0,1])

0.48246533
0.45706216



## Sentiment analysis in finance
First version: <i>29th March 2023</i>

FinBERT is a pretrained model on several financial NLP tasks, all outperforming traditional machine learning models, deep learning models, and fine-tuned BERT models.

All the fine-tuned FinBERT models are publicly hosted at Huggingface ðŸ¤—. Here we will look at two specific instances:
- **FinBERT-Sentiment**: for sentiment classification task
- **FinBERT-FLS**: for forward-looking statement (FLS) classification task

## Import the transformers:

First import the pre-trained model:

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, pipeline

2023-03-29 14:59:58.620296: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
# tested in transformers==4.18.0 
import transformers
transformers.__version__

'4.26.1'

## Sentiment Analysis
Analyzing financial text sentiment is valuable as it can engage the views and opinions of managers, information intermediaries and investors. FinBERT-Sentiment is a FinBERT model fine-tuned on 10,000 manually annotated sentences from analyst reports of S&P 500 firms.

**Input**: A financial text.

**Output**: Positive, Neutral or Negative.

Import FinBERT:

In [None]:
finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

And let's see a small demo:

In [None]:
# Demonstrate FinBERT's ability to classify financial texts into Positive, Neutral, or Negative sentiments
nlp = pipeline("text-classification", model=finbert, tokenizer=tokenizer)
results = nlp(['growth is strong and we have plenty of liquidity.', 
               'there is a shortage of capital, and we need extra financing.',
              'formulation patents might protect Vasotec to a limited extent.'])

In [None]:
results

[{'label': 'Positive', 'score': 1.0},
 {'label': 'Negative', 'score': 0.9952379465103149},
 {'label': 'Neutral', 'score': 0.9979718327522278}]

For the rest of the homework:

1. Download and import the sentiment_analysis dataset.
2. Using FinBERT compute the label for each entry of the dataset.
3. Compute the f1 score using as y_true the label which come with dataset and as y_pred the predictions from FinBERT.

## Import dataset:

Let's import pandas to manage our dataset as well as the f1 score from sklearn

In [None]:
import pandas as pd
from sklearn.metrics import f1_score

To run the lines below, you should replace "dataset_dir" with the name of the folder you have downloaded the dataset in and "dataset_file" with the name of the file (default is sentiment_analysis.txt).

In [None]:
dataset_dir = 'NLP_finance/'
dataset_file = 'sentiment_analysis.txt'
finance_df = pd.read_csv(dataset_dir+dataset_file,
                     sep='\@',
                     header=None,
                     names=['sentence','label'])

  finance_df = pd.read_csv(dataset_dir+dataset_file,


The dataset is now imported as panda dataframe which is an extremely handy format!

In [None]:
finance_df

Unnamed: 0,sentence,label
0,"According to Gran , the company has no plans t...",neutral
1,With the new production plant the company woul...,positive
2,"For the last quarter of 2010 , Componenta 's n...",positive
3,"In the third quarter of 2010 , net sales incre...",positive
4,Operating profit rose to EUR 13.1 mn from EUR ...,positive
...,...,...
3448,Operating result for the 12-month period decre...,negative
3449,HELSINKI Thomson Financial - Shares in Cargote...,negative
3450,LONDON MarketWatch -- Share prices ended lower...,negative
3451,Operating profit fell to EUR 35.4 mn from EUR ...,negative


Compute now the prediction:

In [None]:
nlp = pipeline("text-classification", model=finbert, tokenizer=tokenizer)
results = nlp(finance_df['sentence'].tolist())
y_pred = [item['label'].lower() for item in results]

Now let's also import the list of lables as well as the true values from the dataset:

In [None]:
labels = list(set(finance_df['label'].tolist()))
y_true = finance_df['label'].tolist()

And finally compute the f1 score:

In [None]:
f1_score(y_true, y_pred,labels=labels,average='macro')

0.8473605746028382

85 % it is pretty remarkable!

## FLS-Classification
Forward-looking statements (FLS) inform investors of managersâ€™ beliefs and opinions about firm's future events or results. Identifying forward-looking statements from corporate reports can assist investors in financial analysis. FinBERT-FLS is a FinBERT model fine-tuned on 3,500 manually annotated sentences from Management Discussion and Analysis section of annual reports of Russell 3000 firms.

**Input**: A financial text.

**Output**: Specific-FLS , Non-specific FLS, or Not-FLS.

In [None]:
finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-fls',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-fls')

In [None]:
nlp = pipeline("text-classification", model=finbert, tokenizer=tokenizer)
results = nlp(['we expect the age of our fleet to enhance availability and reliability due to reduced downtime for repairs.',
               'on an equivalent unit of production basis, general and administrative expenses declined 24 percent from 1994 to $.67 per boe.',
               'we will continue to assess the need for a valuation allowance against deferred tax assets considering all available evidence obtained in future reporting periods.'])

In [None]:
results

[{'label': 'Specific FLS', 'score': 0.77278733253479},
 {'label': 'Not FLS', 'score': 0.9905241131782532},
 {'label': 'Non-specific FLS', 'score': 0.975904107093811}]