<a href="https://colab.research.google.com/github/chngchngchng/FinTech-Project-2/blob/main/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 29.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 68.8 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 57.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


In [2]:
!pip install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import transformers
import pandas as pd
import numpy

In [4]:
tokeniser = transformers.BertTokenizer.from_pretrained("bert-base-uncased")


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [5]:
sample_text = "When was I last outside? I have been stuck at home for what feels like 10 years."
tokens = tokeniser.tokenize(sample_text)
print(tokens)

['when', 'was', 'i', 'last', 'outside', '?', 'i', 'have', 'been', 'stuck', 'at', 'home', 'for', 'what', 'feels', 'like', '10', 'years', '.']


In [6]:
token_ids = tokeniser.convert_tokens_to_ids(tokens)
print(token_ids)

[2043, 2001, 1045, 2197, 2648, 1029, 1045, 2031, 2042, 5881, 2012, 2188, 2005, 2054, 5683, 2066, 2184, 2086, 1012]


In [7]:
print(f"Sentence: {sample_text}")
print(f"Tokens: {tokens}")
print(f"Length of tokens: {len(tokens)}")
print(f"Token IDs: {token_ids}")
print(f"Length of token IDs: {len(token_ids)}")

Sentence: When was I last outside? I have been stuck at home for what feels like 10 years.
Tokens: ['when', 'was', 'i', 'last', 'outside', '?', 'i', 'have', 'been', 'stuck', 'at', 'home', 'for', 'what', 'feels', 'like', '10', 'years', '.']
Length of tokens: 19
Token IDs: [2043, 2001, 1045, 2197, 2648, 1029, 1045, 2031, 2042, 5881, 2012, 2188, 2005, 2054, 5683, 2066, 2184, 2086, 1012]
Length of token IDs: 19


# Special Tokens

`BERT` makes use of some special tokens that identify the start and end of a given sentence or a line of text.

First, we have the seperation token, which seperates sentences:

In [8]:
tokeniser.sep_token, tokeniser.sep_token_id

('[SEP]', 102)

Next, we have the classification token, which is used to indicate to BERT whether we are interested in a sequence classification or a text classification task:

In [9]:
tokeniser.cls_token, tokeniser.cls_token_id

('[CLS]', 101)

We also have the token that is used to indicate that something is a padding token (to make similar sentence / block sizes).

In [10]:
tokeniser.pad_token, tokeniser.pad_token_id

('[PAD]', 0)

There is also a token to deal with unknown characters:

In [11]:
tokeniser.unk_token, tokeniser.unk_token_id

('[UNK]', 100)

In [12]:
encoding = tokeniser.encode_plus(
    sample_text,
    max_length = 32,
    add_special_tokens = True,
    padding = True,
    return_attention_mask = True,
    return_token_type_ids = False,
    return_tensors = "pt"
)

encoding.keys()



dict_keys(['input_ids', 'attention_mask'])

In [13]:
encoding['input_ids']

tensor([[ 101, 2043, 2001, 1045, 2197, 2648, 1029, 1045, 2031, 2042, 5881, 2012,
         2188, 2005, 2054, 5683, 2066, 2184, 2086, 1012,  102]])

### Choosing Sequence Lengths

In [14]:
'''
token_lens = []

for txt in df.content: # assuming we have a dataframe with a content column (string datatype)
  tokens = tokeniser.encode(txt, max_length = 512) # some default dummy length first
  token_lens.append(len(tokens))

# Get max length of tokens
print(max(token_lens))

# OR plot a graph, and then get a suitable value
sns.distplot(token_lens);

# we can get a value that might clip some of the tokens, but this will allow the model to be trained faster
'''


'\ntoken_lens = []\n\nfor txt in df.content: # assuming we have a dataframe with a content column (string datatype)\n  tokens = tokeniser.encode(txt, max_length = 512) # some default dummy length first\n  token_lens.append(len(tokens))\n\n# Get max length of tokens\nprint(max(token_lens))\n\n# OR plot a graph, and then get a suitable value\nsns.distplot(token_lens);\n\n# we can get a value that might clip some of the tokens, but this will allow the model to be trained faster\n'

In [15]:
# Let's actually try to create the model now

from google.colab import files
uploaded = files.upload()

Saving stock_data.csv to stock_data.csv


In [16]:
import io
from transformers import BertTokenizer, BertForSequenceClassification, pipeline

df = pd.read_csv(io.BytesIO(uploaded['stock_data.csv']))
df


#  For this project, we'll used a pretrained finbert model and its tokeniser
PRETRAINED_MODEL_NAME = 'yiyanghkust/finbert-tone'
finbert = BertForSequenceClassification.from_pretrained(PRETRAINED_MODEL_NAME, num_labels = 3)
tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

Downloading:   0%|          | 0.00/533 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/439M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [17]:
# Create pipeline for sentiment analysis  
nlp =  pipeline("sentiment-analysis", model = finbert, tokenizer = tokenizer)

In [19]:
def generate_labels(value):
  if value == 1:
    return "positive"
  elif value == 0:
    return "neutral"
  return "negative"

In [22]:
df['Labels'] = df['Sentiment'].apply(lambda x: generate_labels(x))
df

Unnamed: 0,Text,Sentiment,Labels
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,1,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,1,positive
2,user I'd be afraid to short AMZN - they are lo...,1,positive
3,MNTA Over 12.00,1,positive
4,OI Over 21.37,1,positive
...,...,...,...
5786,Industry body CII said #discoms are likely to ...,-1,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...",-1,negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,1,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",1,positive


In [28]:
results = nlp(df['Text'].tolist())
print(results)

[{'label': 'Neutral', 'score': 0.9999955892562866}, {'label': 'Neutral', 'score': 0.9996181726455688}, {'label': 'Neutral', 'score': 0.998150110244751}, {'label': 'Neutral', 'score': 0.9243690967559814}, {'label': 'Neutral', 'score': 0.513309121131897}, {'label': 'Neutral', 'score': 0.6940017938613892}, {'label': 'Negative', 'score': 0.8330612182617188}, {'label': 'Negative', 'score': 0.99880051612854}, {'label': 'Neutral', 'score': 0.9998936653137207}, {'label': 'Neutral', 'score': 0.9999995231628418}, {'label': 'Neutral', 'score': 0.8573377728462219}, {'label': 'Negative', 'score': 0.999962568283081}, {'label': 'Positive', 'score': 0.9961819648742676}, {'label': 'Neutral', 'score': 0.8869619965553284}, {'label': 'Positive', 'score': 0.9999998807907104}, {'label': 'Neutral', 'score': 0.9999772310256958}, {'label': 'Neutral', 'score': 0.9999884366989136}, {'label': 'Neutral', 'score': 0.9360244274139404}, {'label': 'Neutral', 'score': 0.9698817133903503}, {'label': 'Neutral', 'score': 

In [30]:
labels = list(map(lambda x : 0 if x['label'] == "Neutral" else (1 if x['label'] == "Positive" else -1), results))
df['Predicted'] = labels
df

Unnamed: 0,Text,Sentiment,Labels,Predicted
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,1,positive,0
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,1,positive,0
2,user I'd be afraid to short AMZN - they are lo...,1,positive,0
3,MNTA Over 12.00,1,positive,0
4,OI Over 21.37,1,positive,0
...,...,...,...,...
5786,Industry body CII said #discoms are likely to ...,-1,negative,-1
5787,"#Gold prices slip below Rs 46,000 as #investor...",-1,negative,-1
5788,Workers at Bajaj Auto have agreed to a 10% wag...,1,positive,0
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",1,positive,0


# True positive rate
$precision = \frac{num\ of\ true\ positives}{num\ of\ true\ positives\ +\ num\ of\ false\ positives}$

We'll first take a look at precision - which is defined by the formula above

In [32]:
true_positives = df[(df['Sentiment'] == 1) & (df['Predicted'] == 1)]
precision = true_positives.count() / df[df['Predicted'] == 1].count()
print(precision)

Text         0.874262
Sentiment    0.874262
Labels       0.874262
Predicted    0.874262
dtype: float64


Next, let's look at recall - which is defined by the following formula:

$recall = \frac{number\ of\ true\ positives}{number\ of\ true\ positives\ +\ number\ of\ false\ negatives}$

In [33]:
false_negatives = df[(df['Sentiment'] == 1) & (df['Predicted'] != 1)]
recall = true_positives.count() / (true_positives.count() + false_negatives.count())
print(recall)

Text         0.28114
Sentiment    0.28114
Labels       0.28114
Predicted    0.28114
dtype: float64


# Conclusion

Let's look at the overall accuracy of the model to gain some insights into its performance.

In [36]:
accuracy = df[df['Sentiment'] == df['Predicted']].count()
print(accuracy / df.shape[0]) 

Text         0.25436
Sentiment    0.25436
Labels       0.25436
Predicted    0.25436
dtype: float64


In [None]:
-