<a href="https://colab.research.google.com/github/dleegithub/bb/blob/main/5_llm_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Instructions for Colab:
# 1. First CHANGE RUNTIME TYPE to GPU
# 2. Run install commands
# 3. You might need to RESTART RUNTIME
# 4. Run the rest of the cells below

In [1]:
%%capture

# install required libraries
!pip3 install transformers[torch]                  # HuggingFace library for interacting with BERT (and multiple other models)
!pip3 install datasets                      # HuggingFace library to process dataframes
!pip3 install sentence-transformers         # library to use Sentence Similarity BERT
!pip3 install bertviz                       # visualize BERT's attention weigths
!pip3 install annoy                         # Spotify's library for finding nearest neighbours
!pip3 install ipywidgets

In [None]:
# (COLAB) you might need to restart RUNTIME after installing packages!

In [2]:
# import libraries
import gdown
import pandas as pd
import numpy as np
import gdown
import random
from tqdm.auto import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
import torch

from transformers import AutoModel, BertModel, AutoTokenizer, BertForSequenceClassification, pipeline, TrainingArguments, Trainer, utils
from transformers.pipelines.base import KeyDataset
from datasets import load_dataset, load_metric, Dataset, DatasetDict

from gensim.models import Word2Vec
import gensim.downloader as api

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

from google.colab import output
output.enable_custom_widget_manager()

# test GPU
print(f"GPU: {torch.cuda.is_available()}")

GPU: True


In [3]:
# define dictionary with paths to data in Google Drive
urls_dict = {"10k_sent_2019":        ("https://drive.google.com/uc?id=17PQbZ6EotMxyhpt2Laqqh9z-EKUbwDhX", "parquet"),
             "covariates_2019":      ("https://drive.google.com/uc?id=1ELRq69FOiFvNpSvXOijGeGKZB5DiNXt4", "csv"),
            }

In [4]:
# download all files
for file_name, attributes in urls_dict.items():
    url = attributes[0]
    extension = attributes[1]
    gdown.download(url, f"./{file_name}.{extension}", quiet=False)

Downloading...
From: https://drive.google.com/uc?id=17PQbZ6EotMxyhpt2Laqqh9z-EKUbwDhX
To: /content/10k_sent_2019.parquet
100%|██████████| 162M/162M [00:01<00:00, 84.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1ELRq69FOiFvNpSvXOijGeGKZB5DiNXt4
To: /content/covariates_2019.csv
100%|██████████| 428k/428k [00:00<00:00, 90.4MB/s]


# 1. Load and prepare the data

This tutorial uses text data from the **10-K reports** filed by publicly-traded firms in the U.S. in 2019. 10-K reports are a very rich source of data since firms include information regarding their organizational structure, financial performance and risk factors. We will use a version of the data where the risk factors section of each report has been splitted into sentences and each sentence has been assigned an ID that combines the firm identifier (i.e. **CIK**) and a sentence number. The raw data we use has a total of 1,744,131 sentences for 4,033 firms.

More on the 10-K reports [here](https://www.investor.gov/introduction-investing/getting-started/researching-investments/how-read-10-k).

In [5]:
# read data
df = pd.read_parquet("10k_sent_2019.parquet")

In [6]:
# firm-level additional data
covariates = pd.read_csv("covariates_2019.csv")

In [7]:
# we will choose some specific firms of interest using their TIC identifiers
tics = ["AAPL", "GOOGL", "TWTR", "ORCL", "TSLA", "GM", "F", "BAC",
        "COF", "JPM", "AXP", "HBC2", "TGT", "M", "WMT", "COST",
        "BNED", "DIS", "FOXA", "ADSK", "CAT", "BA"]

# subset covariates
covariates_focus = covariates.loc[covariates["tic"].isin(tics)]
covariates_focus = covariates_focus.groupby("tic", as_index=False).max()


In [8]:
# generate a dictionary mapping from CIK to name
cik2name = {row["cik"]: row["conm"] for i, row in covariates_focus.iterrows()}

# select the 10K reports from the choosen firms only
df = df.loc[df["cik"].isin(covariates_focus["cik"])]
df.reset_index(drop=True, inplace=True)

In [9]:
# merge each sentence with the NAICS2 code and name from its corresponding firm
df = pd.merge(df, covariates_focus[["cik", "at","emp","naics2", "naics2_name"]], how="left", on="cik")


In [10]:
# drop empty sentences or sentences with very few words
min_words = 3
df["sentence_len"] = df["sentences"].apply(lambda x: len(x.split()))
df["keep_sent"] = df["sentence_len"].apply(lambda x: x > min_words)
df = df.loc[df["keep_sent"]]
df.reset_index(drop=True, inplace=True)


In [11]:
# save the dataset
df.to_parquet("10k_sent_2019_firms.parquet", index=False)

# 2. Accessing BERT through HuggingFace

### HuggingFace

We will use use  the ```transformers``` library developed by HuggingFace to access and interact with BERT. This library provides very convenient classes (e.g. ```Tokenizer```, ```Model```, ```Pipeline```) that will help us to easily pass our text through BERT (or any other transformer model we wish).

> As a starting point, we will use a basic version of the original BERT model in English that is not case sensitive. We access this model with the name ```bert-base-uncased```. You can read more about the model [here](https://huggingface.co/bert-base-uncased).

> Through the [Model Hub](https://huggingface.co/models) you can browse all the available models currently hosted by HuggingFace. Here you will find other types of language models and many more languages (including [multilingual models](https://huggingface.co/bert-base-multilingual-cased) and [models in Spanish](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) ).


### Text tokenization

We will start by using the ```AutoTokenizer``` class to load the tokenizer from ```bert-base-uncased```. BERT´s Tokenizer was trained on English Wikipedia and the Book Corpus and contains a total amount of 30,522 unique tokens.

In [12]:
# load a tokenizer using the name of the model we want to use
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Passing a list of sequences to the tokenizer object will apply the following steps to each sequence:

1. Breakdown the sequence into individual tokens that are part of BERT's vocabulary
2. Transform tokens into their ids
3. Add special tokens
4. Apply truncation and padding (optional)

In [13]:
# pass all sequences through the tokenizer
encoded_sentences = tokenizer(list(df["sentences"].values),     # list of sequences we want to tokenize
                              truncation=True,                  # truncate sequences longer than specified length
                              max_length=60,                    # maximum number of tokens per sequence
                              padding="max_length",             # pad all sequences to the same size
                              return_tensors='pt'               # data type of results
                              )


### Loading and using a model

We will now use the ```AutoModel``` class to load our model and transform our tokenized sequences into their embedded representations.


In [None]:
# HuggingFace´s generic class for working with language models out-of-the-box


transformers.models.auto.modeling_auto.AutoModel

In [14]:
# load a model using its name and explore its configuration
model = AutoModel.from_pretrained("bert-base-uncased",          # our choice of model
                                  output_hidden_states=True,    # output all hidden states so that we can fully explore the model
                                  output_attentions=True        # output attention weigths so that we can fully explore the model
                                  )

# put model in evaluation model (we will not do any training)
model = model.eval()

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
# if we wish to further inspect the model's configuration in detail we can use the config attribute
#print(model.config)

### Passing a sequence through the model

Generating an embedded representation of a sequence with BERT requires passing its tokens through multiple layers of trained weights.

In [15]:
# lets first get a single sentence as an example
sent_position = 7
sent = df.loc[sent_position, "sentences"]

# tokenize
sent_encoded = tokenizer(sent,  max_length=60, padding="max_length", truncation=True, return_tensors='pt')
sent_encoded["input_ids"]
# apply forward pass through the model (do not accumulate gradients; we are not training)
with torch.no_grad():
    result = model(**sent_encoded)

In [None]:
# what is "result" ?

In [17]:
# get only the NAICS2 code for each sentence to use as the labels for our regression
labels = df[["naics2"]]

### Word embeddings features

We will also generate an embedded representation of sequences using word embeddings from a pre-trained model. This will give us a reference point with respect to which we can assess the quality of the features generated with BERT.


We will use word embeddings estimated with the GloVe algorithm on Wikipedia and a large news corpus. In order to generate a single representation for a whole sentence, we will average the individual word embeddings of all words from the sentence.


In [18]:
# download the model and return as an object ready to use (takes a couple of minutes)
# other available models can be found here: https://kavita-ganesan.com/easily-access-pre-trained-word-embeddings-with-gensim/
#w2v_model = api.load("word2vec-google-news-300")    # model is too large to load in Colab
w2v_model = api.load("glove-wiki-gigaword-300")



In [19]:
def w2v_embed(text, w2v_model):
    """ function to generate a single embedded representation of a text
        by averaging individual word embeddings
    """

    # lowercase all text
    lower_text = text.lower()
    # split sentence into words by spliting on white spaces
    words = lower_text.split(" ")

    # get the word2vec embedding of all words in the vocabulary
    word_embeddings = []
    for w in words:
      if w in w2v_model.index_to_key:
          word_emb = w2v_model[w]
          word_embeddings.append(word_emb)


    if word_embeddings:
        # generate a sequence embedding by averaging all embeddings
        word_embeddings = np.array(word_embeddings)
        seq_embedding = np.mean(word_embeddings, axis=0)
        return seq_embedding
    else:
        return np.nan

In [20]:
# transform all sequences into their w2v embedded representation (takes a couple of minutes)
all_w2v = []
valid_sents = []
for sent in df["sentences"].values:
    seq_emb = w2v_embed(sent, w2v_model)
    if type(seq_emb) == np.ndarray:
        all_w2v.append(seq_emb)
        valid_sents.append(True)
    else:
        valid_sents.append(False)

In [21]:
# get only the labels for those sentences with w2v embeddings
labels_w2v = labels[valid_sents]
labels_w2v.reset_index(drop=True, inplace=True)


### Estimate regressions

### Regression

In [22]:
# create list with all the indexes of available sentences
sent_idxs = list(range(0, len(labels_w2v)))
trn_idxs, tst_idxs = train_test_split(sent_idxs, test_size=0.3, random_state=92)
val_idxs, tst_idxs = train_test_split(tst_idxs, test_size=0.5, random_state=92)
# print(f"For Regressions:\n Train sentences: {len(trn_idxs)}\n", f"Validation sentences: {len(val_idxs)}\n", f"Test sentences: {len(tst_idxs)}")
# select idxs for training and testing

# BERT
raw_train_ds = Dataset.from_pandas(df.loc[trn_idxs])
raw_val_ds = Dataset.from_pandas(df.loc[val_idxs])
raw_test_ds = Dataset.from_pandas(df.loc[tst_idxs])


#change to floats
ds = {"train": raw_train_ds, "validation": raw_val_ds, "test": raw_test_ds}
def preprocess_function(examples):
    label = examples["naics2"]
    examples = tokenizer(examples["sentences"],     # list of sequences we want to tokenize
                              truncation=True,                  # truncate sequences longer than specified length
                              max_length=60,                    # maximum number of tokens per sequence
                              padding="max_length",             # pad all sequences to the same size
                              # return_tensors='pt'               # data type of results
                              )
    # Change this to real number
    examples["label"] = float(label)
    return examples


for split in ds:
    ds[split] = ds[split].map(preprocess_function, remove_columns=['__index_level_0__','sentences', 'cik', 'year', 'sent_no', 'sent_id', 'at', 'emp', 'naics2', 'naics2_name', 'sentence_len', 'keep_sent'])



Map:   0%|          | 0/4758 [00:00<?, ? examples/s]

Map:   0%|          | 0/1020 [00:00<?, ? examples/s]

Map:   0%|          | 0/1020 [00:00<?, ? examples/s]

In [23]:


from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

def compute_metrics_for_regression(eval_pred):
    logits, labels = eval_pred
    labels = labels.reshape(-1, 1)

    mse = mean_squared_error(labels, logits)
    mae = mean_absolute_error(labels, logits)
    r2 = r2_score(labels, logits)
    single_squared_errors = ((logits - labels).flatten()**2).tolist()

    # Compute accuracy
    # Based on the fact that the rounded score = true score only if |single_squared_errors| < 0.5
    accuracy = sum([1 for e in single_squared_errors if e < 0.25]) / len(single_squared_errors)

    return {"mse": mse, "mae": mae, "r2": r2, "accuracy": accuracy}



In [24]:
# regression
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from torch.utils.data import DataLoader


BASE_MODEL = "bert-base-uncased"
LEARNING_RATE = 2e-4
MAX_LENGTH = 60
BATCH_SIZE = 16
EPOCHS = 20


tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = AutoModelForSequenceClassification.from_pretrained(BASE_MODEL, num_labels=1) #different from 5 in multiclassifier

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    output_dir="../fine-tuned-regression",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    metric_for_best_model="accuracy",
    load_best_model_at_end=True,
    weight_decay=0.01,
)
import torch
'''Write a custom class that extends Trainer (let's call it RegressionTrainer) where we override compute_loss by torch.nn.functional.mse_loss to compute the mean-squared loss. This reimplements the MSE loss.'''
class RegressionTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
      labels = inputs.pop("labels")
      outputs = model(**inputs)
      logits = outputs[0][:, 0]
      loss = torch.nn.functional.mse_loss(logits, labels)
      return (loss, outputs) if return_outputs else loss

trainer = RegressionTrainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["validation"],
    compute_metrics=compute_metrics_for_regression,
)

trainer.train()
out = trainer.evaluate()
torch.save(trainer, './result')

Epoch,Training Loss,Validation Loss,Mse,Mae,R2,Accuracy
1,No log,56.012756,56.012753,6.480037,-0.010579,0.0
2,286.850700,55.474777,55.474777,6.341988,-0.000873,0.0
3,286.850700,55.44561,55.445618,6.321418,-0.000347,0.0
4,58.754700,55.427219,55.427219,6.279056,-1.5e-05,0.0
5,58.754700,55.502872,55.502872,6.3563,-0.00138,0.0
6,58.564000,55.756504,55.7565,6.431678,-0.005956,0.0
7,58.268600,55.469654,55.469654,6.233739,-0.00078,0.0
