<a href="https://colab.research.google.com/github/debasmita-das-econ/NLP_LLM_GT_Workshop/blob/main/AI_Future_Finance_Part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning PLM and Zero-shot LLM

### This Notebook was presented at Artificial Intelligence and The Future of Finance Conference [Nov 3, 2023]

#### Author: Agam Shah
#### Edits: Michael Galarnyk

## Introduction

Training a state of the art large language model (LLM) from scratch can cost [millions of dollars](https://www.forbes.com/sites/craigsmith/2023/09/08/what-large-models-cost-you--there-is-no-free-ai-lunch/?sh=5b4457a4af7a). This is one major reason why we typically only fine-tune the last couple layers of the network. This is often something google colab or your local machine can do.

This notebook covers the following:

* Data import from Hugging Face
* Fine-tune RoBERTa model
* Deploy model on HuggingFace
* Use already deployed model on Hugging Face
* Zero-shot LLaMA-2-7B


## Install and import libraries

In [3]:
!pip install transformers
!pip install datasets
!pip install accelerate

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m59.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m101.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m91.6 MB/s[0m eta [36m0:00:00[0m
Co

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
import torch

from datasets import load_dataset
from datasets import load_metric

from transformers import AutoConfig
from transformers import AutoModelForCausalLM # Zero-shot LLaMA-2-7B
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import pipeline
from transformers import TrainingArguments
from transformers import Trainer

# Modify and push files
from huggingface_hub import login, logout
from huggingface_hub import HfApi

## Data import from HuggingFace

This will download the [fomc-example-dataset](https://huggingface.co/datasets/gtfintechlab/fomc-example-dataset) from the paper [Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis](https://arxiv.org/pdf/2305.07972.pdf). This dataset is the largest tokenized and annotated
dataset of FOMC speeches, meeting minutes, and press conference transcripts. It was developed in order to better understand how monetary policy influences financial markets.

In [5]:
data_files = {"train": "train.csv", "test": "test.csv"}
dataset = load_dataset("gtfintechlab/fomc-example-dataset", data_files=data_files)
print(dataset)

Downloading readme:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/423k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/104k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'sentence', 'year', 'label', 'orig_index'],
        num_rows: 1984
    })
    test: Dataset({
        features: ['index', 'sentence', 'year', 'label', 'orig_index'],
        num_rows: 496
    })
})


## Fine-Tune RoBERTa model

[RoBERTa](https://arxiv.org/abs/1907.11692) is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.

### Data processing and tokenization

The code below uses the AutoTokenizer from `'roberta-base'`. This is important for several reasons:

* <b>Model-Specific Tokenization</b>: Different models can have different tokenization approaches. For example:

  * RoBERTa uses Byte-Pair Encoding (BPE).
  * [BERT](https://arxiv.org/abs/1810.04805) uses WordPiece tokenization.

* <b>Vocabulary Matching</b>: Each pretrained model comes with a specific vocabulary that it has been trained on.

* <b>Model Configuration and Special Tokens</b>: Pretrained models often come with specific configurations, including special tokens (like padding tokens, mask tokens, etc.).

* <b>Preprocessing Consistency</b>: If you are fine-tuning a pretrained model on a new task or dataset, it's important to preprocess the new data like how the original training data was processed.

In [6]:
tokenizer = AutoTokenizer.from_pretrained('roberta-base')

def tokenize_data(example):
    return tokenizer(example['sentence'], padding='max_length')

dataset = dataset.map(tokenize_data, batched=True)

remove_columns = ['index', 'sentence', 'year', 'orig_index']
dataset = dataset.map(remove_columns=remove_columns)

print(dataset)

train_dataset = dataset['train']
eval_dataset = dataset['test']

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/1984 [00:00<?, ? examples/s]

Map:   0%|          | 0/496 [00:00<?, ? examples/s]

Map:   0%|          | 0/1984 [00:00<?, ? examples/s]

Map:   0%|          | 0/496 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1984
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 496
    })
})


### Set training arguments

`learning_rate`: Anywhere from 1e-5 to 1e-7 typically works well.

`num_train_epochs`: Only 1 run through the training data. This is typically larger.

`push_to_hub`: False for now because we don't want to push to Hugging Face until we are happy with the model.

## Note: please replace "shahagam4" to your HF username

In [7]:
training_args = TrainingArguments(output_dir="debasmita/trial-model",
                                  num_train_epochs=1,
                                  learning_rate=1e-6,
                                  per_device_train_batch_size=4,
                                  hub_model_id="debasmita/trial-model",
                                  push_to_hub=False)

### Load Pre-trained Language Model (PLM)

The code below loads the pretrained model "roberta-base" from Hugging Face's models hub. Note that sequence classification is a task where a model assigns a label to an entire sequence (or sentence) rather than individual tokens.

In [8]:
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=3)

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Load and create function to compute metric

This F1 score is weighted by the number of true instances for each label. It accounts for class imbalance by giving more weight to the classes with more instances.

In [9]:
metric = load_metric("f1")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average="weighted")

  metric = load_metric("f1")


Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

### Create trainer object

The shard method is used to divide the dataset into multiple smaller "shards" and then select one of those shards. In this case, both the training and evaluation datasets are divided into 10 shards, and only the first shard (index=0) is selected for training and evaluation. This is useful when you want to train or evaluate on a subset of the data, possibly for faster experimentation.

In [10]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset.shard(num_shards=10, index=0),
    eval_dataset=eval_dataset.shard(num_shards=10, index=0),
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

### Train (Fine-tune) the model

In [11]:
trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=50, training_loss=1.1414989471435546, metrics={'train_runtime': 20.4977, 'train_samples_per_second': 9.708, 'train_steps_per_second': 2.439, 'total_flos': 52359570127872.0, 'train_loss': 1.1414989471435546, 'epoch': 1.0})

### Evaluate the model

In [12]:
evaluate_output = trainer.evaluate()
print(evaluate_output)

{'eval_loss': 1.13997483253479, 'eval_f1': 0.10730158730158731, 'eval_runtime': 1.4179, 'eval_samples_per_second': 35.262, 'eval_steps_per_second': 4.937, 'epoch': 1.0}


## Deploy model on HuggingFace

### Login to HuggingFace

This function call prompts the user to log in to their Hugging Face account. Once authenticated, an access token will be saved locally, allowing the user to interact with the Hugging Face Hub (e.g., push models, datasets) programmatically without needing to re-authenticate every time.

In [13]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Push tokenizer and trained model

After pushing, you can check out the model on Hugging Face. It is also possible to do some inference on Hugging Face (test the model).

In [15]:
tokenizer.push_to_hub("debasmita/trial-model")
trainer.push_to_hub()

events.out.tfevents.1699020284.a752a0945d1f.160.0:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

events.out.tfevents.1699020361.a752a0945d1f.160.1:   0%|          | 0.00/399 [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

'https://huggingface.co/debasmita/trial-model/tree/main/'

### Modify and push additional files

The code below modifies a local tokenizer configuration file, then uploads the updated configuration to a specified repository on the Hugging Face Model Hub.

In [16]:
with open("/content/debasmita/trial-model/tokenizer_config.json", "r") as f:
  config = json.load(f)

# Make the necessary changes to the config file.

config["name_or_path"] = "roberta-base"

with open("/content/debasmita/trial-model/tokenizer_config.json", "w") as f:
  json.dump(config, f, indent=4)

api = HfApi()
api.upload_file(
    path_or_fileobj="/content/debasmita/trial-model/tokenizer_config.json",
    path_in_repo="tokenizer_config.json",
    repo_id="debasmita/trial-model",
    repo_type="model",
)

'https://huggingface.co/debasmita/trial-model/blob/main/tokenizer_config.json'

## Use already deployed model on Hugging Face

The code loads a <b>fine-tuned RoBERTa model for FOMC classification task</b> (LABEL_2: Neutral, LABEL_1: Hawkish, LABEL_0: Dovish) trained for sequence classification from the Hugging Face Model Hub, creates a classification pipeline, and then classifies two provided sentences, printing the results. If you don't want to run this code, you can check out [Hugging Face's hosted inference API and play with the model](https://huggingface.co/gtfintechlab/FOMC-RoBERTa).

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gtfintechlab/FOMC-RoBERTa", do_lower_case=True, do_basic_tokenize=True)

model = AutoModelForSequenceClassification.from_pretrained("gtfintechlab/FOMC-RoBERTa", num_labels=3)

config = AutoConfig.from_pretrained("gtfintechlab/FOMC-RoBERTa")

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer, config=config, device=0, framework="pt")
# classifier = pipeline('text-classification', model=model, tokenizer=tokenizer, config=config, framework="pt")
results = classifier(["Such a directive would imply that any tightening should be implemented promptly if developments were perceived as pointing to rising inflation.",
                      "The International Monetary Fund projects that global economic growth in 2019 will be the slowest since the financial crisis."],
                      batch_size=4, truncation="only_first")

print(results)

Downloading (…)okenizer_config.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/891 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

[{'label': 'LABEL_1', 'score': 0.999393105506897}, {'label': 'LABEL_0', 'score': 0.9979877471923828}]


## Zero-shot LLaMA-2-7B

[Llama 2](https://arxiv.org/pdf/2307.09288.pdf), from Meta, outperforms other open source language models on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests. It exists in a gated repository which means Meta has to approve you using the model. Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.

Note that this model will take some time to run since the model itself is in the GB range.

In [None]:
# Get model and tokenizer from Hugging Face
model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

# Set pipeline for text generation
pipeline_obj = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

# Create prompt
prompt = "Behave like you are an expert sentence classifier. Classify the following sentence from FOMC into 'HAWKISH', 'DOVISH', or 'NEUTRAL' class. Label 'HAWKISH' if it is corresponding to tightening of the monetary policy, 'DOVISH' if it is corresponding to easing of the monetary policy, or 'NEUTRAL' if the stance is neutral. Provide the label in the first line and provide a short explanation in the second line. The sentence: " + "Such a directive would imply that any tightening should be implemented promptly if developments were perceived as pointing to rising inflation."
prompt = "Tell me something interesting about Georgia Institute of Technology."

prompts_list = [prompt]

# Chat with model through prompt
res = pipeline_obj(
        prompts_list,
        max_new_tokens=64,
        do_sample=True,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        )

print(res)

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

[[{'generated_text': "Tell me something interesting about Georgia Institute of Technology.\n\nGeorgia Institute of Technology, commonly referred to as Georgia Tech, is a public research university located in Atlanta, Georgia, United States. It was founded in 1885 and is one of the top 10 engineering schools in the country, with a strong focus on science, technology, engineering, and mathematics (STEM) fields.\n\nHere are some interesting facts about Georgia Institute of Technology:\n\n1. Georgia Tech is a part of the University System of Georgia and is ranked among the top 10 public universities in the country by U.S. News & World Report.\n2. The university offers over 100 undergraduate and graduate degree programs in fields such as engineering, computer science, business, and the arts and sciences.\n3. Georgia Tech has a strong research program, with over $1 billion in annual research expenditures. The university is ranked among the top 10 universities in the country for research expe

## Logout from HuggingFace
Logout so your tokens won’t be used by someone else.

In [None]:
logout() # logout completely

Successfully logged out.
