<a href="https://colab.research.google.com/github/llk010502/Sentiment_Scorer/blob/main/FinGPT_Training_LoRA_with_LLama_3_1_8B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 1: Preparing the Data



In [1]:
!pip install datasets torch torchvision torchaudio tqdm pandas huggingface_hub
!pip install transformers
!pip install sentencepiece
!pip install protobuf cpm_kernels gradio mdtex2html sentencepiece accelerate
!pip install loguru
!pip install datasets
!pip install peft
!pip install bitsandbytes
!pip install tensorboard

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

### 1.1 Initialize Directories:

In [None]:
import os
import shutil

jsonl_path = "../data/dataset_new.jsonl"
save_path = '../data/dataset_new'


if os.path.exists(jsonl_path):
    os.remove(jsonl_path)

if os.path.exists(save_path):
    shutil.rmtree(save_path)

directory = "../data"
if not os.path.exists(directory):
    os.makedirs(directory)


### 1.2 Load and Prepare Dataset:

* Import necessary libraries from the datasets package: https://huggingface.co/docs/datasets/index
* Load the Twitter Financial News Sentiment (TFNS) dataset and convert it to a Pandas dataframe. https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment
* Map numerical labels to their corresponding sentiments (negative, positive, neutral).
* Add instruction for each data entry, which is crucial for Instruction Tuning.
* Convert the Pandas dataframe back to a Hugging Face Dataset object.




In [None]:
from datasets import load_dataset
import datasets

dic = {
    0:"negative",
    1:'positive',
    2:'neutral',
}

tfns = load_dataset('llk010502/fingpt-sentiment')
tfns = tfns['train']
tfns = tfns.to_pandas()
tfns['label'] = tfns['label'].apply(lambda x:dic[x])
tfns['instruction'] = 'What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.'
tfns.columns = ['input', 'output', 'instruction']
tfns = datasets.Dataset.from_pandas(tfns)
tfns

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/392 [00:00<?, ?B/s]

(…)-00000-of-00001-37899fcd89b3e3f8.parquet:   0%|          | 0.00/6.30M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/76772 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output', 'instruction'],
    num_rows: 76772
})

### 1.3 Concatenate and Shuffle Dataset

In [None]:
# tmp_dataset = datasets.concatenate_datasets([tfns]*2)
# train_dataset = tmp_dataset
# print(tmp_dataset.num_rows)

# all_dataset = train_dataset.shuffle(seed = 42)
# all_dataset.shape

19086


(19086, 3)

In [None]:
train_dataset = tfns
all_dataset = train_dataset.shuffle(seed = 42)
all_dataset.shape

(76772, 3)

Now that your training data is loaded and prepared.

## Part 2: Dataset Formatting and Tokenization
Once your data is prepared, the next steps involve formatting the dataset for model ingestion and tokenizing the input data. Below, we provide a step-by-step breakdown of the code snippets shared.



### 2.1 Dataset Formatting:
You need to structure your data in a specific format that aligns with the training process.



In [None]:
import json
from tqdm.notebook import tqdm

In [None]:
def format_example(example: dict) -> dict:
    context = f"Instruction: {example['instruction']}\n"
    if example.get("input"):
        context += f"Input: {example['input']}\n"
    context += "Answer: "
    target = example["output"]
    return {"context": context, "target": target}

In [None]:
data_list = []
for item in all_dataset.to_pandas().itertuples():
    tmp = {}
    tmp["instruction"] = item.instruction
    tmp["input"] = item.input
    tmp["output"] = item.output
    data_list.append(tmp)

In [None]:
# save to a jsonl file
with open("../data/dataset_new.jsonl", 'w') as f:
    for example in tqdm(data_list, desc="formatting.."):
        f.write(json.dumps(format_example(example)) + '\n')

formatting..:   0%|          | 0/76772 [00:00<?, ?it/s]

### 2.2 Tokenization
Tokenization is the process of converting input text into tokens that can be fed into the model.



In [None]:
import datasets
from transformers import AutoTokenizer, AutoConfig

model_name = "meta-llama/Llama-3.1-8B"
jsonl_path = "../data/dataset_new.jsonl"  # updated path
save_path = '../data/dataset_new'  # updated path
max_seq_length = 256
skip_overlength = True

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
# The preprocess function tokenizes the prompt and target, combines them into input IDs,
# and then trims or pads the sequence to the maximum sequence length.
def preprocess(tokenizer, config, example, max_seq_length):
    prompt = example["context"]
    target = example["target"]
    prompt_ids = tokenizer.encode(prompt, max_length=max_seq_length, truncation=True)
    target_ids = tokenizer.encode(
        target,
        max_length=max_seq_length,
        truncation=True,
        add_special_tokens=False)
    input_ids = prompt_ids + target_ids + [config.eos_token_id]
    return {"input_ids": input_ids, "seq_len": len(prompt_ids)}

# The read_jsonl function reads each line from the JSONL file, preprocesses it using the preprocess function,
# and then yields each preprocessed example.
def read_jsonl(path, max_seq_length, skip_overlength=False):
    tokenizer = AutoTokenizer.from_pretrained(
        model_name, trust_remote_code=True)
    config = AutoConfig.from_pretrained(
        model_name, trust_remote_code=True, device_map='auto')
    with open(path, "r") as f:
        for line in tqdm(f.readlines()):
            example = json.loads(line)
            feature = preprocess(tokenizer, config, example, max_seq_length)
            if skip_overlength and len(feature["input_ids"]) > max_seq_length:
                continue
            feature["input_ids"] = feature["input_ids"][:max_seq_length]
            yield feature

### 2.3 Save the dataset

In [None]:
from huggingface_hub import login
#login("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")
login('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')
# The script then creates a Hugging Face Dataset object from the generator and saves it to disk.
save_path = '../data/dataset_new'

dataset = datasets.Dataset.from_generator(
    lambda: read_jsonl(jsonl_path, max_seq_length, skip_overlength)
    )
dataset.save_to_disk(save_path)


Generating train split: 0 examples [00:00, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

  0%|          | 0/76772 [00:00<?, ?it/s]

Saving the dataset (0/1 shards):   0%|          | 0/76770 [00:00<?, ? examples/s]

## Part 3: Setup FinGPT training parameters with LoRA on Baichuan2-7B



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import transformers
print(transformers.__version__)

4.46.3


In [None]:
# Ensure CUDA is accessible in the system path
# Only for Windows Subsystem for Linux (WSL)
import os
os.environ["PATH"] = f"{os.environ['PATH']}:/usr/local/cuda/bin"
os.environ['LD_LIBRARY_PATH'] = "/usr/lib/wsl/lib:/usr/local/cuda/lib64"

In [None]:
os.chdir('/content/drive/MyDrive/')

In [None]:
os.getcwd()

'/content/drive/MyDrive'

### 3.1 Training Arguments Setup:
Initialize and set training arguments.



In [None]:
from typing import List, Dict, Optional
import torch
import bitsandbytes
from loguru import logger
from transformers import (
    AutoModel,
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig
)
from peft import (
    TaskType,
    LoraConfig,
    get_peft_model,
    set_peft_model_state_dict,
    prepare_model_for_kbit_training,
)
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING

In [None]:
training_args = TrainingArguments(
        output_dir='./finetuned_model_llama3.1',    # saved model path
        logging_steps = 500,
        # max_steps=10000,
        num_train_epochs = 2,
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        learning_rate=1e-4,
        weight_decay=0.01,
        warmup_steps=1000,
        lr_scheduler_type='linear',
        save_steps=500,
        fp16=True,
        # bf16=True,
        optim="adamw_8bit",
        load_best_model_at_end = True,
        evaluation_strategy="steps",
        remove_unused_columns=False,

    )



### 3.2 Quantization Config Setup:
Set quantization configuration to reduce model size without losing significant precision.



In [None]:
 # Quantization
q_config = BitsAndBytesConfig(
        load_in_8bit=True,
        bnb_8bit_compute_dtype=torch.bfloat16
    )

Unused kwargs: ['bnb_8bit_compute_dtype']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


### 3.3 Model Loading & Preparation:
Load the base model and tokenizer, and prepare the model for INT8 training.

* **Runtime -> Change runtime type -> A100 GPU**
* retart runtime and run again if not working


In [None]:
# Load tokenizer & model
# need massive space
model_name = 'meta-llama/Llama-3.1-8B'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=q_config,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
# pip install accelerate bitsandbytes

### 3.4 LoRA Config & Setup:
Implement Low-Rank Adaptation (LoRA) and print trainable parameters.



In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"]
# target_modules = TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING['llama']
print(target_modules)

['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']


In [None]:
target_modules

['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']

In [None]:
# LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=target_modules,
    bias='none',
)
model = get_peft_model(model, lora_config)
print_trainable_parameters(model)

trainable params: 20971520 || all params: 8051232768 || trainable%: 0.26047588741133265


In [None]:
resume_from_checkpoint = None
if resume_from_checkpoint is not None:
    checkpoint_name = os.path.join(resume_from_checkpoint, 'pytorch_model.bin')
    if not os.path.exists(checkpoint_name):
        checkpoint_name = os.path.join(
            resume_from_checkpoint, 'adapter_model.bin'
        )
        resume_from_checkpoint = False
    if os.path.exists(checkpoint_name):
        logger.info(f'Restarting from {checkpoint_name}')
        adapters_weights = torch.load(checkpoint_name)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        logger.info(f'Checkpoint {checkpoint_name} not found')

In [None]:
model.print_trainable_parameters()

trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605


## Part 4: Loading Data and Training FinGPT
In this segment, we'll delve into the loading of your pre-processed data, and finally, launch the training of your FinGPT model. Here's a stepwise breakdown of the script provided:
* Need to purchase Google Colab GPU plans, Colab Pro is sufficient or just buy 100 compute units for $10


### 4.1 Loading Your Data:


In [None]:
# load data
from datasets import load_from_disk
import datasets

dataset = datasets.load_from_disk("/data/dataset_new")
dataset = dataset.train_test_split(0.2, shuffle=True, seed = 42)

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'seq_len'],
        num_rows: 61416
    })
    test: Dataset({
        features: ['input_ids', 'seq_len'],
        num_rows: 15354
    })
})

### 4.2 Training Configuration and Launch:
* Customize the Trainer class for specific loss computation, prediction step, and model-saving methods.

* Define a data collator function to process batches of data during training.

* Set up TensorBoard for logging, instantiate your modified trainer, and begin training.



In [None]:
# class ModifiedTrainer(Trainer):
#     def compute_loss(self, model, inputs, return_outputs=False):
#         return model(
#             input_ids=inputs["input_ids"],
#             labels=inputs["labels"],
#         ).loss

#     def prediction_step(self, model: torch.nn.Module, inputs, prediction_loss_only: bool, ignore_keys = None):
#         with torch.no_grad():
#             res = model(
#                 input_ids=inputs["input_ids"].to(model.device),
#                 labels=inputs["labels"].to(model.device),
#             ).loss
#         return (res, None, None)

#     def save_model(self, output_dir=None, _internal_call=False):
#         from transformers.trainer import TRAINING_ARGS_NAME

#         os.makedirs(output_dir, exist_ok=True)
#         torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
#         saved_params = {
#             k: v.to("cpu") for k, v in self.model.named_parameters() if v.requires_grad
#         }
#         torch.save(saved_params, os.path.join(output_dir, "adapter_model.bin"))

def data_collator(features: list) -> dict:
    len_ids = [len(feature["input_ids"]) for feature in features]
    longest = max(len_ids)
    input_ids = []
    labels_list = []
    for ids_l, feature in sorted(zip(len_ids, features), key=lambda x: -x[0]):
        ids = feature["input_ids"]
        seq_len = feature["seq_len"]
        labels = (
            [tokenizer.pad_token_id] * (seq_len - 1) + ids[(seq_len - 1) :] + [tokenizer.pad_token_id] * (longest - ids_l)
        )
        ids = ids + [tokenizer.pad_token_id] * (longest - ids_l)
        _ids = torch.LongTensor(ids)
        labels_list.append(torch.LongTensor(labels))
        input_ids.append(_ids)
    input_ids = torch.stack(input_ids)
    labels = torch.stack(labels_list)
    return {
        "input_ids": input_ids,
        "labels": labels,
    }

In [None]:
# Train
# Took about 10 compute units
trainer = Trainer(
    model=model,
    args=training_args,             # Trainer args
    train_dataset=dataset["train"], # Training set
    eval_dataset=dataset["test"],   # Testing set
    data_collator=data_collator,    # Data Collator

)
trainer.train()
# save model
model.save_pretrained(training_args.output_dir)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mll3713[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss
500,1.5914,0.004728
1000,0.004,0.003503
1500,0.0032,0.002852
2000,0.0024,0.002104
2500,0.0016,0.001776
3000,0.0013,0.0015
3500,0.0012,0.00138


### 4.3 Model Saving and Download:
After training, save and download your model. You can also check the model's size.



In [None]:
!zip -r /content/saved_model.zip /content/drive/MyDrive/{training_args.output_dir}


In [None]:
# download to local
from google.colab import files
files.download('/content/saved_model.zip')

In [None]:
# save to google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# save the finetuned model to google drive
!cp -r "/content/finetuned_model_llama3.1" "/content/drive/MyDrive"

In [None]:
training_args.output_dir

'./finetuned_model'

In [None]:
def get_folder_size(folder_path):
    total_size = 0
    for dirpath, _, filenames in os.walk(folder_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            print(f, os.path.getsize(fp)/ 1024 / 1024)
            total_size += os.path.getsize(fp)
    return total_size / 1024 / 1024  # Size in MB

model_size = get_folder_size(training_args.output_dir)
print(f"Model size: {model_size} MB")

adapter_config.json 0.000690460205078125
README.md 0.004860877990722656
adapter_model.safetensors 80.05647277832031
rng_state.pth 0.013584136962890625
trainer_state.json 0.0021963119506835938
adapter_config.json 0.000690460205078125
README.md 0.004860877990722656
training_args.bin 0.00505828857421875
scheduler.pt 0.00101470947265625
optimizer.pt 41.12532424926758
adapter_model.safetensors 80.05647277832031
events.out.tfevents.1732653121.b18d068b2957.3886.0 0.00399017333984375
events.out.tfevents.1732653231.b18d068b2957.4790.0 0.007687568664550781
Model size: 201.28290367126465 MB


Now your model is trained and saved! You can download it and use it for generating financial insights or any other relevant tasks in the finance domain. The usage of TensorBoard allows you to deeply understand and visualize the training dynamics and performance of your model in real-time.

Happy FinGPT Training! 🚀

## Part 5: Inference and Benchmarks using FinGPT
Now that your model is trained, let’s understand how to use it to infer and run benchmarks.
* Took about 10 compute units



### 5.1 Load the model

In [5]:
#clone the FinNLP repository
!git clone https://github.com/AI4Finance-Foundation/FinNLP.git

import sys
sys.path.append('/content/FinNLP/')

Cloning into 'FinNLP'...
remote: Enumerating objects: 1424, done.[K
remote: Counting objects: 100% (184/184), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 1424 (delta 168), reused 148 (delta 148), pack-reused 1240 (from 1)[K
Receiving objects: 100% (1424/1424), 4.53 MiB | 18.40 MiB/s, done.
Resolving deltas: 100% (655/655), done.


In [8]:
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
login("API")

from peft import PeftModel
import torch
import os

# Load benchmark datasets from FinNLP
from finnlp.benchmarks.fpb import test_fpb
from finnlp.benchmarks.fiqa import test_fiqa , add_instructions
from finnlp.benchmarks.tfns import test_tfns
from finnlp.benchmarks.nwgi import test_nwgi

In [2]:
# load model from google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Define the path you want to check
path_to_check = "/content/drive/MyDrive/finetuned_model_llama3.1"

# Check if the specified path exists
if os.path.exists(path_to_check):
    print("Path exists.")
else:
    print("Path does not exist.")


Path exists.


In [12]:
## load our finetuned model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

base_model = "meta-llama/Llama-3.1-8B"
peft_model = "/content/drive/MyDrive/finetuned_model_llama3.1"

model_name = 'meta-llama/Llama-3.1-8B'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # load_in_8bit=True,
    trust_remote_code=True,
    device_map='cuda'
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

model = PeftModel.from_pretrained(model, peft_model)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
# load benchmark model
model_name = 'meta-llama/Llama-3.1-8B-Instruct'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    trust_remote_code=True,
    device_map='auto'
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

model = model.eval()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

### 5.2 Run Benchmarks:

In [10]:
batch_size = 16
import logging
logging.getLogger("transformers").setLevel(logging.ERROR)

In [13]:
# TFNS Test Set, len 2388
res = test_tfns(model, tokenizer, batch_size = batch_size)



Prompt example:
Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.
Input: $ALLY - Ally Financial pulls outlook https://t.co/G9Zdi1boy5
Answer: 


Total len: 2388. Batchsize: 16. Total steps: 150


100%|██████████| 150/150 [02:45<00:00,  1.10s/it]

Acc: 0.9053601340033501. F1 macro: 0.8827360740576906. F1 micro: 0.9053601340033501. F1 weighted (BloombergGPT): 0.9053469716592242. 





In [14]:
# FPB, len 1212
res = test_fpb(model, tokenizer, batch_size = batch_size)

README.md:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

financial_phrasebank.py:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

The repository for financial_phrasebank contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/financial_phrasebank.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


FinancialPhraseBank-v1.0.zip:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4846 [00:00<?, ? examples/s]



Prompt example:
Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.
Input: L&T has also made a commitment to redeem the remaining shares by the end of 2011 .
Answer: 


Total len: 1212. Batchsize: 16. Total steps: 76


100%|██████████| 76/76 [01:36<00:00,  1.27s/it]

Acc: 0.8712871287128713. F1 macro: 0.8656606842321128. F1 micro: 0.8712871287128713. F1 weighted (BloombergGPT): 0.8702876866561449. 





In [15]:
# FiQA, len 275
res = test_fiqa(model, tokenizer, prompt_fun = add_instructions, batch_size = batch_size)

train.csv:   0%|          | 0.00/161k [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/16.7k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/25.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/961 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/102 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/150 [00:00<?, ? examples/s]



Prompt example:
Instruction: What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.
Input: This $BBBY stock options trade would have more than doubled your money https://t.co/Oa0loiRIJL via @TheStreet
Answer: 


Total len: 275. Batchsize: 16. Total steps: 18


100%|██████████| 18/18 [00:18<00:00,  1.05s/it]

Acc: 0.7890909090909091. F1 macro: 0.7015081480869161. F1 micro: 0.7890909090909091. F1 weighted (BloombergGPT): 0.8244744078450086. 





In [16]:
# NWGI, len 4047
res = test_nwgi(model, tokenizer, batch_size = batch_size)

README.md:   0%|          | 0.00/682 [00:00<?, ?B/s]

(…)-00000-of-00001-dd971e407aecb39b.parquet:   0%|          | 0.00/10.8M [00:00<?, ?B/s]

(…)-00000-of-00001-c551483ebf365496.parquet:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16184 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/4047 [00:00<?, ? examples/s]



Prompt example:
Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.
Input: In the latest trading session, Adobe Systems (ADBE) closed at $535.98, marking a +0.31% move from the previous day.
Answer: 


Total len: 4047. Batchsize: 16. Total steps: 253


100%|██████████| 253/253 [06:38<00:00,  1.57s/it]

Acc: 0.6236718556955769. F1 macro: 0.5878600322352122. F1 micro: 0.6236718556955769. F1 weighted (BloombergGPT): 0.5606266523563215. 





#### Inference at length

In [None]:
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

from peft import PeftModel
import torch
import os

base_model = "meta-llama/Llama-3.1-8B"
peft_model = "llk010502/llama3.1-8B-financial_sentiment"

model_name = 'meta-llama/Llama-3.1-8B'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    trust_remote_code=True,
    device_map='cuda'
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

model = PeftModel.from_pretrained(model, peft_model)

model = model.eval()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
text = """Disney+ subscribers will now see an ESPN tile on the streaming service's homepage, part of Walt Disney Co.'s continued efforts to increase subscribers and reduce churn.

Starting Wednesday, bundle subscribers to Disney+, Hulu and ESPN+ will be able to access ESPN content from the Disney+ app.

Those who subscribe only to Disney+ will also be able to watch some Hulu and ESPN+ content through the app, including certain live NBA games, the first day of the Australian Open and some "30 for 30" sports documentaries, as well as series and movies such as FX's Emmy-winning "Shogun," crime procedural "Will Trent" and the film "Dawn of the Planet of the Apes."


Read more: 'Deadpool & Wolverine' and 'Inside Out 2' propel Disney studio earnings

The idea, Disney officials said, is to whet people's appetites and encourage upgrades to the full bundle.

"There are opportunities to use the sampling experiences [as] lead-in to a more fulsome experience," said Alisa Bowen, president of Disney+.

The addition of ESPN content to Disney+ is similar to the roll-out of the Hulu tile earlier this year. By integrating all three of its streaming services into one platform, Disney is betting that a more seamless experience will keep subscribers engaged and increase retention, Bowen said.

"This strategy is really about making it easier for them to consume everything that they're paying for, giving them less friction in terms of navigating between the different apps and better ability for us to personalize the content from all those different services," she said.

Disney's streaming business is key for its growth plans. The company has projected that its entertainment streaming business, which includes just Disney+ and Hulu, will have a 10% operating margin by 2025.

On the sports front, the company is planning to launch its ESPN flagship product in August.

Sign up for our Wide Shot newsletter to get the latest entertainment business news, analysis and insights.

This story originally appeared in Los Angeles Times."""

In [None]:
test_text = "Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}." + '\n' + 'Input: '+ text + '\n' + 'Answer:'
print(test_text)
tokens = tokenizer(test_text, return_tensors='pt', truncation=True, max_length=512)
for k in tokens.keys():
    tokens[k] = tokens[k].cuda()
res = model.generate(**tokens, max_new_tokens=64, use_cache=True)
res_sentences = [tokenizer.decode(i) for i in res]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Instruction: What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.
Input: Disney+ subscribers will now see an ESPN tile on the streaming service's homepage, part of Walt Disney Co.'s continued efforts to increase subscribers and reduce churn.

Starting Wednesday, bundle subscribers to Disney+, Hulu and ESPN+ will be able to access ESPN content from the Disney+ app.

Those who subscribe only to Disney+ will also be able to watch some Hulu and ESPN+ content through the app, including certain live NBA games, the first day of the Australian Open and some "30 for 30" sports documentaries, as well as series and movies such as FX's Emmy-winning "Shogun," crime procedural "Will Trent" and the film "Dawn of the Planet of the Apes."


Read more: 'Deadpool & Wolverine' and 'Inside Out 2' propel Disney studio earnings

The idea, Disney officials said, is to whet people's appetites and encourage upgrades to the full bundle.

"There are opportunities to use th

In [None]:
test = [test_text, 'NVDA is about to rise']

In [None]:
def scorer(prompts, model, tokenizer):

    sentiments = ['positive', 'neutral', 'negative']

    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True, max_length=512)
    input_ids = inputs["input_ids"].cuda()

    with torch.no_grad():
        outputs = model(input_ids)
        logits = outputs.logits # shape:[batch_size, output_length, vocab_size]

    print(logits.shape)
    # Get logits for the last token (the next token to be predicted)
    last_token_logits = logits[:, -1, :]
    last_token_logits = last_token_logits.to(torch.float32)
    probabilities = torch.softmax(last_token_logits, dim=-1)  # Shape: [batch_size, vocab_size]
    sentiment_scores = []
    for i in range(len(prompts)):

        sentiments_prob = [probabilities[i, tokenizer.convert_tokens_to_ids(s)].item() for s in sentiments]

        # Standarized Positive - Standarized Negative
        sentiment_score = (sentiments_prob[0] - sentiments_prob[2])/sum(sentiments_prob)
        sentiment_scores.append(sentiment_score)
    return sentiment_scores

In [None]:
scorer(test, model, tokenizer)

torch.Size([2, 432, 128256])


[0.6716399840563979, 0.3003987538355544]

## Under this scoring rule, the threshold range would be: -1 to 0(negative); 0 to 0.5(neutral); 0.5 to 1(positive)