## Fine-tune Gemma 7B/2B it for Sentiment Analysis
[More Resources](https://huggingface.co/blog/codegemma?latest)

For this hands-on tutorial on fine-tuning a Gemma 7B it, we are going to deal with a sentiment analysis on financial and economic information. Sentiment analysis on financial and economic information is highly relevant for businesses for several key reasons, ranging from market insights (gain valuable insights into market trends, investor confidence, and consumer behavior) to risk management (identifying potential reputational risks) to investment decisions (gauging the sentiment of stakeholders, investors, and the general public businesses can assess the potential success of various investment opportunities).

Before the technicalities of fine-tuning a large language model like Gemma, we had to find the correct dataset to demonstrate the potentialities of fine-tuning.

Particularly within the realm of finance and economic texts, annotated datasets are notably rare, with many being exclusively reserved for proprietary purposes. To address the issue of insufficient training data, scholars from the Aalto University School
of Business introduced in 2014 a set of approximately 5000 sentences. This collection aimed to establish human-annotated benchmarks, serving as a standard for evaluating alternative modeling techniques. The involved annotators (16 people with
adequate background knowledge on financial markets) were instructed to assess the sentences solely from the perspective of an investor, evaluating whether the news potentially holds a positive, negative, or neutral impact on the stock price.

The FinancialPhraseBank dataset is a comprehensive collection that captures the sentiments of financial news headlines from the viewpoint of a retail investor. Comprising two key columns, namely "Sentiment" and "News Headline," the dataset effectively classifies sentiments as either negative, neutral, or positive. This structured dataset serves as a valuable resource for analyzing and understanding the complex dynamics of sentiment in the domain of financial news. It has been used in various studies and research initiatives, since its inception in the work by Malo, P., Sinha, A., Korhonen, P., Wallenius, J., and Takala, P.  "Good debt or bad debt: Detecting semantic orientations in economic texts.", published in the Journal of the Association for Information Science and Technology in 2014.

As a first step, we install the specific libraries necessary to make this example work.

* accelerate is a distributed training library for PyTorch by HuggingFace. It allows you to train your models on multiple GPUs or CPUs in parallel (distributed configurations), which can significantly speed up training in presence of multiple GPUs (we won't use it in our example).
* peft is a Python library by HuggingFace for efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs.
* bitsandbytes by Tim Dettmers, is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions. It allows to run models stored in 4-bit precision: while 4-bit bitsandbytes stores weights in 4-bits, the computation still happens in 16 or 32-bit and here any combination can be chosen (float16, bfloat16, float32, and so on).
* transformers is a Python library for natural language processing (NLP). It provides a number of pre-trained models for NLP tasks such as text classification, question answering, and machine translation.
* trl is a full stack library by HuggingFace providing a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step.

In [1]:
import os
os.environ['CONDA_DEFAULT_ENV']

'py_torching'

In [2]:
# !pip3 install -U bitsandbytes==0.43.0 https://github.com/TimDettmers/bitsandbytes/issues/1140
# !pip3 install -q -U peft==0.8.2
# !pip3 install -q -U trl==0.7.10
# !pip3 install -q -U accelerate==0.27.1
# !pip3 install -q -U datasets==2.17.0
# !pip3 install -q -U transformers==4.38.0

```bash
# conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0  
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```
https://pytorch.org/get-started/locally/

In [1]:
import torch  
  
# Check if CUDA is available and print the number of GPUs available  
if torch.cuda.is_available():   
    print('CUDA is available!  Training on GPU.')  
    print('Number of GPUs available:', torch.cuda.device_count())  
else:   
    print('CUDA is not available. Training on CPU.')  
for i in range(torch.cuda.device_count()):  
    print(f"Device {i}: {torch.cuda.get_device_name(i)}")  

print(torch.__version__)  
print(torch.version.cuda)  

CUDA is available!  Training on GPU.
Number of GPUs available: 1
Device 0: NVIDIA GeForce RTX 3060 Laptop GPU
2.2.2+cu121
12.1


In [4]:
# %env LD_LIBRARY_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\lib\x64
%env BNB_CUDA_VERSION=121

env: BNB_CUDA_VERSION=121


In [5]:
os.environ['BNB_CUDA_VERSION']

'121'

In [6]:
# !pip3 freeze > requirements.txt

In [7]:
# !pip install -q -U git+https://github.com/huggingface/trl@a46cd84a6405312837f0d0e56fd1cf4d45585770
# !pip install -q -U git+https://github.com/huggingface/peft@2efc36ccdf130f03dae8e8468fa4caadc7b5d6f9

The code imports the os module and sets two environment variables:
* CUDA_VISIBLE_DEVICES: This environment variable tells PyTorch which GPUs to use. In this case, the code is setting the environment variable to 0, which means that PyTorch will use the first GPU.
* TOKENIZERS_PARALLELISM: This environment variable tells the Hugging Face Transformers library whether to parallelize the tokenization process. In this case, the code is setting the environment variable to false, which means that the tokenization process will not be parallelized.

In [4]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [5]:
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'True'

The code import warnings; warnings.filterwarnings("ignore") imports the warnings module and sets the warning filter to ignore. This means that all warnings will be suppressed and will not be displayed. Actually during training there are many warnings that do not prevent the fine-tuning but can be distracting and make you wonder if you are doing the correct things.

In [6]:
import warnings
warnings.filterwarnings("ignore")

In the following cell there are all the other imports for running the notebook

In [7]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm

import torch
import torch.nn as nn

import transformers
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline,
                          logging)
from datasets import Dataset
from peft import LoraConfig, PeftConfig
import bitsandbytes as bnb
from trl import SFTTrainer

from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)
from sklearn.model_selection import train_test_split

In [8]:
from watermark import watermark
print(watermark())
print(watermark(iversions=True, globals_=globals()))

Last updated: 2024-04-19T15:48:47.616093-04:00

Python implementation: CPython
Python version       : 3.10.14
IPython version      : 8.23.0

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 141 Stepping 1, GenuineIntel
CPU cores   : 16
Architecture: 64bit

transformers: 4.38.0
bitsandbytes: 0.43.0
numpy       : 1.26.3
pandas      : 2.2.1
torch       : 2.2.2+cu121



The code in the next cell performs the following steps:

1. Reads the input dataset from the all-data.csv file, which is a comma-separated value (CSV) file with two columns: sentiment and text.
2. Splits the dataset into training and test sets, with 300 samples in each set. The split is stratified by sentiment, so that each set contains a representative sample of positive, neutral, and negative sentiments.
3. Shuffles the train data in a replicable order (random_state=10)
4. Transforms the texts contained in the train and test data into prompts to be used by Gemma: the train prompts contains the expected answer we want to fine-tune the model with
5. The residual examples not in train or test, for reporting purposes during training (but it won't be used for early stopping), is treated as evaluation data, which is sampled with repetition in order to have a 50/50/50 sample (negative instances are very few, hence they should be repeated)
5. The train and eval data are wrapped by the class from Hugging Face (https://huggingface.co/docs/datasets/index)

This prepares in a single cell train_data, eval_data and test_data datasets to be used in our fine tuning.

In [9]:
# from google.colab import files

# Prompt for file upload (will open a dialog box)
# uploaded = files.upload()

In [19]:
filename = "firas_lm_dataset.csv"

df = pd.read_csv(filename,
                 header = 0,
                 encoding="utf-8", encoding_errors="replace", compression = 'gzip')
df.Signal = df.Signal.replace({1:'buy', 0:'sell'})

df.reset_index(drop = 1, inplace = True)
# df = pd.read_csv(filename,
#                  names=["sentiment","text"],
#                  encoding="utf-8", encoding_errors="replace")
# X_train = list()
# X_test = list()
# for sentiment in ["positive", "neutral", "negative"]:
#     train, test  = train_test_split(df[df.sentiment==sentiment],
#                                     train_size=300,
#                                     test_size=300,
#                                     random_state=42)
#     X_train.append(train)
#     X_test.append(test)

# X_train = pd.concat(X_train).sample(frac=1, random_state=10)
# X_test = pd.concat(X_test)


X_train = df.loc[df.Partition == 'Train'].reset_index(drop = 1)
X_eval = df.loc[df.Partition == 'Test'].reset_index(drop = 1)
X_test = df.loc[df.Partition == 'Inference'].reset_index(drop = 1)

# eval_idx = [idx for idx in df.index if idx not in list(train.index) + list(test.index)]
# X_eval = df[df.index.isin(eval_idx)]
# X_eval = (X_eval
#           .groupby('sentiment', group_keys=False)
#           .apply(lambda x: x.sample(n=50, random_state=10, replace=True)))
# X_train = X_train.reset_index(drop=True)

def generate_prompt(data_point):
    return f"""
            Analyze the signal of the news headlines enclosed in square brackets,
            determine if it is buy or sell, and return the answer as
            the corresponding signal label "buy" or "sell"

            [{data_point["headline"]}] = {data_point["Signal"]}
            """.strip()

def generate_test_prompt(data_point):
    return f"""
            Analyze the signal of the news headlines enclosed in square brackets,
            determine if it is buy or sell, and return the answer as
            the corresponding signal label "buy" or "sell"

            [{data_point["headline"]}] =

            """.strip()

X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1),
                       columns=["headline"])

train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

y_eval = X_eval.Signal
X_eval = pd.DataFrame(X_eval.apply(generate_test_prompt, axis=1),
                      columns=["headline"])

y_true = X_test.Signal
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["headline"])



In [20]:
X_train.iloc[56][0]

'Analyze the signal of the news headlines enclosed in square brackets,\n            determine if it is buy or sell, and return the answer as\n            the corresponding signal label "buy" or "sell"\n\n            [Agilent-Next-generation Sequencing Advancements in the Diagnostic Lab NYSE ORDER IMBALANCE <A.N> 65842.0 SHARES ON SELL SIDE] = sell'

Next we create a function to evaluate the results from our fine-tuned sentiment model. The function performs the following steps:

1. Maps the sentiment labels to a numerical representation, where 2 represents positive, 1 represents neutral, and 0 represents negative.
2. Calculates the accuracy of the model on the test data.
3. Generates an accuracy report for each sentiment label.
4. Generates a classification report for the model.
5. Generates a confusion matrix for the model.

In [21]:
def evaluate(y_true, y_pred):
    labels = ['buy', 'sell']
    mapping = {'buy': 1, 'sell': 0, 'none':0}
    def map_func(x):
        return mapping.get(x, 1)

    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)

    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')

    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels

    for label in unique_labels:
        label_indices = [i for i in range(len(y_true))
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')

    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)

    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred)
    print('\nConfusion Matrix:')
    print(conf_matrix)

Next we need to take care of the model, which is a 7b-v0.1-hf (7 billion parameters, version 0.1, in the HuggingFace compatible format), loading from Kaggle models and quantization.

Model loading and quantization:

* First the code loads the Gemma 7B it language model from the Hugging Face Hub.
* Then the code gets the float16 data type from the torch library. This is the data type that will be used for the computations.
* Next, it creates a BitsAndBytesConfig object with the following settings:
    1. load_in_4bit: Load the model weights in 4-bit format.
    2. bnb_4bit_quant_type: Use the "nf4" quantization type. 4-bit NormalFloat (NF4), is a new data type that is information theoretically optimal for normally distributed weights.
    3. bnb_4bit_compute_dtype: Use the float16 data type for computations.
    4. bnb_4bit_use_double_quant: Do not use double quantization (reduces the average memory footprint by quantizing also the quantization constants and saves an additional 0.4 bits per parameter.).
* Then the code creates a AutoModelForCausalLM object from the pre-trained Gemma 7B it language model, using the BitsAndBytesConfig object for quantization.
* After that, the code disables caching for the model.
* Finally the code sets the pre-training token probability to 1.

Tokenizer loading:

* First, the code loads the tokenizer for the Gemma 7B it language model.
* Then it sets the padding token to be the end-of-sequence (EOS) token.
* Finally, the code sets the padding side to be "left", which means that the input sequences will be padded on the left side.

In [54]:
from huggingface_hub import notebook_login
# import os
# from google.colab import userdata
os.environ["HF_TOKEN"] = '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'
# notebook_login()

In [23]:
model_name = "google/gemma-2b"
# model_id = "google/gemma-7b-it"
# model_id = "google/gemma-7b"
# model_id = "google/gemma-2b-it"
# model_id = "google/gemma-2b"

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, #GPU
    # llm_int4_enable_fp32_cpu_offload=True, #CPU
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards: 100%|█████| 2/2 [00:14<00:00,  7.22s/it]


In the next cell, we set a function for predicting the sentiment of a news headline using the Gemma 7B it language model. The function takes three arguments:

test: A Pandas DataFrame containing the news headlines to be predicted.
model: The pre-trained Gemma 7B it language model.
tokenizer: The tokenizer for the Gemma 7B it language model.

The function works as follows:

1. For each news headline in the test DataFrame:
    * Create a prompt for the language model, which asks it to analyze the sentiment of the news headline and return the corresponding sentiment label.
    * Use the pipeline() function from the Hugging Face Transformers library to generate text from the language model, using the prompt.
    * Extract the predicted sentiment label from the generated text.
    * Append the predicted sentiment label to the y_pred list.
2. Return the y_pred list.

The pipeline() function from the Hugging Face Transformers library is used to generate text from the language model. The task argument specifies that the task is text generation. The model and tokenizer arguments specify the pre-trained Gemma 7B it language model and the tokenizer for the language model. The max_new_tokens argument specifies the maximum number of new tokens to generate. The temperature argument controls the randomness of the generated text. A lower temperature will produce more predictable text, while a higher temperature will produce more creative and unexpected text.

The if statement checks if the generated text contains the word "positive". If it does, then the predicted sentiment label is "positive". Otherwise, the if statement checks if the generated text contains the word "negative". If it does, then the predicted sentiment label is "negative". Otherwise, the if statement checks if the generated text contains the word "neutral". If it does, then the predicted sentiment label is "neutral.

In [24]:
def predict(X_test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["headline"]
        input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(**input_ids, max_new_tokens=1, temperature=0.0)
        result = tokenizer.decode(outputs[0])
        answer = result.split("=")[-1].lower()
        if "buy" in answer:
            y_pred.append("buy")
        elif "sell" in answer:
            y_pred.append("sell")
        # elif "hold" in answer:
        #     y_pred.append("hold")
        else:
            y_pred.append("none")
    return y_pred

At this point, we are ready to test the Gemma 7B it model and see how it performs on our problem without any fine-tuning. This allows us to get insights on the model itself and establish a baseline.

In [25]:
y_pred = predict(X_test, model, tokenizer)

100%|██████████████████████████| 2860/2860 [11:09<00:00,  4.27it/s]


In the following cell, we evaluate the results. There is little to be said, it is performing really terribly because the 7b-hf model tends to just predict a neutral sentiment and seldom it detects positive or negative sentiment.

In [26]:
evaluate(y_true, y_pred)

Accuracy: 0.530
Accuracy for label 0: 0.378
Accuracy for label 1: 0.629

Classification Report:
              precision    recall  f1-score   support

           0       0.40      0.38      0.39      1121
           1       0.61      0.63      0.62      1739

    accuracy                           0.53      2860
   macro avg       0.50      0.50      0.50      2860
weighted avg       0.53      0.53      0.53      2860


Confusion Matrix:
[[ 424  697]
 [ 646 1093]]


In the next cell we set everything ready for the fine-tuning. We configures and initializes a Simple Fine-tuning Trainer (SFTTrainer) for training a large language model using the Parameter-Efficient Fine-Tuning (PEFT) method, which should save time as it operates on a reduced number of parameters compared to the model's overall size. The PEFT method focuses on refining a limited set of (additional) model parameters, while keeping the majority of the pre-trained LLM parameters fixed. This significantly reduces both computational and storage expenses. Additionally, this strategy addresses the challenge of catastrophic forgetting, which often occurs during the complete fine-tuning of LLMs.

PEFTConfig:

The peft_config object specifies the parameters for PEFT. The following are some of the most important parameters:

* lora_alpha: The learning rate for the LoRA update matrices.
* lora_dropout: The dropout probability for the LoRA update matrices.
* r: The rank of the LoRA update matrices.
* bias: The type of bias to use. The possible values are none, additive, and learned.
* task_type: The type of task that the model is being trained for. The possible values are CAUSAL_LM and MASKED_LM.

TrainingArguments:

The training_arguments object specifies the parameters for training the model. The following are some of the most important parameters:

* output_dir: The directory where the training logs and checkpoints will be saved.
* num_train_epochs: The number of epochs to train the model for.
* per_device_train_batch_size: The number of samples in each batch on each device.
* gradient_accumulation_steps: The number of batches to accumulate gradients before updating the model parameters.
* optim: The optimizer to use for training the model.
* save_steps: The number of steps after which to save a checkpoint.
* logging_steps: The number of steps after which to log the training metrics.
* learning_rate: The learning rate for the optimizer.
* weight_decay: The weight decay parameter for the optimizer.
* fp16: Whether to use 16-bit floating-point precision.
* bf16: Whether to use BFloat16 precision.
* max_grad_norm: The maximum gradient norm.
* max_steps: The maximum number of steps to train the model for.
* warmup_ratio: The proportion of the training steps to use for warming up the learning rate.
* group_by_length: Whether to group the training samples by length.
* lr_scheduler_type: The type of learning rate scheduler to use.
* report_to: The tools to report the training metrics to.
* evaluation_strategy: The strategy for evaluating the model during training.

SFTTrainer:

The SFTTrainer is a custom trainer class from the PEFT library. It is used to train large language models using the PEFT method.

The SFTTrainer object is initialized with the following arguments:

* model: The model to be trained.
* train_dataset: The training dataset.
* eval_dataset: The evaluation dataset.
* peft_config: The PEFT configuration.
* dataset_text_field: The name of the text field in the dataset.
* tokenizer: The tokenizer to use.
* args: The training arguments.
* packing: Whether to pack the training samples.
* max_seq_length: The maximum sequence length.

Once the SFTTrainer object is initialized, it can be used to train the model by calling the train() method

In [27]:
## Apply Lora  
# Here comes the magic with peft! Let's load a PeftModel and specify that we are going to use low-rank adapters (LoRA) 
# using get_peft_model utility function and  the prepare_model_for_kbit_training method from PEFT.

from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [28]:
print(model)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
     

In [29]:
import bitsandbytes as bnb
def find_all_linear_names(model):
  cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
  lora_module_names = set()
  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
      lora_module_names.remove('lm_head')
  return list(lora_module_names)

In [30]:
modules = find_all_linear_names(model)
print(modules)

['down_proj', 'v_proj', 'gate_proj', 'q_proj', 'k_proj', 'o_proj', 'up_proj']


In [31]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules,
)

In [32]:
from peft import get_peft_model


model = get_peft_model(model, peft_config)

In [33]:
trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable: {trainable} | total: {total} | Percentage: {trainable/total*100:.4f}%")

Trainable: 78446592 | total: 2584619008 | Percentage: 3.0351%


In [34]:
training_arguments = TrainingArguments(
    output_dir="logs",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim="paged_adamw_8bit",
    save_steps=0,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=100,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    do_eval=False,
    evaluation_strategy="no", save_strategy="epoch"
)
torch.cuda.empty_cache()
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    peft_config=peft_config,
    dataset_text_field="headline",
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
    max_seq_length=1024,
)

Map: 100%|███████| 101900/101900 [00:07<00:00, 12802.52 examples/s]


In [35]:
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'True'

The following code will train the model using the trainer.train() method and then save the trained model to the trained-model directory. Using The standard GPU P100 offered by Kaggle, the training should be quite fast.

In [36]:
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
# Train model
trainer.train()

Step,Training Loss
25,2.1589
50,0.9454
75,1.7928


ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: 6c18c02a-9acf-4896-81f4-f82fbb137331)')

Afterwards, loading the TensorBoard extension and start TensorBoard, pointing to the logs/runs directory, which is assumed to contain the training logs and checkpoints for your model, will allow you to understand how the models fits during the training.

In [None]:
# %load_ext tensorboard
# %tensorboard --logdir logs/runs

## Share adapters on the 🤗 Hub

In [41]:
trainer.model.save_pretrained("gemma2b-Finetune-test")

In [42]:
model.config.use_cache = True

In [46]:
new_model = "gemma2b-Finetune-test"

In [None]:
model_name = "google/gemma-2b"
# model_id = "google/gemma-7b-it"
# model_id = "google/gemma-7b"
# model_id = "google/gemma-2b-it"
# model_id = "google/gemma-2b"

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, #GPU
    # llm_int4_enable_fp32_cpu_offload=True, #CPU
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [51]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)

merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")

Loading checkpoint shards: 100%|█████| 2/2 [00:03<00:00,  1.87s/it]


('merged_model\\tokenizer_config.json',
 'merged_model\\special_tokens_map.json',
 'merged_model\\tokenizer.json')

In [55]:
os.environ["HF_TOKEN"] = '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'

In [107]:
# Push the model and tokenizer to the Hugging Face Model Hub
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: b43bea7d-e03a-43c3-a6fb-bd4c6ce0c775)')

The following code will first predict the sentiment labels for the test set using the predict() function. Then, it will evaluate the model's performance on the test set using the evaluate() function. The result now should be impressive with an overall accuracy of over 0.8 and high accuracy, precision and recall for the single sentiment labels. The prediction of the neutral label can still be improved, yet it is impressive how much could be done with little data and some fine-tuning.

### Model Evaluation

```pyhton
X_train = df.loc[df.Partition == 'Train'].reset_index(drop = 1)
X_eval = df.loc[df.Partition == 'Test'].reset_index(drop = 1)
X_test = df.loc[df.Partition == 'Inference'].reset_index(drop = 1)
```

In [57]:
y_pred = predict(X_test, merged_model, tokenizer)
evaluate(y_true, y_pred)

100%|██████████████████████████| 2860/2860 [59:53<00:00,  1.26s/it]

Accuracy: 0.473
Accuracy for label 0: 0.651
Accuracy for label 1: 0.359

Classification Report:
              precision    recall  f1-score   support

           0       0.40      0.65      0.49      1121
           1       0.61      0.36      0.45      1739

    accuracy                           0.47      2860
   macro avg       0.51      0.51      0.47      2860
weighted avg       0.53      0.47      0.47      2860


Confusion Matrix:
[[ 730  391]
 [1115  624]]





In [58]:
y_pred2 = predict(X_eval, merged_model, tokenizer)
evaluate(y_eval, y_pred2)

100%|████████████████████████| 5662/5662 [2:02:13<00:00,  1.30s/it]

Accuracy: 0.514
Accuracy for label 0: 0.622
Accuracy for label 1: 0.391

Classification Report:
              precision    recall  f1-score   support

           0       0.54      0.62      0.58      3009
           1       0.48      0.39      0.43      2653

    accuracy                           0.51      5662
   macro avg       0.51      0.51      0.50      5662
weighted avg       0.51      0.51      0.51      5662


Confusion Matrix:
[[1872 1137]
 [1615 1038]]





In [212]:
def predict_proba(X_test, model, tokenizer):  
    y_pred = [] 
    for i in tqdm(range(len(X_test))):  
        prompt = X_test.iloc[i]["headline"]  
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")  
        outputs = model(input_ids) # Instead of generating, we get the logits directly  
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1) # Apply softmax to get probabilities  
        last_token_id = torch.argmax(probs[0][-1]).item() # Get the ID of the last token  
        last_token_prob = probs[0][-1][last_token_id].item() # Get the probability of the last token  
  
        answer = tokenizer.decode([last_token_id]) # Convert the ID back to a token  
  
        if "buy" in answer:  
            y_pred.append({"buy": last_token_prob})  
        elif "sell" in answer:  
            y_pred.append({"sell": last_token_prob})  
        else:  
            y_pred.append({"none": last_token_prob})  
    return y_pred 


In [214]:
y_pred_proba = predict_proba(X_test, merged_model, tokenizer)

100%|████████████████████████| 2860/2860 [1:09:31<00:00,  1.46s/it]


In [234]:
y_pred_proba[:5]

[{'sell': 0.498291015625},
 {'buy': 0.56005859375},
 {'sell': 0.51220703125},
 {'buy': 0.505859375},
 {'sell': 0.51806640625}]

In [215]:
y_pred2_proba = predict_proba(X_eval, merged_model, tokenizer)

100%|████████████████████████| 5662/5662 [2:39:52<00:00,  1.69s/it]


In [233]:
y_pred2_proba[:5]

[{'sell': 0.6123046875},
 {'sell': 0.50732421875},
 {'sell': 0.57177734375},
 {'buy': 0.5537109375},
 {'buy': 0.505859375}]

In [238]:
all(np.array([list(d.keys())[0] for d in y_pred_proba]) == analyze_infer_data.y_pred.values)

True

In [232]:
all(np.array([list(d.keys())[0] for d in y_pred2_proba]) == analyze_test_data.y_pred.values)

True

In [235]:
analyze_test_data = pd.DataFrame()
analyze_test_data['n_Characters'] = df.loc[df.Partition == 'Test','n_Characters'].reset_index(drop = 1).values
analyze_test_data['y_real']  = y_eval
analyze_test_data['y_pred'] = y_pred2
analyze_test_data['Prediction_Probability'] = [list(d.values())[0] for d in y_pred2_proba] 
analyze_test_data['Prediction_Match_Rate'] = analyze_test_data.y_real == analyze_test_data.y_pred
analyze_test_data.head()

Unnamed: 0,n_Characters,y_real,y_pred,Prediction_Probability,Prediction_Match_Rate
0,1441,buy,sell,0.612305,False
1,278,buy,sell,0.507324,False
2,78,sell,sell,0.571777,True
3,159,buy,buy,0.553711,True
4,53,sell,buy,0.505859,False


In [236]:
analyze_infer_data = pd.DataFrame()
analyze_infer_data['n_Characters'] = df.loc[df.Partition == 'Inference','n_Characters'].reset_index(drop = 1).values
analyze_infer_data['y_real']  = y_true
analyze_infer_data['y_pred'] = y_pred
analyze_infer_data['Prediction_Probability'] = [list(d.values())[0] for d in y_pred_proba] 
analyze_infer_data['Prediction_Match_Rate'] = analyze_infer_data.y_real == analyze_infer_data.y_pred
analyze_infer_data.head()

Unnamed: 0,n_Characters,y_real,y_pred,Prediction_Probability,Prediction_Match_Rate
0,33,sell,sell,0.498291,True
1,149,buy,buy,0.560059,True
2,164,buy,sell,0.512207,False
3,104,sell,buy,0.505859,False
4,54,sell,sell,0.518066,True


In [239]:
analyze_test_data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
n_Characters,5662.0,,,,475.433592,801.460538,10.0,81.0,193.0,467.0,8863.0
y_real,5662.0,2.0,sell,3009.0,,,,,,,
y_pred,5662.0,3.0,sell,3470.0,,,,,,,
Prediction_Probability,5662.0,,,,0.540809,0.051875,0.071228,0.509766,0.529297,0.560059,0.90332
Prediction_Match_Rate,5662.0,2.0,True,2901.0,,,,,,,


In [240]:
analyze_infer_data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
n_Characters,2860.0,,,,371.121678,630.572034,14.0,72.0,157.5,358.0,7649.0
y_real,2860.0,2.0,buy,1739.0,,,,,,,
y_pred,2860.0,3.0,sell,1841.0,,,,,,,
Prediction_Probability,2860.0,,,,0.539052,0.047911,0.107117,0.510254,0.527832,0.557617,0.911133
Prediction_Match_Rate,2860.0,2.0,False,1508.0,,,,,,,


In [241]:
analyze_test_data.y_pred.unique()

array(['sell', 'buy', 'none'], dtype=object)

In [242]:
analyze_test_data.groupby(['y_real','Prediction_Match_Rate'])['n_Characters'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
y_real,Prediction_Match_Rate,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
buy,False,1615.0,456.767802,629.262683,17.0,74.0,227.0,596.0,8863.0
buy,True,1038.0,471.55684,931.755951,14.0,93.25,166.0,298.0,7154.0
sell,False,1146.0,538.075916,1129.321453,16.0,94.0,160.5,295.0,7645.0
sell,True,1863.0,455.241009,582.026518,10.0,76.0,232.0,599.5,5529.0


In [243]:
analyze_infer_data.groupby(['y_real','Prediction_Match_Rate'])['n_Characters'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
y_real,Prediction_Match_Rate,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
buy,False,1115.0,374.183857,550.151772,14.0,61.0,157.0,440.0,6779.0
buy,True,624.0,352.304487,727.342665,22.0,88.75,153.0,243.5,6427.0
sell,False,393.0,413.92112,906.439914,20.0,91.0,160.0,259.0,7649.0
sell,True,728.0,359.456044,449.534405,18.0,63.75,162.5,467.75,3632.0


In [244]:
analyze_test_data.groupby(['y_real','Prediction_Match_Rate'])['Prediction_Probability'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
y_real,Prediction_Match_Rate,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
buy,False,1615.0,0.545717,0.05291,0.071228,0.51416,0.534668,0.567383,0.855957
buy,True,1038.0,0.532869,0.04923,0.18457,0.505859,0.521973,0.546875,0.90332
sell,False,1146.0,0.529743,0.055154,0.096497,0.503906,0.519531,0.54541,0.859863
sell,True,1863.0,0.547786,0.048523,0.357178,0.51416,0.534668,0.570312,0.784668


### Soft Inference

In [261]:
index = np.random.randint(0,2000)

print(X_test.iloc[index,0])
print('Real Label:',y_true[index])

input_ids = tokenizer(X_test.iloc[index,0], return_tensors="pt").to("cuda")
print(tokenizer.decode(merged_model.generate(**input_ids, max_new_tokens=1, temperature=0.0)[0]))

Analyze the signal of the news headlines enclosed in square brackets,
            determine if it is buy or sell, and return the answer as
            the corresponding signal label "buy" or "sell"

            [Sermonix Doses First Patient in Phase 2 Clinical Trial Collaboration Studying Lasofoxifene in Combination With Eli Lilly and Company’s Abemaciclib Eli Lilly and Company - Mirikizumab Shows Continued Symptom Improvement and Reduction of Intestinal Inflammation in Patients with Crohn's Disease i... ELI LILLY ANTIBODY TRIAL IS PAUSED BECAUSE OF POTENTIAL SAFETY- NYT REPORTER GOVT-SPONSORED CLINICAL TRIAL TESTING ANTIBODY TREATMENT MADE BY ELI LILLY HAS BEEN PAUSED OVER “POTENTIAL SAFETY CONCERN” - NYT BRIEF-Eli Lilly Antibody Trial Is Paused Because Of Potential Safety Concern- NYT ELI LILLY <LLY.N> SAYS CLINICAL TRIAL OF ITS COVID-19 ANTIBODY TREATMENT PAUSED FOR SAFETY CONCERN ELI LILLY <LLY.N> SAYS INDEPENDENT DATA SAFETY MONITORING BOARD RECOMMENDED A PAUSE IN ENROLLMENT IN AC

In [262]:

input_ids = tokenizer(X_test.iloc[index,0], return_tensors="pt").input_ids.to("cuda")  
  
# Instead of using generate, we use the model to get the logits    
output = merged_model(input_ids)    
# We apply softmax to get probabilities from the logits    
probs = torch.nn.functional.softmax(output.logits, dim=-1)    
# Get the ID of the last token  
last_token_id = torch.argmax(probs[0][-1]).item()  
# Convert the ID back to a token  
last_token = tokenizer.decode([last_token_id])  
# Get the probability of the last token  
last_token_prob = probs[0][-1][last_token_id].item()  
  
print(f"'{last_token}': {last_token_prob}")  


' sell': 0.5576171875


In [263]:
# tokenizer(' buy', return_tensors="pt").input_ids.to("cuda")  

The following code will create a Pandas DataFrame called evaluation containing the text, true labels, and predicted labels from the test set. This is expectially useful for understanding the errors that the fine-tuned model makes, and gettting insights on how to improve the prompt.

In [245]:
analyze_test_data.to_csv("test_predictions.csv", index=False)
analyze_infer_data.to_csv("inference_predictions.csv", index=False)