<a href="https://colab.research.google.com/github/balnarendrasapa/faq-llm/blob/master/Finetuning_Bloom_7b1_for_FAQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About

- This project is about finetuning the pretrained model with a custom dataset

- In this case, the model is finetuned on the custom FAQ dataset from the E-Commerce site.

- The dataset is taken from Kaggle

- Link to the dataset - [click here](https://www.kaggle.com/datasets/saadmakhdoom/ecommerce-faq-chatbot-dataset)


#### Note:

- Since the Memory of the GPU is only limited to 15 gigs in google colab, You won't be able to train the model and evaluate the model simultaneously.

- After training the model, Restart the runtime to clear the memory of the GPU and after that evaluate the model. This loads the model to the GPU.

# Setup Environment

## Download the required Libraries

In [1]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pypro

## Ignore Warnings

In [1]:
import warnings
warnings.filterwarnings("ignore")

## Login to the Huggingface

- Get huggingface write token and enter here

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Check for GPU

In [2]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-b84b6201-00ad-82dc-bc4d-193f275f5cce)


# Setup the model

In [1]:
import os
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-7b1",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-7b1")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Freezing the original weights


In [2]:
for param in model.parameters():
    param.requires_grad = False
    if param.ndim == 1:
        param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

## Setting up the LoRa Adapters

In [3]:
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}")

In [4]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 7864320 || all params: 7076880384 || trainable%: 0.11112693126452029


# Data-Preprocessing

## Get dataset from kaggle using kaggle and CLI commands

In [8]:
# get kaggle.json and upload it to runtime. upload kaggle.json before running this cell
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [9]:
!kaggle datasets download -d saadmakhdoom/ecommerce-faq-chatbot-dataset

Downloading ecommerce-faq-chatbot-dataset.zip to /content
  0% 0.00/4.30k [00:00<?, ?B/s]
100% 4.30k/4.30k [00:00<00:00, 11.1MB/s]


In [10]:
!unzip ecommerce-faq-chatbot-dataset.zip

Archive:  ecommerce-faq-chatbot-dataset.zip
  inflating: Ecommerce_FAQ_Chatbot_dataset.json  


In [5]:
import json

f = open("Ecommerce_FAQ_Chatbot_dataset.json")
data = json.load(f)

## Seperating questions and answers

In [6]:
questions = []
answers = []

for i in data["questions"]:
    questions += [i["question"]]
    answers += [i["answer"]]

In [7]:
questions[0], answers[0], data["questions"][0]

('How can I create an account?',
 "To create an account, click on the 'Sign Up' button on the top right corner of our website and follow the instructions to complete the registration process.",
 {'question': 'How can I create an account?',
  'answer': "To create an account, click on the 'Sign Up' button on the top right corner of our website and follow the instructions to complete the registration process."})

## Converting the dataset into huggingface compatible dataset for easier training

In [8]:
from datasets import Dataset, Features, ClassLabel, Value, Sequence

dataset = Dataset.from_dict({
    "id": list(range(len(questions))),
    "questions": questions,
    "answers": answers
    },
    features = Features({
        "id": Value(dtype='string'),
        "questions": Value(dtype = "string"),
        "answers": Value(dtype = "string")
    }
))

## Split the dataset into train and test dataset

- Splitting the dataset into 85% train and 15% test

In [9]:
dataset = dataset.train_test_split(test_size = 0.15)

In [10]:
def merge_columns(example):
    example["prediction"] = example["questions"] + " ->: " + example["answers"]
    return example

dataset = dataset.map(merge_columns)
dataset["train"]["prediction"][0]

Map:   0%|          | 0/67 [00:00<?, ? examples/s]

Map:   0%|          | 0/12 [00:00<?, ? examples/s]

'Can I use multiple promo codes on a single order? ->: Usually, only one promo code can be applied per order. During the checkout process, enter the promo code in the designated field to apply the discount to your order.'

In [11]:
dataset["train"][0]

{'id': '20',
 'questions': 'Can I use multiple promo codes on a single order?',
 'answers': 'Usually, only one promo code can be applied per order. During the checkout process, enter the promo code in the designated field to apply the discount to your order.',
 'prediction': 'Can I use multiple promo codes on a single order? ->: Usually, only one promo code can be applied per order. During the checkout process, enter the promo code in the designated field to apply the discount to your order.'}

In [12]:
dataset = dataset.map(lambda samples: tokenizer(samples['prediction']), batched=True)

Map:   0%|          | 0/67 [00:00<?, ? examples/s]

Map:   0%|          | 0/12 [00:00<?, ? examples/s]

In [13]:
print(dataset["train"][0])

{'id': '20', 'questions': 'Can I use multiple promo codes on a single order?', 'answers': 'Usually, only one promo code can be applied per order. During the checkout process, enter the promo code in the designated field to apply the discount to your order.', 'prediction': 'Can I use multiple promo codes on a single order? ->: Usually, only one promo code can be applied per order. During the checkout process, enter the promo code in the designated field to apply the discount to your order.', 'input_ids': [16454, 473, 2971, 15289, 62240, 54311, 664, 267, 10546, 7092, 34, 11953, 29, 170014, 15, 3804, 2592, 62240, 4400, 1400, 722, 25392, 604, 7092, 17, 49262, 368, 139100, 4451, 15, 14749, 368, 62240, 4400, 361, 368, 112447, 6608, 427, 22240, 368, 114785, 427, 2632, 7092, 17], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


# Train the Model

In [19]:
import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    args=transformers.TrainingArguments(
        evaluation_strategy = "epoch",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=2,
        num_train_epochs=5,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.3437,1.222561
2,0.2795,1.149662
3,0.1292,1.243928
4,0.1189,1.290086
5,0.095,1.313331


TrainOutput(global_step=85, training_loss=0.17137297374360702, metrics={'train_runtime': 249.7013, 'train_samples_per_second': 1.342, 'train_steps_per_second': 0.34, 'total_flos': 613759533219840.0, 'train_loss': 0.17137297374360702, 'epoch': 5.0})

# Upload the model to Huggingface

In [None]:
model.push_to_hub("bnsapa/faq-llm",
                  use_auth_token=True,
                  commit_message="Trained on the dataset",
                  private=False)



adapter_model.safetensors:   0%|          | 0.00/31.5M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/bnsapa/faq_llm/commit/760808c5745c2b74a4ac9bdb361198de2fb75e79', commit_message='Trained on the dataset', commit_description='', oid='760808c5745c2b74a4ac9bdb361198de2fb75e79', pr_url=None, pr_revision=None, pr_num=None)

# Load adapters from the Hub

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "bnsapa/faq_llm"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/31.5M [00:00<?, ?B/s]

# Example using the trained model

In [None]:
batch = tokenizer("“how to reset my account” ->: ", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))





 “how to reset my account” ->:  “If you have forgotten your password, please enter your email address and we will send you instructions on how to reset it.” ->:  “If you have forgotten your password, please enter your email address and we will send you instructions on how
