# About

- This project is about finetuning the pretrained model with a custom dataset

- In this case, the model is finetuned on the custom FAQ dataset from the E-Commerce site.

- The dataset is taken from Kaggle

- Link to the dataset - [click here](https://www.kaggle.com/datasets/saadmakhdoom/ecommerce-faq-chatbot-dataset)


# Setup Environment

## Download the required Libraries

In [1]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

## Ignore Warnings

In [2]:
import warnings
warnings.filterwarnings("ignore")

## Check for GPU

In [3]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-7d373fdc-a55e-e47b-3344-d37065908d77)
GPU 1: Tesla T4 (UUID: GPU-941c2926-955b-0c48-3aa1-7f8c5268d546)


# Setup the model

In [4]:
import os
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "tiiuae/falcon-7b",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

## Freezing the original weights


In [5]:
for param in model.parameters():
    param.requires_grad = False
    if param.ndim == 1:
        param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

## Setting up the LoRa Adapters

In [6]:
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}")

In [7]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 4718592 || all params: 6926439296 || trainable%: 0.06812435363037071


# Dataset Preprocessing

In [8]:
import json

f = open("/kaggle/input/ecommerce-faq-chatbot-dataset/Ecommerce_FAQ_Chatbot_dataset.json")
data = json.load(f)

## Seperating questions and answers

In [9]:
questions = []
answers = []

for i in data["questions"]:
    questions += [i["question"]]
    answers += [i["answer"]]

In [10]:
questions[0], answers[0], data["questions"][0]

('How can I create an account?',
 "To create an account, click on the 'Sign Up' button on the top right corner of our website and follow the instructions to complete the registration process.",
 {'question': 'How can I create an account?',
  'answer': "To create an account, click on the 'Sign Up' button on the top right corner of our website and follow the instructions to complete the registration process."})

## Converting the dataset into huggingface compatible dataset for easier training

In [11]:
from datasets import Dataset, Features, ClassLabel, Value, Sequence

dataset = Dataset.from_dict({
    "id": list(range(len(questions))),
    "questions": questions,
    "answers": answers
    },
    features = Features({
        "id": Value(dtype='string'),
        "questions": Value(dtype = "string"),
        "answers": Value(dtype = "string")
    }
))

## Split the dataset into train and test dataset

- Splitting the dataset into 85% train and 15% test

In [12]:
dataset = dataset.train_test_split(test_size = 0.15)

In [13]:
def merge_columns(example):
    example["prediction"] = example["questions"] + " ->: " + example["answers"]
    return example

dataset = dataset.map(merge_columns)
dataset["train"]["prediction"][0]

  0%|          | 0/67 [00:00<?, ?ex/s]

  0%|          | 0/12 [00:00<?, ?ex/s]

"What is your price matching policy? ->: We have a price matching policy where we will match the price of an identical product found on a competitor's website. Please contact our customer support team with the details of the product and the competitor's offer."

In [14]:
dataset["train"][0]

{'id': '11',
 'questions': 'What is your price matching policy?',
 'answers': "We have a price matching policy where we will match the price of an identical product found on a competitor's website. Please contact our customer support team with the details of the product and the competitor's offer.",
 'prediction': "What is your price matching policy? ->: We have a price matching policy where we will match the price of an identical product found on a competitor's website. Please contact our customer support team with the details of the product and the competitor's offer."}

In [15]:
dataset = dataset.map(lambda samples: tokenizer(samples['prediction']), batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [16]:
print(dataset["train"][0])

{'id': '11', 'questions': 'What is your price matching policy?', 'answers': "We have a price matching policy where we will match the price of an identical product found on a competitor's website. Please contact our customer support team with the details of the product and the competitor's offer.", 'prediction': "What is your price matching policy? ->: We have a price matching policy where we will match the price of an identical product found on a competitor's website. Please contact our customer support team with the details of the product and the competitor's offer.", 'input_ids': [1562, 304, 402, 2073, 11575, 3244, 42, 204, 1579, 37, 703, 413, 241, 2073, 11575, 3244, 881, 360, 451, 3185, 248, 2073, 275, 267, 12840, 1114, 1217, 313, 241, 25271, 18, 94, 1857, 25, 4012, 2072, 568, 3080, 1164, 1255, 335, 248, 2861, 275, 248, 1114, 273, 248, 25271, 18, 94, 1880, 25], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

# Train the Model

In [17]:
if tokenizer.pad_token == None:
    tokenizer.pad_token = tokenizer.eos_token

In [18]:
import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    args=transformers.TrainingArguments(
        evaluation_strategy = "epoch",
        per_device_train_batch_size=2,
        report_to="tensorboard",
        gradient_accumulation_steps=2,
        num_train_epochs=5,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
The current implementation of Falcon calls `torch.scaled_dot_product_attention` directly, this will be deprecated in the future in favor of the `BetterTransformer` API. Please install the latest optimum library with `pip install -U optimum` and call `model.to_bettertransformer()` to benefit from `torch.scaled_dot_product_attention` and future performance optimizations.


Epoch,Training Loss,Validation Loss
1,1.2759,1.315653
2,0.7962,0.944973
3,0.5999,0.847633
4,0.6959,0.821839
5,0.5703,0.804843


TrainOutput(global_step=85, training_loss=0.9287772010354435, metrics={'train_runtime': 245.4665, 'train_samples_per_second': 1.365, 'train_steps_per_second': 0.346, 'total_flos': 692352864698880.0, 'train_loss': 0.9287772010354435, 'epoch': 5.0})

## Use Tensorboard for viewing Metrics

In [19]:
%load_ext tensorboard

In [20]:
tensorboard --logdir outputs

# Load adapters from the Hub

In [21]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "bnsapa/faq-llm"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

adapter_config.json:   0%|          | 0.00/479 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/18.9M [00:00<?, ?B/s]

# Example using the trained model

In [22]:
batch = tokenizer("how to reset my account ->: ", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.




 how to reset my account ->: (How do I reset my account password or username?) If you have forgotten your password or username, please visit the "Forgot Password" or "Forgot Username" page to reset your account. If you are still unable to access your
