## Installing Necessary Libraries

In [14]:
# !pip install -q -U transformers accelerate trl git+https://github.com/huggingface/peft.git
# !pip install -q datasets bitsandbytes einops wandb
# !pip install xformers

# Dataset details
Instacart Data can be downloaded from [here](https://www.kaggle.com/competitions/instacart-market-basket-analysis/data). We just need product & department csv files


In [7]:
import pandas as pd

df_product = pd.read_csv("./products.csv")
df_dept = pd.read_csv('./departments.csv')

df_joined = pd.merge(df_product, df_dept, on = ['department_id'])
df_joined['text'] = df_joined.apply(lambda row: row['product_name'] + " ->: " + row['department'], axis = 1)
df_joined.sample(5)

Unnamed: 0,product_id,product_name,aisle_id,department_id,department,text
42925,4850,Original Scent Ultra Dishwashing Liquid,100,21,missing,Original Scent Ultra Dishwashing Liquid ->: mi...
3470,27613,45% Cacao Barcelona Bar,45,19,snacks,45% Cacao Barcelona Bar ->: snacks
10912,43244,"Seasoning, All-Purpose",104,13,pantry,"Seasoning, All-Purpose ->: pantry"
34231,3439,"Thick Sliced Canadian, Bacon Natural Hickory S...",106,12,meat seafood,"Thick Sliced Canadian, Bacon Natural Hickory S..."
49652,5161,Dried Mango,18,10,bulk,Dried Mango ->: bulk


In [8]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df_joined, test_size=0.2, random_state=42)

In [9]:
from datasets import Dataset , DatasetDict

train_dataset_dict = DatasetDict({
    "train": Dataset.from_pandas(train_df),
})

train_dataset_dict

DatasetDict({
    train: Dataset({
        features: ['product_id', 'product_name', 'aisle_id', 'department_id', 'department', 'text', '__index_level_0__'],
        num_rows: 39750
    })
})

## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [11]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "ybelkada/falcon-7b-sharded-bf16"      # model_name = "tiiuae/falcon-7b"

# Bits and Bytes (bnb) quantization technique
bnb_config = BitsAndBytesConfig(
    # quantization method - this flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers
    load_in_4bit = True,
    # quantization data type (FP4/NF4) NF4 - normal float 4-bit data type, which is a new 4bit datatype adapted for weights that have been initialized using a normal distribution.
    bnb_4bit_quant_type = "nf4",
    # This sets the computational type which might be different than the input time. For example, inputs might be fp32, but computation can be set to bf16 for speedups.
    # The compute dtype is used to change the dtype that will be used during computation. For example, hidden states could be in float32 but computation can be set to bf16 for speedups.
    bnb_4bit_compute_dtype = torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config = bnb_config,
    # indicates that the model's code and configuration can be trusted, which allows the model to be loaded without additional security checks
    trust_remote_code = True
)

#  intermediate computation results will not be stored in memory
model.config.use_cache = False

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Let's also load the tokenizer below

In [15]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

**Let's check what the base model predicts before finetuning. :)**

In [17]:
from transformers import pipeline

model_pipeline = pipeline(
    "text-generation",
    model = model,
    tokenizer = tokenizer,
    torch_dtype = torch.bfloat16,
    trust_remote_code = True,
    device_map = "auto",
)

In [19]:
sequences = model_pipeline(
    ["“Free & Clear Stage 4 Overnight Diapers” ->:","Bread Rolls ->:","French Milled Oval Almond Gourmande Soap ->:"],
    max_length = 200,
    # do_sample=True indicates that the model will use sampling for text generation rather than deterministic methods
    # top_k=10 controls the diversity of the generated output. It restricts the model to consider only the top-k most probable tokens at each step during sampling. Higher top_k values increase diversity.
    do_sample = True,
    top_k = 10,
    # number of generated sequences
    num_return_sequences = 1,
    eos_token_id = tokenizer.eos_token_id,
)

for seq in sequences:
    print(f"Result: {seq[0]['generated_text']}")
    print()

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Result: “Free & Clear Stage 4 Overnight Diapers” ->:
Free & Clear Diapers
Free & Clear Diapers
Free & Clear Diapers
Free & Clear Diapers
Free & Clear Diapers
Free & Clear Diapers
Free & Clear Diapers
Free & Clear Diapers
Free & Clear Diapers
Free & Clear Diapers
Free & Clear Diapers
Free & Clear Diapers:
Free & Clear Diapers
Free & Clear Diapers Reviews
Free & Clear Diapers Price Chart
Free & Clear Diapers Reviews
Free & Clear Diapers Price Chart


Result: Bread Rolls ->: ->
Ingredients :- ->
-> -> ->
1) 1/2 Cup Maida (Plain Flour)
1/2 Cup Wheat Flour
1/4 Cup Milk Powder
1 Cup Lukewarm Water
Salt As Required
2) 1/2 Tspn Active Dry Yeast (Dried Yeast)
1/2 Cup Lukewarm Milk
1/2 tspn Sugar
->
-> -> ->
Directions :- ->
1) To a large mixing bowl, add maida, wheat flour and milk powder.
2) Add active dry yeast and mix it in milk.
3) Let this mixture sit for 15-20 mins.
4) To the above mixture, add 1/4 tspn of salt and mix it in.
5) Add 1/2 cup lukewarm water and mix it in.
6) Knead the

Resu

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [22]:
# Parameter-Efficient Fine-Tuning (PEFT) methods

from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [23]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 1
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 120 #500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir = output_dir,
    per_device_train_batch_size = per_device_train_batch_size,
    gradient_accumulation_steps = gradient_accumulation_steps,
    optim = optim,
    save_steps = save_steps,
    logging_steps = logging_steps,
    learning_rate = learning_rate,
    fp16 = True,
    max_grad_norm = max_grad_norm,
    max_steps = max_steps,
    warmup_ratio = warmup_ratio,
    group_by_length = True,
    lr_scheduler_type = lr_scheduler_type,
)

Then finally pass everthing to the trainer

In [24]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model = model,
    train_dataset = train_dataset_dict['train'],
    peft_config = peft_config,
    dataset_text_field = "text",                       # dataset_text_field="prediction",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = training_arguments,
)



Map:   0%|          | 0/39750 [00:00<?, ? examples/s]

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [25]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [26]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,3.9383
2,4.2964
3,4.4498
4,4.1203
5,3.9512
6,3.4342
7,2.5358
8,2.794
9,2.3794
10,2.3704


TrainOutput(global_step=120, training_loss=2.2620727290709812, metrics={'train_runtime': 1164.8734, 'train_samples_per_second': 1.648, 'train_steps_per_second': 0.103, 'total_flos': 479793488348160.0, 'train_loss': 2.2620727290709812, 'epoch': 0.05})

In [27]:
sample_size = 25

lst_test_data = list(test_df['text'])
print(len(lst_test_data))

lst_test_data_short = lst_test_data[:sample_size]
print(lst_test_data_short)

9938
['Free & Clear Stage 4 Overnight Diapers ->: babies', 'Beef pot roast with roasted potatoes, carrots, sweet onions, green beans, and a rich gravy Beef Pot Roast ->: frozen', 'Coffee Liquer ->: alcohol', 'Bread Rolls ->: bakery', 'French Milled Oval Almond Gourmande Soap ->: personal care', 'Dust Pan ->: household', 'Roasted Pine Nut Hommus ->: deli', 'Cranberry Raspberry Juice Cocktail ->: beverages', 'Sweet Cream Butter Salted ->: dairy eggs', 'Traditional Chicken Barley Soup ->: canned goods', 'Vanilla Unsweetened Cashewmilk ->: dairy eggs', 'Minis Size Chocolate Candy Bars Variety Mix ->: snacks', 'Cheesy Cheddar Rotini Pasta Sides ->: dry goods pasta', 'Multi Purpose Solution ->: personal care', 'Juice, Raw & Cold-Pressed, Purity ->: beverages', 'Chinese Five Spices Powder ->: international', 'Ancient Grain Original Granola ->: breakfast', 'Crunch Lemon Shortbread Flavor 0% Fat with Toppings Greek Yogurt ->: dairy eggs', 'Flounder Fillets ->: meat seafood', 'Lilac Votive ->: h

In [29]:
import transformers

model_pipeline = transformers.pipeline(
    "text-generation",
    model = model,
    tokenizer = tokenizer,
    torch_dtype = torch.float16,      # torch_dtype=torch.bfloat16,
    trust_remote_code = True,
    device_map = "auto",
)

sequences = model_pipeline(
    lst_test_data_short,
    max_length = 100,  #200,
    do_sample = True,
    top_k = 10,
    num_return_sequences = 1,
    eos_token_id = tokenizer.eos_token_id,
)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_tok

In [31]:
for ix , seq in enumerate(sequences):
    print(ix , ' : ' , seq[0]['generated_text'])
    print()

0  :  Free & Clear Stage 4 Overnight Diapers ->: babies personal care babies diapers ->: babies personal care babies diapers other babies household babies household diapers ->: babies personal care babies diapers other babies household babies household diapers babies personal care babies diapers other babies household babies household diapers babies personal care diapers other babies household babies household supplies babies other diapers ->: babies personal care babies diapers other babies household babies household products babies other diapers ->: babies personal care babies diapers other babies household babies household

1  :  Beef pot roast with roasted potatoes, carrots, sweet onions, green beans, and a rich gravy Beef Pot Roast ->: frozen meals -> canned goods -> canned meat -> canned meats -> stews -> international foods -> canned international goods -> beef dry goods -> canned goods -> soups, canned goods -> canned meats -> canned beef canned goods international -> canned go

In [32]:
def correct_answer(ans):
  return (ans.split("->:")[1]).strip()

answers = []
for ix ,seq in enumerate(sequences):
    answers.append(correct_answer(seq[0]['generated_text']))

In [33]:
df_evaluate = test_df.iloc[:sample_size][['product_name','department']]
df_evaluate = df_evaluate.reset_index(drop=True)

df_evaluate['department_predicted'] = answers
df_evaluate

Unnamed: 0,product_name,department,department_predicted
0,Free & Clear Stage 4 Overnight Diapers,babies,babies personal care babies diapers
1,"Beef pot roast with roasted potatoes, carrots,...",frozen,frozen meals -> canned goods -> canned meat ->...
2,Coffee Liquer,alcohol,alcohol beverage alcohol other alcohol other s...
3,Bread Rolls,bakery,bakery snacks
4,French Milled Oval Almond Gourmande Soap,personal care,personal care personal care missing: personal ...
5,Dust Pan,household,household -> household supplies dry goods hous...
6,Roasted Pine Nut Hommus,deli,deli missing ingredients
7,Cranberry Raspberry Juice Cocktail,beverages,beverages international specialty other beverages
8,Sweet Cream Butter Salted,dairy eggs,dairy eggs other produce pantry snacks canned ...
9,Traditional Chicken Barley Soup,canned goods,canned goods -> canned goods international can...
