<a href="https://colab.research.google.com/github/astromad/GeekyMad/blob/main/LLMTraining_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


What is Task Specific Fine-tuning

* Fine Tuning and Instruction Fine Tuning (IFT)
  * Expensive
  * All Model parameters modified
  * Catastrophic forgetting
* Perameter Efficient Fine Tuning (PEFT)
  * Only few (new) parameres gets modified
  * Cost efficient
  * Original weights frozen

[AutoModelForSequenceClassification](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification)

[GPT-2](https://huggingface.co/transformers/v3.0.2/model_doc/gpt2.html#gpt2doubleheadsmodel)

In [50]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).



**Finetune GPT-2 for product Classification**

Product classification is a challenging task for many companies, With thousands of new products getting added to ecommerce sites, unless they are able to categorize them properly, products won’t be able to show up for the right customers and would not be able to sell as a result. This is a field of study in machine learning. In this project, I will walk you through the advantages of finetuning GPT-2 to solve classification problem bit more efficiently.



In [51]:
# Clean up previous training data
!rm -rf Classification_cache
!rm -rf results_PT
!rm -rf logs_PT

In [52]:
import transformers

In [53]:
!pip install accelerate -U



Loading the training data, in our case it's Amazon dataset. Data set is in csv format and has 3 columns:

* Product Category
* Product Label
* Product Description

In this section , we will

* Read the datset and load it as Pandas Data frame
* Clan the data by removing entries with 'null' category

In [54]:
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/ColabData/Amazon.csv",
                encoding="ISO-8859-1", error_bad_lines=False)

data = df[['category', 'label_title', 'label_description']]
data.dropna(subset=['category'], inplace=True)
print(data.head(3))



  df = pd.read_csv("/content/drive/My Drive/ColabData/Amazon.csv",


                category                                        label_title  \
0  Headphone Accessories                  Koss EQ50 3-Band Stereo Equalizer   
1     Inkjet Printer Ink              Kodak Black Ink Cartridge 10B 1163641   
2  Computers Accessories  Kingston 128MX64 PC2700 COMPAQ Evo D320 KTC-D3...   

                                   label_description  
0  The pocket-size Koss 3-Band Equalizer delivers...  
1  Kodak Black Ink Cartridge 10B is a standard bl...  
2  1GB - 333MHz DDR333 PC2700 - DDR SDRAM - 184-p...  


Now it's time to do some cleanup to remove outliers. Current data has 706 unique categories but many of them have less than 20 products, this is just to improve training time by focussing on categories with larger number of products. With this our category count drops to less than 200

In [55]:
print(data.groupby('category').count() )

value_counts = data['category'].value_counts()
to_remove = value_counts[value_counts <= 20].index
data = data[~data.category.isin(to_remove)]

print(data.groupby('category').count() )

                           label_title  label_description
category                                                 
12V                                  1                  1
6V                                   4                  4
9V                                   6                  6
A                                    2                  2
AA                                  22                 22
...                                ...                ...
Wires                                1                  1
Wiring Harnesses                    20                 20
Wrist Rests                         17                 17
eBook Readers                       12                 12
eBook Readers Accessories            6                  6

[706 rows x 2 columns]
                        label_title  label_description
category                                              
AA                               22                 22
AC Adapters                      38                 38
Ac

Now if we want to use this data, we need target class to be numerical to feed it in to ML/DL models, So converting category to a numerical value

In [56]:

encode_dict={}
def encode_label(x):
    if x not in encode_dict.keys():
        encode_dict[x]=len(encode_dict)
    return encode_dict[x]

data['encoded_category'] = data['category'].apply(lambda x: encode_label(x))

Our data has two text fields, one Label title and label description, We are merging both of them to form one text field to feed it to our model to classify

In [57]:
newData=pd.DataFrame()
newData['desc']=data['label_title'] +' '+ data['label_description']
newData['encoded_category']=data['encoded_category']


Reset the index of our data as we removed some null category data


In [58]:
print(newData[:21])
newData = newData.reset_index(drop=True)
print(newData[:21])

                                                 desc  encoded_category
0   Koss EQ50 3-Band Stereo Equalizer The pocket-s...                 0
1   Kodak Black Ink Cartridge 10B 1163641 Kodak Bl...                 1
2   Kingston 128MX64 PC2700 COMPAQ Evo D320 KTC-D3...                 2
3   Kinamax MS-UES2 Mini High Precision USB 3-Butt...                 3
4   Kensington K72349US Wireless Mouse for Netbook...                 3
5   Kensington BlackBelt Protection Band for iPad ...                 4
6   JUST5 J509 Easy to Use Unlocked Cell Phone wit...                 5
7   Imation Corp 50PK CDR 700MB 80MIN 52X-SPINDLE ...                 6
8   16x DVD-R Media Imation 16x DVD-R Media 17340 ...                 7
9   iGo Arctic Laptop Cooling Pad AC05065-0001 Eve...                 8
10  HP TouchPad Custom Fit Case Protect your HP To...                 9
11  HP LaserJet Pro P1606dn Printer CE749A BGJ WHY...                10
12  HP 85A LaserJet Black Toner Print Cartridge - ...           

In [59]:
newData.dropna(subset=['desc'], inplace=True)
nan_rows = newData[newData.isnull().T.any()]
print(nan_rows)

Empty DataFrame
Columns: [desc, encoded_category]
Index: []


Preprocessing on description text data, remove stop words, remove spaces, lowercase note: we are not lemmatize as Bert will take care of it

In [60]:
newData.loc[20,'desc']

'EDGE SD Gaming Cards - Flash memory card - 1 GB - 130x - SD Edge Tech Corp 1GB Secure Digital SD Gaming Card EDGDM-222666-PE Flash Memory'

In [61]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [62]:
newData['desc']=newData.desc.str.replace("[^\w\s]", "").str.lower()
newData['desc']=newData['desc'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop]))

  newData['desc']=newData.desc.str.replace("[^\w\s]", "").str.lower()


In [63]:
newData.loc[20,'desc']

'edge sd gaming cards flash memory card 1 gb 130x sd edge tech corp 1gb secure digital sd gaming card edgdm222666pe flash memory'

In [64]:

from future.utils import iteritems
label2idx = {t: i for i, t in enumerate(encode_dict)}
idx2label = {v: k for k, v in iteritems(label2idx)}

In [65]:
ClassMax=newData['encoded_category'].max()
print('Number of Categories of products',ClassMax)


Number of Categories of products 187


In [66]:
train_size = 0.8
train_dataset=newData.sample(frac=train_size,random_state=200)
test_dataset=newData.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)


print("FULL Dataset: {}".format(newData.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

FULL Dataset: (18046, 2)
TRAIN Dataset: (14437, 2)
TEST Dataset: (3609, 2)


In [70]:
MAX_LEN = 128
LEARNING_RATE = 3e-02

Using GPT-2 model and it's Tokenizer

In [69]:
from transformers import (
    AutoConfig,
    AutoTokenizer
)
model_args = dict()
model_args['model_name'] = 'gpt2'
model_args['cache_dir'] = "Classification_cache/"
model_args['do_basic_tokenize'] = False

tokenizer = AutoTokenizer.from_pretrained(
    model_args['model_name'],
    cache_dir=model_args['cache_dir'],
    is_pretokenized=model_args['do_basic_tokenize'],
    do_basic_tokenize = model_args['do_basic_tokenize']
)
config = AutoConfig.from_pretrained(
    model_args['model_name'],
    cache_dir=model_args['cache_dir'],
    return_dict=True,
    pad_token_id=tokenizer.eos_token_id,
    id2label=idx2label,
    label2id=label2idx,
    num_labels=ClassMax+1
)


In [None]:
# import torch.nn as nn

# Adding a custom classification head on top of the pre-trained model
# num_classes = x
# classification_head = nn.Linear(model.config.hidden_size, num_classes)

# # Replace the pre-trained model's classification head with our custom head
# model.classifier = classification_head

Let's define function to create input dataset that transformer model understands, this function reads each Description and category and arrange it into 4 sections:

* Input_ids
* token_type_ids
* attention_masks
* label_ids

We use tokenizer.encode_plus to further tokenize each words and we add corresponding labels to the list.

We do this for both Training and Test datasets

In [71]:
def decorate(description):
    start_prompt = 'Classify the following product description.\n\n'
    end_prompt = '\n\nProduct Category: '
    prompt = start_prompt + description + end_prompt
    return prompt

In [72]:
import torch
import re
class TorchClassificationDataset(torch.utils.data.Dataset):
    def __init__(self,dataset,max_len):
        self.len = len(dataset)
        self.data = dataset
        self.max_len=max_len
    def __getitem__(self, idx):
        description = str(self.data.desc[idx])
        description = description[:self.max_len]
        # print(description)
        inputs = tokenizer.encode_plus(
            decorate(description),
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True,
            truncation=True
        )
        item ={}
        item['input_ids']=torch.tensor(inputs['input_ids'], dtype=torch.long)
        item['token_type_ids']=torch.tensor(inputs['token_type_ids'], dtype=torch.long)
        item['attention_mask']=torch.tensor(inputs['attention_mask'], dtype=torch.long)
        item['labels'] = torch.tensor(self.data.encoded_category[idx], dtype=torch.long)
        return item

    def __len__(self):
        return self.len

In [73]:
def createDataset(framework='pt'):
  if framework=='pt':
    train_ds = TorchClassificationDataset(train_dataset,MAX_LEN)
    test_ds= TorchClassificationDataset(test_dataset,MAX_LEN)
  return train_ds,test_ds

Now that the data is available in the format token classification model expects, let's prepare for training the model. As the data need to be fed in batches to take advantage of efficient distribution of data to train to each worker, This data need to be converted to tensors and be part of Data loader for PyTorch model to read, What this following class doing is preparing data in a dictionary for model to read

In [74]:
train_ds,test_ds = createDataset('pt')

In [75]:
tokenizer.pad_token = tokenizer.eos_token

In [76]:
print('One record of Training dataset')
print(train_ds[0])
print('One record of Test dataset')
print(test_ds[0])


One record of Training dataset
{'input_ids': tensor([ 9487,  1958,   262,  1708,  1720,  6764,    13,   198,   198,  6404,
        45396, 26250,    79, 49823,   269, 44928, 49823,   269, 44928, 30902,
         3033, 11711, 26250,    79,  5861,   266,  1460, 32060,   362, 29034,
          289,    67, 12694,   374,  1031,   669,    71,  5117,  2008,  4882,
         4352,   461,   198,   198, 15667, 21743,    25,   220, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 5025

Any Machine learning model to evaluate the performance we do via Accuracy, Precision, Recall & F1 Score metrics. Here I am using sklearn metrics library to measure these.

In [77]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

Ok, As you have seen, majority of the machine learning task is to get the data ready for the model to train. Now let's use Hugginface's new Trainer module to train the model

In [78]:
from transformers import (
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)
model = AutoModelForSequenceClassification.from_pretrained(
    model_args['model_name'],
    config=config,
    cache_dir=model_args['cache_dir'],
)


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [79]:
!pip install peft



In [80]:
from peft import LoraConfig,get_peft_model

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    fan_in_fan_out=True,
    task_type="SEQ_CLS"
)

In [81]:
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

In [82]:
print_trainable_parameters(model)

trainable params: 124584192 || all params: 124584192 || trainable%: 100.00


In [83]:
peftmodel = get_peft_model(model, peft_config)

In [84]:
print('After Peft:',peftmodel.print_trainable_parameters())


trainable params: 2,503,680 || all params: 127,087,872 || trainable%: 1.9700384943104563
After Peft: None


In [85]:
training_args = TrainingArguments(
    output_dir='./results_PT',
    num_train_epochs=10,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs_PT',
    logging_steps=3,
    #learning_rate=LEARNING_RATE
)

trainer = Trainer(
    model=peftmodel,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

In [None]:
ha

NameError: ignored

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [86]:
peftmodelDir="/content/drive/MyDrive/ColabData/SAVED-MODELS/gpt2-Classify-PEFT-model"
basemodelDir="/content/drive/MyDrive/ColabData/SAVED-MODELS/gpt2-Classify-BASE-model"


In [None]:
peftmodel.save_pretrained(peftmodelDir,push_to_hub=False)
peftmodel.base_model.save_pretrained(basemodelDir,push_to_hub=False)
tokenizer.save_pretrained(basemodelDir, push_to_hub=False)

In [87]:
from peft import PeftModel,PeftConfig
from transformers import (
    AutoModelForSequenceClassification,
    Trainer,
    AutoTokenizer,
    TrainingArguments
)
import torch

In [89]:
def load_model(basemodelDir,peftmodelDir):
  bModel=AutoModelForSequenceClassification.from_pretrained(basemodelDir)
  peftModel=PeftModel.from_pretrained(bModel,peftmodelDir,strict=False,remove_module=True)
  return peftModel
def load_tokenizer(basemodelDir):
  tokenizer = AutoTokenizer.from_pretrained(basemodelDir)
  return tokenizer

In [90]:
trained_model = load_model(basemodelDir,peftmodelDir)
tokenizer = load_tokenizer(basemodelDir)

Some weights of the model checkpoint at /content/drive/MyDrive/ColabData/SAVED-MODELS/gpt2-Classify-BASE-model were not used when initializing GPT2ForSequenceClassification: ['transformer.h.7.attn.c_attn.lora_B.default.weight', 'transformer.h.11.attn.c_attn.lora_A.default.weight', 'transformer.h.8.attn.c_attn.lora_B.default.weight', 'score.modules_to_save.default.weight', 'transformer.h.2.attn.c_attn.lora_B.default.weight', 'transformer.h.8.attn.c_attn.lora_A.default.weight', 'transformer.h.3.attn.c_attn.lora_A.default.weight', 'transformer.h.4.attn.c_attn.lora_A.default.weight', 'transformer.h.0.attn.c_attn.lora_A.default.weight', 'transformer.h.2.attn.c_attn.lora_A.default.weight', 'transformer.h.6.attn.c_attn.lora_B.default.weight', 'transformer.h.7.attn.c_attn.lora_A.default.weight', 'score.original_module.weight', 'transformer.h.6.attn.c_attn.lora_A.default.weight', 'transformer.h.5.attn.c_attn.lora_A.default.weight', 'transformer.h.3.attn.c_attn.lora_B.default.weight', 'transform

In [91]:
def generate_text(model,sequence, max_length):
    ids = tokenizer(sequence, return_tensors='pt')
    model.to('cpu')
    final_outputs = model(**ids,labels=torch.tensor([10]).unsqueeze(0))
    print('Category is',idx2label[final_outputs.logits.argmax(-1).detach().numpy()[0]])
    # print(tokenizer.decode(final_outputs[1], skip_special_tokens=True))

In [92]:
desc="TRENDnet 150 Mbps Mini Wireless N USB 2.0 Adapter TEW-648UB Black"
prompt = decorate(desc)
print(prompt)

Classify the following product description.

TRENDnet 150 Mbps Mini Wireless N USB 2.0 Adapter TEW-648UB Black

Product Category: 


In [93]:
generate_text(trained_model,prompt,500)

Category is USB Network Adapters


In [94]:
from transformers import pipeline
classification_pipe = pipeline("text-classification", model=trained_model, tokenizer=tokenizer ,device='cpu')
output=classification_pipe(prompt)
print(output[0]['label'])

The model 'PeftModelForSequenceClassification' is not supported for text-classification. Supported models are ['AlbertForSequenceClassification', 'BartForSequenceClassification', 'BertForSequenceClassification', 'BigBirdForSequenceClassification', 'BigBirdPegasusForSequenceClassification', 'BioGptForSequenceClassification', 'BloomForSequenceClassification', 'CamembertForSequenceClassification', 'CanineForSequenceClassification', 'LlamaForSequenceClassification', 'ConvBertForSequenceClassification', 'CTRLForSequenceClassification', 'Data2VecTextForSequenceClassification', 'DebertaForSequenceClassification', 'DebertaV2ForSequenceClassification', 'DistilBertForSequenceClassification', 'ElectraForSequenceClassification', 'ErnieForSequenceClassification', 'ErnieMForSequenceClassification', 'EsmForSequenceClassification', 'FalconForSequenceClassification', 'FlaubertForSequenceClassification', 'FNetForSequenceClassification', 'FunnelForSequenceClassification', 'GPT2ForSequenceClassification',

USB Network Adapters
