# Fine Tuning Pre-Trained Bert Transformer MultiClassClassificatior for the <b>QueryTypeDetector</b>
This script prepares, trains and saves a transformer based model (fine tuning a pre-trained distilbert) that will <b>categorize the query</b> into one of the following types:<br>
- 0) <b>Retailer</b>: Query is about a retailer (Target, Walmart, ...)
- 1) <b>Brand</b>: Query is about a specific brand (Huggies, Gatorade, ..)
- 2) <b>Category</b>: Query is about an open category (diapers, hand bags, phones, ...)
The purpose of this model is to direct a query request to the specific treatment depending on the category. The output will be the <b>predicted category</b> and the <b>confidence</b> (in that prediction).

### Notes/Steps
- The <b>data is loaded</b> along with the required libraries (transformers, torch, nltk...), the retailers are in 'offer_retailer.RETAILER', the brands in 'offer_retailer.BRAND' and the categories in 'categories.PRODUCT_CATEGORY and categories.IS_CHILD_CATEGORY_TO'. 
- After <b>cleaning the data</b>, I realize that we need a lot more samples for 'categories', since it is the most 'generic' option and we would benefit from having a wide range of observations to feed the model.
- For this I semi-manually created a 'retail_product_nouns' list of <b>potential 'category related words'</b> and I enhanced that list by adding all synonyms of every word (using wordnet from nltk).
- Then the data was <b>balanced</b> (by upsizing the lower categories via bootstrap) and <b>upscaled</b> (by resampling with repetition on the resulting data)
- After this, the data is <b>tokenized</b> using a pre-trained BERT model (from HuggingFace) and loaded/transformed into a convenient format (torch tensors).
- Then, after a few iterations on parameters, I came up with a configuration that lets the <b>model learn reasonably well</b>.
- Finally, the model is <b>saved</b> for production.

In [1]:
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
import torch.nn.functional as F
import torch.optim as optim
import torch.nn as nn
import pandas as pd
import numpy as np
import torch
import nltk

nltk.download('wordnet')
from nltk.corpus import wordnet

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [2]:
# A couple of useful functions for later
def isNaN(num):
    return num != num

def getSynonyms(word):
    synonyms = []
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonyms.append(lemma.name().replace('_', ' '))
    # Remove duplicates
    synonyms = list(set(synonyms))
    return synonyms

In [3]:
# Loading the data

df_offer_retailer = pd.read_csv('data/offer_retailer.csv')
df_offer_retailer['BRAND'] = [r['BRAND'] if r['BRAND'] != r['RETAILER'] else np.nan for i, r in df_offer_retailer.iterrows()]
df_offer_retailer.head(5)

Unnamed: 0,OFFER,RETAILER,BRAND
0,Spend $50 on a Full-Priced new Club Membership,SAMS CLUB,
1,"Beyond Meat® Plant-Based products, spend $25",,BEYOND MEAT
2,Good Humor Viennetta Frozen Vanilla Cake,,GOOD HUMOR
3,"Butterball, select varieties, spend $10 at Dil...",DILLONS FOOD STORE,BUTTERBALL
4,"GATORADE® Fast Twitch®, 12-ounce 12 pack, at A...",AMAZON,GATORADE


In [4]:
df_cat = pd.read_csv('data/categories.csv')
df_cat.head(5)

Unnamed: 0,CATEGORY_ID,PRODUCT_CATEGORY,IS_CHILD_CATEGORY_TO
0,1f7d2fa7-a1d7-4969-aaf4-1244f232c175,Red Pasta Sauce,Pasta Sauce
1,3e48a9b3-1ab2-4f2d-867d-4a30828afeab,Alfredo & White Pasta Sauce,Pasta Sauce
2,09f3decc-aa93-460d-936c-0ddf06b055a3,Cooking & Baking,Pantry
3,12a89b18-4c01-4048-94b2-0705e0a45f6b,Packaged Seafood,Pantry
4,2caa015a-ca32-4456-a086-621446238783,Feminine Hygeine,Health & Wellness


In [5]:
# Augmenting the data

text_snippets = [r for r in df_offer_retailer['RETAILER'][df_offer_retailer['RETAILER'].apply(lambda x:not isNaN(x))]]
labels = [0 for i in range(len(df_offer_retailer['RETAILER'][df_offer_retailer['RETAILER'].apply(lambda x:not isNaN(x))]))]

text_snippets += [r for r in df_offer_retailer['BRAND'][df_offer_retailer['BRAND'].apply(lambda x:not isNaN(x))]]
labels += [1 for i in range(len(df_offer_retailer['BRAND'][df_offer_retailer['BRAND'].apply(lambda x:not isNaN(x))]))]

text_snippets += [r for r in df_cat['PRODUCT_CATEGORY'][df_cat['PRODUCT_CATEGORY'].apply(lambda x:not isNaN(x))]]
labels += [2 for i in range(len(df_cat['PRODUCT_CATEGORY'][df_cat['PRODUCT_CATEGORY'].apply(lambda x:not isNaN(x))]))]

text_snippets += [r for r in df_cat['IS_CHILD_CATEGORY_TO'][df_cat['IS_CHILD_CATEGORY_TO'].apply(lambda x:not isNaN(x))]]
labels += [2 for i in range(len(df_cat['IS_CHILD_CATEGORY_TO'][df_cat['IS_CHILD_CATEGORY_TO'].apply(lambda x:not isNaN(x))]))]

# Lets use these AI generated retail associated nouns
retail_product_nouns = ['shampoo','conditioner','soap','toothpaste','toothbrush','lotion','perfume','deodorant','razor','clothes','shoes','sandals','hat','sunglasses',
'watch','jewelry','necklace','bracelet','earrings','handbag','wallet','backpack','laptop','smartphone','camera','charger','headphones','television','furniture','table','chair','sofa',
'bed','mattress','pillow','blanket','cookware','utensils','plates','glasses','mugs','dishwasher','microwave','oven','fridge','freezer','groceries','vegetables','fruits','meat',
'fish','bread','milk','cheese','eggs','cereal','coffee','tea','soda','juice','wine','beer','liquor','snacks','chips','chocolate','candy','toys','games','puzzles','bicycle',
'scooter','treadmill','weights','yoga_mat','books','magazines','stationery','pen','pencil','notebook','printer','ink','lawnmower','grill','fertilizer','plant','flower','pet_food',
'leash','aquarium','birdcage','cleaning_supplies','broom','mop','vacuum','detergent','bleach','paper_towels'
]

# And let's enhance them with synonyms using wordnet
new_nouns = []
for word in retail_product_nouns:
    new_nouns += getSynonyms(word.replace('_', ' '))
    
retail_product_nouns += new_nouns

text_snippets += retail_product_nouns
labels += [2 for _ in range(len(retail_product_nouns))]

In [6]:
# Balancing the data

df_data = pd.DataFrame({'text_snippets' : text_snippets, 'labels' : labels})

class_counts = {
    0:len([1 for l in labels if l == 0]),
    1:len([1 for l in labels if l == 1]),
    2:len([1 for l in labels if l == 2])
}

# print(class_counts)

# Sort the dictionary by value in descending order
sorted_items = sorted(class_counts.items(), key=lambda x: x[1], reverse=True)
largest_value = sorted_items[0][1]
keys_of_rest = [key for key, value in sorted_items if value != largest_value]

for key in keys_of_rest:
    df_data = pd.concat([df_data, df_data[df_data['labels'] == key].sample(largest_value - class_counts[key], replace=True, random_state=99)], ignore_index=True)

# print(df_data['labels'].value_counts())

data_augm = df_data.sample(10000, replace=True, random_state=99).reset_index(drop=True)

text_snippets = [t for t in data_augm['text_snippets']]
labels = [l for l in data_augm['labels']]

print(data_augm.shape)
data_augm.head()

(10000, 2)


Unnamed: 0,text_snippets,labels
0,Deli & Bakery,2
1,ballock,2
2,BEYOND MEAT,1
3,SARGENTO,1
4,specs,2


In [7]:
# Tokenizing the text_snippets

# Initialize a pre-trained BertTokenizer (uncased) from Hugging Face.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

input_ids = [tokenizer.encode(text, add_special_tokens=True) for text in text_snippets]
input_ids = [torch.tensor(ids) for ids in input_ids]

Downloading: 100%|██████████| 232k/232k [00:00<00:00, 1.66MB/s]
Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 39.6kB/s]
Downloading: 100%|██████████| 466k/466k [00:00<00:00, 2.07MB/s]
Downloading: 100%|██████████| 570/570 [00:00<00:00, 977kB/s]


In [8]:
# TRAINING (FINE TUNING) THE MODEL
# My cuda memory is too small so I'll comment this out for the moment but it will be ready for when a big enough card is available.

device = "cpu" #torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Tokenize and pad tokens
input_ids = [tokenizer.encode(text, add_special_tokens=True) for text in text_snippets]
max_len = max(len(ids) for ids in input_ids)
input_ids = [ids + [0] * (max_len - len(ids)) for ids in input_ids]

# Convert to PyTorch tensors and move to the device
input_ids = torch.tensor(input_ids).to(device)
labels = torch.tensor(labels).to(device)

# Create DataLoader
train_data = TensorDataset(input_ids, labels)
train_loader = DataLoader(train_data, batch_size=2)  

# Inicialize a sequence classification model with 3 labels (0=retail, 1=brand, 2=category)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3).to(device)
optimizer = optim.AdamW(model.parameters(), lr=1e-5)
criterion = nn.CrossEntropyLoss()

# Fine-tuning the model
for epoch in range(1):  # We can increase epochs later when there is more data.
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids, labels = batch
        outputs = model(input_ids)[0]
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        print(f"Epoch {epoch}, Loss {loss.item()}")

Using device: cpu


Downloading: 100%|██████████| 440M/440M [01:31<00:00, 4.80MB/s] 
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenc

Epoch 0, Loss 1.0223584175109863
Epoch 0, Loss 1.12919020652771
Epoch 0, Loss 0.9884907007217407
Epoch 0, Loss 1.5102605819702148
Epoch 0, Loss 1.2039930820465088
Epoch 0, Loss 1.1520724296569824
Epoch 0, Loss 0.9880199432373047
Epoch 0, Loss 1.001156210899353
Epoch 0, Loss 1.1761229038238525
Epoch 0, Loss 0.9695389270782471
Epoch 0, Loss 1.1482722759246826
Epoch 0, Loss 0.9538224935531616
Epoch 0, Loss 1.116461157798767
Epoch 0, Loss 1.0817985534667969
Epoch 0, Loss 1.3178666830062866
Epoch 0, Loss 1.0828416347503662
Epoch 0, Loss 1.2218756675720215
Epoch 0, Loss 1.01663339138031
Epoch 0, Loss 1.2555409669876099
Epoch 0, Loss 1.247986078262329
Epoch 0, Loss 1.0428704023361206
Epoch 0, Loss 1.048269510269165
Epoch 0, Loss 1.0629730224609375
Epoch 0, Loss 1.1511908769607544
Epoch 0, Loss 1.1426758766174316
Epoch 0, Loss 0.9759700298309326
Epoch 0, Loss 1.1915857791900635
Epoch 0, Loss 1.0539774894714355
Epoch 0, Loss 1.062255620956421
Epoch 0, Loss 0.989680290222168
Epoch 0, Loss 1.0513

KeyboardInterrupt: 

In [13]:
# A script to test the model manually
model.to(device)

# Test the fine-tuned model
with torch.no_grad():
    test_text = "walmart"
    input_ids = tokenizer.encode(test_text, add_special_tokens=True)
    input_ids = input_ids + [0] * (max_len - len(input_ids))
    
    # Move the input tensor to the chosen device
    input_ids = torch.tensor([input_ids]).to(device)  # Add batch dimension and move to device
    
    output = model(input_ids)[0]
    probabilities = torch.softmax(output, dim=1)
    
    # Move tensors back to CPU for NumPy operations or native Python numerical operations, if needed
    probabilities = probabilities.to("cpu")
    
    prediction = torch.argmax(probabilities, dim=1)
    confidence = probabilities[0][prediction].item()  # Get the value at the predicted index
    
    print("Prediction:", prediction.item())  # Outputs 0 for "Brand", 1 for "Retailer", 2 for "Other"
    print(f"Confidence: {confidence * 100:.2f}%")  # Outputs the confidence of the prediction in percentage


Prediction: 0
Confidence: 99.80%


In [14]:
# Save model parameters
torch.save(model.state_dict(), "models/classify_request.pth")