<a href="https://colab.research.google.com/github/astromad/MyDeepLearningRepo/blob/master/ProductClassification_Hackathon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Product Classification:**
Product classification is a challenging task for many companies, With thousands of new products getting added to ecommerce sites, unless they are able to categorize them properly, products won’t be able to show up for the right customers and would not be able to sell as a result. This is a field of study in machine learning. In this project, I would like to explore the advantages of Deep leaning and Transformers to solve classification problem bit more efficiently. 
This involve:
* Preparing the training data
* Build transformer model
* Train the model
* Test and measure accuracy of the model.




Clean up previous training data


In [1]:
!rm -rf Classification_cache
!rm -rf results_PT
!rm -rf logs_PT


Installing Huggingface Transformers libraries

In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 8.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 52.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 65.5 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 61.3 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: Py

Loading the training data, in our case it's Amazon dataset.
Data set is in csv format and has 3 columns:


*   Product Category
*   Product Label
*   Product Description

In this section , we will

*   Read the datset and load it as Pandas Data frame
*   Clan the data by removing entries with 'null' category



In [3]:
import pandas as pd
df = pd.read_csv("/content/drive/My Drive/ColabData/Amazon.csv",
                encoding="ISO-8859-1", error_bad_lines=False)

data = df[['category', 'label_title', 'label_description']]
data.dropna(subset=['category'], inplace=True)
print(data.head(3))


                category  ...                                  label_description
0  Headphone Accessories  ...  The pocket-size Koss 3-Band Equalizer delivers...
1     Inkjet Printer Ink  ...  Kodak Black Ink Cartridge 10B is a standard bl...
2  Computers Accessories  ...  1GB - 333MHz DDR333 PC2700 - DDR SDRAM - 184-p...

[3 rows x 3 columns]


Now it's time to do some cleanup to remove outliers. Current data has 706 unique categories but many of them have less than 20 products, this is just to improve training time by focussing on categories with larger number of products.
With this our category count drops to less than 200

In [4]:
print(data.groupby('category').count() )

value_counts = data['category'].value_counts()
to_remove = value_counts[value_counts <= 20].index
data = data[~data.category.isin(to_remove)]

print(data.groupby('category').count() )

                           label_title  label_description
category                                                 
12V                                  1                  1
6V                                   4                  4
9V                                   6                  6
A                                    2                  2
AA                                  22                 22
...                                ...                ...
Wires                                1                  1
Wiring Harnesses                    20                 20
Wrist Rests                         17                 17
eBook Readers                       12                 12
eBook Readers Accessories            6                  6

[706 rows x 2 columns]
                        label_title  label_description
category                                              
AA                               22                 22
AC Adapters                      38                 38
Ac

Now if we want to use this data, we need target class to be numerical to feed it in to ML/DL models, So converting category to a numerical value

In [5]:

encode_dict={}
def encode_label(x):
    if x not in encode_dict.keys():
        encode_dict[x]=len(encode_dict)
    return encode_dict[x]

data['encoded_category'] = data['category'].apply(lambda x: encode_label(x))

Our data has two text fields, one Label title and label description, We are merging both of them to form one text field to feed it to our model to classify

In [6]:
newData=pd.DataFrame()
newData['desc']=data['label_title'] +' '+ data['label_description'] 
newData['encoded_category']=data['encoded_category']


resetting the index of our data as we removed some null category data

In [7]:
print(newData[:21])
newData = newData.reset_index(drop=True)
print(newData[:21])

                                                 desc  encoded_category
0   Koss EQ50 3-Band Stereo Equalizer The pocket-s...                 0
1   Kodak Black Ink Cartridge 10B 1163641 Kodak Bl...                 1
2   Kingston 128MX64 PC2700 COMPAQ Evo D320 KTC-D3...                 2
3   Kinamax MS-UES2 Mini High Precision USB 3-Butt...                 3
4   Kensington K72349US Wireless Mouse for Netbook...                 3
5   Kensington BlackBelt Protection Band for iPad ...                 4
6   JUST5 J509 Easy to Use Unlocked Cell Phone wit...                 5
7   Imation Corp 50PK CDR 700MB 80MIN 52X-SPINDLE ...                 6
8   16x DVD-R Media Imation 16x DVD-R Media 17340 ...                 7
9   iGo Arctic Laptop Cooling Pad AC05065-0001 Eve...                 8
10  HP TouchPad Custom Fit Case Protect your HP To...                 9
11  HP LaserJet Pro P1606dn Printer CE749A BGJ WHY...                10
12  HP 85A LaserJet Black Toner Print Cartridge - ...           

Delete any data that has description as null

In [8]:
newData.dropna(subset=['desc'], inplace=True)
nan_rows = newData[newData.isnull().T.any()]
print(nan_rows)

Empty DataFrame
Columns: [desc, encoded_category]
Index: []


Preprocessing on description text data, remove stop words, remove spaces, lowercase
note: we are not lemmatize as Bert will take care of it

In [9]:
newData.loc[20,'desc']

'EDGE SD Gaming Cards - Flash memory card - 1 GB - 130x - SD Edge Tech Corp 1GB Secure Digital SD Gaming Card EDGDM-222666-PE Flash Memory'

In [10]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [11]:

newData['desc']=newData.desc.str.replace("[^\w\s]", "").str.lower()
#newData['desc']=newData.desc.str.replace('\d+', '')
#newData['desc']=newData['desc'].apply(lambda x: [item for item in x.split() if item not in stop])
newData['desc']=newData['desc'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop]))

In [12]:
newData.loc[20,'desc']

'edge sd gaming cards flash memory card 1 gb 130x sd edge tech corp 1gb secure digital sd gaming card edgdm222666pe flash memory'

Helper functions to convert category ID to numerical and back

In [13]:

from future.utils import iteritems
label2idx = {t: i for i, t in enumerate(encode_dict)}
idx2label = {v: k for k, v in iteritems(label2idx)}

Findout number of categories of products in our dataset after pre-processing

In [14]:
ClassMax=newData['encoded_category'].max()
print(ClassMax)


187


Now let's split the dataset into Training and Test by 80/20

In [15]:
train_size = 0.8
train_dataset=newData.sample(frac=train_size,random_state=200)
test_dataset=newData.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)


print("FULL Dataset: {}".format(newData.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

FULL Dataset: (18046, 2)
TRAIN Dataset: (14437, 2)
TEST Dataset: (3609, 2)


Defining model parameters, Here we are setting maximum sentence length to 128 words and truncate anything after that and also setting at what rate model should learn.

We will also define Tokenizers and Model details. Here we use BERT uncased pre-trained model and using transfer learning add train using our own training data on top of it.

In [16]:
MAX_LEN = 128
LEARNING_RATE = 3e-02

In [17]:
from transformers import (
    AutoConfig,
    AutoTokenizer
)
model_args = dict()
model_args['model_name'] = 'bert-base-uncased' 
model_args['cache_dir'] = "Classification_cache/"
model_args['do_basic_tokenize'] = False

config = AutoConfig.from_pretrained(
    model_args['model_name'],
    cache_dir=model_args['cache_dir'],
    return_dict=True,
    num_labels=ClassMax+1
)

tokenizer = AutoTokenizer.from_pretrained(
    model_args['model_name'],
    cache_dir=model_args['cache_dir'],
    is_pretokenized=model_args['do_basic_tokenize'],
    do_basic_tokenize = model_args['do_basic_tokenize']
)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Let's define function to create input dataset that transformer model understands, this function reads each Description and category and arrange it into 4 sections:
* Input_ids
* token_type_ids
* attention_masks
* label_ids

We use tokenizer.encode_plus to further tokenize each words and we add corresponding labels to the list.

We do this for both Training and Test datasets

In [18]:
import torch
import re
class TorchClassificationDataset(torch.utils.data.Dataset):
    def __init__(self,dataset,max_len):
        self.len = len(dataset)
        self.data = dataset
        self.max_len=max_len
    def __getitem__(self, idx):
        description = str(self.data.desc[idx])
        description = description[:self.max_len]
        inputs = tokenizer.encode_plus(
            description,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True,
            truncation=True
        )
        item ={}
        item['input_ids']=torch.tensor(inputs['input_ids'], dtype=torch.long)
        item['token_type_ids']=torch.tensor(inputs['token_type_ids'], dtype=torch.long)
        item['attention_mask']=torch.tensor(inputs['attention_mask'], dtype=torch.long)
        item['labels'] = torch.tensor(self.data.encoded_category[idx], dtype=torch.long)
        return item

    def __len__(self):
        return self.len

In [19]:
def createDataset(framework='pt'):
  if framework=='pt':
    train_ds = TorchClassificationDataset(train_dataset,MAX_LEN)
    test_ds= TorchClassificationDataset(test_dataset,MAX_LEN)
  return train_ds,test_ds

Now that the data is available in the format token classification model expects, let's prepare for training the model. As the data need to be fed in batches to take advantage of efficient distribution of data to train to each worker, This data need to be converted to tensors and be part of Data loader for PyTorch model to read, What this following class doing is preparing data in a dictionary for model to read

In [20]:
train_ds,test_ds = createDataset('pt')
print('One record of Training dataset')
print(train_dataset.loc[1,'desc'])
print('----')
print(train_ds[1])


One record of Training dataset
hp new oem 3500 3700 fuser kit q3655a q3655a hp oem 3500 3700 fuser kit hp oem genuine sold 90 day warranty
----
{'input_ids': tensor([  101,  6522,  2047,  1051,  6633,  8698,  2692, 16444,  2692, 19976,
         2099,  8934,  1053, 21619, 24087,  2050,  1053, 21619, 24087,  2050,
         6522,  1051,  6633,  8698,  2692, 16444,  2692, 19976,  2099,  8934,
         6522,  1051,  6633, 10218,  2853,  3938,  2154, 10943,  2100,   102,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0

In [21]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l[K     |███████▌                        | 10 kB 32.6 MB/s eta 0:00:01[K     |███████████████                 | 20 kB 28.5 MB/s eta 0:00:01[K     |██████████████████████▌         | 30 kB 18.2 MB/s eta 0:00:01[K     |██████████████████████████████  | 40 kB 15.1 MB/s eta 0:00:01[K     |████████████████████████████████| 43 kB 1.3 MB/s 
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16181 sha256=c64e653a18b458db73dfca6f57cdaf4488c17e12ce93c8d72b59aebf2333fe4d
  Stored in directory: /root/.cache/pip/wheels/05/96/ee/7cac4e74f3b19e3158dce26a20a1c86b3533c43ec72a549fd7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


Any Machine learning model to evaluate the performance we do via Accuracy, Precision, Recall & F1 Score metrics.
Here I am using sklearn metrics library to measure these.

In [22]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [23]:
# from torch import cuda
# device = 'cuda' if cuda.is_available() else 'cpu'

Ok, As you have seen, majority of the machine learning task is to get the data ready for the model to train. Now let's use Hugginface's new Trainer module to train the model

In [24]:
from transformers import (
    AutoModelForSequenceClassification,
    #BertForSequenceClassification,
    Trainer,
    TrainingArguments
)
model = AutoModelForSequenceClassification.from_pretrained(
    model_args['model_name'],
    config=config,
    cache_dir=model_args['cache_dir'],
)
training_args = TrainingArguments(
    output_dir='./results_PT',          
    num_train_epochs=20,              
    per_device_train_batch_size=32,  
    per_device_eval_batch_size=32,   
    warmup_steps=500,                
    weight_decay=0.01,               
    logging_dir='./logs_PT',            
    logging_steps=3,
    #learning_rate=LEARNING_RATE
)

trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_ds,        
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,  
)

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [25]:
# Lets tain the model now
trainer.train()

***** Running training *****
  Num examples = 14437
  Num Epochs = 20
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 9040


Step,Training Loss
3,5.3769
6,5.3431
9,5.3435
12,5.4049
15,5.3234
18,5.3504
21,5.3063
24,5.3284
27,5.2469
30,5.3003


Saving model checkpoint to ./results_PT/checkpoint-500
Configuration saved in ./results_PT/checkpoint-500/config.json
Model weights saved in ./results_PT/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results_PT/checkpoint-1000
Configuration saved in ./results_PT/checkpoint-1000/config.json
Model weights saved in ./results_PT/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./results_PT/checkpoint-1500
Configuration saved in ./results_PT/checkpoint-1500/config.json
Model weights saved in ./results_PT/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to ./results_PT/checkpoint-2000
Configuration saved in ./results_PT/checkpoint-2000/config.json
Model weights saved in ./results_PT/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to ./results_PT/checkpoint-2500
Configuration saved in ./results_PT/checkpoint-2500/config.json
Model weights saved in ./results_PT/checkpoint-2500/pytorch_model.bin
Saving model checkpoint to ./results_PT/checkpoint-30

TrainOutput(global_step=9040, training_loss=0.6762181623101974, metrics={'train_runtime': 3796.067, 'train_samples_per_second': 76.063, 'train_steps_per_second': 2.381, 'total_flos': 1.902438965250048e+16, 'train_loss': 0.6762181623101974, 'epoch': 20.0})

In [26]:
# modelTest = AutoModelForSequenceClassification.from_pretrained("results_PT/checkpoint-1000")
# modelTest.eval()

In [27]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 3609
  Batch size = 32


  _warn_prf(average, modifier, msg_start, len(result))


{'epoch': 20.0,
 'eval_accuracy': 0.7071210861734553,
 'eval_f1': 0.6602830095279952,
 'eval_loss': 2.056774616241455,
 'eval_precision': 0.6669142467070474,
 'eval_recall': 0.6809370707502767,
 'eval_runtime': 15.4124,
 'eval_samples_per_second': 234.163,
 'eval_steps_per_second': 7.332}

In [28]:
predictions, label_ids, metrics = trainer.predict(test_ds)
for key, value in metrics.items():
    print( key, value)

***** Running Prediction *****
  Num examples = 3609
  Batch size = 32


test_loss 2.056774616241455
test_accuracy 0.7071210861734553
test_f1 0.6602830095279952
test_precision 0.6669142467070474
test_recall 0.6809370707502767
test_runtime 15.4072
test_samples_per_second 234.241
test_steps_per_second 7.334


  _warn_prf(average, modifier, msg_start, len(result))


In [32]:
#inputs = tokenizer("Any good flash memory for my house", return_tensors="pt")
inputs = tokenizer("I am looking for a 6 feet long USB Cable", return_tensors="pt")

print(inputs)
labels = torch.tensor([10]).unsqueeze(0)
print(labels)

{'input_ids': tensor([[  101,  1045,  2572,  2559,  2005,  1037,  1020,  2519,  2146, 18833,
          5830,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
tensor([[10]])


In [33]:
model.to('cpu')
outputs = model(**inputs, labels=labels)
print(outputs.loss)
pred=outputs.logits.argmax(-1)
print('prediction=',pred,idx2label[(int)(pred.cpu().detach().numpy())])

tensor(10.6371, grad_fn=<NllLossBackward>)
prediction= tensor([29]) USB Cables
