#NER Model to recognise brands in product title

The goal is to build an NER model that can recognise brands in a title to use generate product catalogues from noisy purchase orders.

We will use two approaches, the first one is using spacy NER framwork with a custom training set.

The second approach is using a hand made model Embeddinge (Word2Vec) > BiLSTM > CRF with Keras (in a second notebook)

We will compare both models and try to deploy the most robust and efficient one

## 1 - Using SPaCy Custom NER

### 1.1 Create a training set

Using a csv file of ebay english products with two columns (title, brand), let's generate the suitable training set to run Spacy NER model

In [1]:
import pandas as pd
import numpy as np
import json
import tqdm

In [2]:
# Train path of csv file with title,brand
train_path = 'train.csv'


In [3]:

raw_data = pd.read_csv(train_path).sample(10000)
raw_data = raw_data[['title','brand','root_cat']]

In [4]:
raw_data

Unnamed: 0,title,brand,root_cat
11257,Guardog Glitz Glitter Bladeguards Blade Guards...,Guardog,888
11643,VAT Free Dress It Up Red Love Hearts 30 Button...,Dress It Up,14339
3271,Beer Mat / Coaster..Breweriana - APPLETISER AP...,Appletiser,1
15221,"HP Proliant DL380 G3 EON 2.8Ghz, 1Gb RAM 2U Ra...",HP,58058
16041,Levis 513 Mens Blue Skinny Stretch Jeans W32...,Levis,11450
...,...,...,...
9645,2x PUKKA PAD THINGS TO DO TODAY DESK MEMO PADS...,PUKKA,12576
18658,BASE LONDON ASH MEN'S WAXY TAN LEATHER UPPER L...,Base,11450
9605,Xerox DocuColor DC 250 Fast 65ppm Digital Colo...,Xerox,12576
1539,Mole collectable soft toy animal by Hansa - 23...,Hansa,220


In [5]:
# Generate a list of [start_char,end_char, label] of given label, tagname is the name of the label
# Spacy format : 
# [('Dell Laptop 3GB',{'entities': [0,4,ORG]}), ('Jackie love Paris',{'entities': [0,6,PERSON]})]
def get_start_end_label(title, label, tagname="BRAND"):
    title = title.lower()
    label = label.lower()
    if title.find(label) != -1:
        start_index = title.find(label)
        last_index = start_index + len(label)
        return [start_index,last_index,tagname]
    else:
        print(title)
        print(f"{label} not found in the given sentence")

In [6]:
# Create training datapoints with spacy format
ner_training = []
for i,row in raw_data.iterrows():
    title = row['title']
    brand = row['brand']
    brand_entity = get_start_end_label(title,brand)
    train_datapoint = (title,{'entities': [brand_entity]})
    ner_training.append(train_datapoint)

In [8]:
with open('ner_train.json','w') as f:
  json.dump(ner_training,f)

In [9]:
# Load ner training from drive and use the last 2000 titles from raw train.csv as validation set
ner_training_path = 'ner_train.json'

ner_training = json.loads(open(ner_training_path,'r').read())
test_set = pd.read_csv(train_path).tail(2000)

In [10]:
# Sanity check
print(test_set.shape)
print(len(ner_training))

(2000, 5)
10000


In [12]:
num_of_starts = 0
for m in ner_training:
  start_char = m[1]["entities"][0][0]
  if start_char == 0:
    num_of_starts +=1


print("% of sentences with brand in start of sentence : ",num_of_starts/len(ner_training) *100)


% of sentences with brand in start of sentence :  70.04


Due to this, we will later see that our model will be more biased toward predicting first words in a sentence as brands.

### Training the spacy NER model

In [13]:
def test_model(model,data):
  score = 0
  for i,row in data.iterrows():
    title = row['title']
    brand = row['brand']
    doc = model(title)
    for ent in doc.ents:
      brand = ent.text
      if ent.label_ == 'BRAND':
        if brand.upper() == row['brand'].upper():
          score += 1
      else:
        continue
  accuracy = score/2000 * 100
  print("Validation accuracy is ",accuracy)
    

In [14]:
import random
import spacy
from tqdm import tqdm
def train_spacy(data,old_model="",iterations=10,dropout=0.2,save=False,test_set=None):
  train_data = data
  if old_model == "":
    nlp = spacy.blank("en")
  else:
    nlp = spacy.load(old_model)
  if "ner" not in nlp.pipe_names:
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner)
  else:
    ner = nlp.get_pipe("ner")
  for _,annot in train_data:
    for ent in annot["entities"]:
      ner.add_label(ent[2])
  other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
  with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()
    
    for i in range(iterations):
      print("Starting iteration " + str(i))
      random.shuffle(train_data)
      losses = {}
      k = 0
      total_loss = 0
      with tqdm(total=len(train_data)) as pbar:
        for text, annotations in train_data:
          nlp.update(
            [text],
            [annotations],
            drop = dropout,
            sgd = optimizer,
            losses = losses
            )
          k += 1
          pbar.update(1)
          total_loss += losses['ner']
        #print("Text N°",k)
      print("Average loss for epoch N° ",i," : ",total_loss/1000)
      test_model(nlp,test_set)
      if save:
        nlp.to_disk(old_model)
      
  return nlp



ModuleNotFoundError: No module named 'spacy'

In [None]:
!pip install "cupy-cuda110<8.0.0"



In [None]:
print(spacy.__version__)

2.2.4


In [None]:
gpu = spacy.prefer_gpu()
print('GPU:', gpu)

GPU: True


In [None]:
#spacy.require_gpu()
model = train_spacy(ner_training,old_model = "/content/drive/MyDrive/DS Projects/brand-ner/ner_en_base_10k",
                    iterations=30,
                    dropout=0.5,
                    save=True,
                    test_set = test_set
                    )


Starting iteration 0


100%|██████████| 10000/10000 [06:45<00:00, 24.65it/s]


Average loss for epoch N°  0  :  24185.167010193487
Validation accuracy is  84.05
Starting iteration 1


100%|██████████| 10000/10000 [06:54<00:00, 24.10it/s]


Average loss for epoch N°  1  :  19365.609143367346
Validation accuracy is  86.35000000000001
Starting iteration 2


100%|██████████| 10000/10000 [07:00<00:00, 23.79it/s]


Average loss for epoch N°  2  :  20725.779652611083
Validation accuracy is  86.4
Starting iteration 3


100%|██████████| 10000/10000 [06:46<00:00, 24.58it/s]


Average loss for epoch N°  3  :  17339.190669058713
Validation accuracy is  87.45
Starting iteration 4


100%|██████████| 10000/10000 [06:42<00:00, 24.85it/s]


Average loss for epoch N°  4  :  16077.09049739907
Validation accuracy is  86.6
Starting iteration 5


100%|██████████| 10000/10000 [07:08<00:00, 23.34it/s]


Average loss for epoch N°  5  :  15280.723468573242
Validation accuracy is  86.6
Starting iteration 6


100%|██████████| 10000/10000 [06:57<00:00, 23.93it/s]


Average loss for epoch N°  6  :  16482.78445236063
Validation accuracy is  86.2
Starting iteration 7


100%|██████████| 10000/10000 [07:14<00:00, 23.02it/s]


Average loss for epoch N°  7  :  15524.70628086061
Validation accuracy is  88.75
Starting iteration 8


100%|██████████| 10000/10000 [07:10<00:00, 23.25it/s]


Average loss for epoch N°  8  :  14348.882085719682
Validation accuracy is  88.05
Starting iteration 9


100%|██████████| 10000/10000 [06:54<00:00, 24.11it/s]


Average loss for epoch N°  9  :  14490.933149543189
Validation accuracy is  88.64999999999999
Starting iteration 10


100%|██████████| 10000/10000 [06:51<00:00, 24.29it/s]


Average loss for epoch N°  10  :  13854.32574263998
Validation accuracy is  87.3
Starting iteration 11


100%|██████████| 10000/10000 [06:45<00:00, 24.63it/s]


Average loss for epoch N°  11  :  13689.143499843358
Validation accuracy is  88.4
Starting iteration 12


100%|██████████| 10000/10000 [06:54<00:00, 24.12it/s]


Average loss for epoch N°  12  :  14593.685551750135
Validation accuracy is  89.2
Starting iteration 13


100%|██████████| 10000/10000 [06:54<00:00, 24.10it/s]


Average loss for epoch N°  13  :  13558.266313288688
Validation accuracy is  88.1
Starting iteration 14


100%|██████████| 10000/10000 [06:47<00:00, 24.52it/s]


Average loss for epoch N°  14  :  12922.25885396521
Validation accuracy is  88.7
Starting iteration 15


100%|██████████| 10000/10000 [06:46<00:00, 24.61it/s]


Average loss for epoch N°  15  :  13490.84282549885
Validation accuracy is  89.60000000000001
Starting iteration 16


100%|██████████| 10000/10000 [06:52<00:00, 24.24it/s]


Average loss for epoch N°  16  :  12626.717902704535
Validation accuracy is  88.85
Starting iteration 17


 16%|█▋        | 1625/10000 [01:09<05:57, 23.44it/s]


KeyboardInterrupt: ignored

In [None]:
nlp = spacy.load('/content/drive/MyDrive/DS Projects/brand-ner/ner_en_base_10k')

In [None]:
texts = [
         'computer hp intel core i5',
         'WIRELESS MOUSE MACTECH USB',
         'pen black 16mm vertex',
         'hard drive sandisk 2 tb usb ',
         'fusion hard drive',
         'smart watch omega',
         'black hat BMW '
]

for t in texts:
  doc = nlp(t)
  for ent in doc.ents:
    print(ent)


hp
sandisk
fusion
smart
BMW


In [None]:
test_model(nlp,test_set)

Validation accuracy is  88.85


The spacy model trained for 45 epochs with 10 000 training examples gave an accuracy of 88.7% on a validation set  of 2000 titles only which is quite good given it has been trained from scratch.