Candidate details:

Name: Natraj Vikram

email: vikram.iisccamp@gmail.com

### Problem Statement

Given: An XLS with the following columns:-

- line_item_name: this is the raw line item name extracted from the invoice
- line_item_description: this is the raw line item description extracted from the invoice
- canonical_vendor_name: this is the canonical vendor name to which the invoice has been mapped
- canonical_line_item_name: this is the canonical line item name to which the raw line item name has been mapped / will be mapped


### Task and objective:
- Map raw line item names (with descriptions) on invoices to canonical line itemnames. 

This is a common task in NLP known as entity linking.

### Approach used:

- The dataset given is labeled.

- Since the amount of data given to us is very limited , we can't expect good performance if we train a model from scratch. The resulting output may not be good because the corpus size provided is not enough for the model to train and generalize well.

- Therefore, we train a custom NER model by fine tuning some existing pretrained language models from spaCy. 

- For this,I've fine tune 3 language models from spacy namely :- en_core_web_sm,en_core_web_md and en_core_web_lg.


### Adding custom labels to the existing spacy nlp model:

- The spaCy library offers pretrained entity extractors.(Person,Organization,Country etc.)

- But for custom business case use requirements,in financial and marketing domains,these pre-existing entities may not be enough. Therefore,we add a new entity called "CLI",short for canonical line item and provide that in the training data.
- After training,We will try to find out what parts of text belong to the entity type CLI using the test/eval data.

In [1]:
#Import all the necessary libraries
import pandas as pd
from tqdm import tqdm,trange
import numpy as np
import re
import spacy
import random

In [2]:
# Load pre-existing spacy language model with english corpus

# nlp=spacy.load('en_core_web_sm')
# nlp=spacy.load('en_core_web_md')
nlp=spacy.load('en_core_web_lg')


In [3]:
nlp.pipe_names
#predefined pipe names already present in the model

['tagger', 'parser', 'ner']

In [4]:
# Getting the pipeline component
ner=nlp.get_pipe("ner")

In [5]:
df = pd.read_excel('Task.xlsx',sheet_name='train')
df = df.fillna('')
df['line_item_name'] = df['line_item_name'].astype(str) 
df['line_item_description'] = df['line_item_description'].astype(str) 


In [6]:
#Since we have to map raw line item names (with descriptions) on invoices to canonical line itemnames,
#we create a new column name_desc which will combine item name with description
df['name_desc'] = np.zeros(len(df))
for i in trange(len(df)):
    df['name_desc'].iloc[i] = df['line_item_name'].iloc[i] + ' ' + df['line_item_description'].iloc[i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
100%|██████████| 659/659 [00:00<00:00, 8497.93it/s]


### Prepare train data
- As per the spaCy documentation,the train data should be present in the following format.
- The format of the training data is a list of tuples. Each tuple contains the example text and a dictionary. The dictionary will have the key entities , that stores the start and end indices along with the label of the entitties present in the text.

- For example , To pass “Pizza is a common fast food” as example the format will be : 
("Pizza is a common fast food",{"entities" : [(0, 5, "FOOD")]})


In [7]:
def find_start_end(string,substring):
    start = string.index(substring)
    end = string.index(substring) + len(substring) 
    return start,end  ## this returns start and end indices of a substring

In [8]:
ix = []
ent = []
ent_ix = []
for i in trange(len(df)):
    try:
        
        ent.append([find_start_end(df['name_desc'].iloc[i],df['canonical_line_item_name'].iloc[i])+("CLI",)])
        ent_ix.append(i)
        
    except:
        ent.append(0)
        ix.append(i)

100%|██████████| 659/659 [00:00<00:00, 59172.08it/s]


In [9]:
df['entities'] = ent

In [10]:
df = df[df.entities != 0]

In [11]:
df = df[['line_item_name', 'line_item_description', 'canonical_vendor_name',
       'name_desc', 'canonical_line_item_name','entities']]

In [12]:
df

Unnamed: 0,line_item_name,line_item_description,canonical_vendor_name,name_desc,canonical_line_item_name,entities
0,Management Services,April 2019 Services,10 Minute Ventures,Management Services April 2019 Services,Management Services,"[(0, 19, CLI)]"
1,June Web Media Fee,"($249,300 x 12% Commission)",Acqcom Digital Marketing,"June Web Media Fee ($249,300 x 12% Commission)",Web Media Fee,"[(5, 18, CLI)]"
2,June Web Media Fee,"($180,000 x 12% Commission)",Acqcom Digital Marketing,"June Web Media Fee ($180,000 x 12% Commission)",Web Media Fee,"[(5, 18, CLI)]"
3,Business Package,,Adjust,Business Package,Business Package,"[(0, 16, CLI)]"
4,SEO Services,,AdLift,SEO Services,SEO Services,"[(0, 12, CLI)]"
...,...,...,...,...,...,...
545,Domain Privacy Registration Fees via $261.70,,Waltz Media Group,Domain Privacy Registration Fees via $261.70,Domain Privacy Registration Fees,"[(0, 32, CLI)]"
547,"Business Hosting (Year, USD) (at $432.00 / yea...",,Webflow,"Business Hosting (Year, USD) (at $432.00 / yea...",Business Hosting,"[(0, 16, CLI)]"
598,Service address charge x4,Service address charge x4,Wilson Stevens,Service address charge x4 Service address char...,Service address charge,"[(0, 22, CLI)]"
604,Workable Annual Plan - Platform fee,,Workable,Workable Annual Plan - Platform fee,Workable Annual Plan - Platform fee,"[(0, 35, CLI)]"


In [13]:
TRAIN_DATA = []
for i in range(len(df)):
    TRAIN_DATA.append((df['name_desc'].iloc[i],{'entities':df['entities'].iloc[i]}))

In [14]:
TRAIN_DATA[0:5]
#first 5 points from the train data

[('Management Services April 2019 Services', {'entities': [(0, 19, 'CLI')]}),
 ('June Web Media Fee ($249,300 x 12% Commission)',
  {'entities': [(5, 18, 'CLI')]}),
 ('June Web Media Fee ($180,000 x 12% Commission)',
  {'entities': [(5, 18, 'CLI')]}),
 ('Business Package ', {'entities': [(0, 16, 'CLI')]}),
 ('SEO Services ', {'entities': [(0, 12, 'CLI')]})]

In [15]:
#Adding labels to the `ner`
# Adding custom label CLI - canonical line item
for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

In [15]:
# ner.add_label("CLI")
# assert "CLI" in ner.labels

In [16]:
ner.labels
#here we see that the CLI has been added. Just a sanity check.

('CARDINAL',
 'CLI',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

In [17]:
ner.labels[1]

'CLI'

In [18]:
# Disable pipeline components you dont need to change
# pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
pipe_exceptions = ["ner"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [19]:
unaffected_pipes

['tagger', 'parser']

### Train model

In [20]:
# Import requirements
import random
from spacy.util import minibatch, compounding
from pathlib import Path

# TRAINING THE MODEL
with nlp.disable_pipes(*unaffected_pipes):
    
    optimizer = nlp.resume_training()

  # Training for 30 iterations
    for iteration in trange(30):
        

    # shuufling examples  before every iteration
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(
                        texts,  # batch of texts
                        annotations,  # batch of annotations
                        drop=0.5,  # dropout - make it harder to memorise data
                        losses=losses,
                        sgd = optimizer
                    )
            print("Losses", losses)

  0%|          | 0/30 [00:00<?, ?it/s]

Losses {'ner': 66.65388107299805}
Losses {'ner': 150.91592979431152}
Losses {'ner': 196.2850785255432}
Losses {'ner': 253.73339128494263}
Losses {'ner': 309.7413754463196}
Losses {'ner': 387.25587224960327}
Losses {'ner': 453.43138456344604}
Losses {'ner': 571.7008635997772}
Losses {'ner': 597.9463069438934}
Losses {'ner': 617.8466880321503}
Losses {'ner': 733.9416263103485}
Losses {'ner': 775.9578865766525}
Losses {'ner': 828.2640651464462}
Losses {'ner': 877.222648024559}
Losses {'ner': 914.7626446485519}
Losses {'ner': 945.2591818571091}
Losses {'ner': 981.9784973859787}
Losses {'ner': 1047.838891863823}
Losses {'ner': 1119.6160868406296}
Losses {'ner': 1171.1442540884018}
Losses {'ner': 1207.6202725172043}
Losses {'ner': 1230.5415757894516}
Losses {'ner': 1288.4488800764084}
Losses {'ner': 1485.6891325712204}
Losses {'ner': 1531.9893645048141}
Losses {'ner': 1598.8106735944748}
Losses {'ner': 1634.714642882347}
Losses {'ner': 1662.8551174402237}
Losses {'ner': 1713.4850422143936}
L

  gold = GoldParse(doc, **gold)


Losses {'ner': 2303.08339035511}
Losses {'ner': 2339.5543640851974}
Losses {'ner': 2385.0630229711533}
Losses {'ner': 2421.089947581291}
Losses {'ner': 2471.572889685631}
Losses {'ner': 2526.786007285118}
Losses {'ner': 2629.4723876714706}
Losses {'ner': 2659.3899418115616}
Losses {'ner': 2706.067607522011}
Losses {'ner': 2726.164668083191}
Losses {'ner': 2776.4133501052856}
Losses {'ner': 2824.0950179100037}
Losses {'ner': 2911.9591059684753}
Losses {'ner': 2966.032901287079}
Losses {'ner': 3041.515498161316}
Losses {'ner': 3101.9174480438232}
Losses {'ner': 3196.8952226638794}
Losses {'ner': 3255.536959886551}
Losses {'ner': 3311.2325117588043}
Losses {'ner': 3325.8400418758392}
Losses {'ner': 3377.094755411148}
Losses {'ner': 3457.766608476639}
Losses {'ner': 3514.2773110866547}
Losses {'ner': 3575.86719250679}
Losses {'ner': 3617.903891801834}
Losses {'ner': 3687.084585428238}
Losses {'ner': 3711.1476607322693}
Losses {'ner': 3755.176158428192}
Losses {'ner': 3813.024025440216}
Los

  gold = GoldParse(doc, **gold)
  3%|▎         | 1/30 [00:02<01:04,  2.21s/it]

Losses {'ner': 4187.384821295738}
Losses {'ner': 4230.452259421349}
Losses {'ner': 4267.179277777672}
Losses {'ner': 4321.95971763134}
Losses {'ner': 4359.007039904594}
Losses {'ner': 4398.061471819878}
Losses {'ner': 4429.946092963219}
Losses {'ner': 4459.445915579796}
Losses {'ner': 4462.679759085178}
Losses {'ner': 77.05341529846191}
Losses {'ner': 121.60512065887451}
Losses {'ner': 138.45220017433167}
Losses {'ner': 182.4047544002533}
Losses {'ner': 218.69641017913818}
Losses {'ner': 275.36989736557007}
Losses {'ner': 295.5197193622589}
Losses {'ner': 355.14693570137024}
Losses {'ner': 401.83169198036194}
Losses {'ner': 478.5235664844513}
Losses {'ner': 544.2939827442169}
Losses {'ner': 566.8995645046234}
Losses {'ner': 628.6152820587158}
Losses {'ner': 649.0062036514282}
Losses {'ner': 675.9119784832001}
Losses {'ner': 708.1405746936798}
Losses {'ner': 748.8798825740814}
Losses {'ner': 796.2994349002838}
Losses {'ner': 841.9002459049225}
Losses {'ner': 887.0365545749664}
Losses {'

  7%|▋         | 2/30 [00:04<01:01,  2.18s/it]

Losses {'ner': 3673.4222955703735}
Losses {'ner': 3712.908148765564}
Losses {'ner': 3787.8390007019043}
Losses {'ner': 3833.3921003341675}
Losses {'ner': 3835.6118890047073}
Losses {'ner': 38.045775413513184}
Losses {'ner': 114.58676338195801}
Losses {'ner': 167.83127665519714}
Losses {'ner': 198.86746191978455}
Losses {'ner': 235.03663420677185}
Losses {'ner': 261.65164613723755}
Losses {'ner': 285.8048801422119}
Losses {'ner': 314.08821535110474}
Losses {'ner': 336.71798300743103}
Losses {'ner': 372.0652883052826}
Losses {'ner': 396.9462459087372}
Losses {'ner': 454.06443905830383}
Losses {'ner': 502.31716656684875}
Losses {'ner': 554.9614169597626}
Losses {'ner': 619.3514301776886}
Losses {'ner': 644.7422244548798}
Losses {'ner': 680.4769076108932}
Losses {'ner': 700.0927475690842}
Losses {'ner': 727.0580798387527}
Losses {'ner': 779.5323177576065}
Losses {'ner': 819.9395691156387}
Losses {'ner': 884.5601612329483}
Losses {'ner': 925.174228310585}
Losses {'ner': 970.7500959634781}
L

 10%|█         | 3/30 [00:06<00:57,  2.12s/it]

Losses {'ner': 3596.2090064287186}
Losses {'ner': 3665.172471165657}
Losses {'ner': 3686.862131714821}
Losses {'ner': 3732.026575922966}
Losses {'ner': 3771.8377269506454}
Losses {'ner': 3777.7292196154594}
Losses {'ner': 31.8829984664917}
Losses {'ner': 100.54536533355713}
Losses {'ner': 139.47314739227295}
Losses {'ner': 161.82160568237305}
Losses {'ner': 236.31753158569336}
Losses {'ner': 280.80714225769043}
Losses {'ner': 307.33110094070435}
Losses {'ner': 346.6356439590454}
Losses {'ner': 388.796217918396}
Losses {'ner': 435.41332817077637}
Losses {'ner': 465.96775579452515}
Losses {'ner': 486.3019161224365}
Losses {'ner': 511.3589165210724}
Losses {'ner': 540.5083885192871}
Losses {'ner': 566.7284569740295}
Losses {'ner': 595.5425765514374}
Losses {'ner': 696.0575664043427}
Losses {'ner': 740.4460790157318}
Losses {'ner': 766.2649359703064}
Losses {'ner': 816.2030720710754}
Losses {'ner': 859.8265309333801}
Losses {'ner': 897.6026844978333}
Losses {'ner': 974.2193403244019}
Losse

 13%|█▎        | 4/30 [00:08<00:54,  2.09s/it]

Losses {'ner': 3360.517107129097}
Losses {'ner': 3406.8970490694046}
Losses {'ner': 3460.611067891121}
Losses {'ner': 3517.545748114586}
Losses {'ner': 3566.4169577360153}
Losses {'ner': 3619.452423930168}
Losses {'ner': 3673.7846573591232}
Losses {'ner': 3676.9965127706528}
Losses {'ner': 60.2909951210022}
Losses {'ner': 80.49496078491211}
Losses {'ner': 105.6138846874237}
Losses {'ner': 137.63207364082336}
Losses {'ner': 196.30195450782776}
Losses {'ner': 242.44850420951843}
Losses {'ner': 288.2049310207367}
Losses {'ner': 357.11305594444275}
Losses {'ner': 382.35424304008484}
Losses {'ner': 448.72151827812195}
Losses {'ner': 492.99997115135193}
Losses {'ner': 591.13711810112}
Losses {'ner': 697.2983558177948}
Losses {'ner': 721.1297142505646}
Losses {'ner': 760.8173644542694}
Losses {'ner': 830.6943719387054}
Losses {'ner': 872.3503739833832}
Losses {'ner': 906.5609848499298}
Losses {'ner': 941.0153720378876}
Losses {'ner': 981.6093866825104}
Losses {'ner': 1006.0804328918457}
Losse

 17%|█▋        | 5/30 [00:10<00:51,  2.04s/it]

Losses {'ner': 3568.40194106102}
Losses {'ner': 3606.725503206253}
Losses {'ner': 3641.1242830753326}
Losses {'ner': 3665.962392807007}
Losses {'ner': 3691.0301337242126}
Losses {'ner': 3721.7863753763563}
Losses {'ner': 41.50513029098511}
Losses {'ner': 84.14811134338379}
Losses {'ner': 130.66074991226196}
Losses {'ner': 155.7610626220703}
Losses {'ner': 191.3574652671814}
Losses {'ner': 255.84901475906372}
Losses {'ner': 309.7556576728821}
Losses {'ner': 403.60060930252075}
Losses {'ner': 442.04689502716064}
Losses {'ner': 489.0456771850586}
Losses {'ner': 534.9567847251892}
Losses {'ner': 580.628960609436}
Losses {'ner': 613.5992133617401}
Losses {'ner': 625.1118556261063}
Losses {'ner': 675.1152478456497}
Losses {'ner': 693.7783282995224}
Losses {'ner': 727.5618838071823}
Losses {'ner': 804.777495265007}
Losses {'ner': 822.8280482590199}
Losses {'ner': 861.494353801012}
Losses {'ner': 920.9731927216053}
Losses {'ner': 969.8083949387074}
Losses {'ner': 989.1732421219349}
Losses {'ne

 20%|██        | 6/30 [00:12<00:48,  2.04s/it]

Losses {'ner': 3314.0571475327015}
Losses {'ner': 3343.6864552795887}
Losses {'ner': 3356.3652426302433}
Losses {'ner': 3394.4926902353764}
Losses {'ner': 3414.316734343767}
Losses {'ner': 3454.3590493500233}
Losses {'ner': 3522.129478007555}
Losses {'ner': 3619.643285304308}
Losses {'ner': 3621.4626800492406}
Losses {'ner': 30.5863618850708}
Losses {'ner': 70.99048328399658}
Losses {'ner': 103.34330749511719}
Losses {'ner': 191.19861030578613}
Losses {'ner': 223.27784061431885}
Losses {'ner': 250.31727075576782}
Losses {'ner': 367.523766040802}
Losses {'ner': 428.5978207588196}
Losses {'ner': 456.07649660110474}
Losses {'ner': 482.00325560569763}
Losses {'ner': 511.70080065727234}
Losses {'ner': 542.5627348423004}
Losses {'ner': 565.0033097267151}
Losses {'ner': 602.3874182701111}
Losses {'ner': 655.4824488162994}
Losses {'ner': 669.0885165333748}
Losses {'ner': 722.2763102650642}
Losses {'ner': 740.8224254250526}
Losses {'ner': 786.1608009934425}
Losses {'ner': 834.415543615818}
Loss

 23%|██▎       | 7/30 [00:14<00:46,  2.04s/it]

Losses {'ner': 3204.0238896012306}
Losses {'ner': 3242.5457568764687}
Losses {'ner': 3292.620688021183}
Losses {'ner': 3329.601714193821}
Losses {'ner': 3359.5168638825417}
Losses {'ner': 3394.992374956608}
Losses {'ner': 3465.435996592045}
Losses {'ner': 3498.831685125828}
Losses {'ner': 3561.0417447686195}
Losses {'ner': 3566.5674148201942}
Losses {'ner': 21.458033561706543}
Losses {'ner': 59.68788433074951}
Losses {'ner': 140.42764282226562}
Losses {'ner': 174.75246143341064}
Losses {'ner': 260.81217193603516}
Losses {'ner': 292.74109077453613}
Losses {'ner': 317.46889543533325}
Losses {'ner': 378.3962345123291}
Losses {'ner': 435.49664306640625}
Losses {'ner': 458.79866766929626}
Losses {'ner': 488.1246283054352}
Losses {'ner': 508.1904377937317}
Losses {'ner': 565.2463283538818}
Losses {'ner': 627.227686882019}
Losses {'ner': 668.9464030265808}
Losses {'ner': 717.5093472003937}
Losses {'ner': 731.6850965619087}
Losses {'ner': 759.3328407406807}
Losses {'ner': 806.2611902356148}
Lo

 27%|██▋       | 8/30 [00:16<00:44,  2.01s/it]

Losses {'ner': 3263.7706084251404}
Losses {'ner': 3319.471579551697}
Losses {'ner': 3410.6406989097595}
Losses {'ner': 3445.2319283485413}
Losses {'ner': 3461.7041836977005}
Losses {'ner': 3515.551314473152}
Losses {'ner': 3562.7819472551346}
Losses {'ner': 3567.291514188051}
Losses {'ner': 47.022525787353516}
Losses {'ner': 70.63394784927368}
Losses {'ner': 111.49938106536865}
Losses {'ner': 163.251784324646}
Losses {'ner': 195.52587842941284}
Losses {'ner': 228.48943901062012}
Losses {'ner': 270.0211305618286}
Losses {'ner': 320.63705348968506}
Losses {'ner': 350.25790882110596}
Losses {'ner': 367.7453052997589}
Losses {'ner': 407.43885350227356}
Losses {'ner': 464.73663544654846}
Losses {'ner': 499.7507493495941}
Losses {'ner': 525.2593648433685}
Losses {'ner': 556.9517588615417}
Losses {'ner': 597.5497159957886}
Losses {'ner': 628.3790998458862}
Losses {'ner': 674.6985926628113}
Losses {'ner': 695.9867103099823}
Losses {'ner': 716.4692056179047}
Losses {'ner': 757.3041269779205}
Lo

 30%|███       | 9/30 [00:18<00:42,  2.01s/it]

Losses {'ner': 3473.754462838173}
Losses {'ner': 3494.1829406023026}
Losses {'ner': 3504.244627058506}
Losses {'ner': 34.27374219894409}
Losses {'ner': 69.9326229095459}
Losses {'ner': 99.08102941513062}
Losses {'ner': 138.77847623825073}
Losses {'ner': 160.50655364990234}
Losses {'ner': 208.97334337234497}
Losses {'ner': 240.06045150756836}
Losses {'ner': 270.63592648506165}
Losses {'ner': 293.2889370918274}
Losses {'ner': 368.0540370941162}
Losses {'ner': 426.2944598197937}
Losses {'ner': 524.1811337471008}
Losses {'ner': 570.4632468223572}
Losses {'ner': 603.4471445083618}
Losses {'ner': 635.7690391540527}
Losses {'ner': 669.4115228652954}
Losses {'ner': 738.6134924888611}
Losses {'ner': 757.022777557373}
Losses {'ner': 802.5352947711945}
Losses {'ner': 859.0759251117706}
Losses {'ner': 878.7702894210815}
Losses {'ner': 914.1952068805695}
Losses {'ner': 938.098881483078}
Losses {'ner': 956.3740000724792}
Losses {'ner': 1009.6931495666504}
Losses {'ner': 1048.6083641052246}
Losses {'

 33%|███▎      | 10/30 [00:20<00:39,  1.97s/it]

Losses {'ner': 3455.7605713102967}
Losses {'ner': 62.80032920837402}
Losses {'ner': 75.17356872558594}
Losses {'ner': 160.2215073108673}
Losses {'ner': 198.63148880004883}
Losses {'ner': 234.19692397117615}
Losses {'ner': 281.7830741405487}
Losses {'ner': 312.62509083747864}
Losses {'ner': 369.74916410446167}
Losses {'ner': 395.2543535232544}
Losses {'ner': 421.87411880493164}
Losses {'ner': 467.2712574005127}
Losses {'ner': 489.1861091852188}
Losses {'ner': 535.2851368188858}
Losses {'ner': 574.3620349168777}
Losses {'ner': 645.977309346199}
Losses {'ner': 672.2224563360214}
Losses {'ner': 695.900293469429}
Losses {'ner': 724.2129579782486}
Losses {'ner': 771.2068907022476}
Losses {'ner': 819.2306624650955}
Losses {'ner': 855.0573788881302}
Losses {'ner': 931.3142491579056}
Losses {'ner': 972.3690110445023}
Losses {'ner': 999.254634976387}
Losses {'ner': 1042.319723725319}
Losses {'ner': 1075.5160156488419}
Losses {'ner': 1102.8966370821}
Losses {'ner': 1135.8577276468277}
Losses {'ne

 37%|███▋      | 11/30 [00:22<00:36,  1.95s/it]

Losses {'ner': 3163.062889933586}
Losses {'ner': 3200.2297986745834}
Losses {'ner': 3234.5019429922104}
Losses {'ner': 3265.874337077141}
Losses {'ner': 3300.6377695798874}
Losses {'ner': 3361.222787976265}
Losses {'ner': 3428.369276046753}
Losses {'ner': 3454.292578458786}
Losses {'ner': 3458.0115653276443}
Losses {'ner': 35.03681993484497}
Losses {'ner': 75.42933893203735}
Losses {'ner': 110.8360047340393}
Losses {'ner': 157.5507788658142}
Losses {'ner': 181.84518611431122}
Losses {'ner': 201.45748507976532}
Losses {'ner': 238.36803996562958}
Losses {'ner': 265.7338823080063}
Losses {'ner': 300.4464200735092}
Losses {'ner': 355.3369730710983}
Losses {'ner': 409.6919959783554}
Losses {'ner': 449.31300818920135}
Losses {'ner': 473.0233072042465}
Losses {'ner': 512.630986571312}
Losses {'ner': 540.7937172651291}
Losses {'ner': 552.682746052742}
Losses {'ner': 622.1306244134903}
Losses {'ner': 665.5789490938187}
Losses {'ner': 747.2432080507278}
Losses {'ner': 771.0503851175308}
Losses {

 40%|████      | 12/30 [00:23<00:34,  1.93s/it]

Losses {'ner': 3283.9465611577034}
Losses {'ner': 3352.341703236103}
Losses {'ner': 3375.583912909031}
Losses {'ner': 3407.468859255314}
Losses {'ner': 3437.7218247056007}
Losses {'ner': 3443.444353222847}
Losses {'ner': 31.96689462661743}
Losses {'ner': 55.752360582351685}
Losses {'ner': 80.61775135993958}
Losses {'ner': 108.47761416435242}
Losses {'ner': 128.1361175775528}
Losses {'ner': 173.01777565479279}
Losses {'ner': 215.46358215808868}
Losses {'ner': 255.61807310581207}
Losses {'ner': 280.1442281007767}
Losses {'ner': 323.50832283496857}
Losses {'ner': 346.96959149837494}
Losses {'ner': 406.78370702266693}
Losses {'ner': 425.6014095544815}
Losses {'ner': 455.2189303636551}
Losses {'ner': 484.5922762155533}
Losses {'ner': 520.2719274759293}
Losses {'ner': 547.3507763147354}
Losses {'ner': 578.7800914049149}
Losses {'ner': 656.802658200264}
Losses {'ner': 687.229208111763}
Losses {'ner': 736.6164857149124}
Losses {'ner': 780.1079536676407}
Losses {'ner': 815.5546232461929}
Losses

 43%|████▎     | 13/30 [00:25<00:33,  1.94s/it]

Losses {'ner': 3169.6271971464157}
Losses {'ner': 3253.3925684690475}
Losses {'ner': 3294.0354984998703}
Losses {'ner': 3359.7735098600388}
Losses {'ner': 3383.428566336632}
Losses {'ner': 3433.4487792253494}
Losses {'ner': 20.988449573516846}
Losses {'ner': 47.34898805618286}
Losses {'ner': 76.36200213432312}
Losses {'ner': 105.70622897148132}
Losses {'ner': 148.6663920879364}
Losses {'ner': 237.83149409294128}
Losses {'ner': 262.4003186225891}
Losses {'ner': 316.33031845092773}
Losses {'ner': 358.6586899757385}
Losses {'ner': 392.33489990234375}
Losses {'ner': 455.6787705421448}
Losses {'ner': 479.3494656085968}
Losses {'ner': 524.7929217815399}
Losses {'ner': 553.5341236591339}
Losses {'ner': 608.0550148487091}
Losses {'ner': 674.4164803028107}
Losses {'ner': 689.8514499664307}
Losses {'ner': 728.13205742836}
Losses {'ner': 779.5354425907135}
Losses {'ner': 805.2314252853394}
Losses {'ner': 821.3908927440643}
Losses {'ner': 848.8093647956848}
Losses {'ner': 879.9504976272583}
Losses

 47%|████▋     | 14/30 [00:28<00:32,  2.00s/it]

Losses {'ner': 3150.562068104744}
Losses {'ner': 3179.4797538518906}
Losses {'ner': 3201.6833089590073}
Losses {'ner': 3241.9342535734177}
Losses {'ner': 3264.521306872368}
Losses {'ner': 3307.019996523857}
Losses {'ner': 3364.3199096918106}
Losses {'ner': 3390.8305283784866}
Losses {'ner': 3393.0161765404046}
Losses {'ner': 22.598001956939697}
Losses {'ner': 45.536977648735046}
Losses {'ner': 64.27596688270569}
Losses {'ner': 118.78824353218079}
Losses {'ner': 160.18266224861145}
Losses {'ner': 189.07855367660522}
Losses {'ner': 218.2122664451599}
Losses {'ner': 241.78362107276917}
Losses {'ner': 291.3841998577118}
Losses {'ner': 310.9106984138489}
Losses {'ner': 341.4759774208069}
Losses {'ner': 365.14781975746155}
Losses {'ner': 411.79502415657043}
Losses {'ner': 447.1104152202606}
Losses {'ner': 471.33788084983826}
Losses {'ner': 505.98211550712585}
Losses {'ner': 563.2400124073029}
Losses {'ner': 627.9334003925323}
Losses {'ner': 674.1872885227203}
Losses {'ner': 703.2557046413422

 50%|█████     | 15/30 [00:30<00:30,  2.01s/it]

Losses {'ner': 3450.125892817974}
Losses {'ner': 59.99864912033081}
Losses {'ner': 86.95303773880005}
Losses {'ner': 124.82992649078369}
Losses {'ner': 147.30217504501343}
Losses {'ner': 177.07745695114136}
Losses {'ner': 206.13266372680664}
Losses {'ner': 224.93329906463623}
Losses {'ner': 256.5517249107361}
Losses {'ner': 304.03088903427124}
Losses {'ner': 335.36981534957886}
Losses {'ner': 354.2018132209778}
Losses {'ner': 388.01704955101013}
Losses {'ner': 418.7383725643158}
Losses {'ner': 452.49334478378296}
Losses {'ner': 507.9005436897278}
Losses {'ner': 540.749062538147}
Losses {'ner': 583.1064081192017}
Losses {'ner': 628.7551789283752}
Losses {'ner': 663.8646121025085}
Losses {'ner': 707.5041708946228}
Losses {'ner': 728.4321596622467}
Losses {'ner': 776.5898790359497}
Losses {'ner': 805.0190753936768}
Losses {'ner': 881.4980888366699}
Losses {'ner': 917.5342266559601}
Losses {'ner': 974.4969561100006}
Losses {'ner': 1015.1091206073761}
Losses {'ner': 1077.141271352768}
Losse

 53%|█████▎    | 16/30 [00:31<00:27,  1.98s/it]

Losses {'ner': 3128.5435432195663}
Losses {'ner': 3156.5089606046677}
Losses {'ner': 3210.2785276174545}
Losses {'ner': 3271.5907233953476}
Losses {'ner': 3293.370663523674}
Losses {'ner': 3353.126252055168}
Losses {'ner': 3370.9958218336105}
Losses {'ner': 3397.871732543259}
Losses {'ner': 38.809014320373535}
Losses {'ner': 111.38795948028564}
Losses {'ner': 129.44527339935303}
Losses {'ner': 157.12302589416504}
Losses {'ner': 199.71881246566772}
Losses {'ner': 246.8945450782776}
Losses {'ner': 275.47050881385803}
Losses {'ner': 313.7963545322418}
Losses {'ner': 346.2827479839325}
Losses {'ner': 404.3166329860687}
Losses {'ner': 449.3043954372406}
Losses {'ner': 470.08376240730286}
Losses {'ner': 498.81053042411804}
Losses {'ner': 538.6031801700592}
Losses {'ner': 568.9266865253448}
Losses {'ner': 616.5817973613739}
Losses {'ner': 659.3729555606842}
Losses {'ner': 715.7605345249176}
Losses {'ner': 730.2188024520874}
Losses {'ner': 784.8477916717529}
Losses {'ner': 819.5055160522461}
L

 57%|█████▋    | 17/30 [00:33<00:25,  1.94s/it]

Losses {'ner': 3415.667089815368}
Losses {'ner': 54.44766092300415}
Losses {'ner': 120.12988424301147}
Losses {'ner': 170.7716965675354}
Losses {'ner': 230.76298260688782}
Losses {'ner': 287.02551436424255}
Losses {'ner': 303.4304099082947}
Losses {'ner': 372.3435597419739}
Losses {'ner': 414.43001890182495}
Losses {'ner': 450.6311435699463}
Losses {'ner': 499.44589376449585}
Losses {'ner': 551.4869666099548}
Losses {'ner': 602.7758355140686}
Losses {'ner': 637.6613621711731}
Losses {'ner': 663.4526979923248}
Losses {'ner': 688.7048037052155}
Losses {'ner': 731.0714361667633}
Losses {'ner': 783.2579853534698}
Losses {'ner': 804.5117781162262}
Losses {'ner': 878.4596445560455}
Losses {'ner': 947.4305326938629}
Losses {'ner': 1021.5045785903931}
Losses {'ner': 1033.895282626152}
Losses {'ner': 1055.130385518074}
Losses {'ner': 1099.297133564949}
Losses {'ner': 1127.9659751653671}
Losses {'ner': 1191.3973692655563}
Losses {'ner': 1223.9603275060654}
Losses {'ner': 1264.2248529195786}
Loss

 60%|██████    | 18/30 [00:35<00:23,  1.94s/it]

Losses {'ner': 3361.42020932016}
Losses {'ner': 45.62174081802368}
Losses {'ner': 87.5808162689209}
Losses {'ner': 117.14848804473877}
Losses {'ner': 147.5897216796875}
Losses {'ner': 242.56191539764404}
Losses {'ner': 263.9515600204468}
Losses {'ner': 317.43269205093384}
Losses {'ner': 357.3978009223938}
Losses {'ner': 372.84617042541504}
Losses {'ner': 418.8856978416443}
Losses {'ner': 456.2517213821411}
Losses {'ner': 482.142023563385}
Losses {'ner': 506.3577923774719}
Losses {'ner': 528.9455466270447}
Losses {'ner': 574.2125034332275}
Losses {'ner': 584.059663027525}
Losses {'ner': 645.907715767622}
Losses {'ner': 665.9847409427166}
Losses {'ner': 692.3500139415264}
Losses {'ner': 745.6177732646465}
Losses {'ner': 767.4714348018169}
Losses {'ner': 785.1856396496296}
Losses {'ner': 839.4584081470966}
Losses {'ner': 876.7707913219929}
Losses {'ner': 932.0752671062946}
Losses {'ner': 958.568689852953}
Losses {'ner': 984.870339423418}
Losses {'ner': 1005.4599180519581}
Losses {'ner': 1

 63%|██████▎   | 19/30 [00:37<00:21,  1.94s/it]

Losses {'ner': 3300.691687077284}
Losses {'ner': 29.007493019104004}
Losses {'ner': 80.12339973449707}
Losses {'ner': 115.92132139205933}
Losses {'ner': 154.97104597091675}
Losses {'ner': 186.87573337554932}
Losses {'ner': 222.6387710571289}
Losses {'ner': 243.49217343330383}
Losses {'ner': 270.0918719768524}
Losses {'ner': 314.4103524684906}
Losses {'ner': 340.57670760154724}
Losses {'ner': 412.53685450553894}
Losses {'ner': 445.4813959598541}
Losses {'ner': 474.4645848274231}
Losses {'ner': 521.5586671829224}
Losses {'ner': 569.1851854324341}
Losses {'ner': 618.8029155731201}
Losses {'ner': 670.3193578720093}
Losses {'ner': 706.183274269104}
Losses {'ner': 774.0036859512329}
Losses {'ner': 810.5897951126099}
Losses {'ner': 851.8101329803467}
Losses {'ner': 872.4312446117401}
Losses {'ner': 897.6045658588409}
Losses {'ner': 928.0305831432343}
Losses {'ner': 948.1594212055206}
Losses {'ner': 979.1900908946991}
Losses {'ner': 1018.028101682663}
Losses {'ner': 1045.4254372119904}
Losses 

 67%|██████▋   | 20/30 [00:39<00:19,  1.92s/it]

Losses {'ner': 3110.7931089401245}
Losses {'ner': 3154.667935371399}
Losses {'ner': 3176.496922016144}
Losses {'ner': 3213.781154155731}
Losses {'ner': 3252.2386512756348}
Losses {'ner': 3287.2880239486694}
Losses {'ner': 3316.701186656952}
Losses {'ner': 3323.6875966750085}
Losses {'ner': 23.349059104919434}
Losses {'ner': 91.49931716918945}
Losses {'ner': 133.32896900177002}
Losses {'ner': 165.67342901229858}
Losses {'ner': 191.37561774253845}
Losses {'ner': 228.61958026885986}
Losses {'ner': 243.1561107635498}
Losses {'ner': 267.123562335968}
Losses {'ner': 300.5987796783447}
Losses {'ner': 340.6380395889282}
Losses {'ner': 394.72505044937134}
Losses {'ner': 436.033034324646}
Losses {'ner': 451.07024109363556}
Losses {'ner': 480.38570511341095}
Losses {'ner': 518.8818525075912}
Losses {'ner': 573.5798071622849}
Losses {'ner': 605.1564403772354}
Losses {'ner': 636.3358694314957}
Losses {'ner': 683.5438410043716}
Losses {'ner': 706.4894744157791}
Losses {'ner': 726.4502772092819}
Loss

 70%|███████   | 21/30 [00:41<00:17,  1.96s/it]

Losses {'ner': 3260.8127826452255}
Losses {'ner': 3273.61378826329}
Losses {'ner': 37.908355712890625}
Losses {'ner': 85.41356611251831}
Losses {'ner': 109.90008592605591}
Losses {'ner': 136.11408233642578}
Losses {'ner': 166.85336637496948}
Losses {'ner': 200.70789241790771}
Losses {'ner': 223.37747383117676}
Losses {'ner': 264.78953790664673}
Losses {'ner': 307.7143363952637}
Losses {'ner': 337.0572438240051}
Losses {'ner': 375.37920808792114}
Losses {'ner': 401.29946517944336}
Losses {'ner': 461.577130317688}
Losses {'ner': 485.6731173992157}
Losses {'ner': 520.6772119998932}
Losses {'ner': 558.5144509077072}
Losses {'ner': 604.0800641775131}
Losses {'ner': 658.8558338880539}
Losses {'ner': 683.9316018819809}
Losses {'ner': 700.5859454870224}
Losses {'ner': 709.6591874044389}
Losses {'ner': 807.2946005742997}
Losses {'ner': 829.5332568567246}
Losses {'ner': 870.1972206514329}
Losses {'ner': 880.1232894938439}
Losses {'ner': 949.3390726130456}
Losses {'ner': 985.2120260279626}
Losses

 73%|███████▎  | 22/30 [00:43<00:16,  2.02s/it]

Losses {'ner': 3275.1377780716866}
Losses {'ner': 3288.3264545258135}
Losses {'ner': 7.994938850402832}
Losses {'ner': 53.49371385574341}
Losses {'ner': 75.34959197044373}
Losses {'ner': 103.33885979652405}
Losses {'ner': 132.25471878051758}
Losses {'ner': 164.36872482299805}
Losses {'ner': 208.7516188621521}
Losses {'ner': 220.00112371705472}
Losses {'ner': 248.1520712878555}
Losses {'ner': 268.1346040274948}
Losses {'ner': 310.4746709372848}
Losses {'ner': 348.970539143309}
Losses {'ner': 390.566383888945}
Losses {'ner': 449.72375922463834}
Losses {'ner': 479.9139233138412}
Losses {'ner': 521.5044556166977}
Losses {'ner': 561.7708073165268}
Losses {'ner': 604.1963832881302}
Losses {'ner': 645.8485234286636}
Losses {'ner': 661.4759259726852}
Losses {'ner': 686.07306628488}
Losses {'ner': 721.1178727652878}
Losses {'ner': 776.5257907416672}
Losses {'ner': 815.372656872496}
Losses {'ner': 859.2407508399338}
Losses {'ner': 910.8715758826584}
Losses {'ner': 962.9919777419418}
Losses {'ner

 77%|███████▋  | 23/30 [00:45<00:14,  2.04s/it]

Losses {'ner': 3011.9997059758753}
Losses {'ner': 3027.0254537519068}
Losses {'ner': 3065.1740357335657}
Losses {'ner': 3125.3456751760095}
Losses {'ner': 3150.804706996307}
Losses {'ner': 3202.2732941564173}
Losses {'ner': 3280.9031161721796}
Losses {'ner': 3293.968570658937}
Losses {'ner': 39.985326290130615}
Losses {'ner': 54.91316318511963}
Losses {'ner': 79.40249156951904}
Losses {'ner': 121.5438141822815}
Losses {'ner': 141.10058331489563}
Losses {'ner': 163.38752365112305}
Losses {'ner': 230.77154064178467}
Losses {'ner': 274.0993995666504}
Losses {'ner': 341.41400814056396}
Losses {'ner': 388.19212341308594}
Losses {'ner': 417.88055419921875}
Losses {'ner': 454.8675870895386}
Losses {'ner': 492.39141273498535}
Losses {'ner': 525.4610815048218}
Losses {'ner': 581.3092186450958}
Losses {'ner': 672.5580599308014}
Losses {'ner': 702.4329810142517}
Losses {'ner': 744.8778600692749}
Losses {'ner': 779.6976447701454}
Losses {'ner': 843.4977682232857}
Losses {'ner': 865.1895154118538}


 80%|████████  | 24/30 [00:48<00:12,  2.10s/it]

Losses {'ner': 3009.112007558346}
Losses {'ner': 3048.1198791861534}
Losses {'ner': 3121.5287155508995}
Losses {'ner': 3189.120963037014}
Losses {'ner': 3220.3957914710045}
Losses {'ner': 3268.6075717806816}
Losses {'ner': 3288.3889284729958}
Losses {'ner': 20.067973613739014}
Losses {'ner': 88.53604936599731}
Losses {'ner': 137.00099802017212}
Losses {'ner': 162.56491470336914}
Losses {'ner': 214.55653190612793}
Losses {'ner': 244.95367217063904}
Losses {'ner': 291.77968859672546}
Losses {'ner': 347.6146295070648}
Losses {'ner': 376.63986372947693}
Losses {'ner': 427.3367021083832}
Losses {'ner': 467.8008668422699}
Losses {'ner': 481.56275737285614}
Losses {'ner': 521.5850952863693}
Losses {'ner': 541.782421708107}
Losses {'ner': 578.9165645837784}
Losses {'ner': 628.8036066293716}
Losses {'ner': 656.3529163599014}
Losses {'ner': 680.3189202547073}
Losses {'ner': 713.9572521448135}
Losses {'ner': 751.1818321943283}
Losses {'ner': 769.7400144338608}
Losses {'ner': 822.2569023370743}
Lo

 83%|████████▎ | 25/30 [00:50<00:10,  2.17s/it]

Losses {'ner': 3099.661181151867}
Losses {'ner': 3124.8743545413017}
Losses {'ner': 3164.0115997195244}
Losses {'ner': 3235.816602408886}
Losses {'ner': 3261.1637168973684}
Losses {'ner': 12.542425155639648}
Losses {'ner': 53.78811264038086}
Losses {'ner': 81.29830074310303}
Losses {'ner': 127.8487696647644}
Losses {'ner': 156.73563599586487}
Losses {'ner': 200.16259849071503}
Losses {'ner': 223.09651148319244}
Losses {'ner': 250.54201710224152}
Losses {'ner': 275.5309954881668}
Losses {'ner': 312.1773806810379}
Losses {'ner': 343.45622050762177}
Losses {'ner': 430.31161296367645}
Losses {'ner': 502.30327451229095}
Losses {'ner': 515.6100980788469}
Losses {'ner': 581.4692513495684}
Losses {'ner': 613.271336749196}
Losses {'ner': 645.4502288848162}
Losses {'ner': 671.9813563376665}
Losses {'ner': 719.4721381217241}
Losses {'ner': 748.4705483466387}
Losses {'ner': 770.4912480860949}
Losses {'ner': 855.9632730036974}
Losses {'ner': 872.8864383250475}
Losses {'ner': 923.7956938296556}
Loss

 87%|████████▋ | 26/30 [00:52<00:08,  2.20s/it]

 {'ner': 2989.916302099824}
Losses {'ner': 3017.470877543092}
Losses {'ner': 3042.4091774374247}
Losses {'ner': 3067.6864429861307}
Losses {'ner': 3116.5499798208475}
Losses {'ner': 3185.3670905977488}
Losses {'ner': 3223.304928675294}
Losses {'ner': 3235.554701194167}
Losses {'ner': 25.325541019439697}
Losses {'ner': 50.62668442726135}
Losses {'ner': 78.00642895698547}
Losses {'ner': 107.59056115150452}
Losses {'ner': 150.17328584194183}
Losses {'ner': 190.33767449855804}
Losses {'ner': 223.06584346294403}
Losses {'ner': 268.1337283849716}
Losses {'ner': 290.07387149333954}
Losses {'ner': 338.5032209157944}
Losses {'ner': 366.28063809871674}
Losses {'ner': 395.4033397436142}
Losses {'ner': 435.2382105588913}
Losses {'ner': 460.3128880262375}
Losses {'ner': 486.6644209623337}
Losses {'ner': 506.5540713071823}
Losses {'ner': 578.788710474968}
Losses {'ner': 627.5060161352158}
Losses {'ner': 695.322275519371}
Losses {'ner': 732.1759423017502}
Losses {'ner': 745.251748919487}
Losses {'ner

 90%|█████████ | 27/30 [00:55<00:06,  2.23s/it]

Losses {'ner': 3144.0585799217224}
Losses {'ner': 3180.6678700447083}
Losses {'ner': 3226.8580298423767}
Losses {'ner': 3235.663334579227}
Losses {'ner': 29.962072372436523}
Losses {'ner': 54.88195538520813}
Losses {'ner': 115.68853068351746}
Losses {'ner': 159.8488872051239}
Losses {'ner': 214.42247462272644}
Losses {'ner': 231.3358974456787}
Losses {'ner': 256.3485732078552}
Losses {'ner': 296.48485374450684}
Losses {'ner': 312.0313811302185}
Losses {'ner': 346.6190114021301}
Losses {'ner': 370.60469818115234}
Losses {'ner': 395.74453043937683}
Losses {'ner': 445.6007421016693}
Losses {'ner': 469.3809551000595}
Losses {'ner': 498.3712352514267}
Losses {'ner': 532.6081422567368}
Losses {'ner': 622.7400192022324}
Losses {'ner': 658.7659219503403}
Losses {'ner': 691.398709654808}
Losses {'ner': 706.0216228961945}
Losses {'ner': 743.7141745090485}
Losses {'ner': 808.7209894657135}
Losses {'ner': 851.691654920578}
Losses {'ner': 892.8465840816498}
Losses {'ner': 920.8409504890442}
Losses 

 93%|█████████▎| 28/30 [00:57<00:04,  2.25s/it]

Losses {'ner': 35.28943920135498}
Losses {'ner': 71.07767653465271}
Losses {'ner': 100.95375561714172}
Losses {'ner': 121.05743551254272}
Losses {'ner': 155.36130475997925}
Losses {'ner': 190.82702159881592}
Losses {'ner': 234.85257387161255}
Losses {'ner': 266.5381543636322}
Losses {'ner': 284.5590073466301}
Losses {'ner': 316.96600979566574}
Losses {'ner': 358.6381451487541}
Losses {'ner': 426.7458578944206}
Losses {'ner': 445.4650943875313}
Losses {'ner': 506.63967686891556}
Losses {'ner': 547.6566050648689}
Losses {'ner': 576.5760787129402}
Losses {'ner': 628.9701860547066}
Losses {'ner': 654.8783071637154}
Losses {'ner': 686.8313749432564}
Losses {'ner': 719.6971605420113}
Losses {'ner': 754.0935462117195}
Losses {'ner': 765.6820469498634}
Losses {'ner': 794.7797678112984}
Losses {'ner': 816.1762242913246}
Losses {'ner': 846.5466161370277}
Losses {'ner': 873.5193579792976}
Losses {'ner': 895.7432926297188}
Losses {'ner': 921.4738066792488}
Losses {'ner': 957.4160278439522}
Losses 

 97%|█████████▋| 29/30 [00:59<00:02,  2.26s/it]

Losses {'ner': 3225.2156196832657}
Losses {'ner': 3269.5864320993423}
Losses {'ner': 3275.028653898298}
Losses {'ner': 22.42706561088562}
Losses {'ner': 55.790393590927124}
Losses {'ner': 117.97295212745667}
Losses {'ner': 145.53264784812927}
Losses {'ner': 183.33447790145874}
Losses {'ner': 224.95560312271118}
Losses {'ner': 253.74021935462952}
Losses {'ner': 275.9021580219269}
Losses {'ner': 321.6686751842499}
Losses {'ner': 345.09603452682495}
Losses {'ner': 364.56579756736755}
Losses {'ner': 428.48092770576477}
Losses {'ner': 445.14167380332947}
Losses {'ner': 484.9329068660736}
Losses {'ner': 527.0433223247528}
Losses {'ner': 553.6901714801788}
Losses {'ner': 602.5732724666595}
Losses {'ner': 629.7142231464386}
Losses {'ner': 645.2917490005493}
Losses {'ner': 685.1467385292053}
Losses {'ner': 712.7895369529724}
Losses {'ner': 734.3780164718628}
Losses {'ner': 764.9751133918762}
Losses {'ner': 811.2402873039246}
Losses {'ner': 847.6402697563171}
Losses {'ner': 920.5125842094421}
Lo

100%|██████████| 30/30 [01:01<00:00,  2.07s/it]

Losses {'ner': 2947.815644264221}
Losses {'ner': 2999.6393508911133}
Losses {'ner': 3045.1209292411804}
Losses {'ner': 3070.401665687561}
Losses {'ner': 3098.9978079795837}
Losses {'ner': 3135.0347352027893}
Losses {'ner': 3203.44092464447}
Losses {'ner': 3210.0688272938132}





In [21]:
nlp.to_disk("./model_lg")
#save model

### Test predictions of unseen data using the trained model

In [1]:
####load trained ner spacy model

In [1]:
import pandas as pd
from tqdm import tqdm,trange
import numpy as np
import re
import spacy
import random

In [2]:
output_dir = './model_lg'
print("Loading from", output_dir)
nlp = spacy.load(output_dir)

Loading from ./model_lg


In [None]:
### Find out the predicted entities from the eval tab

In [3]:
df = pd.read_excel('Task.xlsx',sheet_name='eval')
df = df.fillna('')
df['line_item_name'] = df['line_item_name'].astype(str) 
df['line_item_description'] = df['line_item_description'].astype(str) 
df['name_desc'] = np.zeros(len(df))
for i in trange(len(df)):
    df['name_desc'].iloc[i] = df['line_item_name'].iloc[i] + ' ' + df['line_item_description'].iloc[i]
for i in range(len(df)):
    
    df['name_desc'].iloc[i] = " ".join(df['name_desc'].iloc[i].split())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
100%|██████████| 337/337 [00:00<00:00, 8898.15it/s]


In [4]:
df = df[['line_item_name', 'line_item_description', 'canonical_vendor_name',
       'name_desc', 'canonical_line_item_name']]

In [5]:
# Testing the model
predicted_entities = []
for i in trange(len(df)):
    doc = nlp(df['name_desc'].iloc[i])
    output = [(ent.text, ent.label_) for ent in doc.ents]
    print("Entities:",output)
    try:
        
        predicted_entities.append(output[0][0])  ##check if the model is predicting
    except:
        predicted_entities.append('')  ##append empty string when there are no predictions

  8%|▊         | 28/337 [00:00<00:02, 132.53it/s]

Entities: [('Management Services May', 'CLI')]
Entities: [('Acrobat Pro DC', 'CLI')]
Entities: [('AIEX 96 Pieces', 'CLI')]
Entities: [('AmazonBasics AAA', 'CLI')]
Entities: [('AmazonBasics Mesh Trash Can Waste Basket 1', 'CLI')]
Entities: [('AmazonFresh Mediterranean Extra Virgin Olive Oil, 68 Fl Oz (2L) B01N3LCEDL', 'CLI')]
Entities: [('Apple 87W USB-C Power Adapter (for MacBook Pro)', 'CLI')]
Entities: [('Apple iPad 2', 'CLI')]
Entities: [('Apple iPad with Retina Display', 'CLI')]
Entities: [('Apple iPad with Retina Display', 'CLI')]
Entities: [('Assurant B2B 4YR Kitchen Protection Plan with Accidental Damage $50-74', 'CLI')]
Entities: [('Bose QuietComfort 35 (Series I) Wireless Headphones,', 'CLI')]
Entities: [('Calculator,Vilcome 12-Digit Solar Battery Office Calculator with Large LCD Display Big Sensitive Button, Dual Power Desktop Calculators (Black) Calculator,Vilcome 12-Digit Solar Battery Office Calculator with Large LCD Display Big Sensitive Button, Dual Power Desktop Calcula

 17%|█▋        | 58/337 [00:00<00:01, 139.78it/s]

Entities: [('Tilting TV Wall Mount Bracket Low Profile for Most 23-55 Inch LED, LCD, OLED, Plasma Flat Screen TVs with VESA 400x400mm Weight up to 115lbs by PERLES', 'CLI')]
Entities: [('TRENDnet USB-C HD Docking Cube, HDMI, Gigabit Port, USB Docking Station, Supports up to 3840 x 2160 at 30 Hz, TUC-DS1', 'CLI')]
Entities: [('Whaline 120 Pack Halloween Gift Tags Trick or Treat Craft Label Tags with 130ft Cotton Rope for Halloween Party Favor, Candy Goodie Bag Art', 'CLI')]
Entities: [('Xbox Wireless Controller - Black', 'CLI')]
Entities: [('E-mail gift card to: christopher.byrum@thimble.com', 'CLI')]
Entities: []
Entities: [('Lenovo 500 Wireless Combo Keyboard & Mouse, Black (GX30H55793)', 'CLI')]
Entities: []
Entities: [('1,500-2,000 words', 'CLI')]
Entities: [('1,500-2,000 words', 'CLI')]
Entities: [('900-1,500 words', 'CLI')]
Entities: [('900-1,500 words @', 'CLI')]
Entities: [('900-1,500 words @', 'CLI')]
Entities: [('Trial Assignment @ $350 Miscellaneous Professional Liability Ins

 27%|██▋       | 92/337 [00:00<00:01, 152.69it/s]

Entities: []
Entities: [('Legislative Services in New York State', 'CLI')]
Entities: []
Entities: [('Amy Farrer', 'CLI')]
Entities: [('Andrew Rink', 'CLI')]
Entities: [('Colin Kendon', 'CLI')]
Entities: []
Entities: [('CONS Consulting - Verifly Greybox Security Assessment; 50% at completion; pmt due 3/6/2020', 'CLI')]
Entities: [('CONS Sales Tax', 'CLI')]
Entities: [("UK Full Availability Trade Mark Register Search for THIMBLE in class 36 Payment of external searchers' fees", 'CLI')]
Entities: [('Sidhu, V', 'CLI')]
Entities: [('Wilner, J', 'CLI')]
Entities: [('Microsoft Dynamics 365 Business Central Enhancement Plan', 'CLI')]
Entities: [('48" Cable Tray - TRIMMED TO 36" OVERALL', 'CLI')]
Entities: [('Brace Leg - Back2Back', 'CLI')]
Entities: [('Brace Leg - Double', 'CLI')]
Entities: [('Brace Leg - Single', 'CLI')]
Entities: [('Cable Floor Box', 'CLI')]
Entities: [('Jumper 36" (USED FOR 42" TOP), 3-Circ', 'CLI')]
Entities: [('Single Leg - Center', 'CLI')]
Entities: [('Universal Screen/M

 38%|███▊      | 127/337 [00:00<00:01, 159.93it/s]

Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: [('FOREIGN FILING', 'CLI')]
Entities: []
Entities: [('SERVICE FEE - PREPARE & FILE ANNUAL REPORT/TAX RETURN - ANNUAL REPORT MONITORING SERVICE', 'CLI')]
Entities: [('SERVICE FEE - PREPARE & FILE ANNUAL REPORT/TAX RETURN - ANNUAL REPORT MONITORING SERVICE', 'CLI')]
Entities: [('SERVICE FEE - PREPARE & FILE ANNUAL REPORT/TAX RETURN - ANNUAL REPORT MONITORING SERVICE', 'CLI')]
Entities: [('SERVICE FEE - PREPARE & FILE ANNUAL REPORT/TAX RETURN - ANNUAL REPORT MONITORING SERVICE 2020', 'CLI')]
Entities: [('SERVICE FEE - PREPARE & FILE ANNUAL REPORT/TAX RETURN - ANNUAL RE

 42%|████▏     | 143/337 [00:00<00:01, 155.07it/s]

Entities: [('Head of Product Second Retainer Invoice Second Retainer Invoice', 'CLI')]
Entities: [('Vice President Insurance For securing services of Daversa Partners First Retainer Invoice', 'CLI')]
Entities: [('VP Business Development For securing services of Daversa Partners First Retainer Invoice', 'CLI')]
Entities: []
Entities: [('Total Expenses Billing for professional services in connection with the estimation of a range of value of certain intellectual property assets of Verifly, as of November 30, 2018.', 'CLI')]
Entities: [('Campaign Management March 2019 Campaign Management', 'CLI')]
Entities: [('2196269673816849-4106986', 'CLI')]
Entities: []
Entities: []
Entities: [('Service:', 'CLI')]
Entities: [('Service:', 'CLI')]
Entities: [('Service:', 'CLI')]
Entities: [('Service:', 'CLI')]
Entities: [('Service:', 'CLI')]
Entities: [('Service:', 'CLI')]
Entities: [('Estate Day pass', 'CLI')]
Entities: [('Freshdesk - Estate Monthly plan', 'CLI')]
Entities: [('Freshchat - Agents', 'CLI

 52%|█████▏    | 176/337 [00:01<00:01, 157.10it/s]

Entities: [('Google Ads', 'CLI')]
Entities: [('Google Ads', 'CLI')]
Entities: []
Entities: []
Entities: []
Entities: [('Strategic Finance Team', 'CLI')]
Entities: [('Strategic Finance Team:J_Lazarus', 'CLI')]
Entities: [('Strategic Finance Team:L_Smith', 'CLI')]
Entities: []
Entities: []
Entities: []
Entities: []
Entities: []
Entities: []
Entities: []
Entities: []
Entities: [('Expense Hogan Assessment for JB', 'CLI')]
Entities: []
Entities: []
Entities: []
Entities: []
Entities: []
Entities: []
Entities: []
Entities: [('Unlimited InVision Enterprise Users', 'CLI')]
Entities: [('Promotion 3M9PC46N applied', 'CLI')]
Entities: [('professional services rendered Schwartz, B', 'CLI')]
Entities: [('DS Review master lease, telephone conference with client regarding changes, edit and revise sublease, circulate comments', 'CLI')]
Entities: [('DS Review revised Guaranty, email with client regarding same, circulate redraft', 'CLI')]
Entities: [('DataGrip Commercial annual subscription', 'CLI')]
En

 62%|██████▏   | 208/337 [00:01<00:00, 147.59it/s]

Entities: [('Progress billing for tax consulting services per our engagement letter dated April 3, 2019.', 'CLI')]
Entities: [('A Comprehensive Guide to Self Employed Business Insurance', 'CLI')]
Entities: [('Dance Studio and Instructor Insurance', 'CLI')]
Entities: []
Entities: [('Short-form landing pages', 'CLI')]
Entities: []
Entities: [('Staff photography w/ post production Will produce head shots for all staff members, along with a team photo.', 'CLI')]
Entities: [('DISBURSEMENTS ADVANCED', 'CLI')]
Entities: [('DISBURSEMENTS ADVANCED', 'CLI')]
Entities: [('DISBURSEMENTS ADVANCED', 'CLI')]
Entities: [('DISBURSEMENTS ADVANCED', 'CLI')]
Entities: [('DISBURSEMENTS ADVANCED', 'CLI')]
Entities: [('DISBURSEMENTS ADVANCED', 'CLI')]
Entities: [('DISBURSEMENTS ADVANCED', 'CLI')]
Entities: [('PROFESSIONAL SERVICES', 'CLI')]
Entities: [('Professional Services Kwok W Lee', 'CLI')]
Entities: [('Seth H Ostrow', 'CLI')]
Entities: [('Advisory Service', 'CLI')]
Entities: [('Office 365 Advanced Thre

 72%|███████▏  | 243/337 [00:01<00:00, 156.97it/s]

Entities: [('Storage Standard Page Blob LRS Data Stored', 'CLI')]
Entities: [('Storage Standard SSD Managed Disks Disk Operations', 'CLI')]
Entities: [('Storage Tables Batch Write Operations', 'CLI')]
Entities: [('Storage Tables Batch Write Operations', 'CLI')]
Entities: [('Storage Tables', 'CLI')]
Entities: [('Virtual Machines D/DS Series D2/DS2 US East', 'CLI')]
Entities: [('Virtual Machines BS Series Windows', 'CLI')]
Entities: [('Virtual Machines Ev3/', 'CLI')]
Entities: [('Base charge (4 GB)', 'CLI')]
Entities: [('Trademark 4862340 renewal 1 class included', 'CLI')]
Entities: [('ISO Comm.', 'CLI')]
Entities: [('Care, Custody and Control Pricing Non-Accredited Actuary', 'CLI')]
Entities: []
Entities: [('Pingdom Advanced - Monthly 26 Jun 2019 - 26 Jul 2019', 'CLI')]
Entities: [('Discount code: ANALOGUE', 'CLI')]
Entities: [('Marketing Services Refer to the contract for a complete breakdown of services', 'CLI')]
Entities: [('Marketing Services', 'CLI')]
Entities: [('Marketing Service

 82%|████████▏ | 275/337 [00:01<00:00, 153.92it/s]

Entities: [('Staples Carpet Chair Mat, 36" x 48\'\', Crystal Clear (STP-17436) Staples Carpet Chair Mat, 36" x 48\'\', Crystal Clear (STP-17436)', 'CLI')]
Entities: [('Internet Access Internet Access : New York - Chinatown / LES - Stealth Fiber 1', 'CLI')]
Entities: []
Entities: [('m Limited Fixed Cost: Barracuda Sentinel BSENTS001a:', 'CLI'), ('Barracuda Sentinel for Office 365 - subscription license (1 year):', 'CLI')]
Entities: [('m Limited Fixed Cost: Barracuda Sentinel', 'CLI')]
Entities: [('jabra:', 'CLI')]
Entities: [('Billable Expenses Computer Accessories', 'CLI')]
Entities: [('Engineer Eric Gross', 'CLI')]
Entities: [('Engineer Michael Kerr', 'CLI')]
Entities: []
Entities: [('Billable Services Engineer', 'CLI')]
Entities: [('Engineer Alex Lora', 'CLI')]
Entities: [('Engineer Revor Vicencio', 'CLI')]
Entities: [('Engineer Ajoykumar Rajasekaran', 'CLI')]
Entities: []
Entities: [('Leads 43952', 'CLI')]
Entities: [('Junk hauling', 'CLI')]
Entities: [('Scotch 11.25" x 8.75', 'CLI'

 91%|█████████▏| 308/337 [00:02<00:00, 148.07it/s]

Entities: [('Filing Fees J. Smith license applications in 27 states.', 'CLI')]
Entities: [('Filing Fees Verifly entity renewal in Washington.', 'CLI')]
Entities: [('Filing Fees Verifly entity renewal in the following states with the accompanying state fees: Alaska ($80), California ($193), Colorado ($152), District of Columbia ($105), Idaho ($65), Illinois ($155), New Hampshire ($155), New Jersey ($155), Oregon ($155), Pennsylvania ($115), South Dakota ($5), Utah ($90), and West Virginia ($205).', 'CLI')]
Entities: [('Filing Fees Secretary of State DBA Name Registration in Vermont.', 'CLI')]
Entities: [('Lexis Fees Monthly Lexis', 'CLI')]
Entities: [('UPS Fees - EX', 'CLI')]
Entities: [('UPS Fees Monthly UPS shipment(s).', 'CLI')]
Entities: [('General Pulled New Mexico affiliation renewal form and list of affiliations.', 'CLI')]
Entities: [('General CC', 'CLI')]
Entities: [('KGL Correspondence with New York regarding DP Chnage.', 'CLI')]
Entities: [('Professional Services KGL', 'CLI')]

100%|██████████| 337/337 [00:02<00:00, 152.33it/s]

Entities: [('Arch 1.', 'CLI')]
Entities: [('Mike Data 1.', 'CLI')]
Entities: [('2 Mike Data 1.', 'CLI')]
Entities: [('Misc Rent (per seat x 17)', 'CLI')]
Entities: []
Entities: [('Misc Rent (per seat x 11)', 'CLI')]
Entities: [('Misc LCD ($425)', 'CLI')]
Entities: [('misc xiamen ->', 'CLI')]
Entities: [('Misc nyc', 'CLI')]
Entities: [('Misc Isaacâ€™s trip for visa interview (round-trip flight and one night hotel)', 'CLI')]
Entities: [('Misc Rent (per seat x 16)', 'CLI')]
Entities: [('Misc Rent (per seat x 8)', 'CLI')]
Entities: [('Misc macbook ($1,500)', 'CLI')]
Entities: []
Entities: [('Maggie PM 1.', 'CLI')]
Entities: [('Tina PM 1.', 'CLI')]
Entities: []
Entities: [('Jesse Web 1.', 'CLI')]
Entities: [('Jesse Web 1.', 'CLI')]
Entities: [('Web 1.', 'CLI')]
Entities: [('Sam Web 1.', 'CLI')]
Entities: [('Sam Web 1.', 'CLI')]
Entities: [('Sam Web 1.', 'CLI')]
Entities: [('Conducted a 10-15 minute email survey of up to 6,000 Verifly members Frequencies and cross-tabulation tables: a breakd




In [6]:
predicted_entities

['Management Services May',
 'Acrobat Pro DC',
 'AIEX 96 Pieces',
 'AmazonBasics AAA',
 'AmazonBasics Mesh Trash Can Waste Basket 1',
 'AmazonFresh Mediterranean Extra Virgin Olive Oil, 68 Fl Oz (2L) B01N3LCEDL',
 'Apple 87W USB-C Power Adapter (for MacBook Pro)',
 'Apple iPad 2',
 'Apple iPad with Retina Display',
 'Apple iPad with Retina Display',
 'Assurant B2B 4YR Kitchen Protection Plan with Accidental Damage $50-74',
 'Bose QuietComfort 35 (Series I) Wireless Headphones,',
 'Calculator,Vilcome 12-Digit Solar Battery Office Calculator with Large LCD Display Big Sensitive Button, Dual Power Desktop Calculators (Black) Calculator,Vilcome 12-Digit Solar Battery Office Calculator with Large LCD Display Big Sensitive Button, Dual Power Desktop Calculators (Black)',
 'Dell 130-WATT 3-Prong AC Adapter with 6 FT Power Cord',
 'Logitech K400 Plus Wireless Touch TV Keyboard with Easy Media Control and Built-',
 'Microsoft Natural Ergonomic Keyboard 4000 for Business - Wired',
 'Mouthwash Di

In [7]:
df['predicted_canonical_line_item_name'] = predicted_entities

In [8]:
df = df.drop(['canonical_line_item_name'],axis=1)

In [9]:
df

Unnamed: 0,line_item_name,line_item_description,canonical_vendor_name,name_desc,predicted_canonical_line_item_name
0,Management Services,May 2019 Services,10 Minute Ventures,Management Services May 2019 Services,Management Services May
1,Acrobat Pro DC,,Adobe,Acrobat Pro DC,Acrobat Pro DC
2,AIEX 96 Pieces Adhesive Poster Tacky Putty Sti...,,Amazon Business,AIEX 96 Pieces Adhesive Poster Tacky Putty Sti...,AIEX 96 Pieces
3,AmazonBasics AAA 1.5 Volt Performance Alkaline...,,Amazon Business,AmazonBasics AAA 1.5 Volt Performance Alkaline...,AmazonBasics AAA
4,AmazonBasics Mesh Trash Can Waste Basket,1,Amazon Business,AmazonBasics Mesh Trash Can Waste Basket 1,AmazonBasics Mesh Trash Can Waste Basket 1
...,...,...,...,...,...
332,Web,1. Thimble Monthly Webapp 2. Customer Referral...,Xiamen ZhiZhi Tech,Web 1. Thimble Monthly Webapp 2. Customer Refe...,Web 1.
333,Sam Web,1. Web app 2.0 mobile version 2. Purchase wid...,Xiamen ZhiZhi Tech,Sam Web 1. Web app 2.0 mobile version 2. Purch...,Sam Web 1.
334,Sam Web,1. Segment Integration 2. Iterable Integration,Xiamen ZhiZhi Tech,Sam Web 1. Segment Integration 2. Iterable Int...,Sam Web 1.
335,Sam Web,1. Broker Purchase Project 2. New Insurance Pr...,Xiamen ZhiZhi Tech,Sam Web 1. Broker Purchase Project 2. New Insu...,Sam Web 1.


In [10]:
df.to_csv('eval_predictions_lg.csv',index=False)

### Observation and Results:

- We trained and fine tuned the model using a few hundred datapoints,so the results are far from perfect. So the model has not generalized perfectly but it does a decent job. This approach gives us a good starting point to identify and map custom entities.
- The predicted entites / results obtained from 3 models are present in 3 different csv files inside folder output_eval.
- Result maybe further improved by using more data,providing annotations in a better way or changing the model's parameters while training.