<a href="https://colab.research.google.com/github/YanjunLin-Andrie/NLP_SpaCy_eBay/blob/main/eBay.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Goal: Investigate the eBay challenge with the intention of intigrating Machine Learning and AI technics from class. Take advantage of the eBay datasets, learn, build, test, evaluate, and optimize our NLP product.

Plan:

1. Slicing both datasets for higher effeciency and accuracy

2. Start google colab for group collaboration

3. Keep learning and researching on NLP, SpaCy, and similar projects 

4. Use Train Tagged Titles new dataset to build our pipline with custom NERs and patterns

5. Test the created pipline on a small set of Listing Title dataset

  5.1 Loaded the original Training dataset for fuller entity list and better results

6. Get the Classification report of current model and identify where and what do we need to do to improve the model accuracy (ie. improve the pattern list)

7. Optimize the model

8. Optional: explore other libraries ie. sklearn nlp models

9. Presentation preparation and README file

# We are on number 6.

In [31]:
# Import all libraries and dependencies

import pandas as pd
import spacy
import numpy as np
from spacy.tokens import Span
from collections import Counter
from string import punctuation
from random import shuffle
from spacy.scorer import Scorer
from spacy.tokens import Doc
from spacy.training.example import Example

In [32]:
# Import training dataset
from google.colab import files
# uploaded = files.upload()
#read the file
ttt = pd.read_csv('Train_Tagged_Titles.tsv', on_bad_lines = 'skip', sep = '\t')
ttt = ttt.replace(np.nan, 'Brand', regex=True)
#read the file
lt_sm = pd.read_csv('Listing_Titles_sm.tsv')

In [33]:
print(ttt.shape)
display(ttt)

(38209, 4)


Unnamed: 0,Record Number,Title,Token,Tag
0,1,LOUIS VUITTON M40096 Handbag Priscilla Multi-c...,LOUIS,Brand
1,1,LOUIS VUITTON M40096 Handbag Priscilla Multi-c...,VUITTON,Brand
2,1,LOUIS VUITTON M40096 Handbag Priscilla Multi-c...,M40096,MPN
3,1,LOUIS VUITTON M40096 Handbag Priscilla Multi-c...,Handbag,Type
4,1,LOUIS VUITTON M40096 Handbag Priscilla Multi-c...,Priscilla,Model
...,...,...,...,...
38204,5000,Botkier Sasha Medium Duffel Bag Coral Leather ...,Top,No Tag
38205,5000,Botkier Sasha Medium Duffel Bag Coral Leather ...,Closure,No Tag
38206,5000,Botkier Sasha Medium Duffel Bag Coral Leather ...,Retail,No Tag
38207,5000,Botkier Sasha Medium Duffel Bag Coral Leather ...,$,No Tag


In [34]:
lt_sm.drop(columns = 'Unnamed: 0', inplace = True)
print(lt_sm.shape)
display(lt_sm)

(9985, 2)


Unnamed: 0,Record Number,Title
0,19989937,Authentic Louis Vuitton Monogram Bucket PM Sho...
1,19989938,Will Leather Canvas tote
2,19989939,Brand New with tags Ness Small Tweed bag
3,19989940,Dooney & Bourke Suede Leather Gia Satchel Hand...
4,19989941,Pirates of the Caribbean Canvas String Drawstr...
...,...,...
9980,19999917,' Wire Frame Octopus ' Tote Shopping Bag For L...
9981,19999918,Vintage Etienne Aigner Medium Oxblood Red Leat...
9982,19999919,Nine West Purse Shoulder Bag Handbag Camel Bro...
9983,19999920,New 2021 Ladies Tote Bag Women Crossbody Shoul...


## Strategies:

* Rules based approach follow https://youtu.be/wpyCzodvO3A 
* Train machine learning model to catch tokens that were spelled wrong https://www.youtube.com/watch?v=YBRF7tq1V-Q

* Custom NER training for specific use case


In [35]:
# Get all of the unique tags from ttt dataframe and change them to a list format
all_tags = ttt["Tag"].unique().tolist()

In [36]:
# all_tags

In [37]:
# Create a blank English model
nlp = spacy.blank('en')
# Create the entity ruler and add to entity pipeline
ruler = nlp.add_pipe('entity_ruler')
# Create an empty list to collect all the patterns of the entity
patterns = []


In [38]:
# Get all of the tag names
for tag in all_tags:
  # Save list of Tokens under the tag name
  items = ttt["Token"].loc[ttt["Tag"] == f"{tag}"].tolist()
  # Loops through created list of Tokens
  for item in items:
    # Adds the new pattern to pattens list
    patterns.append({'label': f'{tag}', 'pattern': item})

In [39]:
print(patterns[:10])
print(patterns[-10:])

[{'label': 'Brand', 'pattern': 'LOUIS'}, {'label': 'Brand', 'pattern': 'VUITTON'}, {'label': 'Brand', 'pattern': 'LOUIS'}, {'label': 'Brand', 'pattern': 'VUITTON'}, {'label': 'Brand', 'pattern': 'Noe'}, {'label': 'Brand', 'pattern': 'LOUIS'}, {'label': 'Brand', 'pattern': 'VUITTON'}, {'label': 'Brand', 'pattern': 'LV'}, {'label': 'Brand', 'pattern': 'GUCCI'}, {'label': 'Brand', 'pattern': 'Gucci'}]
[{'label': 'Lining Material', 'pattern': 'lamb'}, {'label': 'Lining Material', 'pattern': 'Satin'}, {'label': 'Strap Drop', 'pattern': '12CM'}, {'label': 'Strap Drop', 'pattern': '18'}, {'label': 'Strap Drop', 'pattern': '23'}, {'label': 'Handle Drop', 'pattern': '24CM'}, {'label': 'Handle Drop', 'pattern': 'SHORT'}, {'label': 'Handle Drop', 'pattern': 'Long'}, {'label': 'Handle Drop', 'pattern': '11-inch'}, {'label': 'Handle Drop', 'pattern': 'Strap'}]


In [40]:
# add patterns list to model
ruler.add_patterns(patterns)

# Save model with patterns
nlp.to_disk('./ebay')

In [41]:
# Load model
nlp = spacy.load('./ebay')

In [42]:
# Create lists
tokens = ttt["Token"]
tags = ttt["Tag"]
TRAIN_DATA = []

In [43]:
# tokens

In [44]:
# tags

In [45]:
# Create list with all ttt tokens and tags grouped together
for i in range(len(tokens)):
  TRAIN_DATA.append([tokens[i], tags[i]])

In [46]:
# TRAIN_DATA

In [47]:
# Collecting testing dataset and convert testing data to list format
sentance_list = lt_sm['Title'].tolist()

In [48]:
# sentance_list

In [49]:
# Load testing data list to saved ebay nlp model to extract entity text and its label
dta = []
# label =[]
for i in sentance_list:
  doc = nlp(i)
  for ent in doc.ents:
      text =[[ent.text,ent.label_]]
      for i in text:
        dta.append(i)      

In [50]:
# dta

In [51]:
# Create a function to get ebay nlp model performance evaluation metrix
def my_evaluate(model, examples):
  # Create Scorer class
  scorer = Scorer() 
  # Create empty list
  example = []
  # Loop through examples
  for input_, annotations in examples: 
    # Predict
    pred = model(input_)
    print(pred, annotations)
    # Create Example instance
    temp = Example.from_dict(pred, dict.fromkeys(annotations))
    # append to example list
    example.append(temp)
  # Get Scores
  scores = scorer.score(example)
  return scores

In [52]:
# Get score of model from my_evaluate
score = my_evaluate(nlp,dta)
print(score)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Shoulder Brand
Bag Brand
Coach Product Line
Black Color
Jacquard Fabric Type
Gold Color
Hardware No Tag
Cross Type
Body Brand
Small Obscure
Shoulder Handle Style
Bag Type
Purse Brand
SATCHEL Type
CROSSBODY Type
SHOULDER Handle Style
BAG Brand
W No Tag
POUCH Type
PURSE Type
set No Tag
by Brand
Radley Brand
Red Color
Leather Trim Material
Hobo Type
/ Handle/Strap Material
Shoulder Handle Style
Grab Model
Bag Type
Medium Size
/ No Tag
Large Brand
Size No Tag
Gucci Product Line
Brown Color
Hobo Type
Web Product Line
Canvas Brand
Bag Brand
With Obscure
Entrupy No Tag
Auth No Tag
Louis Brand
Vuitton Brand
Speedy Brand
25 Measurement, dimension
Monogram Brand
Hand Features
Bag Type
Mini Model
Boston Type
Bag Brand
M41528 MPN
Los Brand
Angeles Brand
Bright Brand
Red Product Line
Hobo Type
Sling Handle Style
Handbag Type
Harajuku No Tag
Lovers No Tag
Crossbody Type
shoulder Brand
Bag Brand
Gwen Model
girls Brand
black Color
white 

In [53]:
print(nlp.pipe_names)

['entity_ruler']
