# NER: Dataset Generation

The main purpose of this notebook is the creation of training dataset for Named Entity Recognition in a way that manual annotation is avoided as much as possible. This allows the creation for enough data for the task in a short amount of time, however, the tradeoff is lower quality, which is somewhat apparent in the performance of the NER, though good enough for the prototype. The general approach employed here are:
* Token-based similarity through cosine distance with GloVe vectors (100d) provided by Gensim
* Hand-picked words are gathered for specific entity categories and chose based on this similarity value and if it passes a specific threshold
* This accounts for
  * PROD: products
  * BRND: brand names
  * MATR: Materials
  * TIME: Time
  * MISC: Miscellaneous
* PERS: person names are difficult to manually gather. In this case, the SpaCy is used to tag this type
  * Using SpaCy's medium English module since the large version takes to long to process
* The main entities are PROD and BRND while the others are in place in order to better distinguish the tokens
  * Problematic cases are person names being classified as brand names, e.g. Tommy Hilfiger
  * Other problemtatic cases are product words having high similarity values to words such as 'leather', 'cloth', which is why a seperate entity MATR is in place.




In [None]:
import nltk
import spacy
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.2MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp36-none-any.whl size=98051304 sha256=fe9d69ddc4bb8a8007426bbb8486039999a5627e9538112132cf95ba90e005bf
  Stored in directory: /tmp/pip-ephem-wheel-cache-ulo6knzg/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [None]:
spacy_nlp = spacy.load("en_core_web_md")

In [None]:
import gensim 
import gensim.downloader as api
glove_vectors = gensim.downloader.load("glove-wiki-gigaword-100") # 128 MB

In [None]:
# words needs to be lower-cased as well
result = print('Word exists') if "JAZZ" in glove_vectors.vocab else print("Word does not exist")

Word does not exist


In [None]:
import pandas as pd
import numpy as np
import itertools
from tqdm import tqdm

# Configs

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
PROJECT_PATH = '/content/drive/MyDrive/Colab/data/ma_data/'
DATA_PATH = PROJECT_PATH + 'ma_feedback_all_doublecheck.csv'
OUT_PATH = PROJECT_PATH + 'ner_train_feedback04.csv'

# Load Data

* Data is which the text has been identified as Product are chosen since we as much of the relevant entities relating to products as possible while keeping the data amount reasonable to promote faster experimentation cycles

In [None]:
df_raw = pd.read_csv(DATA_PATH)
columns = ["id", "feedback_text_en", "product"]
df = df_raw[columns]
df_products = df[df["product"] == True]
print(f"df_products length: {len(df_products)}") 

df_products length: 6552


In [None]:
df_products.to_csv(PROJECT_PATH + "ma_data_ner_products.csv")

# Generate Entities

* Noun words are extracted from the example texts to make the entity generation process somewhat more efficient
* Using NLTK's word pos tagger to retrieve nouns
* Token and Entity label pairs are defined (ents2tag) as domain specific entities
* Similaries are checked for each noun word with words within the domain words and tagged with the appropriate entity label if it has surpassed the threshold value
* GloVe word vectors with 100 dimensions are chosen for their good performance and lightweight
* The threshold value is empirically chosen after multiple iterations and is currently set to 0.75 (value of 1 describes a similarity that is considered identical for the chose model)


In [None]:
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
      return wordnet.ADJ
    elif nltk_tag.startswith('V'):
      return wordnet.VERB
    elif nltk_tag.startswith('N'):
      return wordnet.NOUN
    elif nltk_tag.startswith('R'):
      return wordnet.ADV
    else:          
      return None

In [None]:
def get_nouns_nltk(texts, exceptions):
    nouns_list = []
    for text in texts:
        nouns = []    
        nltk_tagged = nltk.pos_tag(nltk.word_tokenize(text)) 
        wordnet_tagged = list(map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged))
        for word, tag in wordnet_tagged:
            if tag is wordnet.NOUN or word in exceptions:
                nouns.append(word)
        
        nouns_list.append(nouns)
    return nouns_list

In [None]:
product_entities = ["hat", "sunglasses", "jewelery", "scarf", "shirt", "jacket",
               "dress", "bra", "pants", "skirt", "legging", "shoes", "bag"]

brand_entities = ["nike", "adidas", "esprit", "armani", "eastpak", "quicksilver", 
                  "diesel", "hilfiger", "vans", "lacoste", "next", "zalando", "garmin", 
                  "gant", "puma", "mango", "woden", "spencer", "tamaris"]     

material_entities = ["fabric", "silk", "cloth", "plastic", "leather", "seam", "thread"]

time_entities = ["time", "month", "year", "day"]

misc_entities = ["zipper", "button"]

product_tags = ["PROD"] * len(product_entities)
brand_tags = ["BRND"] * len(brand_entities)
material_tags = ["MATR"] * len(material_entities)
misc_tags = ["MISC"] * len(misc_entities)
time_tags = ["TIME"] * len(time_entities)
tags_combined = product_tags + brand_tags + material_tags + misc_tags + time_tags

entities_combined = product_entities + brand_entities + material_entities + misc_entities + time_entities
ents2tag = dict(zip(entities_combined, tags_combined))

In [None]:
# For words unknown to GloVe, save a separate dict
unknown_ent2tag = dict()
for key, tag in zip(ents2tag.keys(), ents2tag.values()):
    if key not in glove_vectors.vocab:
        unknown_ent2tag[key] = tag

In [None]:
unknown_ent2tag

{'tamaris': 'BRND', 'zalando': 'BRND'}

In [None]:
for unknown_tag in unknown_ent2tag.keys():
    ents2tag.pop(unknown_tag)

In [None]:
bigrams = ["north face", "reebok classic", "tommy hilfiger", "adidas continental", 
           "adidas performance", "street one", "tom tailor", "massimo dutti", 
           "tommy jeans", "pier one", "ralph lauren", "nike body"]
first_words = [pair.split(" ")[0] for pair in bigrams]
second_words = [pair.split(" ")[1] for pair in bigrams]
second2first = dict(zip(second_words, first_words))

In [None]:
second2first

{'body': 'nike',
 'classic': 'reebok',
 'continental': 'adidas',
 'dutti': 'massimo',
 'face': 'north',
 'hilfiger': 'tommy',
 'jeans': 'tommy',
 'lauren': 'ralph',
 'one': 'pier',
 'performance': 'adidas',
 'tailor': 'tom'}

In [None]:
texts = df_products["feedback_text_en"].astype(str).tolist()

nouns_list = get_nouns_nltk(texts, first_words + second_words)

In [None]:
def get_word_tag_pairs(model, ents2tag, nouns_list, threshold=0.75):
    word_tag_pairs_list = []
    entities_list = list(ents2tag.keys())

    for nouns in tqdm(nouns_list):
        word_tag_pairs = []
        prev_word = ""
        for word in nouns:
            word_lower = word.lower()
            doc = spacy_nlp(word)
            persons = [i for i in doc.ents if i.label_.lower() in ["person"]]

            # Check bi-grams for brands
            if word_lower in second2first and prev_word == second2first[word_lower]:
                word_tag_pairs.append([word, "I-BRND"])
                prev_word = ""

            elif word_lower in first_words:
                word_tag_pairs.append([word, f"B-BRND"])
                prev_word = word_lower

            # try pre-defined list
            elif word_lower in ents2tag:
                tag = "B-" + ents2tag[word_lower]
                word_tag_pairs.append([word, tag])
                prev_word = ""

            # try words unknown to GloVe
            elif word_lower in unknown_ent2tag:
                tag = "B-" + unknown_ent2tag[word_lower]
                word_tag_pairs.append([word, tag])
                prev_word = ""

            # Spacy tagger for "person" tagging
            elif len(persons):
                tag = "B-PERS"
                if len(word_tag_pairs) > 1 and word_tag_pairs[-1][1] == "B-PERS":
                   tag = "I-PERS"
                word_tag_pairs.append([word, tag])
                prev_word = word_lower

            # try similarity with GloVe
            elif word_lower in model.vocab:
                similar_entities = [model.similarity(word_lower, entity) for entity in entities_list if word_lower in model.vocab]

                max_idx = np.argmax(similar_entities)
                max_score = similar_entities[max_idx]

                if max_score > threshold:
                    pred_tag = ents2tag[entities_list[max_idx]]
                    word_tag_pairs.append([word, "B-" + pred_tag])
                    prev_word = ""
                else:
                    prev_word = word_lower
                
            else:
                prev_word = word_lower

        word_tag_pairs_list.append(word_tag_pairs)
    return word_tag_pairs_list

In [None]:
%%time

# Quick test
word_tag_pairs_list_test = get_word_tag_pairs(glove_vectors, ents2tag, [['sports',
 'Pants',
 'Seam',
 'Ralph',
 'Lauren',
 'Something',
 'opinion',
 'Massimo',
 'Dutti',
 'monthly',
 'Article',
 'Tommy',
 'Hilfiger',
 'Money',
 'John',
 'Smith']])



100%|██████████| 1/1 [00:00<00:00,  3.79it/s]

CPU times: user 205 ms, sys: 6.75 ms, total: 212 ms
Wall time: 270 ms





In [None]:
word_tag_pairs_list_test

[[['Pants', 'B-PROD'],
  ['Seam', 'B-MATR'],
  ['Ralph', 'B-BRND'],
  ['Lauren', 'I-BRND'],
  ['Massimo', 'B-BRND'],
  ['Dutti', 'I-BRND'],
  ['Tommy', 'B-BRND'],
  ['Hilfiger', 'I-BRND'],
  ['John', 'B-PERS']]]

In [None]:
%%time
word_tag_pairs_list = get_word_tag_pairs(glove_vectors, ents2tag, nouns_list)

100%|██████████| 6552/6552 [12:21<00:00,  8.84it/s]

CPU times: user 12min 3s, sys: 4.42 s, total: 12min 8s
Wall time: 12min 21s





In [None]:
len(word_tag_pairs_list)

6552

In [None]:
word_tag_pairs_list[12]

[['Klein', 'B-PERS'],
 ['one', 'B-TIME'],
 ['seam', 'B-MATR'],
 ['fabric', 'B-MATR'],
 ['adidas', 'B-BRND'],
 ['leather', 'B-MATR'],
 ['seam', 'B-MATR']]

# Convert to NER training data format
* In order for the model to properly distinguish between entities not belonging to the ones desired, the `O` tag is introduced as proposed by the BIO-notation. 
* This way the complete sentence can be included and not only the noun words filtered above

In [None]:
ids = df_products["id"].astype(int).tolist()
grouping_ids_list, tokens_list, tags_list = [], [], []

for id, text, word_tag_pairs in tqdm(zip(ids, texts, word_tag_pairs_list)):
    grouping_ids, tokens, tags = [], [], []

    # tokenize texts
    texts_tokenized = word_tokenize(text)
    grouping_ids.extend([id] * len(texts_tokenized))

    for token in texts_tokenized:
        tokens.append(token)

        # compare if token is one of predicted pair words, then set class
        tag = {word_tag_pair[1] for word_tag_pair in word_tag_pairs if token == word_tag_pair[0]}
        if len(tag) != 0:
            tags.append(list(tag)[0])
        else:
            tags.append("O")

    grouping_ids_list.append(grouping_ids)
    tokens_list.append(tokens)
    tags_list.append(tags)

6552it [00:04, 1581.29it/s]


In [None]:
len(grouping_ids_list)

6552

In [None]:
len(tokens_list)

6552

In [None]:
len(tags_list)

6552

In [None]:
flat_ids = list(itertools.chain(*grouping_ids_list))
flat_words = list(itertools.chain(*tokens_list))
flat_tags = list(itertools.chain(*tags_list))

df_out = pd.DataFrame({ "sentence_idx": flat_ids, "word": flat_words, "tag": flat_tags })

# Results
* Sentences are now grouped by their sentence id so that each sentence can be destinguished and recombined
* Each word now also has an appropriate entity tag
* Lastly, we can see below the distribution of all entities

In [None]:
df_out.iloc[30:40]

Unnamed: 0,sentence_idx,word,tag
30,0,ordered,O
31,0,Nike,B-BRND
32,0,Air,O
33,0,Max,B-PERS
34,0,xxx,O
35,0,shoes,B-PROD
36,0,.,O
37,0,And,O
38,0,although,O
39,0,the,O


In [None]:
df_out.to_csv(OUT_PATH, index=False)

In [None]:
df_out["tag"].value_counts()

O         337992
B-PROD      7744
B-TIME      5245
B-PERS      2707
B-BRND      1927
B-MATR      1553
B-MISC       475
I-PERS       474
I-BRND       148
Name: tag, dtype: int64