# Overview of Design
Goal: Message + Intent --> Params

Named Entity Recognition (NER) is a Natural Language Processing (NLP) task where a model learns to find and label certain parts of a sentence that represent 'entities', like names, dates, songs, places, artists, etc.
In this case, NER will be used to identify where the entities are and what type they are in the provided user message
This will work in conjunction with the intent determining model, which tells you what the user wants. This model will specify what they are talking about.

#### Potential Avenues
1. HuggingFace Transformers: May consider for future, but seems overkill for now with GPU training
2. OpenAI/GPT: Needs Wi-Fi connection, unideal
3. spaCy: Ideal for smaller datasets, this was the chosen model

#### NER with spaCy Steps
1. Tokenize the input (split into words)
2. Embedding Layer Converts each token (word) into vector representation
3. Feeds into a neural network (Convolutional / transition-based feature extractor --> Feedforward layers for tagging each token --> trained via backpropagation)
4. Network outputs a label for each token

In [23]:
import pandas as pd
import numpy as np

In [24]:
df = pd.read_csv("Annotated_Intent_Dataset.csv")
print(df.head())

                  input_sentence                         annotations
0   Play something by Luis Fonsi  {'entities': [(18, 28, 'artist')]}
1    I want to hear Shape of You    {'entities': [(15, 27, 'song')]}
2  Put on some Bohemian Rhapsody    {'entities': [(12, 29, 'song')]}
3                 Play Despacito     {'entities': [(5, 14, 'song')]}
4            Start playing Queen  {'entities': [(14, 19, 'artist')]}


In [25]:
import ast

TRAIN_DATA = []
for _, row in df.iterrows():
    text = row["input_sentence"].lower()
    # ast to evaluate a string that represents a python literal like a dict
    annotations = ast.literal_eval(row["annotations"].lower())
    TRAIN_DATA.append((text, annotations))

(TRAIN_DATA[:5])

[('play something by luis fonsi', {'entities': [(18, 28, 'artist')]}),
 ('i want to hear shape of you', {'entities': [(15, 27, 'song')]}),
 ('put on some bohemian rhapsody', {'entities': [(12, 29, 'song')]}),
 ('play despacito', {'entities': [(5, 14, 'song')]}),
 ('start playing queen', {'entities': [(14, 19, 'artist')]})]

In [26]:
import spacy # nlp libary
from spacy.training.example import Example
from random import shuffle

# Step 1: Create a blank English model
nlp = spacy.blank("en")

# Step 2: Add the Named Entity Recognizer (NER) component (this is what gets trained to find the params)
ner = nlp.add_pipe("ner")

# Step 3: Add your custom labels to the NER component
for _, annotations in TRAIN_DATA:
    for ent in annotations["entities"]:
        ner.add_label(ent[2])

# Step 4: Begin training
nlp.begin_training()

# Step 5: Train the model
for i in range(20):  # 20 epochs
    shuffle(TRAIN_DATA)  # shuffle data each time
    losses = {}
    for text, annotations in TRAIN_DATA:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp.update([example], losses=losses)
    print(f"Loss after epoch {i+1}: {losses['ner']:.4f}")


Loss after epoch 1: 239.2330
Loss after epoch 2: 60.9828
Loss after epoch 3: 19.0806
Loss after epoch 4: 6.6742
Loss after epoch 5: 22.4596
Loss after epoch 6: 6.0620
Loss after epoch 7: 0.0624
Loss after epoch 8: 4.1384
Loss after epoch 9: 32.0226
Loss after epoch 10: 36.4206
Loss after epoch 11: 4.5083
Loss after epoch 12: 4.2346
Loss after epoch 13: 0.0000
Loss after epoch 14: 0.0000
Loss after epoch 15: 0.0000
Loss after epoch 16: 0.0000
Loss after epoch 17: 0.0000
Loss after epoch 18: 0.0000
Loss after epoch 19: 0.0000
Loss after epoch 20: 0.0000


In [None]:
#test
doc = nlp("Play into jude by wold beatles")
for ent in doc.ents:
    print(ent.label_, ent.text)


song hey jude
artist the beatles


In [6]:
# Note: joblib/pkl can't be used here because a spaCy pipeline is much more complicated than the simple python objects joblib is designed for (numpy arrays, sklearn estimators, etc)

nlp.to_disk("param_classifier")
