# Overview of Design
Goal: Message + Intent --> Params

Named Entity Recognition (NER) is a Natural Language Processing (NLP) task where a model learns to find and label certain parts of a sentence that represent 'entities', like names, dates, songs, places, artists, etc.
In this case, NER will be used to identify where the entities are and what type they are in the provided user message
This will work in conjunction with the intent determining model, which tells you what the user wants. This model will specify what they are talking about.

#### Potential Avenues
1. HuggingFace Transformers: May consider for future, but seems overkill for now with GPU training
2. OpenAI/GPT: Needs Wi-Fi connection, unideal
3. spaCy: Ideal for smaller datasets, this was the chosen model

#### NER with spaCy Steps
1. Tokenize the input (split into words)
2. Embedding Layer Converts each token (word) into vector representation
3. Feeds into a neural network (Convolutional / transition-based feature extractor --> Feedforward layers for tagging each token --> trained via backpropagation)
4. Network outputs a label for each token

In [87]:
import pandas as pd
import numpy as np
# import os
# os.chdir("src/intent_model/")

In [93]:
df_shuffled = pd.read_csv("NER_Intent_Dataset.csv")
# df_shuffled = df.sample(frac=1).reset_index(drop=True)
print(df_shuffled.head())

                           input_sentence  \
0    Start The Beatles's Someone Like You   
1                             Make it max   
2  Alarm for 6:30 with note call with mom   
3        Delete the morning workout alarm   
4   Turn on the track Sweet Child O' Mine   

                                         annotations  
0  {"entities": [(20, 36, "song"), (6, 17, "artis...  
1        {"entities": [(8, 11, "percent_of_total")]}  
2  {"entities": [(10, 14, "time"), (25, 38, "labe...  
3                  {"entities": [(11, 26, "label")]}  
4                   {"entities": [(18, 37, "song")]}  


In [95]:
import ast

TRAIN_DATA = []

for _, row in df_shuffled.iterrows():
    text = row["input_sentence"].lower()
    # ast to evaluate a string that represents a python literal like a dict
    annotations = ast.literal_eval(row["annotations"].lower())

    TRAIN_DATA.append((text, annotations))
    i+=1

(TRAIN_DATA[:5])


[("start the beatles's someone like you",
  {'entities': [(20, 36, 'song'), (6, 17, 'artist')]}),
 ('make it max', {'entities': [(8, 11, 'percent_of_total')]}),
 ('alarm for 6:30 with note call with mom',
  {'entities': [(10, 14, 'time'), (25, 38, 'label')]}),
 ('delete the morning workout alarm', {'entities': [(11, 26, 'label')]}),
 ("turn on the track sweet child o' mine", {'entities': [(18, 37, 'song')]})]

In [96]:
import spacy # nlp libary
from spacy.training.example import Example
from random import shuffle

# Step 1: Create a blank English model
nlp = spacy.blank("en")

# Step 2: Add the Named Entity Recognizer (NER) component (this is what gets trained to find the params)
ner = nlp.add_pipe("ner")

# Step 3: Add your custom labels to the NER component
for _, annotations in TRAIN_DATA:
    for ent in annotations["entities"]:
        ner.add_label(ent[2])

# Step 4: Begin training
nlp.begin_training()

# Step 5: Train the model
for i in range(20):  # 20 epochs
    shuffle(TRAIN_DATA)  # shuffle data each time
    losses = {}
    for text, annotations in TRAIN_DATA:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp.update([example], losses=losses)

    print(f"Loss after epoch {i+1}: {losses['ner']:.4f}")


Loss after epoch 1: 503.2669
Loss after epoch 2: 25.0873
Loss after epoch 3: 12.2763
Loss after epoch 4: 11.0867
Loss after epoch 5: 0.6833
Loss after epoch 6: 118.1117
Loss after epoch 7: 48.2729
Loss after epoch 8: 14.2991
Loss after epoch 9: 14.9271
Loss after epoch 10: 20.2093
Loss after epoch 11: 8.7728
Loss after epoch 12: 16.2256
Loss after epoch 13: 50.6962
Loss after epoch 14: 103.4880
Loss after epoch 15: 15.7139
Loss after epoch 16: 7.7269
Loss after epoch 17: 5.0962
Loss after epoch 18: 4.0778
Loss after epoch 19: 31.7038
Loss after epoch 20: 36.7922


In [101]:
#test, should produce: song summertime sadness artist lana del ray
def test(test_str):
    output = nlp(test_str)
    for ent in output.ents:
        print(ent.label_, ent.text)
    print("-------------------")

test("set an alarm for 8am tomorrow for breakfast")
test("wake me up tomorrow at 9am for running")
test("play godzilla by eminem")
test("play godzilla")




time 8am
label breakfast
-------------------
song tomorrow at
time 9am
label running
-------------------
song godzilla
artist eminem
-------------------
song godzilla
-------------------


In [99]:
# Note: joblib/pkl can't be used here because a spaCy pipeline is much more complicated than the simple python objects joblib is designed for (numpy arrays, sklearn estimators, etc)

nlp.to_disk("param_classifier")  # Save


