## Installing and Using spaCY
Make sure that the install was done correctly for all packages, and the versions match. More details for installation here: [spaCy 101: Everything you need to know](https://spacy.io/usage) (this part can be frustrating, so validate and make sure everything works with the code blocks below).

In [35]:
'''
Validate spacy installs, compatability for version used
'''
# Active the venv for the models (not available through conda)
!python -m spacy validate
!echo %CUDA_PATH%
import spacy


⠙ Loading compatibility table...
⠹ Loading compatibility table...
⠸ Loading compatibility table...
⠼ Loading compatibility table...
⠴ Loading compatibility table...
⠦ Loading compatibility table...
⠧ Loading compatibility table...
⠇ Loading compatibility table...
⠏ Loading compatibility table...
⠙ Loading compatibility table...
[2K✔ Loaded compatibility table
[1m
ℹ spaCy installation:
C:\Users\elvis\anaconda3\envs\jupyter\lib\site-packages\spacy

NAME              SPACY            VERSION    
en_core_web_lg    >=3.1.0,<3.2.0   3.1.0     ✔
en_core_web_sm    >=3.1.0,<3.2.0   3.1.0     ✔
en_core_web_trf   >=3.1.0,<3.2.0   3.1.0     ✔

C:\Users\elvis\anaconda3\envs\jupyter\Library


## Demo spaCy: NER and Visualization Basics
Here, we are trying to see how we can get spaCy to do what we want it to do. First, let's see what the base NER can do, and visualize it nicely with `displacy`.

Here, let's use a two-sentence text, and run the entity labeler based on the small English model and the large English model. From this, we can learn a few things about these spaCy models and how we may create our own.

In [36]:
# Text taken from Sonic Wikipedia page
texts = [
    "Sonic the Hedgehog is a Japanese video game series and media franchise created and owned by Sega.",
    "The franchise follows Sonic, an anthropomorphic blue hedgehog who battles the evil Doctor Eggman, a mad scientist.",
]

# Example using small English pre-trained model
print("Small Model:")
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])

# Example using large English pre-trained model
print("Large Model:")
nlp = spacy.load("en_core_web_lg")
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])

Small Model:
[('Hedgehog', 'PRODUCT'), ('Japanese', 'NORP'), ('Sega', 'PERSON')]
[('Sonic', 'ORG'), ('Eggman', 'PERSON')]
Large Model:
[('Japanese', 'NORP'), ('Sega', 'ORG')]
[('Sonic', 'ORG'), ('Eggman', 'PERSON')]


From the start, we see some issues the way that entities are found and labeled. Using the **small English model**, it says that *Hedgehog* in this case is a PRODUCT, *Sega* is a PERSON rather than an ORG, and *Sonic* is an ORG when it is referring to the character. The **large English model**, however, correctly labels *Sega* as an ORG, and does not label *Hedgehog* as a PRODUCT. We can see the possible reasoning for *Sonic* to still be mislabeled, as *Sonic* also refers to the fast-food chain, and the model may have been trained based on that.

With these two comparisons, we can see how creating a large training model will be very, very important for this project to work well!

Next, let's create some pretty graphics representing the entity labeling using `displacy`! We're going to use a new, short sentence.

In [53]:
'''
Example with Visualizations
'''
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_lg")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    #print(ent.text, ent.start_char, ent.end_char, ent.label_)
    continue

# TODO: we should define nice colors to reuse for all our entity types
colors = {"ORG": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}
options = {"colors": colors}
doc.user_data["title"] = "Example Render of Entity Recognizer"
displacy.render(doc, style="ent", jupyter=True, options=options)

# Create Training Data
After we did some smaller tests to see how spaCy is working, let's compile the datasets into valuable training data to be used. This will then be used in order to create everything we need to deal with new data. We're doing this with the **recipe** text we want to eventually use!

### Step 1: See What Spacy Already Labels
First, let's see how it works with text from a recipe. This will help us know how to build our model and how to format our dataset. With this, we can see what entities spaCy looks for when given a recipe.

In [65]:
'''
Example using Recipe Text
'''
import spacy
test = "add flour and stir to cook and absorb the oil, then add in 1 cup vegan\
        creamer or plant milk and vegan chicken bouillon paste/powder/cube, then whisk/stir\
        vigorously until the flour has cooked and thickened in the sauce (to make the roux).\
         add the grated vegan parm and stir to melt, then thin the sauce with either more\
        plant milk or the reserved cooking pasta water."
nlp = spacy.load("en_core_web_lg")
doc = nlp(test)
for ent in doc.ents:
    print(ent.text, ent.label_)

1 CARDINAL


It seems that spaCy is only able to pick out cardinals from a text such as this one. That's fine! In the next step, we're going to look on how to create connected entities and labels, heading towards creating the dataset we need.

### Step 2: Build a Small Training Dataset Example

We want to build our dataset to be able to detect three things: **ingredients**, **quantities**, and **processes**. And, we are able to do just that. Let's manually do that now with the example above, but start off with a blank spacy nlp, so that labels such as cardinals are disregarded. Without doing this, **quantities** that include numbers will be labeled as cardinals instead.

Examples of each label:
* **Ingredients**: Flour, sugar, apple sauce...
* **Quantities**: 1 cup, 1/4 tablespoon, 1 teaspoon...
* **Processes**: Mix, stir, fry...

In [70]:
'''
Create and visualize training text and entities.
'''

import spacy
from spacy import displacy

# Retype test text to make sure indent spacing does not show up
test = "add flour and stir to cook and absorb the oil, then add in 1 cup vegan creamer or plant milk and vegan chicken bouillon paste/powder/cube, then whisk/stir vigorously until the flour has cooked and thickened in the sauce (to make the roux). add the grated vegan parm and stir to melt, then thin the sauce with either more plant milk or the reserved cooking pasta water."
# To fill with corpus, body of data
corpus = []

# Blank English-based spacy nlp
nlp = spacy.blank("en")

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler") # entity ruler
nlp.add_pipe("sentencizer") # to split into sentences

#List of all entities in above example
patterns = [
                {"label": "INGREDIENT", "pattern": "flour"},
                {"label": "INGREDIENT", "pattern": "oil"},
                {"label": "INGREDIENT", "pattern": "vegan creamer"},
                {"label": "INGREDIENT", "pattern": "plant milk"},
                {"label": "INGREDIENT", "pattern": "vegan chicken bouillon"},
                {"label": "INGREDIENT", "pattern": "vegan parm"},
                {"label": "QUANTITY", "pattern": "1 cup"},
                {"label": "PROCESS", "pattern": "add"},
                {"label": "PROCESS", "pattern": "stir"},
                {"label": "PROCESS", "pattern": "whisk"},
                {"label": "PROCESS", "pattern": "thin"},
            ]

# Include these patterns in entity ruler
ruler.add_patterns(patterns)

# Create nlp and add to corpus
doc = nlp(test)
for sent in doc.sents:
    corpus.append(sent.text)

# Use print() to check this, but also visualized below
ents = [(ent.text,ent.label_) for ent in doc.ents]

# Make some pretty label colorings for each entity label
sentence_spans = list(doc.sents)
colors = {"INGREDIENT": "linear-gradient(90deg, #aa9cfc, #fc9ce7)",
          "PROCESS": "linear-gradient(90deg, #3fc9a2, #a2f7b5)",
          "QUANTITY": "linear-gradient(90deg, #787ff6, #4adede)"}
options = {"colors": colors}

# Render using displacy
displacy.render(sentence_spans, style="ent", jupyter=True, options=options)

We created our text, nlp, and additions to the *Entity Ruler*. Using `displacy` with a few custom color gradients makes it look quite nice, but also makes it super easy for the human eye to look at and pick out any issues. Because it's a word-by-word basis, everything is correct!

Next, we create a small dataset with this data. Now, we aren't doing anything with these two sentences right now (although we will be scraping through the page that it's hosted on for it), but this is a nice way to simply show how spaCy training data are formatted and used.

In [71]:
'''
Create the training data following the code block above.
'''

TRAIN_DATA = [] # Empty data beforehand

# Iterate through sentences in body of data
for sentence in corpus:
    doc = nlp(sentence)
    entities = []
    # For every entity found...
    for ent in doc.ents:
        # Append locations labels of each entity in text
        entities.append([ent.start_char, ent.end_char, ent.label_])
    TRAIN_DATA.append([sentence, {"entities": entities}])

print("\n Train Data formatted for use of spaCy: \n")
print(TRAIN_DATA)


 Train Data formatted for use of spaCy: 

[['add flour and stir to cook and absorb the oil, then add in 1 cup vegan creamer or plant milk and vegan chicken bouillon paste/powder/cube, then whisk/stir vigorously until the flour has cooked and thickened in the sauce (to make the roux).', {'entities': [[0, 3, 'PROCESS'], [4, 9, 'INGREDIENT'], [14, 18, 'PROCESS'], [42, 45, 'INGREDIENT'], [52, 55, 'PROCESS'], [59, 64, 'QUANTITY'], [65, 78, 'INGREDIENT'], [82, 92, 'INGREDIENT'], [97, 119, 'INGREDIENT'], [144, 149, 'PROCESS'], [150, 154, 'PROCESS'], [176, 181, 'INGREDIENT']]}], ['add the grated vegan parm and stir to melt, then thin the sauce with either more plant milk or the reserved cooking pasta water.', {'entities': [[0, 3, 'PROCESS'], [15, 25, 'INGREDIENT'], [30, 34, 'PROCESS'], [49, 53, 'PROCESS'], [81, 91, 'INGREDIENT']]}]]


This is how spaCy training data is formatted. Within a pair of `[]`, each `[]` contains a sentence followed by the entity locations and their labels. Using this, spaCy can build a model entirely based on an empty English model! This is what we're using for our **Recipe NER**.

### Step 3: Build the Real Training Dataset

As we see above, we can see how we can create training data for spaCy to use. Let's go through how we can do this.

In the **label_data** folder, we create three *json* files: `processes.json`, `quantities.json`, and `ingredients.json`.

In [61]:
'''
Create quantities
'''
import json

# create bases list for quantities
quantity_bases = ["tsp","teaspoon","tbsp","tablespoon","cup","pt","pint","qt","quart","gal","gallon","ml","milliliter","oz","ounce","g","gram"]
# add plural cases to list
quantity_bases = quantity_bases + [quantity_base+"s" for quantity_base in quantity_bases]
# we're combining these cases, because we never know when there'll be a typo :)

#print(quantity_bases)

# create a list of possible fractions to use
possible_fractions = ["1/2",
                      "1/3","2/3",
                      "1/4", "3/4",
                      "1/8", "5/8","7/8",
                      "1/16"]

# finally, complete the quantity list
quantities_list = []
frac_quantities_list = []
int_quantities_list = []
comb_quantities_list = []

LARGEST_QUANTITY = 800

for quantity_base in quantity_bases:
  # into case of each int portion
  for i in range(LARGEST_QUANTITY):
    int_quantity = f"{i+1} {quantity_base}"
    int_quantities_list.append(int_quantity)
  # into case of each fraction portion
  for frac in possible_fractions:
    portion_quantity = f"{frac} {quantity_base}"
    frac_quantities_list.append(portion_quantity)
    # add combined cases for each (ie 1 1/2 tsp)
    for i in range(LARGEST_QUANTITY):
      comb_quantity = f"{i+1} {frac} {quantity_base}"
      comb_quantities_list.append(comb_quantity)
# combine all cases
quantities_list.sort()
quantities_list = int_quantities_list+frac_quantities_list+comb_quantities_list

with open('./label_data/quantities.json', 'w', encoding='utf-8') as f:
  json.dump(quantities_list, f)
