## Installing and Using spaCY
Make sure that the install was done correctly for all packages, and the versions match. More details for installation here: [spaCy 101: Everything you need to know](https://spacy.io/usage) (this part can be frustrating, so validate and make sure everything works with the code blocks below).

In [5]:
'''
Validate spacy installs, compatability for version used
'''
# Active the venv for the models (not available through conda)
!python -m spacy validate
!echo %CUDA_PATH%
import spacy


⠙ Loading compatibility table...
⠹ Loading compatibility table...
⠸ Loading compatibility table...
⠼ Loading compatibility table...
⠴ Loading compatibility table...
⠦ Loading compatibility table...
⠧ Loading compatibility table...
⠇ Loading compatibility table...
⠏ Loading compatibility table...
⠙ Loading compatibility table...
[2K✔ Loaded compatibility table
[1m
ℹ spaCy installation:
C:\Users\elvis\anaconda3\envs\jupyter\lib\site-packages\spacy

NAME              SPACY            VERSION    
en_core_web_lg    >=3.1.0,<3.2.0   3.1.0     ✔
en_core_web_sm    >=3.1.0,<3.2.0   3.1.0     ✔
en_core_web_trf   >=3.1.0,<3.2.0   3.1.0     ✔

C:\Users\elvis\anaconda3\envs\jupyter\Library


## Demo spaCy: NER and Visualization Basics
Here, we are trying to see how we can get spaCy to do what we want it to do. First, let's see what the base NER can do, and visualize it nicely with `displacy`.

In [6]:
texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])

[('$9.4 million', 'MONEY'), ('the prior year', 'DATE'), ('$2.7 million', 'MONEY')]
[('twelve billion dollars', 'MONEY'), ('1b', 'MONEY')]


In [7]:
'''
Example with Visualizations
'''
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_lg")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

# TODO: we should define nice colors to reuse for all our entity types
colors = {"ORG": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}
options = {"colors": colors}
doc.user_data["title"] = "Example Render of Entity Recognizer"
displacy.render(doc, style="ent", jupyter=True, options=options)
#displacy.render(doc, style="dep") # dependency parse

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


# Create Training Data
After we did some smaller tests to see how spaCy is working, let's compile the datasets into valuable training data to be used. This will then be used in order to create everything we need to deal with new data.

### Step 1: See What Spacy Already Labels
First, let's see how it works with text from a recipe. This will help us know how to build our model and how to format our dataset. With this, we can see what entities spaCy looks for when given a recipe.

In [None]:
'''
Example using Recipe Text
'''
import spacy
test = "add flour and stir to cook and absorb the oil, then add in 1 cup vegan\
        creamer or plant milk and vegan chicken bouillon paste/powder/cube, then whisk/stir\
        vigorously until the flour has cooked and thickened in the sauce (to make the roux).\
         add the grated vegan parm and stir to melt, then thin the sauce with either more\
        plant milk or the reserved cooking pasta water."
nlp = spacy.load("en_core_web_lg")
doc = nlp(test)
for ent in doc.ents:
    print(ent.text, ent.label_)

1 CARDINAL


It seems that spaCy is only able to pick out cardinals from a text such as this one. That's fine! In the next step, we're going to look on how to create connected entities and labels, heading towards creating the dataset we need.

### Step 2: Build a Small Training Dataset Example

We want to build our dataset to be able to detect three things: **ingredients**, **quantities**, and **processes**. And, we are able to do just that. Let's manually do that now with the example above, but start off with a blank spacy nlp, so that labels such as cardinals are disregarded. Without doing this, **quantities** that include numbers will be labeled as cardinals instead.

Examples of each label:
* **Ingredients**: Flour, sugar, apple sauce...
* **Quantities**: 1 cup, 1/4 tablespoon, 1 teaspoon...
* **Processes**: Mix, stir, fry...

In [17]:
'''
Create and visualize training text and entities.
'''

import spacy
from spacy import displacy

# Retype test text to make sure indent spacing does not show up
test = "add flour and stir to cook and absorb the oil, then add in 1 cup vegan creamer or plant milk and vegan chicken bouillon paste/powder/cube, then whisk/stir vigorously until the flour has cooked and thickened in the sauce (to make the roux). add the grated vegan parm and stir to melt, then thin the sauce with either more plant milk or the reserved cooking pasta water."
# To fill with corpus, body of data
corpus = []

# Blank English-based spacy nlp
nlp = spacy.blank("en")

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")
nlp.add_pipe("sentencizer")

#List of all entities in above example
patterns = [
                {"label": "INGREDIENT", "pattern": "flour"},
                {"label": "INGREDIENT", "pattern": "oil"},
                {"label": "INGREDIENT", "pattern": "vegan creamer"},
                {"label": "INGREDIENT", "pattern": "plant milk"},
                {"label": "INGREDIENT", "pattern": "vegan chicken bouillon"},
                {"label": "INGREDIENT", "pattern": "vegan parm"},
                {"label": "QUANTITY", "pattern": "1 cup"},
                {"label": "PROCESS", "pattern": "add"},
                {"label": "PROCESS", "pattern": "stir"},
                {"label": "PROCESS", "pattern": "whisk"},
                {"label": "PROCESS", "pattern": "thin"},
            ]

ruler.add_patterns(patterns)

doc = nlp(test)
for sent in doc.sents:
    corpus.append(sent.text)

ents = [(ent.text,ent.label_) for ent in doc.ents]

# Make some pretty label colorings for each entity label
sentence_spans = list(doc.sents)
colors = {"INGREDIENT": "linear-gradient(90deg, #aa9cfc, #fc9ce7)",
          "PROCESS": "linear-gradient(90deg, #3fc9a2, #a2f7b5)",
          "QUANTITY": "linear-gradient(90deg, #787ff6, #4adede)"}
options = {"colors": colors}

# Render using displacy
displacy.render(sentence_spans, style="ent", jupyter=True, options=options)

In [20]:
'''
Create the training data following the code block above.
'''

TRAIN_DATA = [] # Empty data beforehand

# Iterate through sentences in body of data
for sentence in corpus:
    doc = nlp(sentence)
    entities = []
    # For every entity found...
    for ent in doc.ents:
        # Append locations labels of each entity in text
        entities.append([ent.start_char, ent.end_char, ent.label_])
    TRAIN_DATA.append([sentence, {"entities": entities}])

print("\n Train Data formatted for use of spaCy: \n")
print (TRAIN_DATA)


 Train Data formatted for use of spaCy: 

[['add flour and stir to cook and absorb the oil, then add in 1 cup vegan creamer or plant milk and vegan chicken bouillon paste/powder/cube, then whisk/stir vigorously until the flour has cooked and thickened in the sauce (to make the roux).', {'entities': [[0, 3, 'PROCESS'], [4, 9, 'INGREDIENT'], [14, 18, 'PROCESS'], [42, 45, 'INGREDIENT'], [52, 55, 'PROCESS'], [59, 64, 'QUANTITY'], [65, 78, 'INGREDIENT'], [82, 92, 'INGREDIENT'], [97, 119, 'INGREDIENT'], [144, 149, 'PROCESS'], [150, 154, 'PROCESS'], [176, 181, 'INGREDIENT']]}], ['add the grated vegan parm and stir to melt, then thin the sauce with either more plant milk or the reserved cooking pasta water.', {'entities': [[0, 3, 'PROCESS'], [15, 25, 'INGREDIENT'], [30, 34, 'PROCESS'], [49, 53, 'PROCESS'], [81, 91, 'INGREDIENT']]}]]
