# **spaCy**

spaCy is the leading library for NLP, and it has quickly become one of the most popular Python frameworks.

We will use [SpaCy](https://spacy.io/) to perform following task:
1. Basic text processing and pattern matching
2. Building machine learning models with text
3. Representing text with word embeddings that numerically capture the meaning of words and documents

There's a lot you can do with the doc object you just created.

**Tokenizing** <br>
This returns a document object that contains tokens. A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like "don't" into two tokens, "do" and "n't". You can see the tokens by iterating through the document.<br>

**Text preprocessing** <br>
There are a few types of preprocessing to improve how we model with words. The first is "lemmatizing." The "lemma" of a word is its base form. For example, "walk" is the lemma of the word "walking". So, when you lemmatize the word walking, you would convert it to walk.

It's also common to remove stopwords. Stopwords are words that occur frequently in the language and don't contain much information. English stopwords include "the", "is", "and", "but", "not".

With a spaCy token, token.lemma_ returns the lemma, while token.is_stop returns a boolean True if the token is a stopword (and False otherwise). <br>

**Pattern Matching**<br>
Another common NLP task is matching tokens or phrases within chunks of text or whole documents. You can do pattern matching with regular expressions, but spaCy's matching capabilities tend to be easier to use.

To match individual tokens, you create a Matcher. When you want to match a list of terms, it's easier and more efficient to use PhraseMatcher. For example, if you want to find where different smartphone models show up in some text, you can create patterns for the model names of interest.

In [5]:
import pandas as pd
import spacy
from spacy.matcher import PhraseMatcher
from collections import defaultdict

In [2]:
# Load in the data from JSON file
data = pd.read_json('restaurant.json')
data.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
109,lDJIaF4eYRF4F7g6Zb9euw,lb0QUR5bc4O-Am4hNq9ZGg,r5PLDU-4mSbde5XekTXSCA,4,2,0,0,I used to work food service and my manager at ...,2013-01-27 17:54:54
1013,vvIzf3pr8lTqE_AOsxmgaA,MAmijW4ooUzujkufYYLMeQ,r5PLDU-4mSbde5XekTXSCA,4,0,0,0,We have been trying Eggplant sandwiches all ov...,2015-04-15 04:50:56
1204,UF-JqzMczZ8vvp_4tPK3bQ,slfi6gf_qEYTXy90Sw93sg,r5PLDU-4mSbde5XekTXSCA,5,1,0,0,Amazing Steak and Cheese... Better than any Ph...,2011-03-20 00:57:45
1251,geUJGrKhXynxDC2uvERsLw,N_-UepOzAsuDQwOUtfRFGw,r5PLDU-4mSbde5XekTXSCA,1,0,0,0,Although I have been going to DeFalco's for ye...,2018-07-17 01:48:23
1354,aPctXPeZW3kDq36TRm-CqA,139hD7gkZVzSvSzDPwhNNw,r5PLDU-4mSbde5XekTXSCA,2,0,0,0,"Highs: Ambience, value, pizza and deserts. Thi...",2018-01-21 10:52:58


In [3]:
menu = ["Cheese Steak", "Cheesesteak", "Steak and Cheese", "Italian Combo", "Tiramisu", "Cannoli",
        "Chicken Salad", "Chicken Spinach Salad", "Meatball", "Pizza", "Pizzas", "Spaghetti",
        "Bruchetta", "Eggplant", "Italian Beef", "Purista", "Pasta", "Calzones",  "Calzone",
        "Italian Sausage", "Chicken Cutlet", "Chicken Parm", "Chicken Parmesan", "Gnocchi",
        "Chicken Pesto", "Turkey Sandwich", "Turkey Breast", "Ziti", "Portobello", "Reuben",
        "Mozzarella Caprese",  "Corned Beef", "Garlic Bread", "Pastrami", "Roast Beef",
        "Tuna Salad", "Lasagna", "Artichoke Salad", "Fettuccini Alfredo", "Chicken Parmigiana",
        "Grilled Veggie", "Grilled Veggies", "Grilled Vegetable", "Mac and Cheese", "Macaroni",  
         "Prosciutto", "Salami"]

In [8]:
index_of_review_to_test_on = 14
text_to_test_on = data.text.iloc[index_of_review_to_test_on]

# Load the SpaCy model
nlp = spacy.blank('en')
review_doc = nlp(text_to_test_on)

In [9]:
nlp

<spacy.lang.en.English at 0x7f2e560d2160>

In [10]:
# Example
doc = nlp("Tea is healthy and calming, don't you think?")
for token in doc:
    print(token)

Tea
is
healthy
and
calming
,
do
n't
you
think
?


In [11]:
# Create the PhraseMatcher object. The tokenizer is the first argument.
# Use attr = 'LOWER' to make consistent capitalization
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

# Create a list of tokens for each item in the menu
menu_tokens_list = [nlp(item) for item in menu]
menu_tokens_list

[Cheese Steak,
 Cheesesteak,
 Steak and Cheese,
 Italian Combo,
 Tiramisu,
 Cannoli,
 Chicken Salad,
 Chicken Spinach Salad,
 Meatball,
 Pizza,
 Pizzas,
 Spaghetti,
 Bruchetta,
 Eggplant,
 Italian Beef,
 Purista,
 Pasta,
 Calzones,
 Calzone,
 Italian Sausage,
 Chicken Cutlet,
 Chicken Parm,
 Chicken Parmesan,
 Gnocchi,
 Chicken Pesto,
 Turkey Sandwich,
 Turkey Breast,
 Ziti,
 Portobello,
 Reuben,
 Mozzarella Caprese,
 Corned Beef,
 Garlic Bread,
 Pastrami,
 Roast Beef,
 Tuna Salad,
 Lasagna,
 Artichoke Salad,
 Fettuccini Alfredo,
 Chicken Parmigiana,
 Grilled Veggie,
 Grilled Veggies,
 Grilled Vegetable,
 Mac and Cheese,
 Macaroni,
 Prosciutto,
 Salami]

In [None]:
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
menu_tokens_list = [nlp(item) for item in menu]
matcher.add("MENU", None, *menu_tokens_list)
matches = matcher(review_doc)