# **spaCy**

spaCy is the leading library for NLP, and it has quickly become one of the most popular Python frameworks.

We will use [SpaCy](https://spacy.io/) to perform following task:
1. Basic text processing and pattern matching
2. Building machine learning models with text
3. Representing text with word embeddings that numerically capture the meaning of words and documents

There's a lot you can do with the doc object you just created.

**Tokenizing** <br>
This returns a document object that contains tokens. A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like "don't" into two tokens, "do" and "n't". You can see the tokens by iterating through the document.<br>

**Text preprocessing** <br>
There are a few types of preprocessing to improve how we model with words. The first is "lemmatizing." The "lemma" of a word is its base form. For example, "walk" is the lemma of the word "walking". So, when you lemmatize the word walking, you would convert it to walk.

It's also common to remove stopwords. Stopwords are words that occur frequently in the language and don't contain much information. English stopwords include "the", "is", "and", "but", "not".

With a spaCy token, token.lemma_ returns the lemma, while token.is_stop returns a boolean True if the token is a stopword (and False otherwise). <br>

**Pattern Matching**<br>
Another common NLP task is matching tokens or phrases within chunks of text or whole documents. You can do pattern matching with regular expressions, but spaCy's matching capabilities tend to be easier to use.

To match individual tokens, you create a Matcher. When you want to match a list of terms, it's easier and more efficient to use PhraseMatcher. For example, if you want to find where different smartphone models show up in some text, you can create patterns for the model names of interest.

In [1]:
import pandas as pd
import spacy
from spacy.matcher import PhraseMatcher
from collections import defaultdict

In [2]:
# Load in the data from JSON file
data = pd.read_json('restaurant.json')
data.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
109,lDJIaF4eYRF4F7g6Zb9euw,lb0QUR5bc4O-Am4hNq9ZGg,r5PLDU-4mSbde5XekTXSCA,4,2,0,0,I used to work food service and my manager at ...,2013-01-27 17:54:54
1013,vvIzf3pr8lTqE_AOsxmgaA,MAmijW4ooUzujkufYYLMeQ,r5PLDU-4mSbde5XekTXSCA,4,0,0,0,We have been trying Eggplant sandwiches all ov...,2015-04-15 04:50:56
1204,UF-JqzMczZ8vvp_4tPK3bQ,slfi6gf_qEYTXy90Sw93sg,r5PLDU-4mSbde5XekTXSCA,5,1,0,0,Amazing Steak and Cheese... Better than any Ph...,2011-03-20 00:57:45
1251,geUJGrKhXynxDC2uvERsLw,N_-UepOzAsuDQwOUtfRFGw,r5PLDU-4mSbde5XekTXSCA,1,0,0,0,Although I have been going to DeFalco's for ye...,2018-07-17 01:48:23
1354,aPctXPeZW3kDq36TRm-CqA,139hD7gkZVzSvSzDPwhNNw,r5PLDU-4mSbde5XekTXSCA,2,0,0,0,"Highs: Ambience, value, pizza and deserts. Thi...",2018-01-21 10:52:58


In [3]:
menu = ["Cheese Steak", "Cheesesteak", "Steak and Cheese", "Italian Combo", "Tiramisu", "Cannoli",
        "Chicken Salad", "Chicken Spinach Salad", "Meatball", "Pizza", "Pizzas", "Spaghetti",
        "Bruchetta", "Eggplant", "Italian Beef", "Purista", "Pasta", "Calzones",  "Calzone",
        "Italian Sausage", "Chicken Cutlet", "Chicken Parm", "Chicken Parmesan", "Gnocchi",
        "Chicken Pesto", "Turkey Sandwich", "Turkey Breast", "Ziti", "Portobello", "Reuben",
        "Mozzarella Caprese",  "Corned Beef", "Garlic Bread", "Pastrami", "Roast Beef",
        "Tuna Salad", "Lasagna", "Artichoke Salad", "Fettuccini Alfredo", "Chicken Parmigiana",
        "Grilled Veggie", "Grilled Veggies", "Grilled Vegetable", "Mac and Cheese", "Macaroni",  
         "Prosciutto", "Salami"]

In [4]:
index_of_review_to_test_on = 14
text_to_test_on = data.text.iloc[index_of_review_to_test_on]

# Load the SpaCy model
nlp = spacy.blank('en')
review_doc = nlp(text_to_test_on)

In [7]:
text_to_test_on

"The Il Purista sandwich has become a staple of my life. Mozzarella, basil, prosciutto, roasted red peppers and balsamic vinaigrette blend into a front runner for the best sandwich in the valley. Goes great with sparkling water or a beer. \n\nDeFalco's also has other Italian fare such as a delicious meatball sub and classic pastas."

In [5]:
nlp

<spacy.lang.en.English at 0x7ff74cba00b8>

In [6]:
review_doc

The Il Purista sandwich has become a staple of my life. Mozzarella, basil, prosciutto, roasted red peppers and balsamic vinaigrette blend into a front runner for the best sandwich in the valley. Goes great with sparkling water or a beer. 

DeFalco's also has other Italian fare such as a delicious meatball sub and classic pastas.

In [8]:
# Example
doc = nlp("Tea is healthy and calming, don't you think?")
for token in doc:
    print(token)

Tea
is
healthy
and
calming
,
do
n't
you
think
?


In [10]:
for t in review_doc:
  print(t)

The
Il
Purista
sandwich
has
become
a
staple
of
my
life
.
Mozzarella
,
basil
,
prosciutto
,
roasted
red
peppers
and
balsamic
vinaigrette
blend
into
a
front
runner
for
the
best
sandwich
in
the
valley
.
Goes
great
with
sparkling
water
or
a
beer
.



DeFalco
's
also
has
other
Italian
fare
such
as
a
delicious
meatball
sub
and
classic
pastas
.


In [11]:
# Create the PhraseMatcher object. The tokenizer is the first argument.
# Use attr = 'LOWER' to make consistent capitalization
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

# Create a list of tokens for each item in the menu
menu_tokens_list = [nlp(item) for item in menu]
menu_tokens_list

[Cheese Steak,
 Cheesesteak,
 Steak and Cheese,
 Italian Combo,
 Tiramisu,
 Cannoli,
 Chicken Salad,
 Chicken Spinach Salad,
 Meatball,
 Pizza,
 Pizzas,
 Spaghetti,
 Bruchetta,
 Eggplant,
 Italian Beef,
 Purista,
 Pasta,
 Calzones,
 Calzone,
 Italian Sausage,
 Chicken Cutlet,
 Chicken Parm,
 Chicken Parmesan,
 Gnocchi,
 Chicken Pesto,
 Turkey Sandwich,
 Turkey Breast,
 Ziti,
 Portobello,
 Reuben,
 Mozzarella Caprese,
 Corned Beef,
 Garlic Bread,
 Pastrami,
 Roast Beef,
 Tuna Salad,
 Lasagna,
 Artichoke Salad,
 Fettuccini Alfredo,
 Chicken Parmigiana,
 Grilled Veggie,
 Grilled Veggies,
 Grilled Vegetable,
 Mac and Cheese,
 Macaroni,
 Prosciutto,
 Salami]

In [17]:
# Add the item patterns to the matcher.
# Just a name for the set of rules we're matching to 
matcher.add("MENU", None, *menu_tokens_list)

# Find matches in the review_doc
matches = matcher(review_doc)
matches

[(8291075388056826051, 2, 3),
 (8291075388056826051, 16, 17),
 (8291075388056826051, 58, 59)]

In [18]:
for match in matches:
   print(f"Token number {match[1]}: {review_doc[match[1]:match[2]]}")

Token number 2: Purista
Token number 16: prosciutto
Token number 58: meatball


In [19]:
# item_ratings is a dictionary of lists. If a key doesn't exist in item_ratings,
# the key is added with an empty list as the value.
item_ratings = defaultdict(list)
item_ratings

defaultdict(list, {})

In [20]:
for idx, review in data.iterrows():
    doc = nlp(review.text)
    matches = matcher(doc)

    found_items = set([doc[match[1]:match[2]] for match in matches])

    for item in found_items:
        item_ratings[str(item).lower()].append(review.stars)

In [22]:
item_ratings

defaultdict(list,
            {'artichoke salad': [5, 5, 5, 5, 5],
             'calzone': [3,
              5,
              5,
              4,
              4,
              4,
              5,
              5,
              5,
              5,
              3,
              3,
              5,
              5,
              4,
              4,
              4,
              4,
              5,
              5,
              5,
              5,
              5,
              4,
              5,
              5,
              3,
              5,
              5,
              5,
              4,
              5,
              4,
              4,
              5,
              5,
              5,
              5,
              5,
              5,
              5,
              5,
              1,
              1,
              1,
              1,
              4,
              4,
              4,
              3,
              4,
              4,
              5,
              5,
    

In [23]:
type(item_ratings)

collections.defaultdict

In [24]:
# Calculate the mean ratings for each menu item as a dictionary
mean_ratings = {item: sum(ratings)/len(ratings) for item, ratings in item_ratings.items()}
mean_ratings

{'artichoke salad': 5.0,
 'calzone': 4.263636363636364,
 'calzones': 4.552631578947368,
 'cannoli': 4.337078651685394,
 'cheese steak': 4.454545454545454,
 'cheesesteak': 4.335616438356165,
 'chicken cutlet': 3.5454545454545454,
 'chicken parm': 4.155172413793103,
 'chicken parmesan': 4.238095238095238,
 'chicken parmigiana': 4.444444444444445,
 'chicken pesto': 4.566666666666666,
 'chicken salad': 4.666666666666667,
 'chicken spinach salad': 4.5,
 'corned beef': 5.0,
 'eggplant': 3.968421052631579,
 'fettuccini alfredo': 5.0,
 'garlic bread': 4.021739130434782,
 'gnocchi': 4.488888888888889,
 'grilled veggie': 4.5,
 'italian beef': 4.0,
 'italian combo': 3.909090909090909,
 'italian sausage': 4.2105263157894735,
 'lasagna': 4.409638554216867,
 'mac and cheese': 4.444444444444445,
 'macaroni': 4.166666666666667,
 'meatball': 4.079754601226994,
 'pasta': 4.392156862745098,
 'pastrami': 4.6875,
 'pizza': 4.304469273743017,
 'pizzas': 4.393939393939394,
 'portobello': 4.111111111111111,
 

In [25]:
# Find the worst item, and write it as a string in worst_text. This can be multiple lines of code if you want.
worst_item = sorted(mean_ratings, key=mean_ratings.get)[0]
worst_item

'chicken cutlet'

In [26]:
# After implementing the above cell, uncomment and run this to print 
# out the worst item, along with its average rating. 

print(worst_item)
print(mean_ratings[worst_item])

chicken cutlet
3.5454545454545454


In [27]:
# calculate the number of reviews for each item
counts = {item: len(ratings) for item, ratings in item_ratings.items()}

item_counts = sorted(counts, key=counts.get, reverse=True)
for item in item_counts:
    print(f"{item:>25}{counts[item]:>5}")

                    pizza  358
                    pasta  255
                 meatball  163
              cheesesteak  146
                  calzone  110
                 eggplant   95
                  cannoli   89
             cheese steak   88
                  lasagna   83
                  purista   67
               prosciutto   63
             chicken parm   58
          italian sausage   57
             garlic bread   46
                  gnocchi   45
                spaghetti   41
                 calzones   38
                   pizzas   33
                   salami   32
            chicken pesto   30
             italian beef   29
                 tiramisu   27
                     ziti   26
            italian combo   22
         chicken parmesan   21
       chicken parmigiana   18
           mac and cheese   18
               portobello   18
                 pastrami   16
           chicken cutlet   11
         steak and cheese    9
               roast beef    7
       f

In [28]:
# print the 10 best and 10 worst rated items
sorted_ratings = sorted(mean_ratings, key=mean_ratings.get)

print("Worst rated menu items:")
for item in sorted_ratings[:10]:
    print(f"{item:20} Ave rating: {mean_ratings[item]:.2f} \tcount: {counts[item]}")
    
print("\n\nBest rated menu items:")
for item in sorted_ratings[-10:]:
    print(f"{item:20} Ave rating: {mean_ratings[item]:.2f} \tcount: {counts[item]}")

Worst rated menu items:
chicken cutlet       Ave rating: 3.55 	count: 11
turkey sandwich      Ave rating: 3.80 	count: 5
spaghetti            Ave rating: 3.85 	count: 41
italian combo        Ave rating: 3.91 	count: 22
eggplant             Ave rating: 3.97 	count: 95
italian beef         Ave rating: 4.00 	count: 29
tuna salad           Ave rating: 4.00 	count: 5
garlic bread         Ave rating: 4.02 	count: 46
meatball             Ave rating: 4.08 	count: 163
portobello           Ave rating: 4.11 	count: 18


Best rated menu items:
prosciutto           Ave rating: 4.62 	count: 63
purista              Ave rating: 4.64 	count: 67
chicken salad        Ave rating: 4.67 	count: 6
pastrami             Ave rating: 4.69 	count: 16
reuben               Ave rating: 4.80 	count: 5
steak and cheese     Ave rating: 4.89 	count: 9
artichoke salad      Ave rating: 5.00 	count: 5
fettuccini alfredo   Ave rating: 5.00 	count: 6
turkey breast        Ave rating: 5.00 	count: 1
corned beef          Ave ra

The less data you have for any specific item, the less you can trust that the average rating is the "real" sentiment of the customers. This is fairly common sense. If more people tell you the same thing, you're more likely to believe it. It's also mathematically sound. As the number of data points increases, the error on the mean decreases as 1 / sqrt(n).