In [1]:
import spacy
#!python3 -m spacy download en_core_web_sm

## Doc container in spaCy
### The first step of a spaCy text processing pipeline is to convert a given text string into a Doc container, which stores the processed text. In this exercise, you'll practice loading a spaCy model, creating an nlp() object, creating a Doc container and processing a text string that is available for you.

### en_core_web_sm model is already downloaded.

### Instructions
-    Load en_core_web_sm and create an nlp object.
-    Create a doc container of the text string.
-    Create a list containing the text of each tokens in the doc container.

In [2]:
text = 'NLP is becoming increasingly popular for providing business solutions.'

In [3]:
# Load en_core_web_sm and create an nlp object
nlp = spacy.load("en_core_web_sm")

# Create a Doc container for the text object
doc = nlp(text)

# Create a list containing the text of each token in the Doc container
print([token.text for token in doc])

['NLP', 'is', 'becoming', 'increasingly', 'popular', 'for', 'providing', 'business', 'solutions', '.']


## Tokenization with spaCy
### In this exercise, you'll practice tokenizing text. You'll use the first review from the Amazon Fine Food Reviews dataset for this exercise. You can access this review by using the text object provided.

### The en_core_web_sm model is already loaded for you. You can access it by calling nlp(). You can use list comprehension to compile output lists.

### Instructions
-    Store Doc container for the pre-loaded review in a document object.
-    Store and review texts of all the tokens of the document in the variable first_text_tokens.

In [4]:
text = 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

In [5]:
# Create a Doc container of the given text
document = nlp(text)
    
# Store and review the token text values of tokens for the Doc container
first_text_tokens = [token.text for token in document]
print("First text tokens:\n", first_text_tokens, "\n")

First text tokens:
 ['I', 'have', 'bought', 'several', 'of', 'the', 'Vitality', 'canned', 'dog', 'food', 'products', 'and', 'have', 'found', 'them', 'all', 'to', 'be', 'of', 'good', 'quality', '.', 'The', 'product', 'looks', 'more', 'like', 'a', 'stew', 'than', 'a', 'processed', 'meat', 'and', 'it', 'smells', 'better', '.', 'My', 'Labrador', 'is', 'finicky', 'and', 'she', 'appreciates', 'this', 'product', 'better', 'than', ' ', 'most', '.'] 



## Running a spaCy pipeline
### You've already run a spaCy NLP pipeline on a single piece of text and also extracted tokens of a given list of Doc containers. In this exercise, you'll practice the initial steps of running a spaCy pipeline on texts, which is a list of text strings.

### You will use the en_core_web_sm model for this purpose. The spaCy package has already been imported for you.

### Instructions
-    Load the en_core_web_sm model as nlp.
-    Run an nlp() model on each item of texts, and append each corresponding Doc container to a documents list.
-    Print the token texts for each Doc container of the documents list.

In [6]:
texts = ['A loaded spaCy model can be used to compile documents list!',
 'Tokenization is the first step in any spacy pipeline.']

In [7]:
# Load en_core_web_sm model as nlp
nlp = spacy.load('en_core_web_sm')

# Run an nlp model on each item of texts and append the Doc container to documents
documents = []
for text in texts:
  documents.append(nlp(text))
  
# Print the token texts for each Doc container
for doc in documents:
  print([token.text for token in doc])

['A', 'loaded', 'spaCy', 'model', 'can', 'be', 'used', 'to', 'compile', 'documents', 'list', '!']
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'any', 'spacy', 'pipeline', '.']


## Lemmatization with spaCy
### In this exercise, you will practice lemmatization. Lemmatization can be helpful to generate the root form of derived words. This means that given any sentence, we expect the number of lemmas to be less than or equal to the number of tokens.

### The first Amazon food review is provided for you in a string called text. en_core_web_sm is loaded as nlp, and has been run on the text to compile document, a Doc container for the text string.

### tokens, a list containing tokens for the text is also already loaded for your use.

### Instructions
-    Append the lemma for all tokens in the document, then print the list of lemmas.
-    Print tokens list and observe the differences between tokens and lemmas.

In [8]:
text = 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

In [9]:
document = nlp(text)
tokens = [token.text for token in document]

# Append the lemma for all tokens in the document
lemmas = [token.lemma_ for token in document]
print("Lemmas:\n", lemmas, "\n")

# Print tokens and compare with lemmas list
print("Tokens:\n", tokens)

Lemmas:
 ['I', 'have', 'buy', 'several', 'of', 'the', 'Vitality', 'can', 'dog', 'food', 'product', 'and', 'have', 'find', 'they', 'all', 'to', 'be', 'of', 'good', 'quality', '.', 'the', 'product', 'look', 'more', 'like', 'a', 'stew', 'than', 'a', 'process', 'meat', 'and', 'it', 'smell', 'well', '.', 'my', 'Labrador', 'be', 'finicky', 'and', 'she', 'appreciate', 'this', 'product', 'well', 'than', ' ', 'most', '.'] 

Tokens:
 ['I', 'have', 'bought', 'several', 'of', 'the', 'Vitality', 'canned', 'dog', 'food', 'products', 'and', 'have', 'found', 'them', 'all', 'to', 'be', 'of', 'good', 'quality', '.', 'The', 'product', 'looks', 'more', 'like', 'a', 'stew', 'than', 'a', 'processed', 'meat', 'and', 'it', 'smells', 'better', '.', 'My', 'Labrador', 'is', 'finicky', 'and', 'she', 'appreciates', 'this', 'product', 'better', 'than', ' ', 'most', '.']


## Sentence segmentation with spaCy
### In this exercise, you will practice sentence segmentation. In NLP, segmenting a document into its sentences is a useful basic operation. It is one of the first steps in many NLP tasks that are more elaborate, such as detecting named entities. Additionally, capturing the number of sentences may provide some insight into the amount of information provided by the text.

### You can access ten food reviews in the list called texts.

### The en_core_web_sm model has already been loaded for you as nlp and .

### Instructions
-    Run the spaCy model on each item in the texts list to compile documents, a list of all Doc containers.
-    Extract sentences of each doc container by iterating through documents list and append them to a list called sentences.
-    Count the number of sentences in each doc container using the sentences list.

In [10]:
texts = ['I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.',
 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
 'This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.',
 'If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal.',
 'Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.',
 'I got a wild hair for taffy and ordered this five pound bag. The taffy was all very enjoyable with many flavors: watermelon, root beer, melon, peppermint, grape, etc. My only complaint is there was a bit too much red/black licorice-flavored pieces (just not my particular favorites). Between me, my kids, and my husband, this lasted only two weeks! I would recommend this brand of taffy -- it was a delightful treat.',
 "This saltwater taffy had great flavors and was very soft and chewy.  Each candy was individually wrapped well.  None of the candies were stuck together, which did happen in the expensive version, Fralinger's.  Would highly recommend this candy!  I served it at a beach-themed party and everyone loved it!",
 'This taffy is so good.  It is very soft and chewy.  The flavors are amazing.  I would definitely recommend you buying it.  Very satisfying!!',
 "Right now I'm mostly just sprouting this so my cats can eat the grass. They love it. I rotate it around with Wheatgrass and Rye too",
 'This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.']

In [11]:
# Generating a documents list of all Doc containers
documents = [nlp(text) for text in texts]

# Iterate through documents and append sentences in each doc to the sentences list
sentences = []
for doc in documents:
  sentences.append([s for s in doc.sents])
  
# Find number of sentences per each doc container
print([len(s) for s in sentences])

[3, 2, 7, 3, 4, 5, 5, 5, 3, 4]


## POS tagging with spaCy
### In this exercise, you will practice POS tagging. POS tagging is a useful tool in NLP as it allows algorithms to understand the grammatical structure of a sentence and to confirm words that have multiple meanings such as watch and play.

### For this exercise, en_core_web_sm has been loaded for you as nlp. Three comments from the Airline Travel Information System (ATIS) dataset have been provided for you in a list called texts.

### Instructions
-    Compile documents, a list of all doc containers for each text in texts list using list comprehension.
-    For each doc container, print each token's text and its corresponding POS tag by iterating through documents and tokens of each doc container using a nested for loop.

In [12]:
texts = ['What is the arrival time in San francisco for the 7:55 AM flight leaving Washington?',
 'Cheapest airfare from Tacoma to Orlando is 650 dollars.',
 'Round trip fares from Pittsburgh to Philadelphia are under 1000 dollars!']

In [13]:
# Compile a list of all Doc containers of texts
documents = [nlp(text) for text in texts]

# Print token texts and POS tags for each Doc container
for doc in documents:
    for token in doc:
        print("Text: ", token.text, "| POS tag: ", token.pos_)
    print("\n")

Text:  What | POS tag:  PRON
Text:  is | POS tag:  AUX
Text:  the | POS tag:  DET
Text:  arrival | POS tag:  NOUN
Text:  time | POS tag:  NOUN
Text:  in | POS tag:  ADP
Text:  San | POS tag:  PROPN
Text:  francisco | POS tag:  PROPN
Text:  for | POS tag:  ADP
Text:  the | POS tag:  DET
Text:  7:55 | POS tag:  NUM
Text:  AM | POS tag:  PROPN
Text:  flight | POS tag:  NOUN
Text:  leaving | POS tag:  VERB
Text:  Washington | POS tag:  PROPN
Text:  ? | POS tag:  PUNCT


Text:  Cheapest | POS tag:  ADJ
Text:  airfare | POS tag:  NOUN
Text:  from | POS tag:  ADP
Text:  Tacoma | POS tag:  PROPN
Text:  to | POS tag:  ADP
Text:  Orlando | POS tag:  PROPN
Text:  is | POS tag:  AUX
Text:  650 | POS tag:  NUM
Text:  dollars | POS tag:  NOUN
Text:  . | POS tag:  PUNCT


Text:  Round | POS tag:  ADJ
Text:  trip | POS tag:  NOUN
Text:  fares | POS tag:  NOUN
Text:  from | POS tag:  ADP
Text:  Pittsburgh | POS tag:  PROPN
Text:  to | POS tag:  ADP
Text:  Philadelphia | POS tag:  PROPN
Text:  are | POS

## NER with spaCy
### Named entity recognition (NER) helps you to easily identify key elements of a given document, like names of people and places. It helps sort unstructured data and detect important information, which is crucial if you are dealing with large datasets. In this exercise, you will practice Named Entity Recognition.

### en_core_web_sm has been loaded for you as nlp. Three comments from the Airline Travel Information System (ATIS) dataset have been provided for you in a list called texts.

### Instructions
-    Compile documents, a list of all Doc containers for each text in the texts using list comprehension.
-    For each doc container, print each entity's text and corresponding label by iterating through doc.ents.
-    Print the sixth token's text, and the entity type of the second Doc container.

In [14]:
texts = ['I want to fly from Boston at 8:38 am and arrive in Denver at 11:10 in the morning',
 'What flights are available from Pittsburgh to Baltimore on Thursday morning?',
 'What is the arrival time in San francisco for the 7:55 AM flight leaving Washington?']

In [15]:
# Compile a list of all Doc containers of texts
documents = [nlp(text) for text in texts]

# Print the entity text and label for the entities in each document
for doc in documents:
    print([(ent.text, ent.label_) for ent in doc.ents])
    
# Print the 6th token's text and entity type of the second document
print("\nText:", documents[1][5].text, "| Entity type: ", documents[1][5].ent_type_)

[('Boston', 'GPE'), ('8:38 am', 'TIME'), ('Denver', 'GPE'), ('11:10 in the morning', 'TIME')]
[('Pittsburgh', 'GPE'), ('Baltimore', 'GPE'), ('Thursday', 'DATE'), ('morning', 'TIME')]
[('San francisco', 'GPE'), ('7:55 AM', 'TIME'), ('Washington', 'GPE')]

Text: Pittsburgh | Entity type:  GPE


## Text processing with spaCy
### Every NLP application consists of several text processing steps. You have already learned some of these steps, including tokenization, lemmatization, sentence segmentation and named entity recognition.

### spaCy NLP Pipeline 
<img src='./images/spacy_convention.jpeg' />

### In this exercise, you'll continue to practice with text processing steps in spaCy, such as breaking the text into sentences and extracting named entities. You will use the first five reviews from the Amazon Fine Food Reviews dataset for this exercise. You can access these reviews by using the texts object.

### The en_core_web_sm model has already been loaded for you to use, and you can access it by using nlp. The list of Doc containers for each item in texts is also pre-loaded and accessible at documents.

### Instructions 1/2
-    Create sentences, a list of list of all sentences in each doc container in documents using list comprehension.
-    Print num_sentences, a list containing the number of sentences for each doc container by using the len() method.

In [16]:
documents = [nlp("I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most."),
 nlp("Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as \"Jumbo\"."),
 nlp("This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' \"The Lion, The Witch, and The Wardrobe\" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch."),
 nlp("If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal."),
 nlp("Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.")]

In [17]:
# Create a list to store sentences of each Doc container in documents
sentences = [[sent for sent in doc.sents] for doc in documents]

# Print number of sentences in each Doc container in documents
num_sentences = [len(s) for s in sentences]
print("Number of sentences in documents:\n", num_sentences)

Number of sentences in documents:
 [3, 2, 7, 3, 4]


### Instructions 2/2
-    Create a list of tuples of format (entity text, entity label) for the third doc container in third_text_entities.
-    Create a list of tuples of format (token text, POS tag) of first ten tokens of third doc container at third_text_10_pos.

In [18]:
# Create a list to store sentences of each Doc container in documents
sentences = [[sent for sent in doc.sents] for doc in documents]

# Create a list to track number of sentences per Doc container in documents
num_sentences = [len([sent for sent in doc.sents]) for doc in documents]
print("Number of sentences in documents:\n", num_sentences, "\n")

# Record entities text and corresponding label of the third Doc container
third_text_entities = [(ent.text, ent.label_) for ent in documents[2].ents]
print("Third text entities:\n", third_text_entities, "\n")

# Record first ten tokens and corresponding POS tag for the third Doc container
third_text_10_pos = [(token.text, token.pos_) for token in documents[2]][:10]
print("First ten tokens of third text:\n", third_text_10_pos)

Number of sentences in documents:
 [3, 2, 7, 3, 4] 

Third text entities:
 [('citrus gelatin', 'PERSON'), ('Filberts', 'PERSON'), ("C.S. Lewis'", 'ORG'), ('The Lion, The Witch', 'WORK_OF_ART'), ('The Wardrobe', 'WORK_OF_ART'), ('Edmund', 'GPE'), ('Sisters', 'PERSON'), ('Witch', 'LOC')] 

First ten tokens of third text:
 [('This', 'PRON'), ('is', 'AUX'), ('a', 'DET'), ('confection', 'NOUN'), ('that', 'PRON'), ('has', 'AUX'), ('been', 'AUX'), ('around', 'ADP'), ('a', 'DET'), ('few', 'ADJ')]
