In [16]:
import spacy
#!python3 -m spacy download en_core_web_sm

import numpy as np

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

nlp = spacy.load("en_core_web_sm")

## Word-sense disambiguation with spaCy
### WSD is a classical problem of deciding in which sense a word is used in a sentence. Determining the sense of the word can be crucial in search engines, machine translation, and question-answering systems. In this exercise, you will practice using POS tagging for word-sense disambiguation.

### There are two sentences containing the word jam, with two different senses and you are tasked to identify the POS tags to help you determine the corresponding sense of the word in a given sentence.

### The two sentences are available in the texts list. The en_core_web_sm model is already loaded and available for your use as nlp.

### Instructions 1/2
-    Create a documents list containing the Doc containers of each element in the texts list.
-    Print a tuple of the token's text and POS tags per each Doc container only if the word jam is in the token text.

In [2]:
texts = ["This device is used to jam the signal.",
         "I am stuck in a traffic jam"]

# Create a list of Doc containers in the texts list
documents = [nlp(t) for t in texts]

# Print a token's text and POS tag if the word jam is in the token's text
for i, doc in enumerate(documents):
    print(f"Sentence {i+1}: ", [(token.text, token.pos_) for token in doc if "jam" in token.text], "\n")

Sentence 1:  [('jam', 'VERB')] 

Sentence 2:  [('jam', 'NOUN')] 



## Dependency parsing with spaCy
### Dependency parsing analyzes the grammatical structure in a sentence and finds out related words as well as the type of relationship between them. An application of dependency parsing is to identify a sentence object and subject. In this exercise, you will practice extracting dependency labels for given texts.

### Three comments from the Airline Travel Information System (ATIS) dataset have been provided for you in a list called texts. en_core_web_sm model is already loaded and available for your use as nlp.

### Instructions
-    Create a documents list containing the doc containers of each element in the texts list.
-    Print a tuple of (the token's text, dependency label, and label's explanation) per each doc container.

In [3]:
texts = ['I want to fly from Boston at 8:38 am and arrive in Denver at 11:10 in the morning',
 'What flights are available from Pittsburgh to Baltimore on Thursday morning?',
 'What is the arrival time in San francisco for the 7:55 AM flight leaving Washington?']

In [4]:
# Create a list of Doc containts of texts list
documents = [nlp(t) for t in texts]

# Print each token's text, dependency label and its explanation
for doc in documents:
    print([(token.text, token.dep_, spacy.explain(token.dep_)) for token in doc], "\n")

[('I', 'nsubj', 'nominal subject'), ('want', 'ROOT', 'root'), ('to', 'aux', 'auxiliary'), ('fly', 'xcomp', 'open clausal complement'), ('from', 'prep', 'prepositional modifier'), ('Boston', 'pobj', 'object of preposition'), ('at', 'prep', 'prepositional modifier'), ('8:38', 'nummod', 'numeric modifier'), ('am', 'pobj', 'object of preposition'), ('and', 'cc', 'coordinating conjunction'), ('arrive', 'conj', 'conjunct'), ('in', 'prep', 'prepositional modifier'), ('Denver', 'pobj', 'object of preposition'), ('at', 'prep', 'prepositional modifier'), ('11:10', 'pobj', 'object of preposition'), ('in', 'prep', 'prepositional modifier'), ('the', 'det', 'determiner'), ('morning', 'pobj', 'object of preposition')] 

[('What', 'det', 'determiner'), ('flights', 'nsubj', 'nominal subject'), ('are', 'ROOT', 'root'), ('available', 'acomp', 'adjectival complement'), ('from', 'prep', 'prepositional modifier'), ('Pittsburgh', 'pobj', 'object of preposition'), ('to', 'prep', 'prepositional modifier'), ('B

## spaCy vocabulary
### Word vectors, or word embeddings, are numerical representations of words that allow computers to perform complex tasks using text data. Word vectors are a part of many spaCy models, however, a few of the models do not have word vectors.

### In this exercise, you will practice accessing spaCy vocabulary information. Some meta information about word vectors are stored in each spaCy model. You can access this information to learn more about the vocabulary size, word vectors dimensions, etc.

### The spaCy package is already imported for your use. In a spaCy model's metadata, the number of words is stored as an element with the "vectors" key and the dimension of word vectors is stored as an element with the "width" key.

### Instructions
-    Load the en_core_web_md model.
-    Print the number of words in the en_core_web_md model's vocabulary.
-    Print the dimensions of word vectors in the en_core_web_md model.

In [5]:
# Load the en_core_web_md model
md_nlp = spacy.load("en_core_web_md")

# Print the number of words in the model's vocabulary
print("Number of words: ", md_nlp.meta["vectors"]["vectors"], "\n")

# Print the dimensions of word vectors in en_core_web_md model
print("Dimension of word vectors: ", md_nlp.meta["vectors"]["width"])

Number of words:  20000 

Dimension of word vectors:  300


## Word vectors in spaCy vocabulary
### The purpose of word vectors is to allow a computer to understand words. In this exercise, you will practice extracting word vectors for a given list of words.

### A list of words is compiled as words. The en_core_web_md model is already imported and available as nlp.

### The vocabulary of en_core_web_md model contains 20,000 words. If a word does not exist in the vocabulary, you will not be able to extract its corresponding word vector. In this exercise, for simplicity, it is ensured that all the given words exist in this model's vocabulary.

### Instructions
-    Extract the IDs of all the given words and store them in an ids list.
-    For each ID from ids, store the first ten elements of the word vector in the word_vectors list.
-    Print the first ten elements of the first word vector from word_vectors.

In [9]:
nlp = spacy.load("en_core_web_md")

In [10]:
words = ["like", "love"]

# IDs of all the given words
ids = [nlp.vocab.strings[w] for w in words]

# Store the first ten elements of the word vectors for each word
word_vectors = [nlp.vocab.vectors[i][:10] for i in ids]

# Print the first ten elements of the first word vector
print(word_vectors[0])

[-2.3334  -1.3695  -1.133   -0.68461 -1.8482  -0.63712  2.6791   4.1433
 -2.5616  -1.8061 ]


## Word vectors projection

### You can visualize word vectors in a scatter plot to help you understand how the vocabulary words are grouped. In order to visualize word vectors, you need to project them into a two-dimensional space. You can project vectors by extracting the two principal components via Principal Component Analysis (PCA).

### In this exercise, you will practice how to extract word vectors and project them into two-dimensional space using the PCA library from sklearn.

### A short list of words that are stored in the words list and the en_core_web_md model are available for use. The model is loaded as nlp. All necessary libraries and packages are already imported for your use (PCA, numpy as np).

### Instructions
-    Extract the word IDs from the given words and store them in the word_ids list.
-    Extract the first five elements of the word vectors of the words and then stack them vertically using np.vstack() in word_vectors.
-    Given a pca object, calculate the transformed word vectors using the .fit_transform() function of the pca class.
-    Print the first component of the transformed word vectors using [:, 0] indexing.

In [14]:
import numpy as np

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [15]:
words = ["tiger", "bird"]

# Extract word IDs of given words
word_ids = [nlp.vocab.strings[w] for w in words]

# Extract word vectors and stack the first five elements vertically
word_vectors = np.vstack([nlp.vocab.vectors[i][:5] for i in word_ids])

# Calculate the transformed word vectors using the pca object
pca = PCA(n_components=2)
word_vectors_transformed = pca.fit_transform(word_vectors)

# Print the first component of the transformed word vectors
print(word_vectors_transformed[:, 0])

[ 3.4203901 -3.420389 ]


## Similar words in a vocabulary

### Finding semantically similar terms has various applications in information retrieval. In this exercise, you will practice finding the most semantically similar term to the word computer from the en_core_web_md model's vocabulary.

### The computer word vector is already extracted and stored as word_vector. The en_core_web_md model is already loaded as nlp, and NumPy package is loaded as np.

### You can use the .most_similar() function of the nlp.vocab.vectors object to find the most semantically similar terms. Using [0][0] to index the output of this function will return the word IDs of the semantically similar terms. nlp.vocab.strings[<a given word>] can be used to find the word ID of a given word and it can similarly return the word associated with a given word ID.

### Instructions
-    Find the most semantically similar term from the en_core_web_md vocabulary.
-    Find the list of similar words given the word IDs of the similar terms.

In [27]:
word_ids = [nlp.vocab.strings["computer"]]
word_vector = np.vstack([nlp.vocab.vectors[i][:5] for i in word_ids])

In [25]:
# Find the most similar word to the word computer
most_similar_words = nlp.vocab.vectors.most_similar(np.asarray([word_vector]), n = 1)

# Find the list of similar words given the word IDs
words = [nlp.vocab.strings[w] for w in most_similar_words[0][0]]
print(words)

ValueError: shapes (1,1,5) and (300,20000) not aligned: 5 (dim 2) != 300 (dim 0)

## Doc similarity with spaCy

### Semantic similarity is the process of analyzing multiple sentences to identify similarities between them. In this exercise, you will practice calculating semantic similarities of documents to a given document. The goal is to categorize a list of given reviews that are relevant to canned dog food.

### The canned dog food category is stored at category. A sample of five food reviews has been provided for you in a list called texts. en_core_web_md is loaded as nlp.

### Instructions
-    Create a documents list containing Doc containers of all texts.
-    Create a Doc container of the category and store it as category_document.
-    Iterate through documents and print the similarity scores of each Doc container and the category_document, rounded to three digits.

In [28]:
# Create a documents list containing Doc containers
documents = [nlp(t) for t in texts]

# Create a Doc container of the category
category = "canned dog food"
category_document = nlp(category)

# Print similarity scores of each Doc container and the category_document
for i, doc in enumerate(documents):
  print(f"Semantic similarity with document {i+1}:", round(doc.similarity(category_document), 3))

Semantic similarity with document 1: 0.231
Semantic similarity with document 2: 0.329
Semantic similarity with document 3: 0.227


## Span similarity with spaCy

### Determining semantic similarity can help you to categorize texts into predefined categories or detect relevant texts, or to flag duplicate content. In this exercise, you will practice calculating the semantic similarities of spans of a document to a given document. The goal is to find the most relevant Span of three tokens that are relevant to canned dog food.

### The given category of canned dog food is stored at category. A text string is already stored in the text object and the en_core_web_md is loaded as nlp. The Doc container of the text is also already created and stored at document.

### Instructions
-    Create a Doc container for the category and store at category_document.
-    Print similarity score of a given Span and the category_document, rounded to three digits.

In [32]:
document = [nlp("canned"), nlp("food"),  nlp("products.")]
document = nlp("canned food products.")

In [33]:
# Create a Doc container for the category
category = "canned dog food"
category_document = nlp(category)

# Print similarity score of a given Span and category_document
document_span = document[0:3]
print(f"Semantic similarity with", document_span.text, ":", round(document_span.similarity(category_document), 3))

Semantic similarity with canned food products : 0.795


## Semantic similarity for categorizing text

### The main objective of semantic similarity is to measure the distance between the semantic meanings of a pair of words, phrases, sentences, or documents. For example, the word “car” is more similar to “bus” than it is to “cat”. In this exercise, you will find similar sentences to the word sauce from an example text in Amazon Fine Food Reviews. You can use spacy to calculate the similarity score of the word sauce and any of the sentences in a given texts string and report the most similar sentence's score.

### A texts string is pre-loaded that contains all reviews' Text data. You'll use en_core_web_md English model for this exercise which is already available as nlp.

### Instructions
-    Use nlp to generate Doc containers for the word sauce and for texts and store them at key and sentences respectively.
-    Calculate similarity scores of the word sauce with each sentence in the texts string (rounded to two digits).

In [34]:
texts = 'This hot sauce is amazing! We picked up a bottle on a trip! '

In [35]:
# Populate Doc containers for the word "sauce" and for "texts" string 
key = nlp("sauce")
sentences = nlp(texts)

# Calculate similarity score of each sentence and a Doc container for the word sauce
semantic_scores = []
for sent in sentences.sents:
	semantic_scores.append({"score": round(sent.similarity(key), 2)})
print(semantic_scores)

[{'score': 0.5}, {'score': 0.17}]
