**This notebook is an exercise in the [Natural Language Processing](https://www.kaggle.com/learn/natural-language-processing) course.  You can reference the tutorial at [this link](https://www.kaggle.com/matleonard/word-vectors).**

---


# Vectorizing Language

Embeddings are both conceptually clever and practically effective. 

So let's try them for the sentiment analysis model you built for the restaurant. Then you can find the most similar review in the data set given some example text. It's a task where you can easily judge for yourself how well the embeddings work.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import spacy

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex3 import *
print("\nSetup complete")


Setup complete


In [2]:
# Load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

review_data = pd.read_csv('../input/nlp-course/yelp_ratings.csv')
review_data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


Here's an example of loading some document vectors. 

Calculating 44,500 document vectors takes about 20 minutes, so we'll get only the first 100. To save time, we'll load pre-saved document vectors for the hands-on coding exercises.

In [3]:
reviews = review_data[:100]
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    vectors = np.array([nlp(review.text).vector for idx, review in reviews.iterrows()])
    
vectors.shape

(100, 300)

The result is a matrix of 100 rows and 300 columns. 

Why 100 rows?
Because we have 1 row for each column.

Why 300 columns?
This is the same length as word vectors. See if you can figure out why document vectors have the same length as word vectors (some knowledge of linear algebra or vector math would be needed to figure this out).

Go ahead and run the following cell to load in the rest of the document vectors.

In [4]:
# Loading all document vectors from file
vectors = np.load('../input/nlp-course/review_vectors.npy')

# 1) Training a Model on Document Vectors

Next you'll train a `LinearSVC` model using the document vectors. It runs pretty quick and works well in high dimensional settings like you have here.

After running the LinearSVC model, you might try experimenting with other types of models to see whether it improves your results.

In [5]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, review_data.sentiment, 
                                                    test_size=0.1, random_state=1)

# Create the LinearSVC model
model = LinearSVC(random_state=1, dual=False)
# Fit the model
model.fit(X_train, y_train)

# Uncomment and run to see model accuracy
print(f'Model test accuracy: {model.score(X_test, y_test)*100:.3f}%')

# Uncomment to check your work
q_1.check()

Model test accuracy: 93.847%


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [6]:
# Lines below will give you a hint or solution code
#q_1.hint()
#q_1.solution()

In [7]:
# Scratch space in case you want to experiment with other models
from sklearn.neural_network import MLPClassifier

second_model = MLPClassifier(hidden_layer_sizes=(128,32,),
                             early_stopping=True, random_state=1)
second_model.fit(X_train, y_train)
print(f'Model test accuracy: {second_model.score(X_test, y_test)*100:.3f}%')

Model test accuracy: 94.229%


# Document Similarity

For the same tea house review, find the most similar review in the dataset using cosine similarity.

# 2) Centering the Vectors

Sometimes people center document vectors when calculating similarities. That is, they calculate the mean vector from all documents, and they subtract this from each individual document's vector. Why do you think this could help with similarity metrics?

Run the following line after you've decided your answer.

In [8]:
# Check your answer (Run this code cell to receive credit!)
#q_2.solution()
q_2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 


    Sometimes your documents will already be fairly similar. For example, this data set
    is all reviews of businesses. There will be stong similarities between the documents
    compared to news articles, technical manuals, and recipes. You end up with all the
    similarities between 0.8 and 1 and no anti-similar documents (similarity < 0). When the
    vectors are centered, you are comparing documents within your dataset as opposed to all
    possible documents.
    

# 3) Find the most similar review

Given an example review below, find the most similar document within the Yelp dataset using the cosine similarity.

In [9]:
review = """I absolutely love this place. The 360 degree glass windows with the 
Yerba buena garden view, tea pots all around and the smell of fresh tea everywhere 
transports you to what feels like a different zen zone within the city. I know 
the price is slightly more compared to the normal American size, however the food 
is very wholesome, the tea selection is incredible and I know service can be hit 
or miss often but it was on point during our most recent visit. Definitely recommend!

I would especially recommend the butternut squash gyoza."""

def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

review_vec = nlp(review).vector

## Center the document vectors
# Calculate the mean for the document vectors, should have shape (300,)
vec_mean = vectors.mean(axis=0)
# Subtract the mean from the vectors
centered = vectors - vec_mean

# Calculate similarities for each document in the dataset
# Make sure to subtract the mean from the review vector
review_centered = review_vec - vec_mean
sims = np.array([cosine_similarity(v, review_centered) for v in centered])

# Get the index for the most similar document
most_similar = sims.argmax()

# Uncomment to check your work
q_3.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [10]:
# Lines below will give you a hint or solution code
#q_3.hint()
#q_3.solution()

In [11]:
print(review_data.iloc[most_similar].text)

After purchasing my final christmas gifts at the Urban Tea Merchant in Vancouver, I was surprised to hear about Teopia at the new outdoor mall at Don Mills and Lawrence when I went back home to Toronto for Christmas.
Across from the outdoor skating rink and perfect to sit by the ledge to people watch, the location was prime for tea connesieurs... or people who are just freezing cold in need of a drinK!
Like any gourmet tea shop, there were large tins of tea leaves on the walls, and although the tea menu seemed interesting enough, you can get any specialty tea as your drink. We didn't know what to get... so the lady suggested the Goji Berries... it smelled so succulent and juicy... instantly SOLD! I got it into a tea latte and watched the tea steep while the milk was steamed, and surprisingly, with the click of a button, all the water from the tea can be instantly drained into the cup (see photo).. very fascinating!

The tea was aromatic and tasty, not over powering. The price was also 

Even though there are many different sorts of businesses in our Yelp dataset, you should have found another tea shop. 

# 4) Looking at similar reviews

If you look at other similar reviews, you'll see many coffee shops. Why do you think reviews for coffee are similar to the example review which mentions only tea?

In [12]:
# Check your answer (Run this code cell to receive credit!)
#q_4.solution()
q_4.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 


    Reviews for coffee shops will also be similar to our tea house review because
    coffee and tea are semantically similar. Most cafes serve both coffee and tea
    so you'll see the terms appearing together often.
    

# Congratulations!

You've finished the NLP course. It's an exciting field that will help you make use of vast amounts of data you didn't know how to work with before.

This course should be just your introduction. Try a project **[with text](https://www.kaggle.com/datasets?tags=14104-text+data)**. You'll have fun with it, and your skills will continue growing.

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161466) to chat with other Learners.*