# **Word embeddings**

Word embeddings (also called word vectors) represent each word numerically in such a way that the vector corresponds to how that word is used or what it means. Vector encodings are learned by considering the context in which the words appear. Words that appear in similar contexts will have similar vectors. For example, vectors for "leopard", "lion", and "tiger" will be close together, while they'll be far away from "planet" and "castle".

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import spacy

spaCy provides embeddings learned from a model called Word2Vec. You can access them by loading a large language model like `en_core_web_lg`. Then they will be available on tokens from the `.vector` attribute.

In [2]:
# Load the large model to get the vectors
nlp = spacy.load('en')

review_data = pd.read_csv('yelp_ratings.csv')
review_data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


Here's an example of loading some document vectors.

Calculating 44,500 document vectors takes about 20 minutes, so we'll get only the first 100. To save time, we'll load pre-saved document vectors for the hands-on coding exercises.

In [14]:
reviews = review_data[:100]
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    vectors = np.array([nlp(review.text).vector for idx, review in reviews.iterrows()])
    
vectors.shape

(100, 96)

Why 100 rows? Because we have 1 row for each column.

Why 96 columns? This is the same length as word vectors.

In [4]:
# Loading all document vectors from file
vectors = np.load('review_vectors.npy')
vectors

array([[-0.20143504,  0.1837154 , -0.01327053, ..., -0.05922916,
         0.01440009,  0.09077955],
       [-0.02590548,  0.1517007 , -0.11389936, ..., -0.04916738,
         0.03085417,  0.07205424],
       [-0.07666641,  0.19274631, -0.14321738, ..., -0.04575825,
         0.0689992 ,  0.09280958],
       ...,
       [-0.03841371,  0.16862842, -0.24175283, ..., -0.10739233,
         0.14741549,  0.12238124],
       [-0.01221176,  0.11620302, -0.09448893, ..., -0.06332556,
         0.02805696,  0.13142744],
       [ 0.01070178,  0.1630349 , -0.06763948, ..., -0.08762769,
         0.00377347,  0.15404755]], dtype=float32)

In [5]:
vectors.shape

(44530, 300)

# **Training a Model on Document Vectors**

In [9]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, review_data.sentiment, 
                                                    test_size=0.1, random_state=1)

# Create the LinearSVC model
model = LinearSVC(random_state=1, dual=False)
# Fit the model
model.fit(X_train, y_train)

# to see model accuracy
print(f'Model test accuracy: {model.score(X_test, y_test)*100:.3f}%')

Model test accuracy: 93.847%


**Centering the Vectors**

Sometimes people center document vectors when calculating similarities. That is, they calculate the mean vector from all documents, and they subtract this from each individual document's vector. Why do you think this could help with similarity metrics?

Sometimes your documents will already be fairly similar. For example, this data set is all reviews of businesses. There will be stong similarities between the documents compared to news articles, technical manuals, and recipes. You end up with all the similarities between 0.8 and 1 and no anti-similar documents (similarity < 0). When the vectors are centered, you are comparing documents within your dataset as opposed to all possible documents.

In [17]:
review = """I absolutely love this place. The 360 degree glass windows with the 
Yerba buena garden view, tea pots all around and the smell of fresh tea everywhere 
transports you to what feels like a different zen zone within the city. I know 
the price is slightly more compared to the normal American size, however the food 
is very wholesome, the tea selection is incredible and I know service can be hit 
or miss often but it was on point during our most recent visit. Definitely recommend!

I would especially recommend the butternut squash gyoza."""

def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

review_vec = nlp(review).vector

## Center the document vectors
# Calculate the mean for the document vectors
vec_mean = vectors.mean(axis=0)
# Subtract the mean from the vectors
centered = vectors - vec_mean

# Calculate similarities for each document in the dataset
# Make sure to subtract the mean from the review vector
sims = np.array([cosine_similarity(review_vec - vec_mean, vec) for vec in centered])

# Get the index for the most similar document
most_similar = sims.argmax()
most_similar

68

In [24]:
print(centered.shape)
centered

(100, 96)


array([[-0.5606673 , -0.42731857,  0.5784769 , ..., -0.91992617,
        -0.00921389,  0.120713  ],
       [ 0.00657977, -0.1744476 ,  0.19410634, ...,  0.09970576,
         0.12338543,  0.2745846 ],
       [-0.05806408,  0.03195786,  0.09399292, ...,  0.18129015,
         0.06183219, -0.11148781],
       ...,
       [-0.02900801, -0.20155254, -0.01432611, ...,  0.11464477,
         0.2899717 ,  0.06350684],
       [-0.4703021 ,  0.05068153, -0.42928925, ...,  0.34041506,
         0.17901558, -0.11759883],
       [ 0.21898696,  0.22086507, -0.01466776, ..., -0.15677482,
        -0.04762721,  0.03696245]], dtype=float32)

In [18]:
sims

array([ 0.11615878,  0.20520343,  0.1199115 , -0.29613823, -0.2651352 ,
       -0.14336504, -0.0837178 , -0.02431078, -0.14451663, -0.15439798,
       -0.23878428, -0.11076656, -0.17707983,  0.12605008,  0.02963337,
        0.15706927, -0.03548702,  0.13628466,  0.2666304 ,  0.06678802,
        0.12929681,  0.03012088, -0.02520203, -0.1389964 ,  0.07685189,
        0.23875962, -0.16238788, -0.21943115,  0.11658009,  0.04152437,
       -0.19360548, -0.27243188,  0.10683729,  0.02188935, -0.02496911,
       -0.09520426,  0.33202317, -0.08510762,  0.24514355, -0.03882213,
       -0.10121612,  0.12764323, -0.01910499,  0.1520111 ,  0.04102394,
       -0.194206  , -0.00082984, -0.27407327, -0.08929842,  0.32867005,
        0.22730705,  0.02374783,  0.26474845,  0.01061294,  0.07035393,
       -0.1035321 , -0.45238316, -0.02635742,  0.10386878,  0.00223781,
        0.15506822,  0.07237454, -0.0901114 , -0.06198448, -0.23009619,
       -0.08190632, -0.01945008, -0.14834529,  0.46943936, -0.24

In [22]:
len(sims)

100

In [16]:
print(review_data.iloc[most_similar].text)

Yes... the Boba Tea explosion is in full force. I have been to Lee Lee International Supermarket in Chandler many times, but I never noticed this little gem next to it until a couple years ago. Boba Tea House has serving up some of the best boba tea in the Valley long before it became a big thing. They have a fantastic array of flavors and drink choices to choose like fruit slushes, snow, milk tea, pudding, mango jelly, coffee jelly, etc. They even have snacks like popcorn chicken, fried tofu, and fries. The staff is super friendly and the prices are reasonable. I still laugh at my friends who have no idea what Boba Tea is or are too afraid to suck up one of those chewy ball things. LOL. In case you didn't know, Boba Tea is a flavored tea (usually with milk) to which chewy tapioca balls or fruit jellies are added. I think they are super delicious. Today I got the Blueberry Milk Boba Tea and it made for the perfect snack in the middle of my day. Another favorite of my mine is the honeyd

**Looking at similar reviews**

Reviews for coffee shops will also be similar to our tea house review because coffee and tea are semantically similar. Most cafes serve both coffee and tea so you'll see the terms appearing together often.

Reviews for coffee shops will also be similar to our tea house review because coffee and tea are semantically similar. Most cafes serve both coffee and tea so you'll see the terms appearing together often.