In [1]:
import pandas as pd
import scipy as sc
import sklearn
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import r2_score
import statsmodels.api as sm
import sys
### Gensim is outside the anaconda distribution ###
### uncomment to install Gensim ###
#!{sys.executable} -m pip install gensim
import gensim
import gensim.downloader as model_api

  from pandas.core import datetools


In [2]:
# Load pretrained word embeddings
# This will download 60mb of data the first time it's loaded
word_vectors = model_api.load("glove-wiki-gigaword-50")

# An interactive introduction to word embeddings

**Goals:**

- Demystify text-based AI models


- Convince you that this is very cool!

**applications:**

- Translation (eg. Google Translate)


- Text recommendation (autocomplete)


- Chatbots (automatic customer service)


- Much much more!


- [See here for state of the art on tasks](https://github.com/sebastianruder/NLP-progress)

First, a **magic trick**: 


$(Paris - France) + Russia = x$ 


which should give us like $x = Moscow$

In [3]:
# Get the most similar word to an expression
word_vectors.most_similar_cosmul(positive=['paris', 'russia'], negative=['france'])

[('moscow', 0.9656671285629272),
 ('russian', 0.8811113834381104),
 ('prague', 0.8772416710853577),
 ('vienna', 0.8710137009620667),
 ('putin', 0.8702147006988525),
 ('warsaw', 0.8692157864570618),
 ('kiev', 0.8671694993972778),
 ('tokyo', 0.8649566173553467),
 ('berlin', 0.8640562295913696),
 ('47-42-17-11', 0.8622288107872009)]

**NLP ADVANTAGE:** It's easy to generate datasets in NLP if you're clever!

In [4]:
### Generate a dataset
### Takes apple app IDs (find them in appstore URL)
### And gets reviews for that app
### AppReviews is a tiny library to query AppStore JSON API
# import AppReviews
# df = pd.concat([
#     AppReviews.get_reviews(1057889290),
#     AppReviews.get_reviews(1402499966),
#     AppReviews.get_reviews(1417799395),
#     AppReviews.get_reviews(1403455040),
#     AppReviews.get_reviews(585027354),
#     AppReviews.get_reviews(454638411),
# ])
### Clean up NaN values ###
# df.loc[df['review'].isna(), 'review'] = "."
# df.loc[df['app_name'].isna(), "app_name"] = "."
# df.loc[df['title'].isna(), 'title'] = "."
# df.loc[df['vote_count'].isna(), 'vote_count'] = 0.0
# df.loc[df['version'].isna(), 'version'] = -1
# df.loc[df['rating'].isna(), 'rating'] = 2.5
# df.to_csv('app_reviews.csv', index=False)

df = pd.read_csv('app_reviews.csv')
df.head()

Unnamed: 0,app_name,title,version,rating,review,vote_count
0,Tomb of the Mask,Overall fun game...but,1.6,2.0,ENOUGH of the ads. I know the creator wants to...,0.0
1,Tomb of the Mask,Would be good if not for the ads,1.6,1.0,"Everything you do requires an ad, it’s really ...",0.0
2,Tomb of the Mask,AMAZING,1.6,5.0,This game is so indicting and I love it,0.0
3,Tomb of the Mask,So many ads,1.6,4.0,The game itself is really fun but you go throu...,0.0
4,Tomb of the Mask,Terrible app,1.6,1.0,This app has a cool game in it. However the ap...,0.0


# Fundamental Problem:

If we want to the the text to produce predictions or suggestions, we first need to translate it to a mathematical form

**Naive solution:** Have each document (each review) become a list of the word it contains

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df['review'])
wordLabels = vectorizer.get_feature_names()

# Print example of the bag-of-words matrix
pd.DataFrame(data=X.toarray(), columns=wordLabels).loc[:, 'game':].head()

Unnamed: 0,game,gameeee,gamemodes,gameplay,gameplay10,gamer,gamers,games,gamification,gaming,...,너무,네비게이션의,사용하기,안내,좋은지도,좋음,차량,파악하여,편함,해주니
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


It is easy to try to **predict a review's rating** with this approach:

In [7]:
# Use PCA to compress X matrix (for speedup)
# Information on it here: https://jeremykun.com/2012/06/28/principal-component-analysis/
# Reference here: https://onlinecourses.science.psu.edu/stat505/node/51/
# Note model won't work without compression because so many words have only 1 entry
COMPRESSED_SIZE = 200

Xd = X.toarray()
Xd = PCA(COMPRESSED_SIZE).fit(Xd.T).components_.T
Xd = sm.add_constant(Xd)

# OLS computed by hand for convenience
# Ypred = X(X'X)^-1 X'Y  ----- Reference here:
# https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf
pred = Xd @ sc.linalg.inv(Xd.T @ Xd) @ Xd.T @ df['rating'].values
print("OLS R^2: ", r2_score(df['rating'], pred))

OLS R^2:  0.36698404366771886


# What are the problems with this approach?

- Doesn't associate similar/same words


- No information about words themselves


- No word importance information

Some fixes for bag-of-words approach are detailed in the first week of [this free NLP course](https://www.coursera.org/learn/language-processing/).

# A better approach to NLP

Is to get a [mathematical representation of each word](https://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/), and combine those into sentences.

The most popular way to do this is currently to exploit words appearing in sentences together.

First, let's get embeddings from the [co-occurence matrix](https://web.stanford.edu/class/cs224n/reports/2758144.pdf)

In [8]:
# We want a symmetric matrix where each row and column is a word
# And each entry is how often the words appear together in a sentence
# Matrix multiplying the bag-of-words matrix with its transpose this matrix
Xc = (X.T @ X).todense()
# Xc.setdiag(0) # if you want same word cooccurence to 0
Xc

matrix([[8, 0, 0, ..., 0, 0, 0],
        [0, 2, 0, ..., 0, 0, 0],
        [0, 0, 1, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 1, 1, 1],
        [0, 0, 0, ..., 1, 1, 1],
        [0, 0, 0, ..., 1, 1, 1]], dtype=int64)

In [9]:
my_embeddings = PCA(COMPRESSED_SIZE).fit(Xc).components_.T

We can use this to find the "closest" word to an other word. The normal method to do this is called "nearest neighbors".

Good article on it [here](https://www.quantamagazine.org/universal-method-to-sort-complex-information-found-20180813/). Simple method is [K-D tree](https://en.wikipedia.org/wiki/K-d_tree) but there are [more sophisticated libraries](https://github.com/spotify/annoy).

The metric we use to judge how close two words are is the **cosine distance** between their vectors. This is basically the angle between the vectors, so the metric judges the difference in "where they point to"

In [10]:
# Euclidean distance on normalized vectors is cosine distance
my_embeddings = sklearn.preprocessing.normalize(my_embeddings)
# KD-tree uses euclidean distance
tree = sklearn.neighbors.KDTree(my_embeddings)

In [11]:
evalWord = 'game'
k = 5

dist, ind = tree.query([my_embeddings[wordLabels.index(evalWord)]], k=k)

for i in range(k):
    print(wordLabels[ind[0][i]], ":  ", dist[0][i])

game :   0.0
awsome :   0.8607745824279769
jimmy :   0.945669374465239
interuppting :   1.0764969308474335
bhen :   1.0796239962633714


**Now we understand the magic trick used in the beginning!**

There are many better methods to generate embeddings. The most popular is [word2vec](https://www.tensorflow.org/tutorials/representation/word2vec) and [GloVE](https://nlp.stanford.edu/projects/glove/). There are also methods based on [matrix factorization](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/) like we did. Modern techniques use recurrent neural net models [predicting words](https://thegradient.pub/nlp-imagenet/) to generate better embeddings.