# <center>Understanding Word2Vec with PySpark</center>
<center>Gabriel Fair</center>


Today we are going to look at how Word2Vec incorporates word embeddings to create a numeric vectors to represent meaning of words. This is an important part of natural language processing (NLP). The goal of NLP is to extract meaning from human language, often this is provided in the form of text. And this meaning can be found in many components of language.

### Some components of language
  - Pragmatics
  - Semantics
  - Syntax
  - Morphology
  - Phonology

## Distributional Semantic Models
Word embeddings are word representation algorithms used in an NLP. Word embeddings are a subclass of distributional semantic models because they rely on the distributional hypothesis. The distributional hypothesis, created by Zellig Harris in his 1956 paper [“Distributional structure”](http://www.tandfonline.com/doi/pdf/10.1080/00437956.1954.11659520) ,  is assumption that words in the same context tend to proport similar meanings, and thus occur near each other. And thus synonyms have similar representations in a collection of texts. Word embeddings are represented as vector values created as a result of a neural network. 



<!---
<img style="display: block; margin: auto;" alt="photo" src="{{ site.baseurl }}/images/image.jpg">
-->

### We are going to start with some imports that we will need later

In [102]:
import os, sys, codecs, json, datetime

from time import time
import pyspark
print("pyspark version and install location: " + str(sc.version))
print(str(pyspark.version))


from pyspark.mllib.feature import Word2Vec as Word2Vec #https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html#word2vec
from pyspark.mllib.clustering import KMeans as KMeans
from pyspark.mllib.linalg import Vectors as Vectors

#https://spark.apache.org/docs/2.2.0/ml-features.html#word2vec
from pyspark.ml.feature import Word2Vec as Word2Vec2 #https://spark.apache.org/docs/2.2.0/ml-feature-extraction.html#word2vec
from pyspark.ml.clustering import KMeans as KMeans2
from pyspark.ml.linalg import Vectors as Vectors2

#from pyspark.ml.feature import PCA

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import math

from IPython.display import display


pyspark version and install location: 2.4.0
<module 'pyspark.version' from '/opt/spark-2.4.0-bin-hadoop2.7/python/pyspark/version.py'>


### Determine which version of scikit-learn learn we are using. 
#### [As of version 0.20.0 scikit-learn supports Pandas dataframes. ](https://medium.com/dunder-data/from-pandas-to-scikit-learn-a-new-exciting-workflow-e88e2271ef62)

In [13]:
import sklearn
print("sklearn's version is: " + str(sklearn.__version__))
print("My python version is: "+ str(sys.version))

sklearn's version is: 0.20.1
My python version is: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]


In [14]:
#A method to monitor progress
def update_progress(current_progress, total, current):
    text = str(current_progress) + "/" + str(total) + " At: " + str(current)
    sys.stdout.write('\r' + text)
    sys.stdout.flush()

### Input 1 Thousand gab.ai posts

In [94]:
path_to_text_data = '1mill_posts_unique_body_only.csv'
#filter_words = 'filter_word.csv' #Needs to one word per line

post_data = sc.textFile(path_to_text_data)
totalposts = post_data.count()
print("There are posts total: " + str(totalposts) )

There are posts total: 1000001


#### Example gab.ai posts

In [95]:
for i, y in enumerate(post_data.collect()):
    print(y)
    if i >5:
        break

data.body
"Probably because I see the faint hint of 'horns' holding that halo up... "
https://youtu.be/YMQRFT4bZuc
http://www.epochtimes.de/politik/europa/zahl-der-toten-nach-londoner-hochhausbrand-auf-79-gestiegen-2-a2146594.html
https://t.co/LTMBeXvHrC
"Ps 37:14 Die Gottlosen ziehen das Schwert aus und spannen ihren Bogen, daß sie fällen den Elenden und Armen und schlachten die Frommen.\nPs 37:15 Aber ihr Schwert wird in ihr Herz gehen, und ihr Bogen wird zerbrechen.\n\n"
At least 25 killed in airstrike on market in Yemen – reports\nhttps://www.rt.com/news/392838-saudi-yemen-market-airstrike/ #saudiarabia #yemen


### Clean the text

In [96]:
# clean characters by removing some characters and transform text to lower case
posts_RDD = post_data.map(lambda x: x.replace(";"," ").replace(":"," ").replace('"',' ').replace('-',' ').replace(',',' ').replace('.',' ').lower())

# tokenize into separate words
posts_RDD = posts_RDD.map(lambda row: row.split(" ")) 


#### Example after cleaning

In [97]:
for i, y in enumerate(posts_RDD.collect()):
    print(y) #There is no need to worry about the blank string elements, they will be ignored anyway
    if i >5:
        break

['data', 'body']
['', 'probably', 'because', 'i', 'see', 'the', 'faint', 'hint', 'of', "'horns'", 'holding', 'that', 'halo', 'up', '', '', '', '', '']
['https', '//youtu', 'be/ymqrft4bzuc']
['http', '//www', 'epochtimes', 'de/politik/europa/zahl', 'der', 'toten', 'nach', 'londoner', 'hochhausbrand', 'auf', '79', 'gestiegen', '2', 'a2146594', 'html']
['https', '//t', 'co/ltmbexvhrc']
['', 'ps', '37', '14', 'die', 'gottlosen', 'ziehen', 'das', 'schwert', 'aus', 'und', 'spannen', 'ihren', 'bogen', '', 'daß', 'sie', 'fällen', 'den', 'elenden', 'und', 'armen', 'und', 'schlachten', 'die', 'frommen', '\\nps', '37', '15', 'aber', 'ihr', 'schwert', 'wird', 'in', 'ihr', 'herz', 'gehen', '', 'und', 'ihr', 'bogen', 'wird', 'zerbrechen', '\\n\\n', '']
['at', 'least', '25', 'killed', 'in', 'airstrike', 'on', 'market', 'in', 'yemen', '–', 'reports\\nhttps', '//www', 'rt', 'com/news/392838', 'saudi', 'yemen', 'market', 'airstrike/', '#saudiarabia', '#yemen']


## Building a Vector
To use the distributional hypothesis to build a vector, we have to choose what words being near each other means to us. This value of “nearness” is known as a window. In the image below, taken from [Chris McCormick’s tutorial](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/), the target word is highlighted in blue, and the window shown around it as being two words away from the target.
This means the window size is equal to 2.
<img style="display: block; margin: auto;" alt="photo" src="http://mccormickml.com/assets/word2vec/training_data.png">

Word pairs are created between the target word and  all other words in the window which can extend forwards or backwards. The target is then moved to the next word and the process repeats. Some embedding models treat words to the left of the target word differently than words to the right. But for now we will treat them both equally.


These word pairs become the training samples for the model. These pairs are known as a one-hot vector. Currently they are in the form of (target word, context-word-in-window) and this will be used as the input for a simple 1 hidden layer neural network. The hidden layer has a pre-determined number of neurons that we specify as a hyper parameter. A hyper parameter is the number of hidden layer neurons has a large effect on the accacury and speed of the model’s runtime and 300 is widely used in practice since it was used by word2vec’s creators. This simple neural network is known as a Restricted Boltzmann Machine (RBM).

## Restricted Boltzmann Machines (RBMs)

<img style="display: block; margin: auto;" alt="photo" src="https://raw.githubusercontent.com/gabefair/gabefair.github.io/master/images/threelayers%5B1%5D.png">

In the image above, there are three columns that are known in discriptions of neural networks as layers. These diagrams show cause and effect between the layers of a neural network and are read from left-to-right. Each circle represents a neuron and is called a **node**. A node is where a calculation is preformed to determine if it will send a 0 or a 1 to a node in the next layer, which is to the right. This communication is known as **firing** and only happens in one direction, left-to-right. 


<img style="display: block; margin: auto;" alt="photo" src="https://raw.githubusercontent.com/gabefair/gabefair.github.io/master/images/three_layers_connected.png">


In our restricted boltzman machine, nodes are not linked to, or communicate with, other nodes within the same layer. This restriction gives the RBM its name. And every node in the input layer is linked to each node in the hidden layer. The nodes/neurons in the input layer are considered to be different neurons  in the hidden layer, hence why they are in different layers. 

I stress this point because this is known as a bipartite graph. But not just any bipartide graph, a complete bipartite graph because these two layers are fully linked. Note, that some texts call this a symmetrical bipartite graph. Also it is important to notice in the graphic above how the hidden layer has fewer nodes than the input or output layers. This is an important quality of RBMs as a feature known as dimensionality reduction.

When a RBM is inalitized, four things are determined in advance and thus are hard-coded into the construction of the neural network. This things are known as hyper parameters. 

  - Number of nodes in the input layer
  - Number of nodes in the hidden layer
  - Number of nodes in the output layer
  - The weights of the nodes in the hidden layer


With RBMs a special step happens when the hidden layer is created. Each node is randomlly assigned a weight. A wight is the power that node has on the nodes it is linked to in the next layer. This process of randomlly assigning weights is known as Stochastic Gradient Descent. It is called this because stocastic means “random” and these weights provide influence on the node in the next layer they are linked to. 

These weights are important to Word2Vec but unlike normal RBMs, Word2Vec does not randomly assign weights. Instead Word2Vec builts these weights over time while the neural network is fed our word pairs we created previously as input. This is known as training the neural network.

### Setup and train the neural network used by word2Vec

In [None]:
#Word2Vec model param: wordIndex maps each word to an index, 
# which can retrieve the corresponding vector from wordVectors 
# param: wordVectors array of length numWords * vectorSize, vector corresponding to the word mapped with index i can be retrieved by the slice (i * vectorSize, i * vectorSize + vectorSize)

k = 300         # vector dimensionality (The number of nodes in the hidden layer)
word2vec = Word2Vec().setVectorSize(k) #Uses skip-gram model

In [None]:
## train Word2vec
#dir(word2vec)
model = wc.fit(posts_RDD)

model_vectors = model.getVectors()
## Get the list of words in the w2v matrix
vocabsize = len(model_vectors)
print("Size of vocab list: " + str(vocabsize))

#### The the results of training

In [None]:

print(model_vectors)

#### Just the output vector for the word "trump"

In [None]:
print(model_vectors['trump'])

Size of vocab list: 1027


In [85]:
print("Size of vector for the word 'looks': " + str(len(a['looks'])))

Size of vector for the word 'looks': 100


In [91]:
word_to_look_up_final_score = 'trump'
another_word_to_compare_it_against = 'president'

#Find synonyms of a word; do not include the word itself in results
word_cosine_similarity_arry_word1 = model.findSynonyms(word_to_look_up_final_score, vocabsize-1) #returns an array of (word, cosineSimilarity)
word_cosine_similarity_arry_word2 = model.findSynonyms(another_word_to_compare_it_against, vocabsize-1) #returns an array of (word, cosineSimilarity)

list_words = []
for l in word_cosine_similarity_arry_word1:
    list_words.append(l[0])
list_words.append(word_to_look_up_final_score)

nwords = len(list_words)
nfeatures = model.transform(word_to_look_up_final_score).array.shape[0]

In [92]:
print("=================================================")
print("Number of total posts processed: ", totalposts)
#print("=================================================")
#print("Number of filtered posts used: ", twcount)
print("=================================================")
print("Number of words in the model:", nwords)
print("=================================================")
print("Number of features per word: ", nfeatures)
print("=================================================")

Number of total posts processed:  2001
Number of words in the model: 1027
Number of features per word:  100


In [None]:
## Construct the feature matrix, each row is asociated to each word in list_words
feature_matrix = []
found_words = 0
for word in list_words:
    found_words = total_words + 1
    feature_matrix.append(model.transform(word).array)
    update_progress(found_words, 0, word)

In [None]:
np.save('1k_Gab_ai_posts_W2Vmatrix.npy',feature_matrix)
np.save('1k_Gab_ai_posts_WordList.npy',list_words)

In [None]:
 num_of_clusters = int(math.floor(math.sqrt(float(nwords)/2)))
# Clusters ~ sqrt(n/2) is a fast approx according to : http://infolab.stanford.edu/~ullman/mmds/ch7.pdf


In [None]:
Feature_Matrix = np.load('Gab_ai_posts_W2Vmatrix.npy')    # reads model generated by Word2Vec
words = np.load('Gab_ai_posts_WordList.npy')    # reads list of words
Featshape = Feature_Matrix.shape

  
## K-means clustering with Spark  
maxiters=100
clusters = KMeans.train(Feature_Matrix, k = num_of_clusters, maxIterations = maxiters) 

## Getting Cluster Labels for each Word and saving to a numpy file
labels =  Feature_Matrix.map(lambda point: clusters.predict(point)) # add labels to each vector (word)
list_labels = labels.collect()
np.save('k_Clusters.npy',list_labels)

print("="*70)
print("Size of the Word2vec matrix (words, features) is: ", Featshape)
print("="*70)
print("Number of clusters used: ", num_of_clusters)
print("="*70)

In [None]:
Feature_Matrix = np.load('Gab_ai_posts_W2Vmatrix.npy')    # reads model generated by Word2Vec
words = np.load('Gab_ai_posts_WordList.npy')    # reads list of words
labels = np.load('/Users/jorgecastanon/Documents/github/w2v/mllib-scripts/myClusters.npy')

Nw = words.shape[0]                # total number of words
ind_star = np.where(word == words) # find index of the chosen word
wstar = Feat[ind_star,:][0][0]     # vector corresponding to the chosen 'word'
nwstar = math.sqrt(np.dot(wstar,wstar)) # norm of vector corresponding to the chosen 'word'

dist = np.zeros(Nw) # initialize vector of distances
i = 0
for w in Feat: # loop to compute cosine distances 
    den = math.sqrt(np.dot(w,w))*nwstar  # compute denominator of cosine distance
    dist[i] = abs( np.dot(wstar,w) ) / den   # compute cosine distance to each word
    i = i + 1

indexes = np.argpartition(dist,-(nwords+1))[-(nwords+1):]
di = []
for j in range(nwords+1):
    di.append(( words[indexes[j]], dist[indexes[j]], labels[indexes[j]] ) )

result = pd.DataFrame(di, columns = ["word","similarity","cluster"])
return result.iloc[::-1] # order results from closest to

In [None]:
maxWordsVis = 15

Feat = np.load('Gab_ai_posts_W2Vmatrix.npy')  
words = np.load('Gab_ai_posts_WordList.npy')
# to rdd, avoid this with big matrices by reading them directly from hdfs
Feat = sc.parallelize(Feat) 
Feat = Feat.map(lambda vec: (Vectors.dense(vec),))
# to dataframe
dfFeat = sqlContext.createDataFrame(Feat,["features"])



In [None]:
dfFeat.head()

In [None]:
dfComp = model.transform(dfFeat).select("pcaFeatures")
# get the first two components to lists to be plotted
compX = dfComp.map(lambda vec: vec[0][0]).take(maxWordsVis)
compY = dfComp.map(lambda vec: vec[0][1]).take(maxWordsVis)
compZ = dfComp.map(lambda vec: vec[0][2]).take(maxWordsVis)
return compX, compY, compZ, words

In [None]:
#print(pyspark.__version()__) #
#print(pyspark.version()) #
import pyspark 


In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
result = np.array(dfFeat.select('features').collect())
result = np.reshape(result,(1027,100))

pca.fit(result)
model = pca.transform(result)
number_of_words = model.shape[0]
assert number_of_words == len(words)
print(model.shape)
print(model[0])

In [None]:
#%matplotlib inline
#%matplotlib qt5
fs=20 #fontsize
w = words[0:number_of_words]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
height = 10
width = 10
fig.set_size_inches(width, height)

for i, word in enumerate(words):
    ax.scatter(model[i, 0], model[i, 1], model[i, 2], color='red', marker='o', edgecolors='black')
    ax.text(model[i, 0], model[i, 1], model[i, 2], word)
    #plt.scatter(model[i, 0], model[i, 1], color='red', marker='o', edgecolors='black')
    
for angle in range(0, 360):
    ax.view_init(30, angle)
    ax.draw()
    ax.pause(.001)

In [None]:
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(1, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

for word_num , word in enumerate(words):
    for vector_num , vector in enumerate(model[0]):
        word_label = words[word_num]
       