The text corpus can be downloaded from here: http://mattmahoney.net/dc/text8.zip
There are many ways to measure similarity between words:
In this project I have decided to explore word2vec to compute similarities between words. Implemented word2vec using Spark.Mllib package. The Idea behind word2vec is similar words have similar vectors in the high dimensional space.
Example: Consider two sentences:
a. Mike is a good man
b. John is a good man too
In the above sentences both Mike and John occur in same context of words. i.e
(Mike, is),(Mike, a), (Mike, good), (Mike, man)
(John, is), (John, a), (John, good), (John, man), (John, too)
So the way word2vec works it gives similar/closest vectors for the both mike and john in the euclidian space.
There are two ways that I know of we can generate vectors using word2vec :
- Skip-gram model
- CBOW model
- Jacard Similarity
please use pip to setup the env from requirements.txt files
pip install -r requirements.txt
Also to use deepdist please install the deepdist library from third_party folders:
python setup.py build
python setup.py install
Please find the plotly visualiztion below. The model can also be visualized using the script: mllib_word2vec_visualization.ipynb in the repo. The model is generated using the spark.mllib.feature.word2vec implementation.
https://plot.ly/create/?fid=chandan-u:5
Stochastic Gradient Descent is a sequential process. Every gradient is dependent on the previous updates. Hence Single Instance can give good results as the Stochasitic Gradient Descent (SGD) will converge better, but the resources wont be sufficient for larger datasets. For very large corpus and vocabularies it's logical to use distributed computing. But with distributed computing if the communication between the subproblems ( i.e gradient on each partition of data) is not optimal then it will result in poor convergence.
ref : http://ruder.io/optimizing-gradient-descent/index.html#downpoursgd
Please refer to this blog I have written (it shows how to use tensorflow core to build word2vec skip gram model) http://chandan-u.github.io/implementing-and-visualizing-word2vec-using-tensorflow.html#implementing-and-visualizing-word2vec-using-tensorflow
Gensim can parallelize within a node. It runs on multiple cores. It uses C under the hood.
import gensim
model = gensim.models.Word2Vec(sg=1, sentences= <list of tokenized sentences>, size=100, window=5, min_count= 1)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))
Spark Mllib uses skip-gram model of Word2Vec, with Hierarchical Softmax in the last layer of the model.
Disadvantage: In each iteration the Spark driver sends the latest vectors to all Spark executors. Each executor modifies its local copy of vectors based on its partition of the training data set, and the driver then combines local vector modifications to update the global vectors.It requires all vectors to be stored in the memory of all Spark executors, and, similarly to its single machine counterparts, is thus unsuitable for large vocabularies.
ref: https://arxiv.org/pdf/1606.08495.pdf
Alternative for better and faster Convergence: A distributed solution using Spark DeepDist (It uses Downpour SGD to compute gradients)
It uses the concept of Downpour SGD in which:
a. Each of the slaves have a replica of the master and work on a partition of the data
b. At the end of the computational step, gradient updates are collected and transfered to the master.
c. The master and slave work independently.
d. The difference between mllib implementation and deepdist is Model replica requests an updated copy of model parameters from master server. Gradient updates are frequent and hence the model will converge faster and better. Please refer the below link for more details.
ref: http://deepdist.com/
For production scenarios we can use Lambda Architecture. In a lambda architecture, we have two processes running simultaneously.
- One that handles batch data, and
- one that handles streaming data
The batch handler does the actual model training of word2vec. It takes the entire batch of text (collected so far from the stream ), processes it and gives us the trained model, which can be deployed to production. This should be done iteratively.
All the streaming text is processed by the streaming Handle. In our case the streaming handle is supposed to give similarities using the trained model. And the stream text should be saved, for further batch processing.
In spark streaming we get DStreams, which are blocks or RDD collected for a certain window of time. We can use traied model to predict the similarities on the DStreams and at the same time save it to a DB/filesystem for batch training the word2vec model.
Some libraries give the capability of using GPU for computation. The spark Mllib does'nt use the GPU. Computing gradients for skip-gram models is faster using GPU. If we can use a combination of distributed and gpu , it could speed up training much much better. (tensorflow distributed needs to be explored).
Even spark can be used to parallelize to run replicas of tensorflow models locally on nodes and merge the results. This can done too.
Spark offers disk spillout but this will slow down the training. Here is an interesting a paper I found, that speaks about using parametric servers for this problem (deepdist approach uses parametric server, but the implementation explained in this paper is quite different. It uses column based rather than word based split). Yet to explore.