# SGNS using Word2Vec as embedding

Disclaimer: This script doesn't contain the cleaning of the data. 

It uses the cleaned data and train a model by applying Skip Gram with Negative Sample (SGNS). The paper that is used for reference for most of our implementation is here (https://arxiv.org/pdf/1605.09096.pdf). It also discuss on why using SGNS is pretty viable compared to SVD on large datasets.

These two articles given a good insight on SGNS and why the negative sampling is a good idea on big data sets:
- http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
- http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

In our case, we want to create models using deep learning for different periods of time, and compare the evolution of the vectors (for words) over time.

As the creation of each model created different "axes" for words, we had to come up with a method to "align" those axes: The orthogonal Procrustes method was chosen and implemented (It should be mentionned that for computing convenience, only the words in both dictionnaries (of each period) were kept. An explanatory paper on Procrustes is available here (http://winvector.github.io/xDrift/orthApprox.pdf).

The representation of our word vectors of 300 dimensions into something more visual, such as a 2 dimentional space is performed using t-Distributed Stochastic Neighbor Embedding (t-SNE) by reducing dimentionality. The paper and code is available here :

https://lvdmaaten.github.io/tsne/ 

Here is also an interesting site to grasp how t-SNE work and how it can go wrong: 

http://distill.pub/2016/misread-tsne/

## Some code for using the functions

In [1]:
%matplotlib inline

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from collections import Counter
from gensim.models import Word2Vec

# import our own made functions
from load import *
from output import *



In [None]:
# create two model from two different periods by loading the pickle files 
# that were created after cleaning the data.

# note that there are much more articles in recent years so the periods
# must not necessarily be of the same length.
data_old = loadYears("../data/Cleaned/GDL", range(1798,1860))
data_new = loadYears("../data/Cleaned/GDL", range(1950,1960))

model_old = createModel(data_old)
model_new = createModel(data_new)

In [None]:
# now, we create the transformation matrix using Procrustes, and apply it
# to the earlier period. We then return the modified first model (with the
# transformation applied) and the new model.

modZ, modB = createTransformationMatrix(model_old,model_new)

In [None]:
# we can call this function to compare the shift of a selected word, with a
# certain number of neighbour for each period to get an idea of the related
# words and context
# t-SNE is used to show the multidimentional vectors of 300 features
# into a 2 dimentional space.

# the red dots are from the earlier datasets and the blue dots from the 
# most recent one. The arrow shows the shift estimated using t-SNE.

# Note: t-SNE is stochastic, and thus it might give a different result at
#   each trial. Just run it again if the result is not satisfactory.

visualizeWord(modZ, modB, 'bâtiment', 4)

In [None]:
# TODO: here insert just a couple of interesting words
#    some ideas: armée, transport, ...
visualizeWord(modZ, modB, 'vapeur', 4) 

## Conclusion

The tools for quantifying semantic evolution is evolving quite rapidly and many solutions exist. Only a part of those solutions could be tried during that project. With a relatively short period of time we could obtain some interesting results and more results could emerge with some deeper pre-processing of that data, tuning of parameters, and so on... [TODO: insert the facebook stuff que t'as vu Sylvain ;)]