### Generating and using embeddings

We will use GLoVe pre-trained data set to convert some sentences into N-dimensional vectors.

First, we will use the "gensim" package to load the dataset into an easily consumable model.

In [1]:
# Load pre-trained GloVe vectors
import numpy as np

import pandas as pd
import csv

def load_glove_model_v1(dim=50):
    """Load a Glove dataset into a Pandas dataframe
    Returns: embedding"""
    glove_data_file = 'glove.6B.%dd.txt' % dim
    embedding_df = pd.read_table(glove_data_file, sep=" ", index_col=0, 
                                 header=None, quoting=csv.QUOTE_NONE,
                                 na_values=None, keep_default_na=False)
    return embedding_df

def load_glove_model_v2(dim=50):
    """Load a Glove model into a gensim model, converting it
    into word2vec if necessary.
    Adapted from: https://stackoverflow.com/a/47465278
    """
    from gensim.scripts.glove2word2vec import glove2word2vec
    from gensim.models.keyedvectors import KeyedVectors
    from pathlib import Path

    glove_data_file = 'glove.6B.%dd.txt' % dim
    word2vec_output_file = '%s.w2v' % glove_data_file

    if not Path(word2vec_output_file).exists():
        glove2word2vec(glove_input_file=glove_data_file, word2vec_output_file=word2vec_output_file)
    glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
    return glove_model

# We will use v2, because it is more versatile
model = load_glove_model_v2()

Let's try to run this embedding model on a sample sentence within TensorFlow.

In [None]:
# Adapted from: http://adventuresinmachinelearning.com/gensim-word2vec-tutorial/

import tensorflow  as tf

tf.reset_default_graph()
sentences = tf.placeholder(tf.int32, shape=[None,None])

# Build an embedding matrix out of embedding vectors
embedding_matrix = np.zeros((len(model.wv.vocab), dim))
for i in range(len(model.wv.vocab)):
    embedding_vector = model.wv[model.wv.index2word[i]]
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Print a lookup for a sample sentence with word IDs [1, 5, 10]
saved_embeddings = tf.constant(embedding_matrix)
embedding = tf.Variable(initial_value=saved_embeddings, trainable=False)
embedding_lookup = tf.nn.embedding_lookup(embedding, sentences)
with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print(sess.run(embedding_lookup,
                       feed_dict={sentences:[[1, 5, 10]]}))

It works.  Next, we will convert our dataset into a form suitable for this training.  We should also experiment with [GoogleNews word2vec dataset](http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/), because that is more relevant to a Fake News project.

### TODO: Convert our dataset into embeddings