# Word2Vec

Try gensim

After trying time series data, another common series data source is words, e.g. sentences.

In NLP, usually words are replaced by numbers indicating their frequency in a certain document. But doing this loses information regarding word relationship.

Two approaches: 

- Count-based: Compute statistics of how often a word co-occurs with its neighbor and mapping them to a small dense vector for each word.  
- Predictive based: Try to predict words by its neighbors by using small dense embedding vectors.

The goal of word2vec is to learn word embeddings by modeling each word as a vector in n-dimensional space.

Instead of just counting appearances, word2vec creates vector spaced models that represent (embed) word information in a continous vector space. (from the word "cat" the embedding could contain information regarding "animal", "pet", "four-legs", etc.)

Vector mathematics can be applied to these vectors. 

According to the prediction target, word2vec models can be: 

- Skip-gram: **the dog chews the** bone (better for large datasets)
- Continous Bag of Words (CBOW): the dog chews the **bone** (better for small datasets) It smoothes the distributed information by treating the context as one observation.

### CBOW

Noise Contrastive Training - contrasting the noise versus the actual target word. The train bases in a loss function that combines the probability of having a word in a context plus the probability of having one of *k* words, chosen as noise, in the same context.

To visualize relationships for each word vector, use `t-Distributed Stochastic Neighbor Embedding` for dimension reduction.



In [1]:
import collections
import math
import os
import errno
import random
import zipfile

import numpy as np
from six.moves import urllib
from six.moves import xrange
import tensorflow as tf

In [2]:
data_dir = "./04-Recurrent-Neural-Networks/word2vec_data/words"
data_url = "http://mattmahoney.net/dc/text8.zip"

In [4]:
def fetch_words_data(url=data_url, words_data=data_dir):
    
    # Make the Dir if it does not exist
    os.makedirs(words_data, exist_ok = True)
    
    # Path to zip file
    zip_path = os.path.join(words_data, "words.zip")
    
    # If the zip file isn't there, download it from the data url
    if not os.path.exists(zip_path):
        urllib.request.urlretrieve(url, zip_path)
        
    # Now that the zip file is there, get the data from it
    with zipfile.ZipFile(zip_path) as f:
        data = f.read(f.namelist()[0])
        
    # Return a list of all the words in the data source.
    return data.decode("ascii").split()

In [5]:
words = fetch_words_data()

In [6]:
len(words)

17005207

In [7]:
words[9000:9040]

['feelings',
 'and',
 'the',
 'auditory',
 'system',
 'of',
 'a',
 'person',
 'without',
 'autism',
 'often',
 'cannot',
 'sense',
 'the',
 'fluctuations',
 'what',
 'seems',
 'to',
 'non',
 'autistic',
 'people',
 'like',
 'a',
 'high',
 'pitched',
 'sing',
 'song',
 'or',
 'flat',
 'robot',
 'like',
 'voice',
 'is',
 'common',
 'in',
 'autistic',
 'children',
 'some',
 'autistic',
 'children']

In [8]:
for w in words[9000:9040]:
    print(w, end=' ')

feelings and the auditory system of a person without autism often cannot sense the fluctuations what seems to non autistic people like a high pitched sing song or flat robot like voice is common in autistic children some autistic children 

In [9]:
from collections import Counter

In [10]:
my_list = ["one", "two", "two"]

In [15]:
Counter(my_list).most_common(2)

[('two', 2), ('one', 1)]

In [16]:
def create_counts(vocab_size=50000):
    vocab = [] + Counter(words).most_common(vocab_size)
    vocab = np.array([word for word, _ in vocab])
    dictionary = { word: code for code, word in enumerate(vocab) }
    data = np.array([ dictionary.get(word, 0) for word in words])
    return data, vocab

In [17]:
data, vocabulary = create_counts()

In [20]:
data.shape

(17005207,)

In [22]:
vocabulary.shape

(50000,)

In [23]:
words[100]

'interpretations'

In [24]:
data[100]

4192

In [None]:
def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1  # [ skip_window target skip_window ]
    buffer = collections.deque(maxlen=span)  # pylint: disable=redefined-builtin
    if data_index + span > len(data):
        data_index = 0
    buffer.extend(data[data_index:data_index + span])
    data_index += span
    for i in range(batch_size // num_skips):
        context_words = [w for w in range(span) if w != skip_window]
        words_to_use = random.sample(context_words, num_skips)
        for j, context_word in enumerate(words_to_use):
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[context_word]
        if data_index == len(data):
            buffer.extend(data[0:span])
            data_index = span
        else:
            buffer.append(data[data_index])
            data_index += 1
    # Backtrack a little bit to avoid skipping words in the end of a batch
    data_index = (data_index + len(data) - span) % len(data)
    return batch, labels