Pair Problem  

This should be fun. Try to use sklearn to implement SkipGrams using the sklearn.neural_network.MLPClassifier class.

Remember:  

The input to SkipGrams is a one-hot encoded vector for the word under examination in a context window.
The output is the similar one-hot vectors for the remaining words in the window

Effectively:

You can treat the outputs as if it were a multiclass, multilabel problem. That is, each training observation (a context window) will have a 1-hot encoded vector as input (representing the middle word in the window) and 4 integers as output (representing the "classes" for that observation, with the classes here being the indexes of the context words in your vocabulary).  
With this formulation, you should be able to make it work with the MLPClassifier, although start with very small amounts of data.

Data:  
 - Use the 20 newsgroups data from sklearn but only take a small sample of the data. word2vec is not quick.

Suggested Steps:

 - Set a Window size (the number of words before and after the target word included in the context window, I suggest 2 for now)

 - Use a CountVectorizer to establish the term indexes in your vocabulary

 - For each context window, use the CountVectorizer to get the appropriate 1-hot vector as input and 4 integers (labels) as output (representing the 4 context words for a window size of 2)

 - Plug in your observations (X=1-hot vector, and y=vector of 4 integers) and fit an MLPClassifier and see how it does!

 - That is, you need to use the coef_ (the weights matrix W) attribute from your CountVectorizer to map the word vectors into the new word2vec space. The transformation for this remember, is W'x where W' is the transpose of W and x is the 1-hot vector for a word.

 - Compute some cosine similarities between a few terms using the new W'x vectors. Any luck at all?

Feel free to collaborate extensively on this one. I don't want it to take too too long but I understand it is involved.

May the best vectors win!

In [14]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier

In [15]:
# Set a Window size (the number of words before and after the target word included 
# in the context window, I suggest 2 for now)

# Set Network Parameters
window_size = 2
dim = 100 #dimensions of hidden layer

In [16]:
# Retrieve Data
ng = fetch_20newsgroups().data[0:50]

In [17]:
# Use a CountVectorizer to establish the term indexes in your vocabulary
cv = CountVectorizer()
ng_vecs = cv.fit_transform(ng)

# Store those indices here
vocab = cv.vocabulary_ #word to index

# And the reverse mapping
id2word = {v:k for (k,v) in vocab.items()} #reversed mapping of index to word (reverse of vocab)

# The total unique words, aka size of vocabulary
V = len(vocab) #Output layer size


In [18]:
# Use CountVectorizer to turn our list of documents into a Series of lists of terms
tokenizer = cv.build_tokenizer()
tokenized_docs = pd.Series(ng).map(tokenizer).map(lambda x: [a.lower() for a in x])

In [19]:
ng[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [20]:
tokenized_docs[0][0:10]

['from',
 'lerxst',
 'wam',
 'umd',
 'edu',
 'where',
 'my',
 'thing',
 'subject',
 'what']

In [21]:
# Generate the X data matrix and y vector for MLP
# X: A matrix of one-hot encoded vectors (dimension V) for each center word over all context windows (size 2+2+1=5)
# y: A matrix over all context windows where the outputs are the 4 "labels", aka the indices of the 4 "other" context words
X = []
y = []
# Step thru tokenized document list
for doc in tokenized_docs:
    # For each document, step thru the words
    for index, word in enumerate(doc): 
        # Skip if at the edge of a document (can handle differently)
        if index < 2 or index > (len(doc)-3):
            continue
        # Retrieve the one-hot V-dimensional input vector and add it to X
        one_hot_input = [0]*V 
        one_hot_input[vocab[word]] = 1
        # HACK: Had to do the window cooccurrences separately as MLP won't support multilabel tho it says it does
        X.append(one_hot_input)
        X.append(one_hot_input)
        X.append(one_hot_input)
        X.append(one_hot_input)
        # Retrieve the 4 outputs for the context window and add them to y
        # Same HACK here
        context1 = doc[index-2]
        y.append(vocab[context1])
        context2 = doc[index-1]
        y.append(vocab[context2])
        context3 = doc[index+1]
        y.append(vocab[context3])
        context4 = doc[index+2]
        y.append(vocab[context4])


In [22]:
# Convert to Numpy arrays
X = np.array(X)
y = np.array(y)

In [23]:
X.shape

(50112, 3612)

In [24]:
### CHECK THIS, SOMETHING WRONG HERE (taking too long to run)

# Fit the MLP Classifier
mlp = MLPClassifier(hidden_layer_sizes=(dim,))
mlp.fit(X, y)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [None]:
# Here are the word vectors!!
word_vecs = mlp.coefs_[0]
word_vecs

In [None]:
# Look at the closest words to a query using a K-Nearest Neighbors search with cosine
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(metric='cosine', algorithm='brute')
nn.fit(word_vecs)

In [None]:
# Look at the closest words to a query using a K-Nearest Neighbors search with cosine
def get_similar(query, n=10):
    query_index = vocab[query]
    if query_index:
        dist, index = nn.kneighbors(word_vecs[query_index, :], n_neighbors=n)
        return ([(id2word[i[0]], d[0]) for (d, i) in zip(dist.transpose(), index.transpose())])
    else:
        return "Query not in the dataset!"

In [None]:
# Try it out!
get_similar("bat", 20)