## Extracting word paths (sequences of embeddings)

First, download to the same directory as this note the data from the [Yelp recruiting contest](https://www.kaggle.com/c/yelp-recruiting) on [kaggle](https://www.kaggle.com/):
* https://www.kaggle.com/c/yelp-recruiting/download/yelp_training_set.zip
* https://www.kaggle.com/c/yelp-recruiting/download/yelp_test_set.zip

You'll need to sign-up for kaggle.

Then we define below a super simple parser, and a generator for sentences from the reviews

In [14]:
import re
contractions = re.compile(r"'s*|-|\"")
# all non alphanumeric
symbols = re.compile(r'(\W+)', re.U)
# separators (any whitespace)
seps = re.compile(r'\s+')
# some stops to remove
stops = re.compile(r'(\s[,:\)\(]\s)')
# for sentence splitter
alteos = re.compile(r'([!\?])')


# cleaner (order matters)
def clean(text): 
    text = text.lower()
    text = contractions.sub('', text)
    text = symbols.sub(r' \1 ', text)
    text = stops.sub(' ', text)
    text = seps.sub(' ', text)
    text = alteos.sub(r' \1 .', text)
    return text

from zipfile import ZipFile
import json

def YelpSentences(label, stars=[1,2,3,4,5]):
    with ZipFile("yelp_%s_set.zip"%label, 'r') as zf:
        with zf.open("yelp_%s_set/yelp_%s_set_review.json"%(label,label)) as f:
            for line in f:
                rev = json.loads(line)
                if rev['stars'] in stars:
                    text = rev['text'].clean()
                    for s in text.split("."):
                        yield s.split()


small dataset, bring everything into memory

In [2]:
sentences = list(YelpSentences("training"))

use gensim to train word2vec

In [3]:
from gensim.models import Word2Vec
import logging 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## create a w2v learner 
w2v = Word2Vec(sentences, workers=8, iter=3) 

define a function to save vector path to file.

Each row is a word, sentence reads from top.


In [4]:
import numpy as np
# (throws a key error if words are not in the vocab.)
def savepath(words):
    print(words)
    np.savetxt("sentences/"+"_".join(words)+".txt", w2v[words], fmt="%.6f")

cherry pick a few and print

In [5]:
savepath(sentences[0])

[u'my', u'wife', u'took', u'me', u'here', u'on', u'my', u'birthday', u'for', u'breakfast', u'and', u'it', u'was', u'excellent']


In [6]:
savepath(sentences[97])

[u'the', u'horchata', u'is', u'handmade', u'and', u'delicious']


In [9]:
badsentences = list(YelpSentences("test", [1]))

In [16]:
savepath(badsentences[100])

[u'we', u'thought', u'this', u'was', u'a', u'little', u'strange', u'since', u'every', u'single', u'other', u'kennel', u'weve', u'ever', u'been', u'to', u'was', u'willing', u'and', u'wanted', u'to', u'give', u'us', u'a', u'tour']


In [20]:
savepath(badsentences[800])

[u'thanks', u'for', u'lying', u'to', u'my', u'face', u'dude']
