# word2vec

created on 3/2/15; 3/13/15; 6/3/15; updated on 11/25/15 (reviewing word representations for ebay; last week at clarapath; bay is in taiwan)

script trains a word2vec model on a corpus (file is train.tsv). 

in word2vec, each word is represented as a high dimensional vector, where each dimension represents a concept. to compare words, simply compare their corresponding "concept" vectors.

similar words will have similar vectors, as determined by cosine similarity. similar words may then form concepts (unlabeled, of course). for example, eggs and chorizo may have a high similarity score, and may form the concept "breakfast". 

in contrast, latent dirichlet analysis generates concepts, where each concept is represented by a vector of words. for example, the concept "sports" may consist of 0.1 * baseball + 0.2 * fans + 0.1 * yankees... 

### Step 1: load and clean training data

In [1]:
import pandas as pd
import json

In [2]:
data = pd.read_csv('train.tsv', sep = '\t')

In [3]:
# need to strip body text from boilerplate
# this body will be used to create word vectors
data['body'] = data['boilerplate'].map(lambda x: json.loads(x)['body'])

In [4]:
data['body'].head()

0    A sign stands outside the International Busine...
1    And that can be carried on a plane without the...
2    Apples The most popular source of antioxidants...
3    There was a period in my life when I had a lot...
4    Jersey sales is a curious business Whether you...
Name: body, dtype: object

In [5]:
# eliminate NAs
# each document is tokenized
# 1 collection of bloomberg articles => list of 1053 articles => each article consisting of a list of tokens


text = data['body'].dropna().map(lambda x: x.split())

### Step 2: Train word2vec model

In [8]:
import gensim

In [9]:
# load pre-trained word2vec object

model = gensim.models.Word2Vec.load('w2v_classifier.pkl') 

In [19]:
# this step is not needed if loading word2vec object into memory

# model = gensim.models.Word2Vec(text, size=100, window=5, min_count=5, workers=4)

# min count is minimum frequency of words in order to be relevant
# size is the size of feature vector, usually between 50 and 300

### Step 3: Examine feature vectors for each word

word2vec trains a neural network to "predict" the next word. the sentences that are fed the word2vec are actually labeled data - so word2vec is not actually an unsupervised learner.

in other words, youve trained a neural network to predict sequences of words, which are really just word assocations.

you can interpret the weights connecting each input word to the hidden layer as the set of feature vector!

intuitive description of feature vectors: http://www.quora.com/How-does-word2vec-work

"One sentence you may have in your dataset is "College student is in library on laptop with coffee". You could turn this into a training example, where (ignoring "stop words" like "the", "in", "a", etc.) the input word is "student", the previous word is "college", and the next word is "library".  The task of the neural network is to see many examples of input words and corresponding "next" and "previous" words from the sentences in your dataset, and adjust its parameters (the "weights" of the connections between units) by gradient descent, until it's able to predict the next word really well."

In [10]:
model['london'].shape # numpy array of feature weights

# absolute weight values are meaningless, since they are really inputs for the NN

(100,)

In [11]:
# each word is a vector in 100 dimensional space where each dimension represents a vector
model['cover']

array([-0.01798411, -0.22022806,  0.03624697, -0.05326099,  0.087598  ,
       -0.19318575,  0.19579409,  0.38363433, -0.04080451,  0.00303119,
        0.34474367,  0.20686239,  0.27119562,  0.3131083 , -0.20390284,
        0.1150891 ,  0.0292746 ,  0.32440165,  0.03998555,  0.05431781,
        0.23208222, -0.00732663,  0.21856338, -0.04406519,  0.18706802,
        0.14253987, -0.22453867,  0.12971422,  0.13067588,  0.15929456,
        0.00923921, -0.10648965, -0.408389  ,  0.04919467,  0.05963369,
       -0.24016431, -0.27650422, -0.15956043,  0.23993519,  0.00789068,
        0.07275969,  0.00070268,  0.2326823 ,  0.05196571, -0.04761857,
       -0.10696487, -0.16221803, -0.03003974, -0.00444298, -0.00513762,
       -0.34660926,  0.0951684 ,  0.12023665,  0.61390495,  0.04786177,
        0.02856054, -0.06897122, -0.01904007,  0.02815092,  0.04311292,
       -0.06407554, -0.13684776, -0.1619869 ,  0.37621379, -0.20205018,
        0.11821651,  0.11658648,  0.50736421,  0.04840638, -0.39

you can compare similarity between two words by calculating cosine similarity between their word2vec vectors!

In [22]:
a = model['london']
b = model['paris']
c = model['steak']

In [23]:
from sklearn.metrics.pairwise import cosine_similarity

In [24]:
cosine_similarity(a,b)

array([[ 0.63628358]], dtype=float32)

In [25]:
model.similarity('london', 'paris') # cosine similarity scores match!

0.63628360408711426

In [26]:
cosine_similarity(a,c)

array([[-0.23756182]], dtype=float32)

In [27]:
model.similarity('london', 'steak') # cosine similarity scores match!

-0.23756182666546211

### Step 4: Generate "shared concepts"!

concepts may be derived from high cosine similarity scores between two word vectors. 

for example, london's word vector may contain "accessories", "capital", "city". (this means that the word "london" is often surrounded by words "accessories", "capital", and "city".) meanwhile, paris' word vector may contain "restaurant", "fashion", and "accessories". "concepts" are words with a high degree of cosine similarity with both "london" and "paris". 

the concept "accessories" is created because "accessories" has a high cosine similarity score with both "london" and "paris".

reference: http://radimrehurek.com/gensim/models/word2vec.html

##### Similarity Method

In [28]:
model.most_similar(positive=['eggs', 'pancakes']) # scores are cosine similarties

[(u'scrambled', 0.6806254982948303),
 (u'buttermilk', 0.6671367287635803),
 (u'whisked', 0.643000602722168),
 (u'peaches', 0.6395407915115356),
 (u'lemonade', 0.6366180181503296),
 (u'pureed', 0.631975531578064),
 (u'poached', 0.6306062936782837),
 (u'pur\xe9e', 0.626115083694458),
 (u'boiled', 0.6236461400985718),
 (u'shortcakes', 0.6233633160591125)]

In [29]:
# accessories feature vector is similar to both lond

print model.similarity('london', 'accessories'), model.similarity('london', 'paris')

0.383001988603 0.636283604087


In [30]:
print model.similarity('london', 'styles'), model.similarity('london', 'styles')

0.313966917715 0.313966917715


##### Doesnt Match Method

In [31]:
# given a string, which word doesn't match?

In [32]:
model.doesnt_match('london kitchen paris'.split())

'kitchen'

In [33]:
print model.similarity('kitchen', 'london'), model.similarity('kitchen', 'paris') # low cosine scores

0.00212391896327 0.0494093253568


##### Most Similar Method

for each word, find words that are most similar

In [34]:
model.most_similar('hello') 

[(u'uphold', 0.6465566158294678),
 (u'Somebody', 0.6247789859771729),
 (u'archived', 0.6211997866630554),
 (u'silly', 0.6135072112083435),
 (u'saying', 0.6095913648605347),
 (u'yeah', 0.6038674116134644),
 (u'dizzy', 0.6024335622787476),
 (u'fucking', 0.5981181263923645),
 (u'GREAT', 0.5940213799476624),
 (u'Mayer', 0.5923054218292236)]

In [36]:
model.most_similar('awful') # this is sentiment analysis!

[(u'desperate', 0.7431875467300415),
 (u'inherited', 0.7413910627365112),
 (u'unhappy', 0.736996591091156),
 (u'considering', 0.7366149425506592),
 (u'downside', 0.7344501614570618),
 (u'calling', 0.7344481348991394),
 (u'realization', 0.7288010716438293),
 (u'Fortunately', 0.7211899757385254),
 (u'undoubtedly', 0.720962643623352),
 (u'odd', 0.7180483341217041)]

### Deploy w2v classifier

In [37]:
from yhat import Yhat, YhatModel

class model_yhat(YhatModel):
    
    def execute(self, data):
        '''
        all yhat class objects must contain an execute function.
        returned values must be in the form of a dictionary.

        http://help.yhathq.com/v1.0/docs/testing-your-model
        
        '''
        
        prediction = model.most_similar(data)
        
        # predictions must be returned in dict
        return {'output': prediction}

In [38]:
# local test

print model_yhat().execute("hello")

{'output': [(u'uphold', 0.6465566158294678), (u'Somebody', 0.6247789859771729), (u'archived', 0.6211997866630554), (u'silly', 0.6135072112083435), (u'saying', 0.6095913648605347), (u'yeah', 0.6038674116134644), (u'dizzy', 0.6024335622787476), (u'fucking', 0.5981181263923645), (u'GREAT', 0.5940213799476624), (u'Mayer', 0.5923054218292236)]}


In [39]:
# configure api call

yh = Yhat("vin.tang@gmail.com", "d202df851e8a2889c0cfb17551d623df", "http://cloud.yhathq.com/")

In [40]:
yh.deploy("w2v_classifier", model_yhat, globals()) # you deploy the class not the object

Are you sure you want to deploy? (y/N): y




extracting model


Transfering Model: |##                          |  9% ETA:  00:35:47  17.64 K/s

KeyboardInterrupt: 

In [32]:
query = 'paris'

In [33]:
# query yhat-hosted classifier

prediction = yh.predict("word2vec_classifier", query)

Exception: 
        Could not unpack response values.
        Please visit "http://cloud.yhathq.com"
        to make sure your model is online and not still building.

# Latent Dirichlet Analysis

lets explore latent dirichlet analysis. in contrast to word2vec, it represents concepts as a high dimensional vector of words.

lda is often used to identify latent "topics", which makes it useful for clustering analysis. for example, hulu uses lda to categorize movies.

In [42]:
from sklearn.feature_extraction.text import CountVectorizer


In [43]:
# instantiate countvectorizer object

count_vec = CountVectorizer(binary=False, stop_words='english', min_df=3)

In [54]:
# convert each document into count vector matrix

docs = count_vec.fit_transform(data['body'].dropna())

In [None]:
# t

id2word = dict(enumerate(count_vec.get_feature_names()))

In [60]:

lda = gensim.models.ldamodel.LdaModel(corpus=gensim.matutils.Sparse2Corpus(docs, documents_columns = False), id2word=id2word, num_topics=15)



In [61]:
n_topics = 25
n_words_per_topic = 5
for ti, topic in enumerate(lda.show_topics(num_topics = n_topics, num_words = n_words_per_topic)):
    print "Topic: %d - %s" % (ti, topic)
    print

Topic: 0 - 0.007*world + 0.005*best + 0.004*new + 0.004*year + 0.003*sports

Topic: 1 - 0.011*com + 0.009*flashvars + 0.008*http + 0.005*video + 0.005*www

Topic: 2 - 0.010*cancer + 0.007*cup + 0.006*funny + 0.006*95 + 0.005*world

Topic: 3 - 0.005*function + 0.004*food + 0.004*div + 0.004*like + 0.003*make

Topic: 4 - 0.009*just + 0.008*like + 0.007*time + 0.006*chocolate + 0.005*make

Topic: 5 - 0.009*swimsuit + 0.008*news + 0.008*si + 0.006*models + 0.005*photo

Topic: 6 - 0.017*raw + 0.008*dress + 0.006*food + 0.005*like + 0.005*la

Topic: 7 - 0.019*2009 + 0.018*2010 + 0.012*2008 + 0.012*10 + 0.011*2011

Topic: 8 - 0.005*new + 0.005*said + 0.004*like + 0.004*just + 0.004*time

Topic: 9 - 0.012*cup + 0.009*recipe + 0.008*add + 0.008*minutes + 0.007*butter

Topic: 10 - 0.012*food + 0.006*chocolate + 0.006*just + 0.005*make + 0.005*recipes

Topic: 11 - 0.013*image + 0.013*images + 0.013*future + 0.012*small + 0.012*link

Topic: 12 - 0.009*pie + 0.008*chocolate + 0.006*butter + 0.006*c