<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lesson 14 - Latent Variables and Natural Language Processing

---

## Guided practice and demos

In [1]:
# Imports
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

# Config
np.random.seed(1)

In [2]:
# spacy is used for pre-processing and traditional NLP
import spacy
from spacy.en import English

# Gensim is used for LDA and word2vec
from gensim.models.word2vec import Word2Vec

In [3]:
# Import data
df = pd.read_csv('../dataset/stumbleupon.tsv', sep='\t')
df['title'] = df.boilerplate.map(lambda x: json.loads(x).get('title', ''))
df['body'] = df.boilerplate.map(lambda x: json.loads(x).get('body', ''))

## Demo: "LDA in gensim"

Gensim is a library of language processing tools focused on latent variable models for text. It was originally developed by grad students dissatisfied with current implementations of latent models. Documentation and tutorials are available on the [package’s website](https://radimrehurek.com/gensim/index.html).


Let’s first translate a set of documents (articles) into a matrix representation with a row per document and a column per feature (word or n-gram).

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

body_text = df.body.dropna()
vectorizer = CountVectorizer(binary=False,
                             stop_words='english',
                             min_df=3)
vectorizer.fit(body_text)
docs = vectorizer.transform(body_text)

In [5]:
# Build a mapping of numerical ID to word
id2word = dict(enumerate(vectorizer.get_feature_names()))
print id2word[5000], id2word[10000], id2word[20000]

cheers flagship recounting


- We want to learn which columns are correlated (i.e. likely to come from the same topic). This is the word distribution. 
- We can also determine what topics are in each document, the topic distribution.

In [6]:
from gensim.models.ldamodel import LdaModel
from gensim.matutils import Sparse2Corpus

# First we convert our word-matrix into gensim's format
corpus = Sparse2Corpus(docs, documents_columns=False)

# Then we fit an LDA model
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=15)

In this model, we need to explicitly specify the number of topics we want the model to uncover. This is a critical parameter, but there isn’t much guidance on how to choose it.  Try to use domain expertise where possible.


Now we need to assess the goodness of fit for our model. Like other unsupervised learning techniques, our validation techniques are mostly about interpretation.

#### Use the following questions to guide you:

- Did we learn reasonable topics?
- Do the words that make up a topic make sense?
- Is this topic helpful towards our goal?

#### We can evaluate fit by viewing the top words in each topic.

- Gensim has a `show_topics()` function for this.

In [7]:
for ti, topic in enumerate(lda_model.show_topics(num_topics=5, num_words=10)):
    print "Topic: {}".format(ti)
    print topic
    print

Topic: 0
(3, u'0.007*"10" + 0.006*"said" + 0.006*"just" + 0.005*"2009" + 0.005*"like" + 0.005*"2010" + 0.004*"12" + 0.004*"11" + 0.004*"game" + 0.004*"pm"')

Topic: 1
(1, u'0.006*"news" + 0.006*"workout" + 0.005*"exercises" + 0.005*"muscle" + 0.004*"like" + 0.004*"said" + 0.004*"just" + 0.004*"leg" + 0.004*"body" + 0.004*"make"')

Topic: 2
(10, u'0.017*"cake" + 0.011*"dress" + 0.010*"sleep" + 0.008*"chocolate" + 0.007*"00" + 0.004*"time" + 0.004*"make" + 0.003*"wedding" + 0.003*"carnival" + 0.003*"und"')

Topic: 3
(8, u'0.007*"health" + 0.005*"people" + 0.005*"body" + 0.005*"like" + 0.004*"food" + 0.004*"cancer" + 0.004*"day" + 0.004*"time" + 0.004*"help" + 0.003*"water"')

Topic: 4
(13, u'0.013*"com" + 0.009*"http" + 0.008*"www" + 0.006*"new" + 0.005*"href" + 0.005*"best" + 0.005*"content" + 0.004*"left" + 0.004*"cbssports" + 0.004*"like"')



#### Let's now use our fitted model to predict topics for some new data

(examples taken from http://www.buzzfeed.com/babymantis/25-stupid-newspaper-headlines-1opu)

In [8]:
new_text = [
    "Japanese scientists grow frog eyes and ears",
    "Statistics show that teen pregnancy drops of significantly after age 25",
    "Bugs flying around with wings are flying bugs",
    "Federal agents raid gun shop, find weapons",
    "Marijuana issue sent to a joint committee"
]

# Transform the text into the bag-of-words (bow) space using our vectorizer
new_bow = vectorizer.transform(new_text)

# Transform into format expected by gensim
new_corpus = Sparse2Corpus(new_bow, documents_columns=False)

# Print out first entry + matching words
print list(new_corpus)[0]
print [(id2word[id], count) for id, count in list(new_corpus)[0]]

[(8461, 1), (9477, 1), (10556, 1), (11436, 1), (13429, 1), (21460, 1)]
[(u'ears', 1), (u'eyes', 1), (u'frog', 1), (u'grow', 1), (u'japanese', 1), (u'scientists', 1)]


#### Transform into LDA space by applying fitted LDA model to the corpus

In [9]:
lda_vector = lda_model[new_corpus]

#### For each entry we can extract a tuple indicating how much it makes part of each topic

In [10]:
[list(lda_vec) for lda_vec in lda_vector]

[[(8, 0.63181306448865238), (10, 0.24437718882330839)],
 [(5, 0.3336190844178899), (8, 0.55804729894211291)],
 [(0, 0.011111132890035902),
  (1, 0.011111138793326703),
  (2, 0.011111128325934191),
  (3, 0.011111116928335189),
  (4, 0.011111116289867936),
  (5, 0.22317720606119099),
  (6, 0.011111132516016461),
  (7, 0.011111116850011184),
  (8, 0.011111114308550711),
  (9, 0.011111119703747542),
  (10, 0.011111128722263669),
  (11, 0.011111121504129736),
  (12, 0.6323781684114721),
  (13, 0.011111128392975156),
  (14, 0.011111130302142516)],
 [(12, 0.86666618543879392)],
 [(0, 0.011111122439168298),
  (1, 0.011111133366836626),
  (2, 0.011111183859447316),
  (3, 0.011111136356965843),
  (4, 0.01111113363404132),
  (5, 0.011111128024102992),
  (6, 0.011111125249237261),
  (7, 0.84444412400051205),
  (8, 0.011111136845392298),
  (9, 0.011111116052744388),
  (10, 0.011111129296949211),
  (11, 0.011111127448963865),
  (12, 0.011111137238477207),
  (13, 0.01111112862796741),
  (14, 0.011111

#### Extract most prominent LDA topics for each entry

In [11]:
top_topics = [max(x, key=lambda item: item[1]) for x in list(lda_vector)]
top_topics

[(8, 0.63178064645276466),
 (8, 0.5567098314508202),
 (12, 0.63233605463090126),
 (12, 0.86666626907355926),
 (7, 0.84444391572062794)]

#### Print out text + topic

In [12]:
for i, topic_tuple in enumerate(top_topics):
    print new_text[i]
    print "{0:.1f}% as topic #{1}:".format(100 * topic_tuple[1], topic_tuple[0])
    print lda_model.print_topic(topic_tuple[0],topn=10), "\n"

Japanese scientists grow frog eyes and ears
63.2% as topic #8:
0.007*"health" + 0.005*"people" + 0.005*"body" + 0.005*"like" + 0.004*"food" + 0.004*"cancer" + 0.004*"day" + 0.004*"time" + 0.004*"help" + 0.003*"water" 

Statistics show that teen pregnancy drops of significantly after age 25
55.7% as topic #8:
0.007*"health" + 0.005*"people" + 0.005*"body" + 0.005*"like" + 0.004*"food" + 0.004*"cancer" + 0.004*"day" + 0.004*"time" + 0.004*"help" + 0.003*"water" 

Bugs flying around with wings are flying bugs
63.2% as topic #12:
0.005*"world" + 0.005*"new" + 0.004*"000" + 0.003*"year" + 0.003*"just" + 0.003*"like" + 0.003*"said" + 0.003*"time" + 0.003*"years" + 0.002*"says" 

Federal agents raid gun shop, find weapons
86.7% as topic #12:
0.005*"world" + 0.005*"new" + 0.004*"000" + 0.003*"year" + 0.003*"just" + 0.003*"like" + 0.003*"said" + 0.003*"time" + 0.003*"years" + 0.002*"says" 

Marijuana issue sent to a joint committee
84.4% as topic #7:
0.008*"2010" + 0.007*"2008" + 0.007*"2007" +

For more examples on using LDA with gensim, see: http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

## Demo: Word2Vec in gensim

We will build a Word2Vec model using the text body of the articles available in the StumbleUpon dataset.

The Word2Vec class has many arguments:

- size represents how many concepts or topics we should use
- window represents how many words surrounding a sentence we should use as our original feature
- min_count is the number of times that context or word must appear
- workers is the number of CPU cores to use to speed up model training

In [13]:
from gensim.models import Word2Vec

# Setup the text body
text = df.body.dropna().map(lambda x: x.split())
model = Word2Vec(text,
                 size=100,      # how many concepts or topics should we use?
                 window=5,      # how many words surrounding a sentence we should use as our original feature?
                 min_count=5,   # number of times that context or word must appear
                 workers=4)     # number of CPU cores to use (can speed up model training)

The model has a `most_similar function` that helps finding the words most similar to the one you queried.
This will return words that are most often used in the same context.
It can easily identify words related to those from this dataset.

In [14]:
model.most_similar(positive=['cookie', 'brownie'])

[(u'cupcake', 0.9082363843917847),
 (u'pie', 0.8527567982673645),
 (u'candy', 0.8522288799285889),
 (u'crust', 0.8520104885101318),
 (u'tart', 0.828007698059082),
 (u'cheesecake', 0.8274497389793396),
 (u'cake', 0.8255833387374878),
 (u'icing', 0.8201949596405029),
 (u'mini', 0.8160595297813416),
 (u'buttercream', 0.8136184811592102)]

#### Word vector maths: 

- "man - boy $\approx$ person"

In [15]:
model.most_similar(positive=['man'], negative=['boy'])

[(u'sportsmanship', 0.5067369937896729),
 (u'person', 0.45713984966278076),
 (u'people', 0.3930201530456543),
 (u'example', 0.38871055841445923),
 (u'difference', 0.38625895977020264),
 (u'Americans', 0.3604770302772522),
 (u'those', 0.3589774966239929),
 (u'solution', 0.34824103116989136),
 (u'consumers', 0.3280426263809204),
 (u'anyone', 0.32233816385269165)]

#### Read this as "man is to woman as boy is to...girl"
 
- "man + boy - woman = girl"

In [16]:
model.most_similar(positive=['man', 'boy'], negative=['woman'])

[(u'daughter', 0.768936038017273),
 (u'girl', 0.7630077004432678),
 (u'brother', 0.7546021938323975),
 (u'wife', 0.7309377193450928),
 (u'girlfriend', 0.7301152944564819),
 (u'father', 0.7230810523033142),
 (u'son', 0.7200206518173218),
 (u'guy', 0.6960198879241943),
 (u'shot', 0.6947152018547058),
 (u'caught', 0.6867551207542419)]

#### "cheesecake + cake - frosting = pie"

In [17]:
model.most_similar(positive=['cheesecake', 'cake'], negative=['frosting'])

[(u'pie', 0.7631522417068481),
 (u'crust', 0.738057553768158),
 (u'tart', 0.7293705940246582),
 (u'pizza', 0.7125093340873718),
 (u'brownie', 0.7053690552711487),
 (u'dessert', 0.6859625577926636),
 (u'cookie', 0.6765167117118835),
 (u'cupcake', 0.6735799312591553),
 (u'brownies', 0.6691722273826599),
 (u'recipe', 0.6645998954772949)]

#### data + science - statistics = ?

In [18]:
model.most_similar(positive=['data', 'science'], negative=['statistics'])

[(u'device', 0.725189745426178),
 (u'technology', 0.7006625533103943),
 (u'product', 0.681377649307251),
 (u'company', 0.6799347996711731),
 (u'industry', 0.6791709661483765),
 (u'material', 0.6722677946090698),
 (u'development', 0.6717053651809692),
 (u'human', 0.6707926392555237),
 (u'research', 0.6674562692642212),
 (u'virus', 0.6576155424118042)]

#### technology + entrepreneur - hipster = ?

In [19]:
model.most_similar(positive=['technology', 'entrepreneur'], negative=['hipster'])

[(u'design', 0.8124440312385559),
 (u'electronics', 0.807840883731842),
 (u'innovative', 0.7850945591926575),
 (u'concept', 0.7809066772460938),
 (u'manufacturing', 0.7697283029556274),
 (u'technologies', 0.7624959945678711),
 (u'phones', 0.7618154287338257),
 (u'solar', 0.7546247839927673),
 (u'military', 0.7517671585083008),
 (u'international', 0.750795841217041)]

#### Which one of these doesn't fit?

In [20]:
print model.doesnt_match("breakfast cereal lunch dinner".split())
print model.doesnt_match("facebook twitter tumblr myspace".split())

cereal
myspace


#### Similarity between two words

In [21]:
print model.similarity('man', 'woman')
print model.similarity('man', 'monkey')
print model.similarity('apple', 'pear')
print model.similarity('man', 'apple')

0.903090449011
0.226632053199
0.620058241646
-0.0905850081333


#### Inspect a single vector

In [22]:
model['man']

array([-1.48491478,  0.82271338, -0.66203606,  1.47417819, -0.83262259,
       -0.58856583,  0.70644802,  0.6183418 ,  0.83599722, -0.32860628,
        1.01383781,  0.27052182, -0.91036075,  0.86578614, -1.10889387,
        0.32364827, -0.07446388, -0.83384198, -0.85951984, -0.73046398,
       -0.06517454, -1.00799024,  1.32200027,  1.04098773, -0.21395648,
        0.21387762,  1.23757994, -1.03043449, -0.27736598,  1.13346863,
        0.2824986 ,  0.82115883,  0.99513644, -0.20105761, -0.07797236,
       -0.56093013,  0.61872196, -0.6510182 , -0.85838211,  0.15905283,
       -0.955392  , -0.63124895, -0.2876226 ,  0.84796757,  0.40757838,
       -1.34304857,  0.37937742,  0.83164185, -1.11950517,  0.01291989,
        0.78509122, -0.43319365,  0.93026209, -0.17269517, -0.33601609,
       -0.23730457,  1.97847748, -0.54382747, -0.35669383, -0.86995107,
       -0.15285942, -0.13247661, -0.91331434, -1.90528333,  0.37306562,
       -0.33279184,  0.66351765, -1.06753397,  1.18373585,  0.60