---
Exercises: Topic Modeling with LDA
----

![](http://www.thewrap.com/wp-content/uploads/2015/12/New-York-Times-paper.jpg)

Today you will apply Latent Dirichlet allocation (LDA) to a corpus of NYT articles to discover latent topics. 
_Yes - the same the NYT articles_ as previously.

Load the data
----

Same as last lab

In [1]:
# Read the articles.pkl file using the df.read_pickle function in Pandas and transform the data into a structure for sci-kit learn.

import pandas as pd
df = pd.read_pickle("../../corpora/nyt_articles.pkl")

# Look that data...
# What are the columns?

# columns are the features of the article

# What are the rows?

# individual articles

df.head(n=2)

Unnamed: 0,document_type,web_url,lead_paragraph,abstract,snippet,news_desk,word_count,source,section_name,subsection_name,_id,pub_date,print_page,headline,content
0,article,http://www.nytimes.com/2013/10/03/sports/footb...,You would think that in a symmetric zero-sum s...,,You would think that in a symmetric zero-sum s...,Sports,347,The New York Times,Sports,Pro Football,524d4e3a38f0d8198974001f,2013-10-03T00:00:00Z,,Week 5 Probabilities: Why Offense Is More Impo...,the original goal building model football fore...
1,article,http://www.nytimes.com/2013/10/03/us/new-immig...,House Democrats on Wednesday unveiled an immig...,House Democrats unveil immigration bill that p...,House Democrats on Wednesday unveiled an immig...,National,83,The New York Times,U.S.,,524cf71338f0d8198973ff7b,2013-10-03T00:00:00Z,21.0,New Immigration Bill Put Forward,house unveiled immigration bill provides path ...


In [2]:
# Look at a sample of the data - content of the news stories
df.content[0][:100]

# content of the news article

'the original goal building model football forecasting weigh importance facet game in particular want'

Vectorize data
-----

Let's use tf-idf 

[It is "fight" whether that is "theoretical" correct. But it works better in practice](https://groups.google.com/forum/#!topic/gensim/OESG1jcaXaQ)

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

# document-term matrix A
tfidf_vectorizer = TfidfVectorizer(max_features=1000, max_df=0.95, min_df=2, stop_words='english')
vectorized = tfidf_vectorizer
vectorized = vectorized.fit_transform(df.content)

---
Scikit-learn's LDA
------

Use [Scikit-learn's LDA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) to find topics.

In [14]:
from sklearn.decomposition.online_lda import LatentDirichletAllocation

In [36]:
lda = LatentDirichletAllocation(n_topics=4,
                                max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=42)

lda.fit(vectorized)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_jobs=1, n_topics=4, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

Write a function to print topics and words:

It should look like this:  
`Topic #0:
said mr year game new season team like time government state people ms company percent republican work million city party`

<br>
<details><summary>
Click here for a hint…
</summary>
lda.components_   
tf_feature_names = vectorizer.get_feature_names()
</details>

<br>
<details><summary>
Click here for the solution…
</summary>
```
def print_top_words(model, feature_names, n_top_words=20):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    
print("Topics in LDA model:")
tf_feature_names = vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names)
```
</details>

In [46]:
lda.components_.shape

(10, 1000)

In [35]:
def print_top_words(model, feature_names, n_top_words=20):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

print("Topics in LDA model:")

# print(type(tfidf_vectorizer))
# print(type(vectorized))

vectorizer = tfidf_vectorizer

tf_feature_names = vectorizer.get_feature_names()  # TfidfVectorizer
print_top_words(lda, tf_feature_names)

Topics in LDA model:
Topic #0:
said mr year game new season team like time government state people ms company percent republican work million city party
Topic #1:
party merkel said european government mr euro zone germany german political bank new europe percent coalition election official design union
Topic #2:
iran mr said attack rouhani united syria nuclear official weapon iranian security nation killed israel al military state chemical president
Topic #3:
twitter follow visit web view site international work merkel europe judge opera mr congress sunday giant india british said political



Experiment with the number of topics. What patterns emerge?

What is the best number of topics?

In [43]:
lda = LatentDirichletAllocation(n_topics=10,
                                max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=42)

lda.fit(vectorized)


print("Topics in LDA model:")

# print(type(tfidf_vectorizer))
# print(type(vectorized))

vectorizer = tfidf_vectorizer

tf_feature_names = vectorizer.get_feature_names()  # TfidfVectorizer
print_top_words(lda, tf_feature_names)

Topics in LDA model:
Topic #0:
republican cruz debt house shutdown percent senate bond government investor senator ceiling stock congress said democrat vote boehner market spending
Topic #1:
twitter follow visit web view site international work police chinese internet merkel party europe pakistan mr attack said government golden
Topic #2:
mr said new ms like music work year art people time sept company city york world life night series school
Topic #3:
game season team said player league yankee cup yard play coach year win second run rivera inning time race sunday
Topic #4:
korea south north oil said music festival government award defense military giant economy gas game nuclear mr cut meeting new
Topic #5:
song race film bank mr said music new year art like time government design history people republican political work ms
Topic #6:
miller drug federal county rule law department said court final judge state race employee group year vice nbc company play
Topic #7:
mr said government st

n_topics = 0 is invalid, 1 gives words about sports and domestic affairs

the best is probably 10, because that's the number in set(df.section_name); however upon inspection, sometimes it's not so clear which topics match which df.section_name.

How do the LDA topics compare to the NMF topics?

here were the NMF topics:

* Topic 0: mr, said, court, case, judge, state, justice, lawyer, prison, official
* Topic 1: game, season, yard, team, said, league, player, coach, play, touchdown
* Topic 2: republican, house, health, care, government, senate, shutdown, obama, law, democrat
* Topic 3: iran, rouhani, nuclear, iranian, obama, israel, united, mr, netanyahu, president
* Topic 4: ms, music, art, new, work, like, dance, york, museum, song
* Topic 5: company, percent, said, market, year, million, bank, china, price, state
* Topic 6: yankee, rivera, pettitte, inning, game, season, baseball, run, pitch, stadium
* Topic 7: attack, said, official, syria, killed, people, government, police, security, mall
* Topic 8: party, merkel, government, germany, european, election, political, europe, german, ms
* Topic 9: cup, team, race, said, club, player, year, america, won, ve


Topics in LDA model:
* Topic #0:
republican cruz debt house shutdown percent senate bond government investor senator ceiling stock congress said democrat vote boehner market spending
* Topic #1:
twitter follow visit web view site international work police chinese internet merkel party europe pakistan mr attack said government golden
* Topic #2:
mr said new ms like music work year art people time sept company city york world life night series school
* Topic #3:
game season team said player league yankee cup yard play coach year win second run rivera inning time race sunday
* Topic #4:
korea south north oil said music festival government award defense military giant economy gas game nuclear mr cut meeting new
* Topic #5:
song race film bank mr said music new year art like time government design history people republican political work ms
* Topic #6:
miller drug federal county rule law department said court final judge state race employee group year vice nbc company play
* Topic #7:
mr said government state republican year party united official people iran country president law percent new court obama american health
* Topic #8:
mr opera drug wednesday starting bomb clinton live said lawyer big republican network united concert access meeting court case man
* Topic #9:
syria chemical weapon mr syrian state said united government council resolution nation russia security coalition group gas opposition war al



How do the LDA topics compare to the NYT section labels?

answer to this question is subjective.

in the LDA model, the word "republican" shows up in multiple topics.  The same word can show up in multiple topics for either LDA or NMF.

A nuanced distinction to keep in mind is that within a document, a word can belong only to a single topic.  So inside a docment, if "bank" gets labeled in the "finance" topic, all occurances of "bank" in the document get assigned the "finance" topic.

There's no golden labels, so whatever you think is better is better.

---
Challenge Exercises
----

1) Try the same analysis with the `lda` package.

[RTFM](http://pythonhosted.org/lda/)

In [52]:
try:
    import lda
except ImportError:
    import pip
    pip.main(['install', 'lda'])

2) Try with [genism](https://radimrehurek.com/gensim/tut2.html)

3) Try the same analysis with [GraphLab's API](https://dato.com/products/create/docs/generated/graphlab.topic_model.create.html)

Check out this notebook for a great [lda visualization](http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb#topic=2&lambda=0.41&term=).

__NOTE:__ GraphLab only supports Python 2.7

In [53]:
%%bash
# Install graphlab
sudo pip2 install graphlab-create
echo '[Product]
product_key=D868-7DBE-AC8A-0343-45F3-E250-34B4-24CA' > ~/.graphlab/config

sudo: no tty present and no askpass program specified
bash: line 4: /Users/justw/.graphlab/config: No such file or directory


In [54]:
import graphlab as gl

ImportError: No module named 'graphlab'

<br>
<br>
<br>

---