----
Exercises: Topic Modeling with Non-Negative Matrix factorization (NMF) 
----

Today we will apply the Non-Negative Matrix factorization (NMF) algorithm to discover latent topics in New York Times articles.

![](http://1.bp.blogspot.com/_JNTikHKvtnY/S6tLPRWmxjI/AAAAAAAABcQ/-eszxl-WIQ0/s1600/New-York-Times.jpg)

---
Preprocessing
----

1) Read the articles.pkl file using the `df.read_pickle` function in Pandas and transform the data into a structure for sci-kit learn.

In [1]:
reset -fs

In [2]:
import pandas as pd

In [4]:
df = pd.read_pickle("../../corpora/nyt_articles.pkl")

In [6]:
# Look that data...
# What are the columns?

# columns are the features of the article

# What are the rows?

# individual articles

df.head(n=2)

Unnamed: 0,document_type,web_url,lead_paragraph,abstract,snippet,news_desk,word_count,source,section_name,subsection_name,_id,pub_date,print_page,headline,content
0,article,http://www.nytimes.com/2013/10/03/sports/footb...,You would think that in a symmetric zero-sum s...,,You would think that in a symmetric zero-sum s...,Sports,347,The New York Times,Sports,Pro Football,524d4e3a38f0d8198974001f,2013-10-03T00:00:00Z,,Week 5 Probabilities: Why Offense Is More Impo...,the original goal building model football fore...
1,article,http://www.nytimes.com/2013/10/03/us/new-immig...,House Democrats on Wednesday unveiled an immig...,House Democrats unveil immigration bill that p...,House Democrats on Wednesday unveiled an immig...,National,83,The New York Times,U.S.,,524cf71338f0d8198973ff7b,2013-10-03T00:00:00Z,21.0,New Immigration Bill Put Forward,house unveiled immigration bill provides path ...


In [7]:
# Look at a sample of the data - content of the news stories
df.content[0][:100]

# content of the news article

'the original goal building model football forecasting weigh importance facet game in particular want'

2) Use the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from scikit-learn to turn the content of the news stories into the document-term matrix $\textbf{A}$ 

(I call it "vectorized")

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
# Here are some good defaults
max_features=1000
max_df=0.95,  
min_df=2,
max_features=1000,
stop_words='english'

What is the size of the document-term matrix?

In [45]:
# document-term matrix A
vectorized = CountVectorizer(max_features=1000, max_df=0.95, min_df=2, stop_words='english')

a = vectorized.fit_transform(df.content)
a.shape

# so n is 1405, and m is 1000

(1405, 1000)

----
NMF with scikit-learn 
------

Hint: [Here is an example](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html)

In [46]:
import sklearn

assert sklearn.__version__ == '0.18' # Make sure we are in the modern age

In [47]:
from sklearn.decomposition import NMF

Apply NMF with SVD-based initialization to the document-term matrix $\text{A}$ generate 4 topics.

In [48]:
model = NMF(init="nndsvd",
            n_components=4,
            max_iter=200)

Get the factors $\text{W}$ and $\text{H}$ from the resulting model.

In [51]:
W = model.fit_transform(a)
H = model.components_

What is are sizes of W and H?

In [80]:
print("W:", W.shape)
print("H:", H.shape)

W: (1405, 4)
H: (4, 1000)


Get the list of all terms whose indices correspond to the columns of the document-term matrix.

In [68]:
type(vectorized)

sklearn.feature_extraction.text.CountVectorizer

In [67]:
vectorizer = vectorized

terms = [""] * len(vectorizer.vocabulary_)
for term in vectorizer.vocabulary_.keys():
    terms[vectorizer.vocabulary_[term]] = term

In [71]:
# Have a look that some of the terms
terms[-5:]
# terms

['yard', 'year', 'york', 'young', 'zone']

Print the top 10 ranked terms for each topic, by sorting the values in the rows of the $\text{H}$ factor 

<br>

<details><summary>
Click here for a hint…
</summary>
```
for topic_index in None:
    top_indices = np.argsort(None)[None][None]
    term_ranking = [None[i] for i in None]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))
```
</details>

<br>

<details><summary>
Click here for the answer…
</summary>
```
for topic_index in range(H.shape[0]):
    top_indices = np.argsort(H[topic_index,:])[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))
```
</details>

In [78]:
import numpy as np

for topic_index in range(H.shape[0]):  # H.shape[0] is k
    top_indices = np.argsort(H[topic_index,:])[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))

Topic 0: said, year, new, people, state, company, gun, work, like, percent
Topic 1: game, season, said, team, year, player, time, play, yankee, league
Topic 2: republican, government, house, health, law, care, party, shutdown, senate, president
Topic 3: mr, said, iran, rouhani, united, nuclear, president, obama, state, netanyahu


Look at the words in the numbered topics. For each one, make-up a label that describes it.

For example:  
`Topic 3: people, mobile, said, phone, technology, music, digital, users, microsoft, software`  
is about "The Singularity" 😉

Are there any topics that don't make sense (i.e., the words don't go together)?

These are all a mix of the 10 section_names.

Topic 0: said, year, new, people, state, company, gun, work, like, percent

Business Day, business news, or front page headlines

Topic 1: game, season, said, team, year, player, time, play, yankee, league

sports

Topic 2: republican, government, house, health, law, care, party, shutdown, senate, president

domestic politics - U.S.

Topic 3: mr, said, iran, rouhani, united, nuclear, president, obama, state, netanyahu

foreign politics - World


Topic 0 is not immediately obvious - if we try to catergorize it as business news, the word "gun" doesn't seem to go together.  Maybe it's Opinion

Change the number of topics to match the number of topics in NYT section labels

In [90]:
set(df.section_name)

{'Arts',
 'Books',
 'Business Day',
 'Magazine',
 'Opinion',
 'Real Estate',
 'Sports',
 'Travel',
 'U.S.',
 'World'}

In [92]:
len(set(df.section_name))  # so there are 10 topics in NYT section labels

10

In [95]:
model = NMF(init="nndsvd",
            n_components=10,
            max_iter=200)

W = model.fit_transform(a)
H = model.components_


vectorizer = vectorized

terms = [""] * len(vectorizer.vocabulary_)
for term in vectorizer.vocabulary_.keys():
    terms[vectorizer.vocabulary_[term]] = term

for topic_index in range(H.shape[0]):  # H.shape[0] is k
    top_indices = np.argsort(H[topic_index,:])[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))

Topic 0: said, year, day, people, official, case, time, decision, court, added
Topic 1: game, season, team, year, player, time, league, yankee, run, play
Topic 2: republican, house, government, health, law, care, party, president, shutdown, obama
Topic 3: mr, year, party, political, case, like, leader, state, court, night
Topic 4: new, work, company, like, york, people, ms, job, worker, executive
Topic 5: gun, child, death, year, law, state, time, shooting, old, killed
Topic 6: iran, rouhani, nuclear, obama, iranian, netanyahu, president, israel, united, mr
Topic 7: davis, state, story, texas, woman, democratic, city, new, republican, candidate
Topic 8: percent, year, government, market, company, million, month, country, bank, economy
Topic 9: united, government, syria, state, weapon, chemical, security, attack, official, nation


How do the NMF topics compare to the NYT section labels?

some are great like sports, the politics ones are harder to tell, except the domestic ones are obvious.  books is an enigma.

Which would you use to filter your news?

I would use the NMF topics to choose sections that interested me.

Repeat with the same modeling with [`tf-idf`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

In [96]:
from sklearn.feature_extraction.text import TfidfVectorizer

# document-term matrix A
tfidf_vectorizer = TfidfVectorizer(max_features=1000, max_df=0.95, min_df=2, stop_words='english')

vectorized = tfidf_vectorizer

a = vectorized.fit_transform(df.content)

model = NMF(init="nndsvd",
            n_components=10,
            max_iter=200)

W = model.fit_transform(a)
H = model.components_


vectorizer = vectorized

terms = [""] * len(vectorizer.vocabulary_)
for term in vectorizer.vocabulary_.keys():
    terms[vectorizer.vocabulary_[term]] = term

for topic_index in range(H.shape[0]):  # H.shape[0] is k
    top_indices = np.argsort(H[topic_index,:])[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))

Topic 0: mr, said, court, case, judge, state, justice, lawyer, prison, official
Topic 1: game, season, yard, team, said, league, player, coach, play, touchdown
Topic 2: republican, house, health, care, government, senate, shutdown, obama, law, democrat
Topic 3: iran, rouhani, nuclear, iranian, obama, israel, united, mr, netanyahu, president
Topic 4: ms, music, art, new, work, like, dance, york, museum, song
Topic 5: company, percent, said, market, year, million, bank, china, price, state
Topic 6: yankee, rivera, pettitte, inning, game, season, baseball, run, pitch, stadium
Topic 7: attack, said, official, syria, killed, people, government, police, security, mall
Topic 8: party, merkel, government, germany, european, election, political, europe, german, ms
Topic 9: cup, team, race, said, club, player, year, america, won, ve


How does that change the topics?

Are they "tighter"? Easier to describe?

yes!  they're way tigher, look at topic 6 (sports); tfidf is taking 

---
Challenge Exercises
====

Rolling Your Own (RYO) NMF
-----

With the document matrix (our bags of words), we can begin implementing the NMF algorithm.  

1. Create a NMF class to that is initialized with a document matrix (bag of words or tf-idf) __V__.  As arguments (in addition to the document matrix) it should also take parameters __k__ (# of latent topics) and the maximum # of iterations to perform. 
  
  First we need to initialize our weights (__W__) and features (__H__) matrices.  

2. Initialize the weights matrix (W) with (positive) random values to be a __n x k__ matrix, where __n__ is the number of documents and __k__ is the number of latent topics.

2.  Initialize the feature matrix (H) to be __k x m__ where __m__ is the number of words in our vocabulary (i.e. length of bag).  Our original document matrix (__V__) is a __n x m__ matrix.  __NOTICE: shape(V) = shape(W * H)__

3. Now that we have initialized our matrices and defined our cost, we can begin iterating. Update your weights and features matrices accordingly.  7. Repeat this update until convergence (i.e. change in __cost(V, W*H)__ close to 0). or until our max # of iterations.

4. Assume we want to use a least-squares error metric when we update the matrices __W__ and __H__. This allows us to use the numpy.linalg.lstsq solver. 
Update __H__ by calling lstsq, holding __W__ fixed and minimizing the sum of squared errors predicting the document matrix. Since these values should all be at least 0, clip all the values in __H__ after the call to lstsq.

5. Use the lstsq solver to update __W__ while holding __H__ fixed. The lstsq solver assumes it is optimizing the right matrix of the multiplication (e.g. x in the equation __ax=b__). So you will need to get creative so you can use it and have the dimensions line up correctly.  Brainstorm on paper or a whiteboard how to manipulate the matrices so lstsq can get the dimensionality correct and optimize __W__. __hint: it involves transposes.__ Clip __W__ appropriately after updating it with lstsq to ensure it is at least 0.  
`from numpy.linalg import lstsq`

6. Repeat steps 4 and 5 for a fixed number of iterations.

7. Return the computed weights matrix and features matrix.



Using Your NMF Function
----

1. Write a function that takes __W__, __H__ and the document matrix as arguments, and returns the mean-squared error (of __document matrix - WH__).

2. Using argsort on each topic in __H__, find the index values of the words most associated with that topic.  Combine these index values with the word-names you stored in the __Preliminaries__ section to print out the most common words for each topic.




Run the code you wrote for the __Using Your NMF Function__ on the SKlearn classifier.  How close is the output to what you found writing your own NMF classifier?

__Design an API__:
1. Put your nmf function in an nmf class.
2. Define a function that displays the headlines/titles of the top 10 documents for each topic.
3. Define a function that takes as input a document and displays the top 3 topics it belongs to.
4. Define a function that ensure consistent ordering between your nmf function and the sklearn nmf class.

<br>
<br>
<br>

---