Clustering and Topic Modeling using NMF

```python
0) Get the pre-processed [BBC dataset here](http://mlg.ucd.ie/files/datasets/bbc.zip)
1) There are articles collected from 5 different topics, and the data pre-processed
2) Use the data to build a sparse matrix (or regular matrix)
3) run NMF to first do clustering on the articles
4) use NMF to attempt Topic Modeling
More details on each step:
```
0) bbc.terms is just a list of words and bbc.docs is a list of artcles listed by topic. And bbc.mtx is a list: first column is wordID, second is articleID and the third is the number of times that word appeared in that article.

1) To read a file into a list (each line will be a string in that list), you can do this:
```python
    with open(filename) as f:
        content = f.readlines()
```
2) To build a matrix, all you need is the bbc.mtx file. Look up coo style sparse matrix. Or just build a regular matrix.

3) Once you have the matrix, it's just a matter of calling NMF with the number of components equals five (we luckily know that there are 5 types of articles in our dataset). NMF will return a N x 5 matrix. For each articles it says how much of each of the 5 topics it belongs to. Take the max and assign each articles to the cooresponding cluster. Since we have the topics for each of the article (in the bbc.docs file), we can compare and see how well our clustering performed.

4) NMF.components_ is a 5 x M matrix. Each feature (topic) has a long vector of words associated with it. Find the top 5 to 10 words for each topic and print them out and see if we can come up with topics based on the set of words.

In [1]:
with open('bbc/bbc.classes') as f:
    classes = f.readlines()
    
with open('bbc/bbc.docs') as f:
    docs = f.readlines()
    
with open('bbc/bbc.mtx') as f:
    mtx = f.readlines()
    
with open('bbc/bbc.terms') as f:
    terms = f.readlines()

mtx = mtx[2:]

In [2]:
mat = []
for line in mtx:
    mat.append(line.split())

In [3]:
import pandas as pd

data = pd.DataFrame(mat)
data.columns = [['r', 'c', 'data']]

for column in data.columns:
    data[column] = data[column].apply(float)
    data[column] = data[column].apply(int)

In [4]:
data.head()

Unnamed: 0,r,c,data
0,1,1,1
1,1,7,2
2,1,11,1
3,1,14,1
4,1,15,2


In [5]:
data.head()

Unnamed: 0,r,c,data
0,1,1,1
1,1,7,2
2,1,11,1
3,1,14,1
4,1,15,2


In [6]:
from scipy.sparse import coo_matrix
import numpy as np

r  = np.array(data.r)
c  = np.array(data.c)
d = np.array(data.data)

coo_mat = coo_matrix((d, (r, c)))

In [7]:
coo_mat.shape

(9636, 2226)

In [8]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=5, random_state=1, alpha=.1, l1_ratio=.5).fit_transform(coo_mat)

In [9]:
nmf.shape

(9636, 5)