<a href="https://colab.research.google.com/github/biku1998/NLP-Notebooks/blob/master/NLP_03_Topic_Modeling_NMF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### In this Notebook we will apply Topic Modeling using NMF i.e Non Negative Matrix Factorization

If we remember the Vh matrix that we get from SVD matrix decomposition has some `negative` values in it, what NMF does it gives only `positive` values inside that Vh matrix also **NMF** results in 2 matrices instead of 3.

Suppose we have a dataset V after NMF decomposition we get result as 2 matrices i.e W and H.
So $V = WH$, both $W$ and $H$ are `positive` matrices.
Also **NMF** is `non-unique` decomposition on the other hand SVD was a unique decomposition.

<img src = "./NMF_rep.png"></img>

### Let's implement NMF using sklearn

**We will use the same new group data as before.**

In [0]:
# basic imports

import nltk
nltk.download('wordnet')
from nltk import stem

import spacy

import numpy as np
import matplotlib.pyplot as plt

from scipy import linalg

from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

[nltk_data] Downloading package wordnet to /Users/biku/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
# load the data

# we will only work with 4 categories to keep things simple and easy to understand
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

# we will remove below attributes, as we only want articles text
remove = ('headers', 'footers', 'quotes')

# load the data for train and test mode
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

In [0]:
# explore the data little bit

# how many data points we have ?
print(newsgroups_train.filenames.shape,newsgroups_train.target.shape)

(2034,) (2034,)


In [0]:
target_names = newsgroups_train.target_names
target_names

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

In [0]:
# look the some samples

idx = np.random.choice(2034)
# idx = 0

print(f"article / document : {newsgroups_train.data[idx]}\
        \n=================================\ncategory\
      : {target_names[newsgroups_train.target[idx]]}")

article / document : 
This sounds wonderful, but it seems no one either wants to spend time doing
this, or they don't have the power to do so.  For example, I would like
to see a comp.graphics architecture like this:

comp.graphics.algorithms.2d
comp.graphics.algorithms.3d
comp.graphics.algorithms.misc
comp.graphics.hardware
comp.graphics.misc
comp.graphics.software/apps

However, that is almost overkill.  Something more like this would probably
make EVERYONE a lot happier:

comp.graphics.programmer
comp.graphics.hardware
comp.graphics.apps
comp.graphics.misc

It would be nice to see specialized groups devote to 2d, 3d, morphing,
raytracing, image processing, interactive graphics, toolkits, languages,
object systems, etc. but these could be posted to a relevant group or
have a mailing list organized.

That way when someone reads news they don't have to see these subject
headings, which are rather disparate:

System specific stuff ( should be under comp.sys or comp.os.???.programmer ):


In [0]:
# Tf- Idf

vectorizer_tfidf = TfidfVectorizer(stop_words='english')
# tf-idf matrix will be normalized and most stop words will be zero
vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data).todense()
vectors_tfidf.shape

(2034, 26576)

**Using sklearn implementation of NMF**

In [0]:
from sklearn.decomposition import NMF

In [0]:
vectors_tfidf.shape

(2034, 26576)

In [0]:
# in NMF we have to choose number of topics , consider this as a hyper-parameter

no_of_topics = 5

nmf = NMF(n_components = no_of_topics,random_state = 1)

W1 = nmf.fit_transform(vectors_tfidf)
H1 = nmf.components_

In [0]:
H1.shape

(5, 26576)

In [0]:
vocab = np.array(vectorizer_tfidf.get_feature_names())

print(len(vocab))

# look at some samples
print(vocab[8000:8010])

26576
['detects' 'deter' 'deteriorated' 'deterioration' 'determinant'
 'determination' 'determinations' 'determine' 'determined' 'determines']


In [0]:
def print_topics(Vh,vocab,no_of_words):
    res = []
    for i,v in enumerate(Vh):
        # stick the words and v together
        vocab_components = zip(vocab,v)
        
            
        # sort the vocab components according to the importance that is captured in v
        sorted_components = sorted(vocab_components,key = lambda x:x[1],reverse = True)\
                                                                            [:no_of_words]
        for c in sorted_components:
            res.append(c[0])
    print(res)

In [0]:
no_of_words = 8

print_topics(H1,vocab,no_of_words)

['people', 'don', 'think', 'just', 'like', 'objective', 'say', 'morality', 'graphics', 'thanks', 'files', 'image', 'file', 'program', 'windows', 'know', 'space', 'nasa', 'launch', 'shuttle', 'orbit', 'moon', 'lunar', 'earth', 'ico', 'bobbe', 'tek', 'beauchaine', 'bronx', 'manhattan', 'sank', 'queens', 'god', 'jesus', 'bible', 'believe', 'christian', 'atheism', 'does', 'belief']


**Some more points about NMF**
- Since it's not an unique decomposition we may not get our original matrix back