# TP4 - Non-negative Matrix Factorization
The goal is to study the use of nonnegative matrix factorisation (NMF) for topic extraction from a dataset of text documents. The rationale is to interpret each extracted NMF component as being associated with a specific topic. 

Study and test the following script (introduced  on [scikit](http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html))

In [1]:
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

In [2]:
def vectorizeFeatures(_vectorizer=None, _random_state=None):
    # Set default params
    if _vectorizer is None:
        vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
    else:
        vectorizer = _vectorizer
    random_state = 1 if _random_state is None else _random_state
    # Fetch data and vectorize
    print("Loading dataset...")
    dataset = fetch_20newsgroups(shuffle=True, random_state=random_state,
                                 remove=('headers', 'footers', 'quotes'))
    data_samples = dataset.data[:2000]        
    t0 = time()
    features = vectorizer.fit_transform(data_samples)
    feature_names = vectorizer.get_feature_names()
    print("done in %0.3fs." % (time() - t0))
    return features, feature_names

In [3]:
def NMFModel(features, _vectorizerName=None, _random_state=None, 
             _beta_loss=None, _init=None, _W=None, _H=None, _K = None):
    
    n_samples = 2000
    n_features = 1000
    n_top_words = 20
    n_components = 10 if _K is None else _K
    vectorizerName = "tf_idf" if _vectorizerName is None else _vectorizerName
    random_state = 1 if _random_state is None else _random_state
    solver = 'cd' if _beta_loss is None else 'mu'
    beta_loss = 'frobenius' if _beta_loss is None else _beta_loss
    init = 'random' if _init is None else _init
    
    print("Fitting the NMF model ("+beta_loss+" norm) with "+vectorizerName+" features, "
          "n_samples=%d and n_features=%d..." % (n_samples, n_features))
    
    t0 = time()
    if _init is None:
        nmf = NMF(n_components=n_components, 
                  random_state=_random_state,
                  solver = solver,
                  beta_loss = beta_loss,
                  alpha=.1, l1_ratio=.5).fit(features)
    else:
        nmf = NMF(n_components=n_components, 
                  random_state=_random_state,
                  solver = solver,
                  beta_loss = beta_loss,
                  alpha=.1, l1_ratio=.5)
        nmf.fit_transform(features, W=_W, H=_H)
    print("done in %0.3fs." % (time() - t0))

    print("\nTopics in NMF model ("+beta_loss+" norm):")
    return nmf, n_top_words

In [4]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [5]:
def runExample(_vectorizer=None, _vectorizerName=None, _random_state=None, _beta_loss=None, 
               _init=None, _W=None, _H=None, _K=None):
    features, feature_names = vectorizeFeatures()
    nmf, n_top_words = NMFModel(features, _vectorizerName, _random_state, _beta_loss, _init, _W, _H, _K)
    print_top_words(nmf, feature_names, n_top_words)

### Q1. Test and comment on the effect of varying the initialisation, especially using random nonnegative values as initial guesses (for W and H coefficients, using the notations introduced during the lecture).

In [6]:
runExample()

Loading dataset...
done in 0.354s.
Fitting the NMF model (frobenius norm) with tf_idf features, n_samples=2000 and n_features=1000...
done in 0.330s.

Topics in NMF model (frobenius norm):
Topic #0: just people don think like know time good make way really say right ve want did ll new use years
Topic #1: windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5: edu soon com send university internet mit ftp mail cc pub article information hope program mac email home contact blood
Topic #6: file problem files for

In [7]:
runExample(_random_state=29)

Loading dataset...
done in 0.387s.
Fitting the NMF model (frobenius norm) with tf_idf features, n_samples=2000 and n_features=1000...
done in 0.344s.

Topics in NMF model (frobenius norm):
Topic #0: just people don think like know time good make way really say right ve want did ll new use years
Topic #1: windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5: edu soon com send university internet mit ftp mail cc pub article information hope program mac email home contact blood
Topic #6: file problem files for

In [8]:
runExample(_random_state=69)

Loading dataset...
done in 0.365s.
Fitting the NMF model (frobenius norm) with tf_idf features, n_samples=2000 and n_features=1000...
done in 0.333s.

Topics in NMF model (frobenius norm):
Topic #0: just people don think like know time good make way really say right ve want did ll new use years
Topic #1: windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5: edu soon com send university internet mit ftp mail cc pub article information hope program mac email home contact blood
Topic #6: file problem files for

### Q2. Compare and comment on the difference between the results obtained with 2 cost compared to the generalised Kullback-Liebler cost.

In [9]:
runExample(_beta_loss='kullback-leibler')

Loading dataset...
done in 0.370s.
Fitting the NMF model (kullback-leibler norm) with tf_idf features, n_samples=2000 and n_features=1000...
done in 1.459s.

Topics in NMF model (kullback-leibler norm):
Topic #0: people like just don time make right think know way say want really look said ve probably thing work things
Topic #1: windows thanks using need help use hi work know software looking mail used pc does video running available card info
Topic #2: god does read true know say subject believe says point religion question jesus mean people book mind matter christian life
Topic #3: thanks know like mail interested want just send edu new does list thing post bike email hear reply heard wondering
Topic #4: new year 10 sale old time good offer 20 15 16 30 weeks great test model condition 11 14 power
Topic #5: use number government new university information states data phone provide right 1993 security state large long note edu com used
Topic #6: edu try file soon com remember problem p

### Q3. Test and comment on the results obtained using a simpler term-frequency representation as input (as opposed to the TF-IDF representation considered in the code above) when considering the Kullback-Liebler cost.

In [10]:
_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')    
runExample(_beta_loss='kullback-leibler', _vectorizer=_vectorizer, _vectorizerName="CountVectorizer")

Loading dataset...
done in 0.380s.
Fitting the NMF model (kullback-leibler norm) with CountVectorizer features, n_samples=2000 and n_features=1000...
done in 1.222s.

Topics in NMF model (kullback-leibler norm):
Topic #0: people just like time make don right really know say way think did want ve work things said years let
Topic #1: windows thanks using help need use work hi know software looking mail does used pc available running video advance info
Topic #2: god does true read say know point believe subject religion says mean question jesus people life christian matter fact mind
Topic #3: thanks know mail interested like just edu new send want list does email bike hear reply thing wondering price com
Topic #4: new time year 10 sale old offer 15 16 20 good 30 great high weeks test model 11 condition 14
Topic #5: number government use data states university control state information phone used talk right new provide security 1993 note research support
Topic #6: edu com wrong soon rememb

____________________________________
## Custom NFM Implementation

In [11]:
###### CUSTOM NMF IMPLEMENTATION ######
# Multiplicative Update Rules for NMF #
# estimation with beta divergences    #
import numpy

# TODO: translate slides 59 [beta-divergence] & 47 [error and special cases]

def custom_NMF(V, K, W=None, H=None, steps=50, beta=0, toll=0.1, show_div=False):
    
    F = len(V) #Number of V rows
    N = len(V[0]) #Number of V columns

    if W is None:
        W = numpy.random.rand(F,K)
        
    if H is None:
        H = numpy.random.rand(K,N)
        
    if N != len(H[0]):
        raise ValueError("Size for H[0] is different - found "+str(len(H[0]))+" in place of "+str(N))
    if F != len(W):
        raise ValueError("Size for F is different - found "+str(len(F))+" in place of "+str(N))
        
    #Setup n_iter
    n_iter = 1
    
    # Setup initial error
    init_error = _beta_div(V,W,H,beta,F,N,K)
    if show_div:
        print("Initial error: "+str(init_error))
    error = init_error
    
    for step in range(steps):
    
#         Tests with whole matrix : multiply = O | dot = *
        upd_UP = numpy.dot(W.T, numpy.multiply(pow(numpy.dot(W,H),beta-2), V))
        upd_DOWN = numpy.dot(W.T, pow(numpy.dot(W,H),beta-1))
        upd = upd_UP / upd_DOWN
        H = numpy.multiply(H, upd)
        
        upd_UP = numpy.dot(numpy.multiply(pow(numpy.dot(W,H),beta-2), V),H.T)
        upd_DOWN = numpy.dot(pow(numpy.dot(W,H),beta-1), H.T)
        upd = upd_UP / upd_DOWN
        W = numpy.multiply(W, upd)
        
        if toll > 0:
            new_error = _beta_div(V,W,H,beta,F,N,K)
            if show_div:
                print("Error on iteration "+str(n_iter)+": " +str(new_error))
            # Check if approximation error relative decrease is below the desired threshold
            if ((error - new_error) / init_error) < toll:
                break
            error = new_error
            
        n_iter += 1
            
    return W, H

def _beta_div(V,W,H,beta,F,N,K):
    div = 0
    # Update beta_divergence
    WH = numpy.dot(W, H)
    for i in range(F):
        for j in range(N):
                x = V[i][j] if V[i][j] != 0 else numpy.finfo(numpy.double).tiny
                y = WH[i][j]
                if beta == 1: # generalized Kullback-Leibler divergence. x log(x/y) - x + y
                    div += x*numpy.log(x/y) - x + y
                elif beta == 0: # Itakura-Saito divergence. (x/y) - log(x/y) -1
                    div += (x/y) * numpy.log(x/y) - 1
                else: # Euclidean distance. (1/beta(beta-1))(x^beta + (beta-1)y^beta - beta*x*y^beta-1)
                    div += 1/(beta*(beta-1))*(pow(x,beta) + (beta-1)*pow(y,beta) - beta*x*pow(y,beta-1))
    return div

#######

In [12]:
features, feature_names = vectorizeFeatures()

V = numpy.random.rand(features.shape[0], features.shape[1])
V = numpy.array(V) # Data matrix F x N 
K = 10

W, H = custom_NMF(V, K, beta = 1, toll = 0.001, show_div = True)

Loading dataset...
done in 0.357s.
Initial error: 2612699.3198669804
Error on iteration 1: 198298.79901250819
Error on iteration 2: 197609.62437157347


In [13]:
runExample(_init='custom', _W=W, _H=H, _K=K)

Loading dataset...
done in 0.356s.
Fitting the NMF model (frobenius norm) with tf_idf features, n_samples=2000 and n_features=1000...
done in 0.318s.

Topics in NMF model (frobenius norm):
Topic #0: just people don think like know time good make way really say right ve want did ll new use years
Topic #1: windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5: edu soon com send university internet mit ftp mail cc pub article information hope program mac email home contact blood
Topic #6: file problem files for