## Natural language processing and non-negative matrix factorization (NMF) for topic modeling

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import NMF, LatentDirichletAllocation
import TCD19_utils as TCD
TCD.set_plotting_style_2()

# Magic command to enable plotting inside notebook
%matplotlib inline
# Magic command to enable svg format in plots
%config InlineBackend.figure_format = 'svg'
#Set random seed
seed = np.random.seed(42)

## Natural language processing (NLP) vocabulary

Text is by itself an unstructured data type. 

* corpus: set of documents
* stop words: 
* tokenization: Tokenization breaks unstructured data, in this case text, into chunks of information which can be counted as discrete elements. These counts of token occurrences in a document can be used directly as a vector representing that document. This immediately turns an unstructured string (text document) into a structured, numerical data structure suitable for machine learning. 
* n-gram : 

In [2]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
documents = dataset.data

In [3]:
type(documents)

list

In [5]:
documents[-1]

'                                      ^^^^^^\nNo argument at all with Murphy.  He scared the hell out of me when he came in\nlast year.  On the other hand, the club though enough of Boever to put him into\nan awful lot of games (he may have led the league in appearances - he did at\nleast at some point).  He seemed to be a very viable setup guy - but I guess\nthat\'s not considered that crucial by the club.  I can just remember two years\nago so well, though...\n...\n\nI\'m not that concerned.  Those guys have been relatively consistent over the\nyears and they have no good reasons to decline (no injuries, not old, ...).\nI expect them to come through just fine.  It\'s those guys that have not\nbeen consistently good that are the worrisome part, even if they are coming\nthrough right now.\n\nThis sounds like their old road unis.  Pretty dull.  Buttons or pullovers?\nI\'ll check through my uniform book to see if they\'ve always had some orange.\n\n\nWell, we\'ll see.  I\'ve got a Astro

We can see that each element of the list is a document containing a news article. We will now proceed to transform our document list to a term frequency-inverse document frequency (TF-IDF) matrix. This matrix has a  (`docs`, `words`) shape, where the words will be our features. The values inside the matrix will be TF-IDF that is defined by the following formula. 

\begin{align}
\text{TF-IDF} = \text{TF} \times \log{ \frac{N}{DF}}
\end{align}


Where: 
$\text{TF} = \text{word} \text{counts in} \, document_{i}$,  $\text{DF (document frequency) =  counts of the documents that contain} \, word_{i} $ , $\text{N = total number of documents}$. 

Notice that if a word is ubiquitous in all documents (e.g. stop words), the term in the right turns to zero. In this sense TF-IDF makes a robust assesment of the word frequency in a document, eliminating the bias of highly repeated words. This makes sense as ubiquitous words throughout different documents might not contain any information about the documents themselves. 

The sci-kit learn package has a great implementation using the `tfidf_vectorizer`. Some important arguments of this function are `max_df` threshold for the proportion of documents that have highly repeated words, (`min_df`is the low limit cutoff, if `int` it is in counts), `stop_words` let's you pick a language to eliminate stop words from, and `max_features` let's you consider the top words term frequency across the corpus.

Let's compute the TF-IDF vectorizer on our documents. 

In [None]:
no_features = 1000

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=no_features,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()


In [None]:
type(tfidf)

We can see that our TF-IDF matrix is a scipy sparse matrix and that we cannot readily visualize it's components. 

In [None]:
tfidf_feature_names[-1]

We can see that our features are words.

In [None]:
df_tfidf = pd.DataFrame(tfidf.toarray(), columns=tfidf_feature_names)

df_tfidf.head()

In [None]:
df_tfidf.shape

Now we are ready to extract the topics from our documents.

## Non-negative matrix factorization (NMF)

NMF can be thought of as a clustering technique, as  it finds a decomposition of samples $\matrix{X}$ into two matrices $\matrix{W}$ and $\matrix{H}$ of non-negative elements, by minimizing the distance d between $\matrix{X}$ and the matrix product $\matrix{W}$$\matrix{H}$ . More specifically each matrix represents the following:

$\matrix{W}$ (clusters) = the topics (clusters) discovered from the documents.
$\matrix{H}$ (coefficient matrix) = the membership weights for the topics in each document.
$\matrix{X}$ (Document-word matrix) — input that contains which words appear in which documents.

\begin{align}
\matrix{X} = \matrix{W}\matrix{H}
\end{align}

We won't go to any mathematical detail but just know that the current implementation in scikit learn uses the minimization of the Frobenius norm, the matrix analog of the euclidean distance, and that you can use different divergence measurements like the Kullback-leibler divergence by modifying the `beta_loss` parameter. 

In [None]:
import sklearn

In [None]:
sklearn.__version__

In [None]:
n_topics = 10

# Run NMF on the TF-IDF matrix
nmf = NMF(n_components=n_topics).fit(tfidf)

In [None]:
#Transform
nmf_W = nmf.transform(tfidf)
nmf_H = nmf.components_


In [None]:
print('Topic (W) matrix has a', nmf_W.shape , 'shape')
print('Coeffficient (H) matrix has a ', nmf_H.shape, 'shape')

We can see that the $\matrix{W}$ matrix has a (`n_documents`, `n_topics`) shape.

We can readily see that the NMF $\matrix{H}$ matrix has a shape of (`n_topics`, `n_features`) .

Therefore if we want to get the topic associations from each document we must get the biggest argument from the topic matrix.

In [None]:
nmf_topics = []
for i in range(nmf_W.shape[0]):
    nmf_topics.append(nmf_W[i].argmax())

In [None]:
def display_topics(model, feature_names, no_top_words):
    
    for topic_idx, topic in enumerate(model.components_):
        
        #print topic index
        print ("Topic %d:" % (topic_idx))
        
        #print topic 
        print (" ".join([feature_names[i] for i in topic.argsort()[: - no_top_words -1 :-1]]))
        

In [None]:
no_top_words = 15
display_topics(nmf, tfidf_feature_names, no_top_words)

Voilà. We have our topics and the most important words associated with them. From this we can readily make a list that will serve for visualization purposes. 

In [None]:
classes = ['random','video','catholic church','gamers', 'bike / car selling',
           'email','windows','computer science','cybersecurity',
           'hardware']

Now, let's see if by using a PCA on the topic matrix, we can visualize the documents in its document space. 

In [None]:
palette = TCD.palette(cmap = True)

In [None]:
from sklearn.decomposition import PCA

pca = PCA(random_state = 42)

doc_pca = pca.fit_transform(nmf_W)

In [None]:
plt.figure(figsize = (8,6))

plt.scatter(doc_pca[:,0], doc_pca[:, 1], alpha = 0.8,
            c = nmf_topics, cmap = palette.reversed())

plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
cbar = plt.colorbar(boundaries=np.arange(11)-0.5)
cbar.set_ticks(np.arange(10))
cbar.set_ticklabels(classes)

#plt.savefig('news_pca.png', dpi = 500, bbox_inches = 'tight')
plt.tight_layout();


In [None]:
from umap import UMAP

In [None]:
reducer = UMAP(random_state = 42)

doc_umap = reducer.fit_transform(nmf_W)

In [None]:
plt.figure(figsize = (8,6))

plt.scatter(doc_umap[:,0], doc_umap[:, 1], alpha = 0.8, 
            c = nmf_topics, cmap = palette.reversed())

plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
cbar = plt.colorbar(boundaries=np.arange(11)-0.5)
cbar.set_ticks(np.arange(10))
cbar.set_ticklabels(classes)

#plt.savefig('news_UMAP.png', dpi = 500, bbox_inches = 'tight')
plt.tight_layout();

In [None]:
reducer = UMAP(random_state = 42, y = nmf_topics)

doc_umap = reducer.fit_transform(nmf_W)

In [None]:
plt.figure(figsize = (8,6))

plt.scatter(doc_umap[:,0], doc_umap[:, 1], alpha = 0.8, 
            c = nmf_topics, cmap = palette.reversed())

plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
cbar = plt.colorbar(boundaries=np.arange(11)-0.5)
cbar.set_ticks(np.arange(10))
cbar.set_ticklabels(classes)

#plt.savefig('news_UMAP_learn.png', dpi = 500, bbox_inches = 'tight')
plt.tight_layout();

We have seen an end to end clustering and visualization pipeline using NMF, PCA, and UMAP. We could also use another method for topic modelling in text called Latent Dirichlet Allocation (LDA). The implementation is very similar, but if you want to see how to do so follow this [great post](https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730). 

Other than clustering, NMF can be applied for collaborative filtering and image analysis. 