# NMF Decomposition and K-Means Clustering on tf-idf Wikipedia Text with scikit-learn

## By Christopher Hauman
<br>

### This example of NMF Decomposition was adapted from DataCamp's [Unsupervised Learning in Python](https://www.datacamp.com/courses/unsupervised-learning-in-python) course. This is a sequel to my walkthrough of [NMF Decomposition on LCD Digits Data](https://nbviewer.jupyter.org/github/chrisman1015/Unsupervised-Learning/blob/master/NMF%20Decomposition%20on%20LCD%20Digits/NMF%20Decomposition%20on%20LCD%20Digits.ipynb). This will add some complexity to that example, so please check that out if you need an introduction or a refresher. 

### Note: This assumes you have basic knowledge of python data science basics. If you don't, or encounter something you're not familiar with, don't worry! You can get a crash course in my guide, [Cleaning MLB Statcast Data using pandas DataFrames and seaborn Visualization](https://nbviewer.jupyter.org/github/chrisman1015/Cleaning-Statcast-Data/blob/master/Cleaning%20Statcast%20Data/Cleaning%20Statcast%20Data.ipynb).
<br>

Similar to NMF's ability to deconstruct images into it's component patterns, it can deconstruct text data into common themes or topics.
<br>

Let's import the data:

In [49]:
import pandas as pd
from scipy.sparse import csr_matrix

df = pd.read_csv('wikipedia-vectors.csv', index_col=0)
articles = csr_matrix(df.transpose())
titles = list(df.columns)

The csv we're working with contains a tf-idf frequency array, where each row is a document and each columns is a word. Each entry in the array is the weighted-frequency the word in that row appears in that document. Let's quickly look at the first 2 values for the Skrillex row in articles. 

In [50]:
df['Skrillex'][0:2]

0    0.049502
1    0.000000
Name: Skrillex, dtype: float64

We see that the first word, aaron (the first name of a keyboad player who toured with Skrillex) appears in [krillex's wikipedia page](https://en.wikipedia.org/wiki/Skrillex), while the word abandon does not. You can look at the words in the vocabulary text file to see which correspond to the frequencies (aaron and abandon are the first two words in the file).
<br>

On to the NMF! This time we're going to add a normalization step in a pipeline:

In [51]:
# Perform the necessary imports
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline


# Create an NMF model: nmf
nmf = NMF(n_components=20, random_state=10)

# Create a Normalizer: normalizer
normalizer = Normalizer()

# Create a pipeline: pipeline
pipeline = make_pipeline(nmf, normalizer)

# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(articles)

The reason we added this normalization was because it will allow to compute the [cosine similarity] between points. This means we'll be able to look through our articles and see which have  common themes or topics. [This website](https://python-data-science.readthedocs.io/en/latest/normalisation.html) has a nice image of normalization which may help you understand the necessity of normalization for cosine similarty. We'll now compute cosine similarities to the actor Russel Crowe and see which the NMF algorithm consider most similar to him:

In [52]:
# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=titles)

# Select row of 'Bruce Springsteen': artist
article = df.loc['Russell Crowe']

# Compute cosine similarities: similarities with dot
similarities = df.dot(article)

# Display 7 articles with highest cosine similarity
print(similarities.nlargest(10))

Russell Crowe           1.000000
Denzel Washington       0.848427
Dakota Fanning          0.779560
Michael Fassbender      0.756301
Jessica Biel            0.753638
Anne Hathaway           0.753422
Catherine Zeta-Jones    0.752894
Mila Kunis              0.752350
Jennifer Aniston        0.751033
Angelina Jolie          0.742789
dtype: float64


Amazingly, all ten of the closest similarities are also actors! The NMF model was able to pick out patterns which actors wikipedia pages had in common and use that to find similar articles! You should be able to see why this is extremely useful, as many web sources use algorithms like this to give users recommendations for similar articles or videos!