#### Clustering Wikipedia Articles
In this analysis, I'll use `TruncatedSVD` to perform Principle Component Analysis on sparse arrays in `csr_matrix` format--in this case word-frequency arrays. First, I'll build the necessary machine learning pipeline using `TruncatedSVD` and k-means to cluster some popular pages from Wikipedia. Then, I'll apply it to the word-frequency array of some popular Wikipedia articles.

The data are available from [Lateral.io](https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/)

In [40]:
# Perform the necessary imports
import pandas as pd
import numpy as np

from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from scipy.sparse import csr_matrix

In [69]:
# Read in the Wikipedia articles as a dataframe
df = pd.read_csv('datasets/wikipedia-vectors.csv', index_col=0)

# Transpose the dataframe otherwise there will be 13,000 columns (corresponding to the 13,000 words in the file)
articles = csr_matrix(df.transpose())
titles = list(df.columns)

In [42]:
# Create a TruncatedSVD instance
svd = TruncatedSVD(n_components=50)

# Create a KMeans instance
kmeans = KMeans(n_clusters=6)

# Create a pipeline
pipeline = make_pipeline(svd, kmeans)

# Fit the pipeline to articles
pipeline.fit(articles)

# Calculate the cluster labels
labels = pipeline.predict(articles)

# Create a DataFrame aligning labels and titles
df = pd.DataFrame({'label': labels, 'article': titles})

# Display df sorted by cluster label
print(df.sort_values('label'))


                                          article  label
0                                        HTTP 404      0
8                                         Firefox      0
7                                   Social search      0
6                     Hypertext Transfer Protocol      0
5                                          Tumblr      0
9                                        LinkedIn      0
3                                     HTTP cookie      0
2                               Internet Explorer      0
1                                  Alexa Internet      0
4                                   Google Search      0
21                             Michael Fassbender      1
28                                  Anne Hathaway      1
27                                 Dakota Fanning      1
26                                     Mila Kunis      1
25                                  Russell Crowe      1
24                                   Jessica Biel      1
23                           Ca

#### Discovering interpretable features
As you can see above articles cluster into well determinable clusters. We have a topic that looks like 'internet technologies', and another for 'football', and another for 'actors', etc.

In the following, I'll employ a dimension reduction technique called "Non-negative matrix factorization" (NMF) that expresses samples as combinations of interpretable parts. For example, it expresses documents as combinations of topics. To apply NMF, I'll use the tf-idf word-frequency array of Wikipedia articles that I created earlier, given as a csr matrix. I'll fit the model and transform the articles, and then I'll explore the result.

In [43]:
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance
model = NMF(n_components=6)

# Fit the model to articles
model.fit(articles)

# Transform the articles
nmf_features = model.transform(articles)

# Print the first 10 NMF features
print(nmf_features[0:10])


[[0.         0.         0.         0.         0.         0.44044633]
 [0.         0.         0.         0.         0.         0.56658049]
 [0.00382077 0.         0.         0.         0.         0.39862949]
 [0.         0.         0.         0.         0.         0.38172339]
 [0.         0.         0.         0.         0.         0.48549606]
 [0.0129298  0.01378914 0.00776279 0.03344405 0.         0.33450788]
 [0.         0.         0.02067311 0.         0.00604506 0.35904555]
 [0.         0.         0.         0.         0.         0.49095545]
 [0.01542826 0.01428189 0.00376613 0.023708   0.02626278 0.48075388]
 [0.01117449 0.03136815 0.03094688 0.06569109 0.01966827 0.33827458]]


In [44]:
# Create a pandas DataFrame
df = pd.DataFrame(nmf_features, index=titles)

# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway'])

# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington'])


0    0.003846
1    0.000000
2    0.000000
3    0.575634
4    0.000000
5    0.000000
Name: Anne Hathaway, dtype: float64
0    0.000000
1    0.005601
2    0.000000
3    0.422324
4    0.000000
5    0.000000
Name: Denzel Washington, dtype: float64


#### NMF features of the Wikipedia articles
When investigating the features above, notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. This is because NMF components represent topics (for instance, acting). When NMF is applied to documents, the components correspond to topics of documents, and the NMF features reconstruct the documents from the topics. 

Verifying this with the NMF model I have can be done. We saw above that the 3rd NMF feature value was high for articles about both actors Denzel Washington and Anne Hathaway. Now, I'll demonstrate how to identify the topic of the corresponding NMF component.

In [70]:
# Import the words csv and transform it into a list of words.
words_df = pd.read_csv('datasets/wikipedia-words.csv', names=['words'], index_col=False)
words = words_df.words.tolist()

# Create a DataFrame
components_df = pd.DataFrame(model.components_, columns=words)

# Print the shape of the DataFrame
print(components_df.shape)

# Select row 3 corresponding with the 3rd feature
component = components_df.iloc[3]

# call and print nlarges on component, this gives the five words with the highest values for that component.
print(component.nlargest())


(6, 13125)
filming    0.627960
award      0.253165
stated     0.245317
romance    0.211479
actress    0.186422
Name: 3, dtype: float64


#### NMF features are topics
It's now easy to recognise the topics that the articles about Anne Hathaway and Denzel Washington have in common!

In [68]:
# Perform the necessary imports
import pandas as pd
from sklearn.preprocessing import normalize

# Normalize the NMF features
norm_features = normalize(nmf_features)

# Create a DataFrame
df = pd.DataFrame(norm_features, index=titles)

# Select the row corresponding to 'Cristiano Ronaldo'
article = df.loc["Cristiano Ronaldo"]

# Compute the dot products
similarities = df.dot(article)

# Display those with the largest cosine similarity
print(similarities.nlargest())

Cristiano Ronaldo                1.000000
Franck Ribéry                    0.999972
Radamel Falcao                   0.999942
Zlatan Ibrahimović               0.999942
France national football team    0.999923
dtype: float64


#### Conclusion Finding Similar Articles
Finally, I've quickly demonstrated how to use NMF features and the cosine similarity to find similar articles. I applied this to the NMF model for popular Wikipedia articles, by finding the articles most similar to the article about the footballer Cristiano Ronaldo. As you can see, we get back mostly other footballers and an article about a football team. Pretty cool! 