#### Discovering interpretable features
In this analysis, I'll employ a dimension reduction technique called "Non-negative matrix factorization" (NMF) that expresses samples as combinations of interpretable parts. For example, it expresses documents as combinations of topics, and images in terms of commonly occurring visual patterns. The data are available from [somewhere](somewhere.com)

In [2]:
# Perform the necessary imports
import pandas as pd
import numpy as np

from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from scipy.sparse import csr_matrix

In [4]:
# Read in the fish csv as a dataframe
df = pd.read_csv('datasets/wikipedia-vectors.csv', index_col=0)

# Transpose the data otherwise there will be 13,000 columns (corresponding to the 13,000 words in the file)
articles = csr_matrix(df.transpose())
titles = list(df.columns)

#### NMF applied to Wikipedia articles
In the video, you saw NMF applied to transform a toy word-frequency array. Now it's your turn to apply NMF, this time using the tf-idf word-frequency array of Wikipedia articles, given as a csr matrix articles. Here, fit the model and transform the articles. In the next exercise, you'll explore the result.

In [11]:
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(n_components=6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

# Print the NMF features
print(nmf_features)

[[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 4.40472201e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 5.66613446e-01]
 [3.82097250e-03 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 3.98652678e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 3.81745752e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 4.85524386e-01]
 [1.29304110e-02 1.37895457e-02 7.76352681e-03 3.34464609e-02
  0.00000000e+00 3.34527009e-01]
 [0.00000000e+00 0.00000000e+00 2.06748995e-02 0.00000000e+00
  6.04447369e-03 3.59066705e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 4.90984053e-01]
 [1.54290254e-02 1.42823486e-02 3.76647032e-03 2.37095929e-02
  2.62603335e-02 4.80781839e-01]
 [1.11750119e-02 3.13690968e-02 3.09496286e-02 6.56956240e-02
  1.96664991e-02 3.38293907e-01]
 [0.00000000e+00 0.00000000e+00 5.30737300e-01 0.0

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

articles = df['articles'].tolist()

# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer() 

# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(articles)

# Print result of toarray() method
print(csr_mat.toarray()[0])

# Get the words: words
#words = tfidf.get_feature_names()

# Print words
#print(words)


In [6]:
# Create a TruncatedSVD instance
svd = TruncatedSVD(n_components=50)

# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)

# Create a pipeline
pipeline = make_pipeline(svd, kmeans)


In [None]:
# Create an array of articles using the column 'articles'
articles = df['articles'].values

# Fit the pipeline to articles
pipeline.fit(articles)

# Calculate the cluster labels
labels = pipeline.predict(articles)

# Create a DataFrame aligning labels and titles
df = pd.DataFrame({'label': labels, 'article': titles})

# Display df sorted by cluster label
print(df.sort_values('label'))


In [None]:
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(n_components=6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

# Print the NMF features
print(nmf_features)


In [None]:
# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=titles)

# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway'])

# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington'])


In [None]:
# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_, columns=words)

# Print the shape of the DataFrame
print(components_df.shape)

# Select row 3: component
component = components_df.iloc[3]

# Print result of nlargest
print(component.nlargest())


In [None]:
# Perform the necessary imports
import pandas as pd
from sklearn.preprocessing import normalize

# Normalize the NMF features: norm_features
norm_features = normalize(nmf_features)

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=titles)

# Select the row corresponding to 'Cristiano Ronaldo': article
article = df.loc["Cristiano Ronaldo"]

# Compute the dot products: similarities
similarities = df.dot(article)

# Display those with the largest cosine similarity
print(similarities.nlargest())