#### Scaling fish data for clustering (no pun intended ;)
Given an array of samples giving measurements of fish where each row represents an individual fish and the measurements, such as weight (g), length (cm), and the percentage ratio of height to length, have very different scales, how can you effectively cluster the data? In order to cluster this data effectively, I'll need to standardize these features first. In this analysis, I'll build a pipeline to standardize and then cluster the data.

These fish measurement data were sourced from the [Journal of Statistics Education](http://jse.amstat.org/jse_data_archive.htm).

In [1]:
# Perform the necessary imports
import pandas as pd
import numpy as np

from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

In [7]:
cols = ['url', 'articles']
df = pd.read_csv('datasets/wikipedia_utf8_filtered_20pageviews.csv', names=cols)
df.head(2)

Unnamed: 0,url,articles
0,wikipedia-23885690,Research Design and Standards Organization T...
1,wikipedia-23885928,The Death of Bunny Munro The Death of Bunny ...


In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

articles = df['articles'].tolist()

# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer() 

# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(articles)

# Print result of toarray() method
print(csr_mat.toarray()[0])

# Get the words: words
#words = tfidf.get_feature_names()

# Print words
#print(words)


In [6]:
# Create a TruncatedSVD instance
svd = TruncatedSVD(n_components=50)

# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)

# Create a pipeline
pipeline = make_pipeline(svd, kmeans)


In [None]:
# Create an array of articles using the column 'articles'
articles = df['articles'].values

# Fit the pipeline to articles
pipeline.fit(articles)

# Calculate the cluster labels
labels = pipeline.predict(articles)

# Create a DataFrame aligning labels and titles
df = pd.DataFrame({'label': labels, 'article': titles})

# Display df sorted by cluster label
print(df.sort_values('label'))


In [None]:
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(n_components=6)

# Fit the model to articles
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

# Print the NMF features
print(nmf_features)


In [None]:
# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=titles)

# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway'])

# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington'])


In [None]:
# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_, columns=words)

# Print the shape of the DataFrame
print(components_df.shape)

# Select row 3: component
component = components_df.iloc[3]

# Print result of nlargest
print(component.nlargest())


In [None]:
# Perform the necessary imports
import pandas as pd
from sklearn.preprocessing import normalize

# Normalize the NMF features: norm_features
norm_features = normalize(nmf_features)

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=titles)

# Select the row corresponding to 'Cristiano Ronaldo': article
article = df.loc["Cristiano Ronaldo"]

# Compute the dot products: similarities
similarities = df.dot(article)

# Display those with the largest cosine similarity
print(similarities.nlargest())