# K-Means Clustering

## Prepare Data

In [20]:
import numpy as np

points = np.load("data/points.npy")
new_points = np.load("data/new_points.npy")
seeds = np.loadtxt("data/seeds.txt")
fish = 

## Clustering 2D points

You'll create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.

You are given an array points of size 300x2, where each row gives the (x, y) co-ordinates of a point on a map.  and also an array new_points.

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters = 3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)

## Inspect your clustering
Let's now inspect the clustering you performed in the previous exercise!

A solution to the previous exercise has already run, so new_points is an array of points and labels is the array of their cluster labels.

In [None]:
# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs,ys, c=labels, alpha=0.5)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker='D', s=50)
plt.show()

## How many clusters of grain?
In the video, you learnt how to choose a good number of clusters for a dataset using the k-means inertia graph. You are given an array samples containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?

KMeans and PyPlot (plt) have already been imported for you.

In [None]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(samples)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

## Evaluating the grain clustering
In the previous exercise, you observed from the inertia plot that 3 is a good number of clusters for the grain data. In fact, the grain samples come from a mix of 3 different grain varieties: "Kama", "Rosa" and "Canadian". In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.

You have the array samples of grain samples, and a list varieties giving the grain variety for each sample. Pandas (pd) and KMeans have already been imported for you.

In [None]:
# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'],df['varieties'])

# Display ct
print(ct)

## Scaling fish data for clustering
You are given an array samples giving measurements of fish. Each row represents an individual fish. The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very different scales. In order to cluster this data effectively, you'll need to standardize these features first. In this exercise, you'll build a pipeline to standardize and cluster the data.

In [None]:
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=4)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler,kmeans)

## Clustering the fish data
You'll now use your standardization and clustering pipeline from the previous exercise to cluster the fish by their measurements, and then create a cross-tabulation to compare the cluster labels with the fish species.

As before, samples is the 2D array of fish measurements. Your pipeline is available as pipeline, and the species of every fish sample is given by the list species.

In [None]:
# Import pandas
import pandas as pd

# Fit the pipeline to samples
pipeline.fit(samples)

# Calculate the cluster labels: labels
labels = pipeline.predict(samples)

# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels': labels, 'species': species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'],df['species'])

# Display ct
print(ct)

## Clustering stocks using KMeans
In this exercise, you'll cluster companies using their daily stock price movements (i.e. the dollar difference between the closing and opening prices for each trading day). You are given a NumPy array movements of daily price movements from 2010 to 2015, where each row corresponds to a company, and each column corresponds to a trading day.

Some stocks are more expensive than others. To account for this, include a Normalizer at the beginning of your pipeline. The Normalizer will separately transform each company's stock price to a relative scale before the clustering begins.

Note that Normalizer() is different to StandardScaler(), which you used in the previous exercise. While StandardScaler() standardizes features (such as the features of the fish data from the previous exercise) by removing the mean and scaling to unit variance, Normalizer() rescales each sample - here, each company's stock price - independently of the other.

KMeans and make_pipeline have already been imported for you.

In [None]:
# Import Normalizer
from sklearn.preprocessing import Normalizer

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters=10)

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer,kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(movements)

## Which stocks move together?
In the previous exercise, you clustered companies by their daily stock price movements. So which company have stock prices that tend to change in the same way? You'll now inspect the cluster labels from your clustering to find out.

Your solution to the previous exercise has already been run. Recall that you constructed a Pipeline pipeline containing a KMeans model and fit it to the NumPy array movements of daily stock movements. In addition, a list companies of the company names is available.

In [None]:
# Import pandas
import pandas as pd

# Predict the cluster labels: labels
labels = pipeline.predict(movements)

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values('labels'))

In [21]:
fish = np.array([[  242. ,    23.2,    25.4,    30. ,    38.4,    13.4],
       [  290. ,    24. ,    26.3,    31.2,    40. ,    13.8],
       [  340. ,    23.9,    26.5,    31.1,    39.8,    15.1],
       [  363. ,    26.3,    29. ,    33.5,    38. ,    13.3],
       [  430. ,    26.5,    29. ,    34. ,    36.6,    15.1],
       [  450. ,    26.8,    29.7,    34.7,    39.2,    14.2],
       [  500. ,    26.8,    29.7,    34.5,    41.1,    15.3],
       [  390. ,    27.6,    30. ,    35. ,    36.2,    13.4],
       [  450. ,    27.6,    30. ,    35.1,    39.9,    13.8],
       [  500. ,    28.5,    30.7,    36.2,    39.3,    13.7],
       [  475. ,    28.4,    31. ,    36.2,    39.4,    14.1],
       [  500. ,    28.7,    31. ,    36.2,    39.7,    13.3],
       [  500. ,    29.1,    31.5,    36.4,    37.8,    12. ],
       [  600. ,    29.4,    32. ,    37.2,    40.2,    13.9],
       [  600. ,    29.4,    32. ,    37.2,    41.5,    15. ],
       [  700. ,    30.4,    33. ,    38.3,    38.8,    13.8],
       [  700. ,    30.4,    33. ,    38.5,    38.8,    13.5],
       [  610. ,    30.9,    33.5,    38.6,    40.5,    13.3],
       [  650. ,    31. ,    33.5,    38.7,    37.4,    14.8],
       [  575. ,    31.3,    34. ,    39.5,    38.3,    14.1],
       [  685. ,    31.4,    34. ,    39.2,    40.8,    13.7],
       [  620. ,    31.5,    34.5,    39.7,    39.1,    13.3],
       [  680. ,    31.8,    35. ,    40.6,    38.1,    15.1],
       [  700. ,    31.9,    35. ,    40.5,    40.1,    13.8],
       [  725. ,    31.8,    35. ,    40.9,    40. ,    14.8],
       [  720. ,    32. ,    35. ,    40.6,    40.3,    15. ],
       [  714. ,    32.7,    36. ,    41.5,    39.8,    14.1],
       [  850. ,    32.8,    36. ,    41.6,    40.6,    14.9],
       [ 1000. ,    33.5,    37. ,    42.6,    44.5,    15.5],
       [  920. ,    35. ,    38.5,    44.1,    40.9,    14.3],
       [  955. ,    35. ,    38.5,    44. ,    41.1,    14.3],
       [  925. ,    36.2,    39.5,    45.3,    41.4,    14.9],
       [  975. ,    37.4,    41. ,    45.9,    40.6,    14.7],
       [  950. ,    38. ,    41. ,    46.5,    37.9,    13.7],
       [   40. ,    12.9,    14.1,    16.2,    25.6,    14. ],
       [   69. ,    16.5,    18.2,    20.3,    26.1,    13.9],
       [   78. ,    17.5,    18.8,    21.2,    26.3,    13.7],
       [   87. ,    18.2,    19.8,    22.2,    25.3,    14.3],
       [  120. ,    18.6,    20. ,    22.2,    28. ,    16.1],
       [    0. ,    19. ,    20.5,    22.8,    28.4,    14.7],
       [  110. ,    19.1,    20.8,    23.1,    26.7,    14.7],
       [  120. ,    19.4,    21. ,    23.7,    25.8,    13.9],
       [  150. ,    20.4,    22. ,    24.7,    23.5,    15.2],
       [  145. ,    20.5,    22. ,    24.3,    27.3,    14.6],
       [  160. ,    20.5,    22.5,    25.3,    27.8,    15.1],
       [  140. ,    21. ,    22.5,    25. ,    26.2,    13.3],
       [  160. ,    21.1,    22.5,    25. ,    25.6,    15.2],
       [  169. ,    22. ,    24. ,    27.2,    27.7,    14.1],
       [  161. ,    22. ,    23.4,    26.7,    25.9,    13.6],
       [  200. ,    22.1,    23.5,    26.8,    27.6,    15.4],
       [  180. ,    23.6,    25.2,    27.9,    25.4,    14. ],
       [  290. ,    24. ,    26. ,    29.2,    30.4,    15.4],
       [  272. ,    25. ,    27. ,    30.6,    28. ,    15.6],
       [  390. ,    29.5,    31.7,    35. ,    27.1,    15.3],
       [    6.7,     9.3,     9.8,    10.8,    16.1,     9.7],
       [    7.5,    10. ,    10.5,    11.6,    17. ,    10. ],
       [    7. ,    10.1,    10.6,    11.6,    14.9,     9.9],
       [    9.7,    10.4,    11. ,    12. ,    18.3,    11.5],
       [    9.8,    10.7,    11.2,    12.4,    16.8,    10.3],
       [    8.7,    10.8,    11.3,    12.6,    15.7,    10.2],
       [   10. ,    11.3,    11.8,    13.1,    16.9,     9.8],
       [    9.9,    11.3,    11.8,    13.1,    16.9,     8.9],
       [    9.8,    11.4,    12. ,    13.2,    16.7,     8.7],
       [   12.2,    11.5,    12.2,    13.4,    15.6,    10.4],
       [   13.4,    11.7,    12.4,    13.5,    18. ,     9.4],
       [   12.2,    12.1,    13. ,    13.8,    16.5,     9.1],
       [   19.7,    13.2,    14.3,    15.2,    18.9,    13.6],
       [   19.9,    13.8,    15. ,    16.2,    18.1,    11.6],
       [  200. ,    30. ,    32.3,    34.8,    16. ,     9.7],
       [  300. ,    31.7,    34. ,    37.8,    15.1,    11. ],
       [  300. ,    32.7,    35. ,    38.8,    15.3,    11.3],
       [  300. ,    34.8,    37.3,    39.8,    15.8,    10.1],
       [  430. ,    35.5,    38. ,    40.5,    18. ,    11.3],
       [  345. ,    36. ,    38.5,    41. ,    15.6,     9.7],
       [  456. ,    40. ,    42.5,    45.5,    16. ,     9.5],
       [  510. ,    40. ,    42.5,    45.5,    15. ,     9.8],
       [  540. ,    40.1,    43. ,    45.8,    17. ,    11.2],
       [  500. ,    42. ,    45. ,    48. ,    14.5,    10.2],
       [  567. ,    43.2,    46. ,    48.7,    16. ,    10. ],
       [  770. ,    44.8,    48. ,    51.2,    15. ,    10.5],
       [  950. ,    48.3,    51.7,    55.1,    16.2,    11.2],
       [ 1250. ,    52. ,    56. ,    59.7,    17.9,    11.7],
       [ 1600. ,    56. ,    60. ,    64. ,    15. ,     9.6],
       [ 1550. ,    56. ,    60. ,    64. ,    15. ,     9.6],
       [ 1650. ,    59. ,    63.4,    68. ,    15.9,    11. ]])

In [22]:
np.save("fish.npy",fish)