# Transforming features for better clusterings
Let's look now at another dataset

In [8]:
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

#ignore warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('fish.csv')
df.head()

Unnamed: 0,Bream,242.0,23.2,25.4,30.0,38.4,13.4
0,Bream,290.0,24.0,26.3,31.2,40.0,13.8
1,Bream,340.0,23.9,26.5,31.1,39.8,15.1
2,Bream,363.0,26.3,29.0,33.5,38.0,13.3
3,Bream,430.0,26.5,29.0,34.0,36.6,15.1
4,Bream,450.0,26.8,29.7,34.7,39.2,14.2


In [13]:
df[df.columns[1:]].head()

Unnamed: 0,242.0,23.2,25.4,30.0,38.4,13.4
0,290.0,24.0,26.3,31.2,40.0,13.8
1,340.0,23.9,26.5,31.1,39.8,15.1
2,363.0,26.3,29.0,33.5,38.0,13.3
3,430.0,26.5,29.0,34.0,36.6,15.1
4,450.0,26.8,29.7,34.7,39.2,14.2


In [58]:
sample = df[df.columns[1:6]].values

In [59]:
sample.shape

(84, 5)

In [50]:
sample = sample.reshape(-1,1)

In [18]:
df[df.columns[:1]].head()

Unnamed: 0,Bream
0,Bream
1,Bream
2,Bream
3,Bream
4,Bream


In [55]:
species = df[df.columns[0]].values

Let's take the array of samples and use KMeans to find 3 clusters.

In [60]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters =3)
labels = model.fit_predict(sample)

In [61]:
df = pd.DataFrame({'labels':labels,'species':species})

In [62]:
ct = pd.crosstab(df['labels'],df['species'])
print(ct)

species  Bream  Pike  Roach  Smelt
labels                            
0            1     4     19     14
1            7     5      0      0
2           25     8      1      0


We can see, this time things haven't worked out so well. The KMeans clusters don't correspond well with the fish species. The problem is that the features of the dataset have very different variances. 


## StandardScaler
In KMeans clustering, the variance of a feature corresponds to its influence on the clustering algorithm. To give every feature a chance, the data needs to be transformed so that features have equal variance. This can be achieved with the StandardScaler from scikit-learn. It transforms every feature to have mean 0 and variance 1. The resulting "standardized" features can be very informative. 

In [64]:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
scalar.fit(sample)

StandardScaler()

In [65]:
StandardScaler(copy=True,with_mean=True,with_std=True)
sample_scaled = scalar.transform(sample)

In [66]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters =3)
labels = model.fit_predict(sample_scaled)

In [67]:
df = pd.DataFrame({'labels':labels,'species':species})

In [68]:
ct_scaled = pd.crosstab(df['labels'],df['species'])
print(ct_scaled)

species  Bream  Pike  Roach  Smelt
labels                            
0            0     1     19     14
1           33     0      1      0
2            0    16      0      0


## Using Pipelines

In [74]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=3)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)

In [75]:
pipeline.fit(sample)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kmeans', KMeans(n_clusters=3))])

In [76]:
labels = pipeline.predict(sample)

In [77]:
df = pd.DataFrame({'labels':labels,'species':species})

In [78]:
ct_pip = pd.crosstab(df['labels'],df['species'])
print(ct_pip)

species  Bream  Pike  Roach  Smelt
labels                            
0            0    16      0      0
1            0     1     19     14
2           33     0      1      0


Checking the correspondence between the cluster labels and the species reveals that this new clustering, incorporating standardization, is fantastic. Its three clusters correspond almost exactly to the speices. This is a huge improvement on the clustering without standardization.