## Clustering 📦📦📦

#### Import the libraries

In [1]:
import numpy as np
import pandas as pd
#scikit is an open source data analysis library, and the gold standard for Machine Learning
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#### 1. Read the data

In [4]:
df = pd.read_csv('penguins_cluster.csv')
df

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,Adelie,39.1,18.7,181.0,3750.0
1,Adelie,39.5,17.4,186.0,3800.0
2,Adelie,40.3,18.0,195.0,3250.0
3,Adelie,36.7,19.3,193.0,3450.0
4,Adelie,39.3,20.6,190.0,3650.0
...,...,...,...,...,...
301,Gentoo,47.2,13.7,214.0,4925.0
302,Gentoo,46.8,14.3,215.0,4850.0
303,Gentoo,50.4,15.7,222.0,5750.0
304,Gentoo,45.2,14.8,212.0,5200.0


#### 2. Show the first 10 rows of the data

In [5]:
df.head(10)

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,Adelie,39.1,18.7,181.0,3750.0
1,Adelie,39.5,17.4,186.0,3800.0
2,Adelie,40.3,18.0,195.0,3250.0
3,Adelie,36.7,19.3,193.0,3450.0
4,Adelie,39.3,20.6,190.0,3650.0
5,Adelie,38.9,17.8,181.0,3625.0
6,Adelie,39.2,19.6,195.0,4675.0
7,Adelie,41.1,17.6,182.0,3200.0
8,Adelie,38.6,21.2,191.0,3800.0
9,Adelie,34.6,21.1,198.0,4400.0


In [None]:
df['species'].unique

#### 3. We are going work with the numerical data. Filter out the species column, name the dataset df_num and show the dataset.

In [None]:
df_num = 

#### 4. Use the `describe()` function to see if the variables in the data set have large differences between their ranges.

#### 5. Do you see any large difference? If yes which features? 


#### 6. If you think one or more features may dominate over the other ones, you need to standardize the data. Name the scaled data as penguins_scaled.

**Feature Scaling is an important technique that mostly comes to the picture during pre-processing step in Machine Learning.**
 
We use feature scaling when the variables in the data set have large differences in order of magnitude, or when they are similar in that sense but measured with different metrics such as meters vs kilometers, etc. 
These differences cause problems for many models. For example, if one of the features has a way higher order of magnitude, this particular feature will dominate over the other ones.

In order to avoid this issue, we will perform feature scaling which brings all of the measurements into a similar range of values. There are different approaches to feature scaling:
- normalization - it maps the data in the range between 0 and 1 (the minimal data point will be mapped to 0 and the maximal one to 1). Note that if the data consist of any outliers it will influence the new distribution heavily.
- standarization - it maps the data in a way that all the new values will oscilate around 0 with a unit standard deviation. In this case, the mapped values are not restricted to a particular range. Standarization is widely used when the data has a gaussian distribution.

Imagine you have a 2 dimensional dataset representing the body measurements of a group of adult people: height in meters and weight in kg. The height ranges respectively from 1 to 2 and weight from 40 to 200. It does not matter which model you use on this dataset, the weight feature will dominate over the height and it will contribute more to the computation.  

In python we can use scikit-learn to scale the data.

In [None]:
scaler = StandardScaler()
scaler.fit(df_num)
df_num_scaled = scaler.transform(df_num)
df_num_scaled

#### 7. The standardized data is an array. Please convert the array to a pandas dataframe, Name the data df_penguins. (Hint: columns = df_num.columns) 

In [None]:
df_penguins = 

#### 8. Check how does the scaled data look like.

#### 9. Let's imagine that we don't know anything about the data and we assume there might be only two groups of penguins.

In [None]:
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(df_penguins)

#### 10. Let's check which labels we have

In [None]:
# assign a cluster to each example
clusters = kmeans.predict(df_penguins)
clusters

In [None]:

# retrieve unique clusters
labels = np.unique(clusters)


In [None]:
pd.Series(clusters).value_counts().sort_index()

#### 11. Now we are adding the defined clusters to the dataframe

#### 12. Add real penguins species to the dataframe again 

#### 13. Let's check the mapping between the species and clusters

#### 14. Let's use elbow method to see how many clusters are recommended for this dataset (we know that there are 3 species in the dataset)

**The Elbow Method**

The elbow method is one of the most well-known methods in machine learning and could be also used for finding the optimal number of clusters. With calculating the **Within-Cluster-Sum of Squared Errors ([WSS](https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb))** for different values of k we can choose the k for which WSS first starts to decrease. In a plot this will show an elbow joint.


In [None]:
K = range(2, 10) #let's give it a range
inertia = []

for k in K:
    kmeans = KMeans(n_clusters=k,
                    random_state=1234)
    kmeans.fit(df_penguins)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(16,8))
plt.plot(K, inertia, 'bx-') # shows the x symbols on the graph
plt.xlabel('k')
plt.ylabel('inertia')
plt.xticks(np.arange(min(K), max(K), 1.0))
plt.title('Elbow Method showing the optimal k')

We can see a light elbow for k = 3 which fits our knowledge of the dataset.

#### 15. Repeat k-means clustering with k = 3

In [None]:
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(df_penguins)

clusters = kmeans.predict(df_penguins)
df_clustered_3 = df_penguins.copy() 
df_clustered_3["cluster"] = clusters
df_clustered_3

In [None]:
df_clustered_3['species'] = df[['species']]
df_clustered_3

In [None]:
adelie_3 = df_clustered_3.loc[df_clustered_3['species'] == 'Adelie']
adelie['cluster'].unique() # which label got Adelie?