### Clustering
kMeans clustering with sklearn on the auto-mpg dataset

***
#### Environment
`conda activate sklearn-env`

***
#### Goals
- Compute 4 clusters based on input dataset
- Compute test dataset cluster value based on trained model
- Visualize pair of features and highlight clustre membership

***
#### References
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

#### Basic python imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


# Make numpy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)

#### Dataset load from CSV located on UCI website.

http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data  
If the URL does not work the dataset can be loaded from the data folder `./data/auto-mpg.data`.

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)
dataset = raw_dataset.copy()
dataset.tail()

#### Data preparation

- Choose only continuous features
- Split data in `training` and `test` datasets

In [None]:
dataset = dataset.dropna()
continuous_dataset = dataset[['MPG', 'Displacement', 'Horsepower', 'Weight']]
train_dataset = continuous_dataset.sample(frac=0.8, random_state=0)
test_dataset = continuous_dataset.drop(train_dataset.index)

#### Traing sklean KMeans algorithm (based on training datasets)

In [None]:
from sklearn.cluster import KMeans
clusters_no = 3
kmeans = KMeans(n_clusters=clusters_no, random_state=0).fit(train_dataset)

#### Determin in which cluster the sample belongs to

In [None]:
predicted_clusters = kmeans.predict(test_dataset)
clustered_test_dataset = test_dataset.copy()
clustered_test_dataset['Cluster'] = predicted_clusters
clustered_test_dataset.head(10)

#### Visualize piar of features and highlight cluster membership

In [None]:
sns.pairplot(clustered_test_dataset, hue = 'Cluster', diag_kind='kde', palette=sns.color_palette("hls",clusters_no), corner=True)