### The data set has information about features of silhouette extracted from the images of different cars

Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.



### 1. Read the dataset using function .dropna() - to avoid dealing with NAs as of now

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats

In [None]:
df_Vehicle=pd.read_csv('../input/vehicle/vehicle.csv').dropna()

In [None]:
df_Vehicle.tail(5)

In [None]:
df_Vehicle.isnull().sum()

In [None]:
sns.pairplot(df_Vehicle,diag_kind='kde',hue='class')

### 2. Print/ Plot the dependent (categorical variable) - Class column

Since the variable is categorical, you can use value_counts function

In [None]:
df_Vehicle['class'].value_counts()

In [None]:
 sns.countplot(x="class", data=df_Vehicle)

### Check for any missing values in the data 

In [None]:
df_Vehicle[df_Vehicle.isna()].count()

### 3. Standardize the data 

Since the dimensions of the data are not really known to us, it would be wise to standardize the data using z scores before we go for any clustering methods.
You can use zscore function to do this

In [None]:
df_Vehicle_numeric_cols=df_Vehicle.select_dtypes(include=[np.number])
from scipy.stats import zscore
df_scale=df_Vehicle_numeric_cols.apply(zscore)
df_scale

### K - Means Clustering

### Assign a dummy array called Cluster_error

In [None]:
cluster_errors = []
X=np.array(df_scale)

### 5. Calculate errorrs for each K

Iterating values of k from 1 to 10 fit K means model
Using inertia

In [None]:
from sklearn.cluster import KMeans
# Let us check optimal number of clusters-
cluster_range = range( 1, 10)
for num_clusters in cluster_range:
  clusters = KMeans( num_clusters, n_init = 5,max_iter=100)
  clusters.fit(X)
  labels = clusters.labels_                     # capture the cluster lables
  centroids = clusters.cluster_centers_         # capture the centroids
  cluster_errors.append( clusters.inertia_ )    # capture the intertia
# combine the cluster_range and cluster_errors into a dataframe by combining them
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors} )
clusters_df[0:10]

optimal value = 4

### 6. Plotting Elbow/ Scree Plot

Use Matplotlib to plot the scree plot - Note: Scree plot plots Errors vs the no of clusters

In [None]:
# Elbow plot
from matplotlib import pyplot as plt
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )

### Find out the optimal value of K

In [None]:
#computing of the slope using code 
#slope=error/cluster
errors = clusters_df['cluster_errors']
for i in range(8):
    print(errors[i+1]-errors[i])

### Using optimal value of K - Cluster the data. 
Note: Since the data has more than 2 dimension we cannot visualize the data. As an alternative, we can observe the centroids and note how they are distributed across different dimensions

In [None]:
# Number of clusters
kmeans = KMeans(n_clusters=4)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_
# Comparing with scikit-learn centroids
print("Centroid values")
print("sklearn")
print(centroids) # From sci-kit learn

You can use kmeans.cluster_centers_ function to pull the centroid information from the instance

In [None]:
colnames = df_scale.columns

### 7. Store the centroids in a dataframe with column names from the original dataset given 

In [None]:
df_centroids=pd.DataFrame(centroids,columns=colnames)

Hint: Use pd.Dataframe function 

In [None]:
df_centroids

### Use kmeans.labels_ function to print out the labels of the classes

In [None]:
kmeans.labels_

In [None]:
prediction= kmeans.predict(X)
#X["clusters"] = prediction
X_df = pd.DataFrame(X, columns= colnames)
X_df["group"] = prediction

In [None]:
X_df.head()

In [None]:
sns.pairplot(X_df,diag_kind='kde',hue='group')

## Hierarchical Clustering 

### 8. Variable creation

For Hierarchical clustering, we will create datasets using multivariate normal distribution to visually observe how the clusters are formed at the end

In [None]:
np.random.seed(101)  # for repeatability of this dataset
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
c = np.random.multivariate_normal([10, 20], [[3, 1], [1, 4]], size=[100,])

a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
c = np.random.multivariate_normal([10, 20], [[3, 1], [1, 4]], size=[100,])

https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.multivariate_normal.html

### 9. Combine all three arrays a,b,c into a dataframe

In [None]:
a=np.concatenate([a, b, c])
df=pd.DataFrame(a)
df.head()

In [None]:
df.info()

### 10. Use scatter matrix to print all the 3 distributions

In [None]:
sns.pairplot(df,diag_kind='kde')

In [None]:
#observation:
#1.range=4
#max peaks for 0's = 2
#max peaks for 1's = 2

### 11. Find out the linkage matrix

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.linkage.html

Use ward as linkage metric and distance as Eucledian

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
Z = linkage(df, method='ward', metric='euclidean')

### 12. Plot the dendrogram for the consolidated dataframe

In [None]:
from scipy.spatial.distance import pdist
plt.figure(figsize=(18, 16))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z,leaf_rotation=90.0,p=25,color_threshold=12,leaf_font_size=10,truncate_mode='level')
plt.tight_layout()

### 13. Recreate the dendrogram for last 12 merged clusters 

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html

Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram 

In [None]:
from scipy.spatial.distance import pdist
plt.figure(figsize=(18, 16))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z,leaf_rotation=90.0,p=12,color_threshold=12,leaf_font_size=10,truncate_mode='lastp')
plt.tight_layout()

### 14. From the truncated dendrogram, find out the optimal distance between clusters which u want to use an input for clustering data

https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.cluster.hierarchy.fcluster.html

Optimal distance > 50 for 3 clusters

### 15. Using this distance measure and fcluster function to cluster the data into 3 different groups

In [None]:
from scipy.cluster.hierarchy import fcluster
z=fcluster(Z,t=50,criterion='distance')

In [None]:
z

### Use matplotlib to visually observe the clusters in 2D space 

In [None]:
plt.scatter(df.iloc[:,0],df.iloc[:,1],c=z)