<a href="https://colab.research.google.com/github/chpb1982/Numbers/blob/master/How_to_Choose_the_Number_of_Clusters_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Choose the Number of Clusters 

Using the same code as in the previous exercise, find the WCSS for clustering solutions with 1 to 10 clusters (you can try with more if you wish).

Find the most suitable solutions, run them and compare the results.

## Import the relevant libraries

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans

## Load the data

Load data from the csv file: <i> 'Countries_exercise.csv'</i>.

In [0]:
# Load the data
raw_data = pd.read_csv('Countries_exercise.csv')
# Check the data
raw_data

## Plot the data

Plot the <i>'Longtitude'</i> and <i>'Latitude'</i> columns. 

In [0]:
plt.scatter(data['Longitude'], data['Latitude'])
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()

## Select the features

Make sure to select the appropriate features since we are no longer using the categorical variable for our clustering but rather 'Longitude' and 'Laditude'.

In [0]:
x = data.iloc[:,1:3]
x

## Clustering

Use 4 clusters initially. 

In [0]:
kmeans = KMeans(4)

In [0]:
kmeans.fit(x)

### Clustering Resutls

In [0]:
identified_clusters = kmeans.fit_predict(x)
identified_clusters

In [0]:
data_with_clusters = data.copy()
data_with_clusters['Cluster'] = identified_clusters
data_with_clusters

Plot the data once again. This time use the <i> c </i> parameter to separate the data by the clusters we defined.  

In [0]:
plt.scatter(data['Longitude'], data['Latitude'], c=data_with_clusters['Cluster'], cmap = 'rainbow')
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()

## Selecting the number of clusters

### WCSS

Use the ingerated <i>sklearn</i> method <i> 'inertia_' </i>.

In [0]:
kmeans.inertia_

Write a loop that calculates and saves the WCSS for any number of clusters from 1 up to 10 (or more if you wish).

In [0]:
wcss = []
# 'cl_num' is a that keeps track the highest number of clusters we want to use the WCSS method for.
# Note that 'range' doesn't include the upper boundery
cl_num = 11
for i in range (1,cl_num):
    kmeans= KMeans(i)
    kmeans.fit(x)
    wcss_iter = kmeans.inertia_
    wcss.append(wcss_iter)

In [0]:
wcss

### The Elbow Method

In [0]:
number_clusters = range(1,cl_num)
plt.plot(number_clusters, wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Within-cluster Sum of Squares')

Based on the Elbow Curve, plot several graphs with the appropriate amount of clusters you believe would best fit the data.

In this this specific case, that would be 2 or 3.
Compare the scatter plots to determine which one to use in any further analysis. 

<i>Hint: we already created the scatter plot for 4 clusters, so we only have to slightly alter our code.</i>

In [0]:
kmeans = KMeans(2)
kmeans.fit(x)

In [0]:
identified_clusters = kmeans.fit_predict(x)

In [0]:
data_with_clusters = data.copy()
data_with_clusters['Cluster'] = identified_clusters

In [0]:
plt.scatter(data['Longitude'], data['Latitude'], c=data_with_clusters['Cluster'], cmap = 'rainbow')
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()

In [0]:
kmeans = KMeans(3)
kmeans.fit(x)

In [0]:
identified_clusters = kmeans.fit_predict(x)

In [0]:
data_with_clusters = data.copy()
data_with_clusters['Cluster'] = identified_clusters

In [0]:
plt.scatter(data['Longitude'], data['Latitude'], c=data_with_clusters['Cluster'], cmap = 'rainbow')
plt.xlim(-180,180)
plt.ylim(-90, 90)
plt.show()

Think about the geological developments of Earth and how it split from one big continent (Pangea) into smaller multiple smaller ones. 