# Indian Student Clustering with Plotly

The 'Indian_Student_Data.csv' file contains different countries and the percentage of Indian students studying in them. The countries are listed by name, latitude & longitude and a country code. 

We have to cluster students based on their numbers and country and plot on plotly's geographical map.
Plotly tutorial and documentation is available at https://plot.ly/python/choropleth-maps/. For understanding Plotly, a short jupyter notebook is also available at https://drive.google.com/drive/folders/1SCm-y25dhegPJDfr6tWrRgxlbHI-4XJu?usp=sharing

We are using World Choropleth maps like following one:

![Image](StudentClustering_plotly.png)

Solve the following questions:
1. Choose the number of clusters by observing the data distribution on Plotly. Justify your choice of clusters. 
2. Run the KMeans clustering algorithm and chosen number of clusters in Qt 1. Use the features latitude, longitude, percentage for the clustering. 
3. Plot the clusters on the plotly's geographical map. The countries in the same cluster should have the same color. Each cluster should have different color.
4. For every cluster you make, find out the sum of sqaured errors.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("Indian_Student_Data.csv")
df.head()

Unnamed: 0,country,no_of_indian_students,percentage,latitude,longitude,code
0,United States of America,165918,37.134985,39.78373,-100.445882,USA
1,Australia,66886,14.970109,-24.776109,134.755,AUS
2,Canada,50000,11.190764,61.066692,-107.991707,CAN
3,New Zealand,32000,7.162089,-41.500083,172.834408,NZL
4,Bahrain,27000,6.043013,35.207801,72.547397,BHR


In [2]:
from sklearn.cluster import KMeans 
kmeans = KMeans(n_clusters=3, init='random')
y_kmeans = kmeans.fit_predict(df[['percentage','latitude','longitude']])
print(y_kmeans)
c = kmeans.cluster_centers_
c

[2 1 2 1 1 0 2 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 0
 0 0 0 0 0 0 0 2 0 0 0 0 0 2 1 1 0 1 0 0 0 2 0 0 2 2 1 0]


array([[  0.31766276,  40.41470957,  24.60008055],
       [  1.8991349 ,  16.22711237, 107.31606639],
       [  6.51151418,  20.87950446, -86.0408836 ]])

In [3]:
import plotly.plotly as py
import plotly
import pandas as pd
plotly.tools.set_credentials_file(username='junaidshaikh98', api_key="4EEfIDltWDZbPKkM3Bec")

data = [dict(
        type = 'choropleth',
        locations = df['code'],
        text = df['country'],
        autocolorscale = False,
        z = y_kmeans,
        colorscale = [[0,"rgb(255,0 , 0)"],[1,"rgb(255, 255, 0)"], [2,"rgb(127, 255, 255)"]],
        marker = dict(
            line = dict (
                color = 'rgb(0,0,0)',
                width = 1) ),
        colorbar = dict(
            autotick = False,
            title = 'Color Distribution')
      )]

layout = dict(
    title = 'ploty',
    geo = dict(
        showframe = True,
        showcoastlines = True
    )
)

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='lotly' )


Consider using IPython.display.IFrame instead



In [4]:
print(kmeans.inertia_) # sum of squared distance to closest cluster

60000.068878341626


In [5]:
c1_centroid= c[0]
c2_centroid= c[1]
c3_centroid= c[2]
cluster1_points=df[y_kmeans==0]
cluster2_points=df[y_kmeans==1]
cluster3_points=df[y_kmeans==2]
c1_distances=cluster1_points.apply(lambda x: np.sqrt((x[2]-c1_centroid[0])**2 + (x[3]-c1_centroid[1])**2)+(x[4]-c1_centroid[2])**2,axis=1)
c2_distances=cluster2_points.apply(lambda x: np.sqrt((x[2]-c2_centroid[0])**2 + (x[3]-c2_centroid[1])**2)+(x[4]-c2_centroid[2])**2,axis=1)
c3_distances=cluster3_points.apply(lambda x: np.sqrt((x[2]-c3_centroid[0])**2 + (x[3]-c3_centroid[1])**2)+(x[4]-c3_centroid[2])**2,axis=1)
print(c1_distances.mean())
print(c2_distances.mean())
print(c3_distances.mean())

298.6359851150617
761.9283542750946
330.48335702615805
