# Exercise: Geographical Cluster Analysis of Taxi Rides
Using the NY Taxi data set (see Use Case Block I) and the use case from the lecture...

In [194]:
!pip install folium



In [195]:
import pandas as pd
import numpy as np
import folium

In [196]:
# we load the data we have saved after wrangling and pre-processing in block I
train=pd.read_csv('../../DATA/train_cleaned.csv')

In [197]:
#select only the culumns with the ride coordinates
coordinates = train[ ['pickup_latitude','pickup_longitude','dropoff_latitude' , 'dropoff_longitude' ] ]

## Clustering approach from the lecture
we will be using simple K-Means:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [198]:
from sklearn.cluster import KMeans

In [199]:
#define number of clusters and create instance
clusters=100
myKMeans=KMeans(n_clusters=clusters, n_jobs=-1)#parallelize to all cores

In [200]:
%%time
#train model
myKMeans.fit(coordinates.to_numpy()[:100000,:])#use only subset of the data to make it faster



Wall time: 29.9 s


KMeans(n_clusters=100, n_jobs=-1)

In [202]:
#get cluster centers
centers=myKMeans.cluster_centers_
    

In [203]:
#draw map: green: start, red: end
cluster_map = folium.Map(location = [40.730610,-73.935242],zoom_start = 12,)
for i in range(clusters):
    folium.CircleMarker([centers[i,0], centers[i,1]], radius=3,                
                        color="green", 
                        fill_opacity=0.9
                       ).add_to(cluster_map)
    folium.CircleMarker([centers[i,2], centers[i,3]], radius=3,                
                        color="red", 
                        fill_opacity=0.9
                       ).add_to(cluster_map)
    folium.PolyLine([ [centers[i,0],centers[i,1]] , [centers[i,2],centers[i,3]]  ], color="black", weight=2.5, opacity=1).add_to(cluster_map)

In [204]:
cluster_map

## Exercise 1
Write a function ```show_cluster(cluster_number,...)``` that draws the cluster centers and all start and end points of a given cluster in the map.
* use the ```predict()``` method to map all data in ```train_data``` to a cluster center
* use ```folium.CircleMarker``` to draw all members of a given cluster


In [205]:
%%time

coordinates_array=coordinates.to_numpy()[:100000,:]

#define number of clusters and create instance
clusters_nb=100
my_Clustering=KMeans(n_clusters=clusters, n_jobs=-1)#parallelize to all cores

#train model
my_Clustering.fit(coordinates_array)#use only subset of the data to make it faster
my_Clustering.predict(coordinates_array)



Wall time: 29.2 s


array([76, 61, 53, ..., 59, 68, 25])

In [206]:
#get cluster centers and cluster labels for each points
centers_Array=my_Clustering.cluster_centers_
labels_Array=my_Clustering.labels_

In [207]:
def show_cluster(cluster_number,centers,labels,data_array):
    index_array=np.where(labels == cluster_number)[0]
    
    clustered_map = folium.Map(location = [40.730610,-73.935242],zoom_start = 12,)
    
    for j in range(len(index_array)):
        folium.CircleMarker([data_array[index_array[j],0], data_array[index_array[j],1]], radius=3,                
                        color="green", 
                        fill_opacity=0.9
                       ).add_to(clustered_map)
        folium.CircleMarker([data_array[index_array[j],2], data_array[index_array[j],3]], radius=3,                
                        color="red", 
                        fill_opacity=0.9
                       ).add_to(clustered_map)
    folium.CircleMarker([centers[cluster_number,0], centers[cluster_number,1]], radius=3,                
                    color="blue", 
                    fill_opacity=0.9
                    ).add_to(clustered_map)
    folium.CircleMarker([centers[cluster_number,2], centers[cluster_number,3]], radius=3,                
                    color="yellow", 
                    fill_opacity=0.9
                    ).add_to(clustered_map)
    return clustered_map

In [208]:
clust_map=show_cluster(0,centers_Array,labels_Array,coordinates_array)

In [209]:
clust_map

## Exercise 2
Write a function ```cluster_var(cluster_number,...)``` that computes the intra- and extra cluster variance for a given cluster. Apply it to all clusters and compare the results for k=100 and k=10.

![image.png](attachment:4041781a-5503-4d43-af09-a0e59ab34c29.png)

https://stats.stackexchange.com/questions/86645/variance-within-each-cluster

![image.png](attachment:daad0791-cf4d-45a2-8f5d-ca0df5047e8a.png)

https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

To calculate the centroid from the cluster table just get the position of all points of a single cluster, sum them up and divide by the number of points.

In [210]:
def intra_var(cluster_number,centers,labels,data_array):
    index_array=np.where(labels == cluster_number)[0]
    
    intra_variance_pickup=0
    intra_variance_dropoff=0
    for i in range(len(index_array)):
        intra_variance_pickup += np.linalg.norm(coordinates_array[index_array[i],0:2]-centers_Array[1,0:2])
        intra_variance_dropoff += np.linalg.norm(coordinates_array[index_array[i],2:4]-centers_Array[1,2:4])
        
    return intra_variance_pickup/len(index_array),intra_variance_dropoff/len(index_array) #we divice by the length of index_array array in order not to take into account the amount of point in the cluster

In [211]:
int_var=intra_var(2,centers_Array,labels_Array,coordinates_array)
print(int_var)

(0.1390121252921762, 0.2255769127806139)


In [212]:
def centroid(myArray):
    return myArray[:,0].mean(), myArray[:,1].mean(), myArray[:,2].mean(), myArray[:,3].mean()

In [213]:
myCentroid=centroid(centers_Array)
myCentroid

(40.784647257965624, -73.91629217386031, 40.77927556773993, -73.91886198599983)

In [214]:
def extra_var(centers,labels):
    
    myCentroid=centroid(centers)
    
    extra_variance_pickup=0
    extra_variance_dropoff=0
    for i in range(len(centers)):
        index_array=np.where(labels == i)[0]
        extra_variance_pickup += np.linalg.norm(centers[i,0:2]-myCentroid[0:2])
        extra_variance_dropoff += np.linalg.norm(centers[i,2:4]-myCentroid[2:4])
        
    return extra_variance_pickup/len(centers),extra_variance_dropoff/len(centers) #we divice by the length of center array in order not to take into account the amount of cluster

In [215]:
extra_variance=extra_var(centers_Array,labels_Array)
extra_variance

(0.13751156662762926, 0.1324006773215603)

In [216]:
def total_intra_var(centers,labels,data_array):
    total_intra_variance_pickup=0
    total_intra_variance_dropoff=0
    for i in range(len(centers)):
        total_intra_variance_pickup += intra_var(i,centers,labels,data_array)[0]
        total_intra_variance_dropoff += intra_var(i,centers,labels,data_array)[1]
    return total_intra_variance_pickup/len(centers),total_intra_variance_dropoff/len(centers) #we divice by the length of center array in order not to take into account the amount of cluster

In [217]:
resultat=total_intra_var(centers_Array,labels_Array,coordinates_array)
resultat

(0.13009951953890256, 0.12473483530643502)

In [218]:
def cluster_var(data_array):
    #my_array=np.arrange(10,1000,10)
    my_array=[10,100]
    int_variance=[]
    extra_variance=[]
    for i in range(len(my_array)):
        centers,labels=my_kmeans(data_array,my_array[i])
        int_variance.append(total_intra_var(centers,labels,data_array))
        extra_variance.append(extra_var(centers,labels))
    return int_variance,extra_variance

In [219]:
def my_kmeans(array,nb_cluster):
    clusters=KMeans(n_clusters=nb_cluster, n_jobs=-1, random_state=0)#parallelize to all cores

    #train model
    clusters.fit(array)#use only subset of the data to make it faster
    clusters.predict(array)
    centers=clusters.cluster_centers_
    labels=clusters.labels_
    
    return centers,labels

In [220]:
resultats=cluster_var(coordinates_array)
centers_10,labels_10=my_kmeans(coordinates_array,10)
int_variance=total_intra_var(centers_10,labels_10,coordinates_array)
extra_variance=extra_var(centers_10,labels_10)
print(int_variance)
print(extra_variance)
print(resultats)



(0.18127460528564027, 0.14429063074224494)
(0.20033229189659973, 0.1451546027988987)
([(0.18127460528564027, 0.14429063074224494), (0.12726165382547255, 0.123938058375302)], [(0.20033229189659973, 0.1451546027988987), (0.1347430554922511, 0.13225470070391043)])


In [221]:
centers_100,labels_100=my_kmeans(coordinates_array,100)
int_variance=total_intra_var(centers_100,labels_100,coordinates_array)
extra_variance=extra_var(centers_100,labels_100)
print(int_variance)
print(extra_variance)



(0.12726165382547255, 0.123938058375302)
(0.1347430554922511, 0.13225470070391043)


## Answers to exercice 2

* k=10 -> total intracluster variance = (0.18127460528564027, 0.14429063074224494)
* k=10 -> extracluster variance = (0.20033229189659973, 0.1451546027988987)


* k=100 -> total intracluster variance = (0.12726165382547255, 0.123938058375302)
* k=100 -> extracluster variance = (0.1347430554922511, 0.13225470070391043)

Improving quality is minimizing intracluster variance and maximizing extracluster variance.

Here we can see that the intravariance indeed is droping within k=10 and k=100

However the extracluster is diminushing which shoudn't



Concerning the formulas from the course I didn't understood it so I used as you can saw the formulas from the Calinski-Harabasz Index.

However these formulas don't take into acount when used separatly the amount of clusters and datas point so I decided to divide the total sum calculated by the cluster/nb of points length. Which help taking into account the cluster and data point amount.

By the results I obtained, it seems that my formulas are not right and should be change with the right one.

In [222]:
from sklearn import metrics
K10_score=metrics.calinski_harabasz_score(coordinates_array,labels_10)
K100_score=metrics.calinski_harabasz_score(coordinates_array,labels_100)
print(K10_score,K100_score)

28923.22365514256 16678.147944559227


Knowing that the calinski_harabasz_score is higher with k=10 it seem better than k=100 with this method.