## BikeMi Stalls K-Means Clustering

In [14]:
# path manipulation
from pathlib import Path

# data manipulation
import pandas as pd

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# connecting to a database
import psycopg2

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# set settings for seaborn
sns.set_style(style="whitegrid", rc={"grid.color": ".9"})
sns.set_palette(palette="deep")

# customise matplotlib and sns plot dimensions
plt.rcParams["figure.figsize"] = [12, 6]
plt.rcParams["figure.dpi"] = 100
title_font = {"fontname": "DejaVu Sans Mono"}

# create paths
milan_data = Path("../data/milan")

# establish connection with the database
conn = psycopg2.connect("dbname=bikemi user=luca")

### Correlations Between Series

After our first data selection to remove outliers and restrict the spatial area in which we are conducting our analysis, we are still left with more than 200 stations, spread across 25 neighbourhoods out of 88 - identified by the acronym NIL, i.e. *nuclei d'identità locale*. This figure might still be too high, especially as far as multivariate models are concerned: indeed, shrinkage will be necessary in order to avoid highly correlated features (*multicollinearity*). However, it is still in our interests to reduce the number of series to model even for the univariate forecasting: fitting twenty or two-hundred series is a different task. Even inspecting the correlation across series becomes a daunting task with such a great number of features.

In [17]:
pd.read_csv(Path(milan_data / "bikemi-selected_stalls.csv")).nil.unique().shape[0]

25

K-means clustering is a popular method widely used in the sharing-services literature, especially to identify "virtual stations" in free-float services <cite id="54l2o">(Ma et al., 2018)</cite> or to "visualize the spatial distribution of DBS [Dockless Bike Sharing] and taxis around metro stations" <cite id="ai9ag">(Li et al., 2019)</cite>.

In a few words, with K-means clustering  we "want to partition the observations into $K$ clusters such that the total within-cluster variation, summed over all $K$ clusters, is as small as possible" <cite id="is7ue">(Sohil et al., 2021)</cite>. The objective function to optimise is usually the squared Euclidean distance. Simply put, K-means "aims to partition n observations into $K$ clusters, represented by their centres or means. The centre of each cluster is calculated as the mean of all the instances belonging to that cluster" <cite id="z2z8d">(Li et al., 2019)</cite> and "is extremely efficient and concise for the classification of equivalent multidimensional data" such as sharing services data <cite id="ter8l">(Li et al., 2019)</cite>.

The algorithm begins with randomly choosing clusters centres and, with each iteration, the centres are re-calculated to reduce the partitioning error - which decreases monotonically, as $K$ increases. Basically, in this second step the algorithm "creates new centroids by taking the mean value of all of the samples assigned to each previous centroid [...] until the centroids do not move significantly" <cite id="wx6bz">(<i>Clustering</i>, n.d.)</cite>. However, greater values of $K$ deprive the classification task of its meaning. To deal with this problem, the so-called Elbow method is used: in other words, $K$ is chosen as the number after which the performance improvements start to marginally decline.

K-Means clustering scales well with the number of samples $n$, but assumes convex shapes (i.e., has worse performances where the "true" clusters have elongated or irregular shapes) <cite id="hqe2q">(<i>Clustering</i>, n.d.)</cite>. Besides, since the initial position of the cluster is random, it might take some attempt for the algorithm to converge. Most importantly, however, K-Means is sensitive to the scales of the variables in the data, so normalising the feature matrix is a crucial step.

In [1]:
import pandas as pd
import geopandas
import numpy as np

from pathlib import Path

from sklearn import metrics
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

data_path = Path("../data/milan/")

In [8]:
bikemi_stalls = geopandas.read_file(
    Path(data_path / "bikemi-selected_stalls.csv"),
    geometry="stalls_geometry",
).set_index("numero_stazione")

bikemi_stalls.head()

Unnamed: 0,numero_stazione,nome,stalls_geometry,anno,nil,id_nil,municipio,geometry
0,1,Duomo,POINT (9.189141462641917 45.46474597340785),2008,DUOMO,1,1,
1,3,Cadorna 1,POINT (9.175661673055156 45.46800286489534),2008,MAGENTA - S. VITTORE,7,1,
2,4,Lanza,POINT (9.181970059045605 45.47227398001715),2008,BRERA,2,1,
3,5,Universita' Cattolica,POINT (9.176411553596575 45.46312096737842),2008,DUOMO,1,1,
4,6,San Giorgio,POINT (9.18366605858828 45.46088788086489),2008,DUOMO,1,1,


In [9]:
kmeans = KMeans(init="random", n_clusters=10, n_init=4, random_state=0)

kmeans_plus = KMeans(init="k-means++", n_clusters=10, n_init=4, random_state=0)

def compute_kmeans(kmeans, data):

    return make_pipeline(StandardScaler(), kmeans).fit(data)

In [10]:
compute_kmeans(kmeans, bikemi_stalls)

ValueError: could not convert string to float: 'Duomo'