In this notebook we analyze the Geolife data in the Beijing area during a specific timeframe in order to detect and cluster together stop places and calculate each cluster's medoid.

We begin by loading the dataset:

In [26]:
import pandas as pd

cols = ["date_time", "lat", "lon", "uid"]
df = pd.read_csv("/Users/tommasocolella/Desktop/Tesi/datasets/complete.csv", \
                 usecols = cols, parse_dates = ["date_time"])

df.head()

Unnamed: 0,date_time,lat,lon,uid
0,2009-01-03 01:21:34,39.974294,116.399741,135
1,2009-01-03 01:21:35,39.974292,116.399592,135
2,2009-01-03 01:21:36,39.974309,116.399523,135
3,2009-01-03 01:21:38,39.97432,116.399588,135
4,2009-01-03 01:21:39,39.974365,116.39973,135


We're interested in a subset comprising only beijing coordinates in a specific timeframe. so we restrict the dataset 39.511573, 117.184740

In [27]:
#restricting to beijing area
df = df[(df['lat'].between(39.54, 40.3)) & (df['lon'].between(115.75, 117.13))]

#restricting to june 2008
start_time = "2008-06-01 00:00:00"
end_time = "2008-08-31 23:59:00"

df = df[(df.date_time > start_time) & (df.date_time < end_time)]

df.head()

Unnamed: 0,date_time,lat,lon,uid
107402,2008-08-16 07:47:56,39.930748,116.306143,104
107403,2008-08-16 07:47:57,39.930792,116.306167,104
107404,2008-08-16 07:48:00,39.931093,116.306342,104
107405,2008-08-16 07:48:05,39.93095,116.306313,104
107406,2008-08-16 07:48:10,39.930963,116.306383,104


Now we convert df to a TrajectoryDataFrame and set the uid column to "0" in order to calculate and cluster stops for all the trajectories, indipendently on the user who recorded each of them

In [28]:
import skmob
from skmob.preprocessing import detection

#converting to tdf
tdf = skmob.TrajDataFrame(df, latitude='lat', longitude='lon', datetime='date_time', user_id='uid')
tdf.head()

#setting uid to 0
tdf["uid"] = 0

#detecting stops
stdf = detection.stops(tdf, stop_radius_factor=0.5, minutes_for_a_stop=60.0, spatial_radius_km=0.5, leaving_time=True)
stdf.head()

Unnamed: 0,datetime,lat,lng,uid,leaving_datetime
0,2008-06-01 14:07:46,39.984939,116.352704,0,2008-06-01 15:11:14
1,2008-06-01 15:24:04,39.984915,116.352534,0,2008-06-01 23:03:27
2,2008-06-02 01:48:56,39.97619,116.330269,0,2008-06-02 03:46:06
3,2008-06-02 15:21:05,39.971958,116.324943,0,2008-06-02 19:04:58
4,2008-06-02 19:22:02,39.987358,116.451792,0,2008-06-02 22:56:50


Now we show the stopping points on a folium map

In [29]:
stdf.plot_stops()



Next we need to cluster all the stopping points. After doing that we plot the clustered points on a folium map.

In [30]:
from skmob.preprocessing import clustering

#clustering the stopping points
cstdf = clustering.cluster(stdf, cluster_radius_km=0.5, min_samples=1)

#printing on a folium map
cstdf.plot_stops()



Now we calculate the medoid for each cluster and we put each of them in a dataframe, then we serialize it on disk for further usage.

In [58]:
import sklearn
import numpy

medoids = pd.DataFrame()

#for each cluster
for i in range(0, cstdf.cluster.max() + 1):
    cluster_i = cstdf[(cstdf["cluster"] == i)].copy()
    cluster_i.reset_index(inplace=True)
    pairwise = sklearn.metrics.pairwise_distances(cluster_i[["lat", "lng"]], metric='euclidean')
    medoid = numpy.argmin(pairwise.sum(axis=0))
    #print(medoid)
    #print(cluster_i.loc[medoid])
    medoids = medoids.append(cluster_i.loc[[medoid]])

#setting back the index
medoids = medoids.set_index("index")

#we expect 81 rows
print(medoids.shape)

#we plot the medoids on a dataframe
medoids.plot_stops()

(81, 6)


