# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Location-History" data-toc-modified-id="Location-History-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Location History</a></div><div class="lev2 toc-item"><a href="#Loading-Google-Location-History" data-toc-modified-id="Loading-Google-Location-History-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Loading Google Location History</a></div><div class="lev2 toc-item"><a href="#Per-day-clustering-and-centroid-determination" data-toc-modified-id="Per-day-clustering-and-centroid-determination-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Per day clustering and centroid determination</a></div><div class="lev2 toc-item"><a href="#Clustering-of-all-days" data-toc-modified-id="Clustering-of-all-days-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Clustering of all days</a></div><div class="lev2 toc-item"><a href="#Calculation-of-the-overall-centroids" data-toc-modified-id="Calculation-of-the-overall-centroids-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Calculation of the overall centroids</a></div><div class="lev2 toc-item"><a href="#Calculation-of-the-presence-time-per-day-of-the-week" data-toc-modified-id="Calculation-of-the-presence-time-per-day-of-the-week-15"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Calculation of the presence time per day of the week</a></div><div class="lev2 toc-item"><a href="#Display-the-clusters-locations-on-the-map" data-toc-modified-id="Display-the-clusters-locations-on-the-map-16"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Display the clusters locations on the map</a></div>

# Location History
In this notebook I show how I used clustering techniques to get the locations I spend the most time at. As well I'm showing the hours of the day per week day I was at those locations.

## Loading Google Location History
First I downloaded as JSON my location history from google at: https://takeout.google.com/settings/takeout
The next step is to take in that data, in my case 500547 location data points.

In [1]:
'''
Loading the location data
'''

import json
import pandas as pd

filename = 'LocationHistory.json'

# Open the location history
with open(filename) as data_file:
    loc_data = json.load(data_file)

# Creating the data frame of locations
locations=[]
for loc in loc_data['locations']:
    tmp = {}
    tmp['timestamp'] = pd.to_datetime(int(loc['timestampMs']), unit='ms').tz_localize('utc').tz_convert('US/Eastern')
    tmp['lat'] = loc['latitudeE7']/10000000
    tmp['lon'] = loc['longitudeE7']/10000000
    locations.append(tmp)
data = pd.DataFrame(locations)
data = data.set_index('timestamp')

print('{} locations loaded in "data".'.format(len(data)))
data.head(4)

500547 locations loaded in "data".


Unnamed: 0_level_0,lat,lon
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-03-28 15:45:28.653000-04:00,45.909298,-74.917595
2015-03-28 15:44:27.996000-04:00,45.909298,-74.917595
2015-03-28 15:43:26.793000-04:00,45.909298,-74.917595
2015-03-28 15:42:25.765000-04:00,45.909298,-74.917595


## Per day clustering and centroid determination
Since that number of data points is just too much to do clustering on it all, I decided on an approach where I first cluster the data on a per day basis. In a next step we will cluster all days together. In order to be able to do analysis on the times at locations we will also record the "hours" where the location was measured (this is crude but sufficient for the purpose). Eventually with that times information more could be done, like a clustering / analysis of when I'm at those locations, or if clusters move with time, etc. But let's stay on the basics for now.

In [2]:
'''
As we have too much data for our VM to handle it, we will do a per day clustering, then
another clustering for all days on top of it.
'''

from datetime import timedelta
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
import numpy as np

kms_per_radian = 6371.0088
epsilon = .25 / kms_per_radian

tmin = data.index.min()
tmin = pd.to_datetime('{}-{}-{}'.format(tmin.year,tmin.month,tmin.day))
tmax = data.index.max()
tmax = pd.to_datetime('{}-{}-{}'.format(tmax.year,tmax.month,tmax.day))+timedelta(days=1)

rng = pd.date_range(tmin,tmax)

centroids = pd.DataFrame(columns=['lat','lon'])
for d in rng:
    ds = '{}-{}-{}'.format(d.year,d.month,d.day)
    X = data[ds].reset_index()
    if len(X)>0:
        coordinates=X[['lat','lon']].values
        minSamples = len(X)/10
        db = DBSCAN(eps=epsilon, min_samples=minSamples, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
        y = db.labels_
        Y = pd.DataFrame(y, columns=['label'])
        res = pd.concat([X,Y], axis=1)
        res['hour']=res['timestamp'].apply(lambda ts:ts.hour)
        n_clusters = len(set(res['label'])) - (1 if -1 in set(res['label']) else 0)
        for i in range(n_clusters):
            el = res[res['label'] == i].mean(axis=0).drop('label')
            el['times']=set(res[res['label'] == i]['hour'].values)
            el['timestamp'] = ds
            centroids = centroids.append(el, ignore_index=True)
centroids = centroids.set_index('timestamp')
centroids.index = pd.to_datetime(centroids.index)

print('{} centroids of locations kept in "centroids".'.format(len(centroids)))
centroids.head(4)

1721 centroids of locations kept in "centroids".


Unnamed: 0_level_0,lat,lon,hour,times
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2011-12-14,45.507178,-73.831019,20.9,"{18, 20, 21, 22, 23}"
2011-12-14,45.531191,-73.81856,20.333333,"{17, 19, 20, 21, 22, 23}"
2011-12-14,45.50396,-73.664508,11.1,"{7, 9, 10, 11, 13, 15}"
2011-12-15,45.507178,-73.831019,8.642857,"{0, 2, 3, 4, 5, 6, 8, 20, 22, 23}"


## Clustering of all days
Now we cluster all the centroids of the per day clusters we determined at the previous step.

In [3]:
'''
DBSCAN clustering taking into account the spherical earth
source: http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
'''

from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
import numpy as np

kms_per_radian = 6371.0088
epsilon = 1.5 / kms_per_radian
minSamples = len(centroids)/20 # Since this is the second pass, we could want to detect vacation spots or such, 
                               # in this case we might have to lover the min number of sample to 1-2

coordinates=centroids[['lat','lon']].values

db = DBSCAN(eps=epsilon, min_samples=minSamples, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
y = db.labels_
print ('List of generated labels for {} clusters: {}'.format(len(set(y)),set(y)))

centroids['label'] = y
centroids.head(4)

List of generated labels for 4 clusters: {0, 1, 2, -1}


Unnamed: 0_level_0,lat,lon,hour,times,label
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011-12-14,45.507178,-73.831019,20.9,"{18, 20, 21, 22, 23}",0
2011-12-14,45.531191,-73.81856,20.333333,"{17, 19, 20, 21, 22, 23}",0
2011-12-14,45.50396,-73.664508,11.1,"{7, 9, 10, 11, 13, 15}",1
2011-12-15,45.507178,-73.831019,8.642857,"{0, 2, 3, 4, 5, 6, 8, 20, 22, 23}",0


## Calculation of the overall centroids
And we calculate the centroids of those overall clusters.

In [4]:
overallCentroids = pd.DataFrame(columns=['lat','lon'])
X = pd.DataFrame(coordinates, columns=['lat','lon'])
Y = pd.DataFrame(y, columns=['label'])
res = pd.concat([X,Y], axis=1, join='inner')
overall_n_clusters = len(set(res['label'])) - (1 if -1 in set(res['label']) else 0)
for i in range(overall_n_clusters):
    el = res[res['label'] == i].mean(axis=0).drop('label')
    overallCentroids = overallCentroids.append(el, ignore_index=True)

print('{} overall centroids of locations kept in "overallCentroids".'.format(len(overallCentroids)))
overallCentroids

3 overall centroids of locations kept in "overallCentroids".


Unnamed: 0,lat,lon
0,45.516337,-73.833416
1,45.49986,-73.667323
2,45.919177,-74.915494


## Calculation of the presence time per day of the week
We calculate when we were present at a clustered location, per hour and day of the week.

In [5]:
def present(df, label, tintra, tinter):
    h = [[item for sublist in df[df.label==label].times.values.tolist() for item in sublist].count(x) for x in range(24)]
    t = [[item for sublist in df.times.values.tolist() for item in sublist].count(x) for x in range(24)]
    h = list(map(lambda x: 0 if x<max(h)*tintra or max(h)<max(t)*tinter else 1, h))
    return h

ln=['Home', 'Work', 'Cottage']
dw=['MON',
    'TUE',
    'WED',
    'THU',
    'FRI',
    'SAT',
    'SUN']
presence = []
ptext = []
for l in range(len(ln)):
    print('{}:'.format(ln[l]))
    p = []
    pt = ''
    for dow in range(7):
        h = present(centroids[(centroids.index.dayofweek==dow)], l, 0.5, 0.5)
        p.append(h)
        t = ''.join(['X' if i else '.' for i in h])
        pt += '{} {}\n'.format(dw[dow],t)
    print(pt)
    presence.append(p)
    ptext.append(pt)

Home:
MON XXXXXXX.........XXXXXXXX
TUE XXXXXXX.........XXXXXXXX
WED XXXXXXX.........XXXXXXXX
THU XXXXXXX.........XXXXXXXX
FRI XXXXXXX........XXXXX....
SAT ........................
SUN .................XXXXXXX

Work:
MON ......XXXXXXXXXX........
TUE ......XXXXXXXXXX........
WED ......XXXXXXXXXX........
THU ......XXXXXXXXXX........
FRI ......XXXXXXXXX.........
SAT ........................
SUN ........................

Cottage:
MON ........................
TUE ........................
WED ........................
THU ........................
FRI ........................
SAT XXXXXXXXXXXXXXXXXXXXXXXX
SUN XXXXXXXXXXXXXXXXXXXX....



## Display the clusters locations on the map
We can now print those clusters centroids on a map and verify that where I spend most time is at home, at work and at my cottage. Because of the way the location was estimated in older Android phones, it happens that the location of my house seems in two distinct spots. Something I noticed before while looking at my location history. But overall, the method correctly found my three main locations.

In [6]:
import folium
import numpy as np

colorsList = ['red',
            'blue',
            'green',
            'orange',
            'purple',
            'pink',
            'gray',
            'cadetblue',
            'darkred',
            'darkblue',
            'darkgreen',
            'darkpurple',
            'lightgray',
            'lightred',
            'beige',
            'lightgreen',
            'lightblue',
            'white',
            'black']

centCoordinates=overallCentroids[['lat','lon']].values

m = folium.Map(location=[45.6, -73.8], zoom_start=9)

#iframe = folium.element.IFrame(html=html, width=500, height=300)
#popup = folium.Popup(iframe, max_width=2650)

#folium.Marker([30,-100], popup=popup).add_to(m)

for i,r in enumerate(centCoordinates):
    color = i
    folium.Marker(
        location=[r[0], r[1]],
        #popup='{}'.format(ptext[i]),
        popup=folium.Popup(folium.element.IFrame(html='<code>'+ptext[i].replace('\n','<br>')+'</code>', width=270, height=150), max_width=2650),
        icon=folium.Icon(color=colorsList[color])
    ).add_to(m)        

m