# Discretizing Individual GeoLife Trajectory 1.3 sample data into a spatial grid

*Part of the COVID19risk project*  
*http://covid19risk.com/*  
*2020-03-04*  

*Copyright (C) 2020 Mikhail Voloshin, Mighty Data Inc.*  
*All rights reserved.*  

The objective of this Notebook is to convert the sample GeoLife trajectory data from Microsoft (https://www.microsoft.com/en-us/download/details.aspx?id=52367) into a form that can be marshalled to a browser for purposes of being rendered client-side as a heatmap that can be navigated across space and time.

The GeoLife data is hundreds of megabytes ZIPped, and over 1.5 GB expanded into CSV-based "trajectory" files. This is far too big and too detailed to be usable by a client.

As such, the purpose of this Notebook is to simplify each individual trajectory record for easy network transmission.

This Notebook discretizes individual trajectories into a spatial grid, but not a temporal one. It does, however, "collapse" temporal information -- that is, when an indivudal is in the same cell for multiple time ticks, it will only export the time of the individual's entry into the cell, and will assume that the individual is still in that same cell until further notified.

## Define parameters

In [1]:
# Define parameters

# The number of kilometers covered by each grid cell, in both the latitude and longitude direction.
# Discretization is zeroed on the equator and prime meridian for latitude and longitude respectively.
# For now, the number of kilometers per degree is fixed at a conversion rate optimized for 40oN -100oW,
# which lies in northern Kansas near the geospatial center of the US. Future versions of this
# script will employ an algorithm that takes the Earth's curvature into account. For now, this will lead
# to a slight illusory increase in data points per cell (i.e. deceptively "hotter" cells) near
# the equator, and an illusory decrease in data points per cell (i.e. deceptively "colder" cells)
# near the poles. The distortion near the poles will be significant, but because those are far from
# human population centers this should be acceptable.
GRID_SPATIAL_KM = 1

# The number of days for each time step into which to discretize the data.
# Discretization is zeroed on the Unix epoch.
GRID_TEMPORAL_DAYS = 7

# Path to the unzipped Geolife data folder.
DATA_DIR = "../../../../data/proof-of-concept/Geolife Trajectories 1.3"

## Search for trajectory files

In [2]:
import os

if not os.path.isdir(DATA_DIR):
    raise ValueError(DATA_DIR + " doesn't appear to be a directory that exists on this filesystem.")
if not os.path.isdir(DATA_DIR+'/Data'):
    raise ValueError(DATA_DIR + " doesn't appear to be the unzipped Geolife dataset. It doesn't contain a /Data subdirectory.")

# A trajectory folder path is any subfolder of the Data directory that has a Trajectory subfolder.
trajectory_folder_paths = [f.path for f in os.scandir(DATA_DIR+'/Data') if f.is_dir() and os.path.isdir(f.path+'/Trajectory')]

print(f'Success. {len(trajectory_folder_paths)} trajectories found.')

Success. 182 trajectories found.


## Precompute some values that will be useful to us during the computation

Spatial grid degrees were computed with the help of https://andrew.hedges.name/experiments/haversine/. Future versions of this script should use trigonometry to discretize by traversing a longitudinal line in intervals of *GRID_SPATIAL_KM* km steps from the equator, and then using the Earth's cross-section at that latitude to walk by *GRID_SPATIAL_KM* km steps from the prime meridian. It's not hard, but a little too tedious to go through the trouble of writing said function for this proof-of-concept stage.

In [3]:
CELLSIZE_SECONDS = GRID_TEMPORAL_DAYS * 24 * 60 * 60

CELLSIZE_DEGREES_LAT = GRID_SPATIAL_KM * 0.00899
CELLSIZE_DEGREES_LONG = GRID_SPATIAL_KM * 0.01174
CELLSIZE_DEGREES_PRECISION = 5


## Define some helper functions

In [4]:
import time
import datetime

MICROSOFT_EPOCH_START = time.mktime(datetime.datetime.strptime('1899-12-30', '%Y-%m-%d').timetuple())

def convert_microsoft_epoch_to_unixtime(dayscount):
    # Microsoft Research encoded this dataset to include a field
    # that contains the number of days that have elapsed since
    # Dec 30 (not 31), 1899. No word on whether this is midnight
    # at the *beginning* or *end* of said date, but we can validate
    # against the other date and time columns to make sure we
    # got it right (which we did).

    secondscount = dayscount * 24 * 60 * 60
    retval = MICROSOFT_EPOCH_START + secondscount
    return int(retval)
    


In [5]:
latlong_fstring_part = f'{{:.{CELLSIZE_DEGREES_PRECISION}f}}'
latlong_fstring = f'{latlong_fstring_part},{latlong_fstring_part}'

def determine_spatial_grid_cell(df):
    df['unixtime'] = df['daysSinceMicrosoftEpoch'].apply(convert_microsoft_epoch_to_unixtime)    
    
    # We're using int as a de facto math.floor function
    df['cell_latitude'] = (df['latitude'] / CELLSIZE_DEGREES_LAT).apply(int) * CELLSIZE_DEGREES_LAT
    df['cell_longitude'] = (df['longitude'] / CELLSIZE_DEGREES_LONG).apply(int) * CELLSIZE_DEGREES_LONG

    df['cell_key_spatial'] = df.apply(lambda x: latlong_fstring.format(x['cell_latitude'], x['cell_longitude']), axis=1)
    
    return df
    

## Populate our data structures
This script should take about 20 minutes to run on a 2.8 Ghz Lenovo laptop with 16 GB RAM.

In [6]:
%%time
print(f'Started at {datetime.datetime.now()}')

import math
import json
import pandas as pd

# The column meanings come from the User Guide PDF that comes with the Geolife dataset.
GEOLIFE_COLUMNS = [
    'latitude',
    'longitude',
    'reserved0',
    'altitude',
    'daysSinceMicrosoftEpoch',
    'date',
    'time'    
]

trajectories_time_in_cell = {}


for traj_path in trajectory_folder_paths:
    traj_id = traj_path.split('/')[-1]
    print(f'Processing Trajectory ID: {traj_id}')

    cells_hit_by_this_traj = set()
    
    traj_filenames = [f.path for f in os.scandir(traj_path+'/Trajectory') if f.is_file() and f.path.endswith('.plt')]
    print(f'{len(traj_filenames)} trajectory plots found.')
    
    trajdf = None
    for traj_filename in traj_filenames:
        print('.', end='')
        df = pd.read_csv(traj_filename, skiprows=6, names=GEOLIFE_COLUMNS)
        determine_spatial_grid_cell(df)
        # print(f'Read {traj_filename} into a dataframe')
        
        trimdf = df[['unixtime', 'cell_latitude', 'cell_longitude', 'cell_key_spatial']]
        if trajdf is None:
            trajdf = trimdf
        else:
            trajdf = trajdf.append(trimdf)

    trajdf = trajdf.sort_values(by='unixtime')
    
    # Select only the cells that represent a change in the traveler's position.
    # Without such a change, we can assume that the traveler is stationary.
    trajdf['entered_cell'] = trajdf['cell_key_spatial'] != trajdf['cell_key_spatial'].shift()
    
    # A gap of more than an hour indicates that the user has gone offline.
    trajdf['gone_offline'] = trajdf['unixtime'].shift(-1).isna() | ((trajdf['unixtime'].shift(-1) - trajdf['unixtime']) > (60 * 60))

    trajdf_enterexits = trajdf[trajdf['entered_cell'] | trajdf['gone_offline']].copy()
    trajdf_enterexits['unixtime_end'] = trajdf_enterexits['unixtime'].shift(-1)
    trajdf_enterexits = trajdf_enterexits[trajdf_enterexits['gone_offline'] == False].copy()
    trajdf_enterexits['unixtime_end'] = trajdf_enterexits['unixtime_end'].astype('int64')
    
    trajdf_enterexits['cell_lat_fixedpt'] = (trajdf_enterexits['cell_latitude'] * math.pow(10, CELLSIZE_DEGREES_PRECISION)).astype('int64')
    trajdf_enterexits['cell_long_fixedpt'] = (trajdf_enterexits['cell_longitude'] * math.pow(10, CELLSIZE_DEGREES_PRECISION)).astype('int64')
        
    trajdf_min = trajdf_enterexits.loc[:, ['unixtime', 'unixtime_end', 'cell_lat_fixedpt', 'cell_long_fixedpt']]

    # It's a little silly that the most efficient way to turn a pandas dataframe
    # into a list of lists is to go through json, but it works.
    trajectories_time_in_cell[traj_id] = json.loads(trajdf_min.to_json(orient='values'))
            
    print(f'\nTrajectory ID {traj_id} done.\n')


Started at 2020-03-13 02:40:11.427768
Processing Trajectory ID: 000
171 trajectory plots found.
...........................................................................................................................................................................
Trajectory ID 000 done.

Processing Trajectory ID: 001
71 trajectory plots found.
.......................................................................
Trajectory ID 001 done.

Processing Trajectory ID: 002
175 trajectory plots found.
...............................................................................................................................................................................
Trajectory ID 002 done.

Processing Trajectory ID: 003
322 trajectory plots found.
............................................................................................................................................................................................................................................

....................................................................................................................................................
Trajectory ID 037 done.

Processing Trajectory ID: 038
110 trajectory plots found.
..............................................................................................................
Trajectory ID 038 done.

Processing Trajectory ID: 039
227 trajectory plots found.
...................................................................................................................................................................................................................................
Trajectory ID 039 done.

Processing Trajectory ID: 040
27 trajectory plots found.
...........................
Trajectory ID 040 done.

Processing Trajectory ID: 041
557 trajectory plots found.
.........................................................................................................................................................

...........................................................................................................................................................................................................................................................................................................................................................................................................................................
Trajectory ID 085 done.

Processing Trajectory ID: 086
6 trajectory plots found.
......
Trajectory ID 086 done.

Processing Trajectory ID: 087
8 trajectory plots found.
........
Trajectory ID 087 done.

Processing Trajectory ID: 088
59 trajectory plots found.
...........................................................
Trajectory ID 088 done.

Processing Trajectory ID: 089
64 trajectory plots found.
................................................................
Trajectory ID 089 done.

Processing Trajectory ID: 090
8 trajectory plots found.
........
Trajectory ID 0

....................
Trajectory ID 130 done.

Processing Trajectory ID: 131
21 trajectory plots found.
.....................
Trajectory ID 131 done.

Processing Trajectory ID: 132
6 trajectory plots found.
......
Trajectory ID 132 done.

Processing Trajectory ID: 133
5 trajectory plots found.
.....
Trajectory ID 133 done.

Processing Trajectory ID: 134
75 trajectory plots found.
...........................................................................
Trajectory ID 134 done.

Processing Trajectory ID: 135
13 trajectory plots found.
.............
Trajectory ID 135 done.

Processing Trajectory ID: 136
17 trajectory plots found.
.................
Trajectory ID 136 done.

Processing Trajectory ID: 137
1 trajectory plots found.
.
Trajectory ID 137 done.

Processing Trajectory ID: 138
18 trajectory plots found.
..................
Trajectory ID 138 done.

Processing Trajectory ID: 139
19 trajectory plots found.
...................
Trajectory ID 139 done.

Processing Trajectory ID: 140
380 t

....................................
Trajectory ID 169 done.

Processing Trajectory ID: 170
5 trajectory plots found.
.....
Trajectory ID 170 done.

Processing Trajectory ID: 171
5 trajectory plots found.
.....
Trajectory ID 171 done.

Processing Trajectory ID: 172
21 trajectory plots found.
.....................
Trajectory ID 172 done.

Processing Trajectory ID: 173
6 trajectory plots found.
......
Trajectory ID 173 done.

Processing Trajectory ID: 174
70 trajectory plots found.
......................................................................
Trajectory ID 174 done.

Processing Trajectory ID: 175
4 trajectory plots found.
....
Trajectory ID 175 done.

Processing Trajectory ID: 176
8 trajectory plots found.
........
Trajectory ID 176 done.

Processing Trajectory ID: 177
1 trajectory plots found.
.
Trajectory ID 177 done.

Processing Trajectory ID: 178
1 trajectory plots found.
.
Trajectory ID 178 done.

Processing Trajectory ID: 179
71 trajectory plots found.
....................

In [7]:
trajdf_min

Unnamed: 0,unixtime,unixtime_end,cell_lat_fixedpt,cell_long_fixedpt
0,1197039964,1197040560,3996954,11631992
2,1197040560,1197040676,3996954,11633166
3,1197040676,1197040865,3995156,11633166
6,1197040865,1197041288,3996055,11633166
9,1197052942,1197053061,3998751,11633166
...,...,...,...,...
32,1203239758,1203240513,3997853,11630818
41,1203240513,1203243900,3997853,11629644
55,1203262930,1203263135,3998751,11629644
0,1205481474,1205484182,4091348,11170610


## Trim garbage time slots
There appear to be a small number of errors in the timestamps of the Geolife data. As such, throw out any data that occurs before there are at least 3 active users.

In [8]:
traj_with_starttime = [[traj_id, trajectory[0][0]] for traj_id, trajectory in trajectories_time_in_cell.items()]
traj_with_starttime.sort(key = lambda x: x[1])

earliest_acceptable_time = traj_with_starttime[2][1]
print(f'Filtering out all records with a start time before {earliest_acceptable_time}')

def trajectory_filtered(trajectory):
    return [trecord for trecord in trajectory if trecord[0]>=earliest_acceptable_time]

trajectories_time_in_cell = {
    traj_id:trajectory_filtered(trajectory) for traj_id, trajectory in trajectories_time_in_cell.items()
}

traj_with_starttime = [[traj_id, trajectory[0][0]] for traj_id, trajectory in trajectories_time_in_cell.items()]
traj_with_starttime.sort(key = lambda x: x[1])
traj_with_starttime

Filtering out all records with a start time before 1176391133


[['161', 1176391133],
 ['163', 1176391276],
 ['142', 1176394485],
 ['097', 1176403581],
 ['111', 1176495809],
 ['128', 1176530188],
 ['076', 1176746284],
 ['134', 1177406516],
 ['140', 1177783389],
 ['021', 1177853671],
 ['086', 1177912445],
 ['154', 1177951657],
 ['115', 1178724420],
 ['058', 1178740107],
 ['118', 1179008395],
 ['138', 1179902920],
 ['082', 1180021176],
 ['098', 1180192904],
 ['117', 1182446500],
 ['080', 1182955761],
 ['072', 1184945325],
 ['153', 1185012427],
 ['090', 1185080621],
 ['171', 1185179513],
 ['055', 1185897073],
 ['061', 1185900349],
 ['157', 1185900476],
 ['057', 1185900523],
 ['094', 1185900523],
 ['150', 1185900523],
 ['146', 1185907852],
 ['162', 1185944998],
 ['047', 1185947239],
 ['010', 1186216231],
 ['091', 1186687843],
 ['087', 1187156226],
 ['060', 1187429165],
 ['139', 1188821753],
 ['105', 1191309371],
 ['107', 1191328409],
 ['108', 1191353115],
 ['106', 1191826605],
 ['056', 1192047812],
 ['175', 1192789395],
 ['114', 1192789468],
 ['174', 1

In [9]:
print(f'Earliest time in recordset: {traj_with_starttime[0][1]}')
print(f'  Latest time in recordset: {traj_with_starttime[-1][1]}')


Earliest time in recordset: 1176391133
  Latest time in recordset: 1333000710


## Count how many grid cells we're using
For sake of reference, count how many cells our data set covers. This might be useful for the client to know.

In [10]:
all_cells_used = set()

# We've shaved off our cell key, but we can reconstruct something just as unique.
# Remember we're in the integer domain, so some things are easier now.
for trajectory in trajectories_time_in_cell.values():
    for trecord in trajectory:
        celldesignator = f'{trecord[1:]}'
        all_cells_used.add(celldesignator)

len(all_cells_used)

532361

## Save output
Here's what we've been waiting for! Save our output, along with the parameters we used for creating it.

In [11]:
results = {}

# Record the grid parameters.
results['gridparams'] = {
    'spatial-cell-size-km': GRID_SPATIAL_KM,
    'fixed-point-precision': CELLSIZE_DEGREES_PRECISION,
}

results['ranges'] = {
    'num-cells': len(all_cells_used),
    'time-start': traj_with_starttime[0][1],
    'time-end': traj_with_starttime[-1][1]
}

# And of course, the trajectories!
results['trajectories'] = trajectories_time_in_cell

In [12]:

with open('trajectories_in_spatial_grid.json', 'w') as f:
    json.dump(results, f)