# Ghost Sequences From GeoLife Trajectory 1.3 sample data

*Part of the COVID19risk project*  
*http://covid19risk.com/*  
*2020-03-04*  

*Copyright (C) 2020 Mikhail Voloshin, Mighty Data Inc.*  
*All rights reserved.*  

The objective of this Notebook is to convert the sample GeoLife trajectory data from Microsoft (https://www.microsoft.com/en-us/download/details.aspx?id=52367) into a form that can be marshalled to a browser for purposes of being rendered client-side as a heatmap that can be navigated across space and time.

The Geolife data is hundreds of megabytes ZIPped, and over 1.5 GB expanded into CSV-based "trajectory" files. This is far too big and too detailed to be usable by a client, and needs to be simplified and summarized in order to be usable.

The Geolife data is also very sparse. It's very rare for two Geolife users to be anywhere near one another at the same time, or even exhibiting activity within the same small number of days as one another.

The Geolife data also has relatively few users. There are only 182 people represented in the entire dataset.

This Notebook seeks to solve all of the above problems by imagining that each continuous record from each of these 182 people is a "ghost" that loops through the same series of actions continuously. This creates the impression of thousands of simultaneous points of continuous activity, while still permitting us to say that we're using real GPS data.


## Define parameters

In [1]:
# Define parameters

# The distance, in km, that a user has to travel before their GPS position is updated in the 
# summarized record. This is just a handy way to prevent the record from getting too big;
# it basically says that it only says that a user has moved after they've traveled more
# than GRID_SPATIAL_KM distance (or thereabouts).
# The bigger this number is, the smaller the resulting output file.
GRID_SPATIAL_KM = 2

# Path to the unzipped Geolife data folder.
DATA_DIR = "../../../../data/proof-of-concept/Geolife Trajectories 1.3"

## Search for trajectory files

In [2]:
import os

if not os.path.isdir(DATA_DIR):
    raise ValueError(DATA_DIR + " doesn't appear to be a directory that exists on this filesystem.")
if not os.path.isdir(DATA_DIR+'/Data'):
    raise ValueError(DATA_DIR + " doesn't appear to be the unzipped Geolife dataset. It doesn't contain a /Data subdirectory.")

# A trajectory folder path is any subfolder of the Data directory that has a Trajectory subfolder.
trajectory_folder_paths = [f.path for f in os.scandir(DATA_DIR+'/Data') if f.is_dir() and os.path.isdir(f.path+'/Trajectory')]

print(f'Success. {len(trajectory_folder_paths)} trajectories found.')

Success. 182 trajectories found.


## Precompute some values that will be useful to us during the computation

Spatial grid degrees were computed with the help of https://andrew.hedges.name/experiments/haversine/. Future versions of this script should use trigonometry to discretize by traversing a longitudinal line in intervals of *GRID_SPATIAL_KM* km steps from the equator, and then using the Earth's cross-section at that latitude to walk by *GRID_SPATIAL_KM* km steps from the prime meridian. It's not hard, but a little too tedious to go through the trouble of writing said function for this proof-of-concept stage.

In [3]:
CELLSIZE_DEGREES_LAT = GRID_SPATIAL_KM * 0.00899
CELLSIZE_DEGREES_LONG = GRID_SPATIAL_KM * 0.01174
CELLSIZE_DEGREES_PRECISION = 5


## Define some helper functions

In [4]:
import time
import datetime

MICROSOFT_EPOCH_START = time.mktime(datetime.datetime.strptime('1899-12-30', '%Y-%m-%d').timetuple())

def convert_microsoft_epoch_to_unixtime(dayscount):
    # Microsoft Research encoded this dataset to include a field
    # that contains the number of days that have elapsed since
    # Dec 30 (not 31), 1899. No word on whether this is midnight
    # at the *beginning* or *end* of said date, but we can validate
    # against the other date and time columns to make sure we
    # got it right (which we did).

    secondscount = dayscount * 24 * 60 * 60
    retval = MICROSOFT_EPOCH_START + secondscount
    return int(retval)
    


In [5]:
# We'll keep the last lat/long position memoed, and determine when the current position
# has wandered away from the memo by a distance of about GRID_SPATIAL_KM.
# This is probably an abuse of the dataframe apply method, but as long as it runs in
# order, we'll be okay.
# Note that we're computing movement by Manhattan metric. In the real world, we'd use
# a Euclidean metric, but this is good enough.

def check_wander(df):
    lastpos = [0, 0]
    def did_wander(row):
        if abs(row.latitude - lastpos[0]) > CELLSIZE_DEGREES_LAT or abs(row.longitude - lastpos[1]) > CELLSIZE_DEGREES_LONG:
            lastpos[0] = row.latitude
            lastpos[1] = row.longitude
            return True
        return False
        
    return df.apply(did_wander, axis=1)


## Populate our data structures
This script should take about 20 minutes to run on a 2.8 Ghz Lenovo laptop with 16 GB RAM.

In [6]:
%%time
print(f'Started at {datetime.datetime.now()}')

import math
import json
import pandas as pd

# The column meanings come from the User Guide PDF that comes with the Geolife dataset.
GEOLIFE_COLUMNS = [
    'latitude',
    'longitude',
    'reserved0',
    'altitude',
    'daysSinceMicrosoftEpoch',
    'date',
    'time'    
]

trajectories_by_ghostid = {}


for traj_path in trajectory_folder_paths:
    traj_id = traj_path.split('/')[-1]
    print(f'Processing Trajectory ID: {traj_id}')

    cells_hit_by_this_traj = set()
    
    traj_filenames = [f.path for f in os.scandir(traj_path+'/Trajectory') if f.is_file() and f.path.endswith('.plt')]
    print(f'{len(traj_filenames)} trajectory plots found.')
    
    for traj_subnum, traj_filename in enumerate(traj_filenames):
        ghost_id = f'{traj_id}-{traj_subnum:05}'

        df = pd.read_csv(traj_filename, skiprows=6, names=GEOLIFE_COLUMNS)

        df['unixtime'] = df['daysSinceMicrosoftEpoch'].apply(convert_microsoft_epoch_to_unixtime)
        
        df['wander'] = check_wander(df)
        
        wanderdf = df.loc[df['wander'], ['latitude', 'longitude', 'unixtime']]
        wanderdf['seconds'] = (wanderdf['unixtime'].shift(-1) - wanderdf['unixtime']).fillna(0).astype('int64')
        

        wanderdf['lat_fixedpt'] = (wanderdf['latitude'] * math.pow(10, CELLSIZE_DEGREES_PRECISION)).astype('int64')
        wanderdf['long_fixedpt'] = (wanderdf['longitude'] * math.pow(10, CELLSIZE_DEGREES_PRECISION)).astype('int64')
        
        trimdf = wanderdf.loc[:, ['lat_fixedpt','long_fixedpt', 'seconds']]
        
        ghost_traj = json.loads(trimdf.to_json(orient='values'))
        trajectories_by_ghostid[ghost_id] = ghost_traj
            
    print(f'\nTrajectory ID {traj_id} done.\n')


Started at 2020-03-16 01:58:33.595645
Processing Trajectory ID: 000
171 trajectory plots found.

Trajectory ID 000 done.

Processing Trajectory ID: 001
71 trajectory plots found.

Trajectory ID 001 done.

Processing Trajectory ID: 002
175 trajectory plots found.

Trajectory ID 002 done.

Processing Trajectory ID: 003
322 trajectory plots found.

Trajectory ID 003 done.

Processing Trajectory ID: 004
395 trajectory plots found.

Trajectory ID 004 done.

Processing Trajectory ID: 005
86 trajectory plots found.

Trajectory ID 005 done.

Processing Trajectory ID: 006
28 trajectory plots found.

Trajectory ID 006 done.

Processing Trajectory ID: 007
54 trajectory plots found.

Trajectory ID 007 done.

Processing Trajectory ID: 008
34 trajectory plots found.

Trajectory ID 008 done.

Processing Trajectory ID: 009
49 trajectory plots found.

Trajectory ID 009 done.

Processing Trajectory ID: 010
161 trajectory plots found.

Trajectory ID 010 done.

Processing Trajectory ID: 011
201 trajectory


Trajectory ID 099 done.

Processing Trajectory ID: 100
7 trajectory plots found.

Trajectory ID 100 done.

Processing Trajectory ID: 101
68 trajectory plots found.

Trajectory ID 101 done.

Processing Trajectory ID: 102
38 trajectory plots found.

Trajectory ID 102 done.

Processing Trajectory ID: 103
48 trajectory plots found.

Trajectory ID 103 done.

Processing Trajectory ID: 104
115 trajectory plots found.

Trajectory ID 104 done.

Processing Trajectory ID: 105
10 trajectory plots found.

Trajectory ID 105 done.

Processing Trajectory ID: 106
3 trajectory plots found.

Trajectory ID 106 done.

Processing Trajectory ID: 107
3 trajectory plots found.

Trajectory ID 107 done.

Processing Trajectory ID: 108
9 trajectory plots found.

Trajectory ID 108 done.

Processing Trajectory ID: 109
4 trajectory plots found.

Trajectory ID 109 done.

Processing Trajectory ID: 110
25 trajectory plots found.

Trajectory ID 110 done.

Processing Trajectory ID: 111
44 trajectory plots found.

Traject

## Examine our results.
Take a look at how much data we've produced

In [7]:
print(f'Number of ghosts: {len(trajectories_by_ghostid)}')

Number of ghosts: 18670


In [8]:
ghostlens = [len(traj) for traj in trajectories_by_ghostid.values()]

In [9]:
print(f'Average number of entries per ghost: {sum(ghostlens)/len(ghostlens)}')

Average number of entries per ghost: 9.562131762185324


## Save output
Here's what we've been waiting for! Save our output, along with the parameters we used for creating it.

In [10]:
results = {}

# Record the grid parameters.
results['gridparams'] = {
    'spatial-cell-size-km': GRID_SPATIAL_KM,
    'fixed-point-precision': CELLSIZE_DEGREES_PRECISION,
}

# And of course, the trajectories!
results['trajectories'] = trajectories_by_ghostid

In [11]:

with open('trajectories_ghosts.json', 'w') as f:
    json.dump(results, f)