want: 
- reproduce visualization from pptk tutorial
- get summary statistics 
    - number of users 
	- median/quartiles of number of locations visited 

In [None]:
# import packages and dependencies
import pandas as pd
import matplotlib.pyplot as plt
import folium
from PIL import Image, ImageDraw
import sys
sys.path.insert(1, '../src')

import read_geolife

In [None]:

# testing reading data and assigning labels  for a single user 
df_all_users = read_geolife.read_all_users('../data/')

In [None]:
# save raw data in pickle format 
df_valid_cb_valid_cb_valid_cb_all_users.to_pickle('../results/geolife.pkl')

In [None]:
# read pickled data
df = pd.read_pickle('../results/geolife.pkl')

In [None]:
df.describe()

In [None]:
df[df['alt'] == -3.264760e+04]

Some observations: 
- latitude: falls outside of possible range e.g. min = 104.40, max = 400.17? 

- longitude: Ok at first glance

- altitude: This is trickier. Codebook doesn't say what metric of altitude was used (w/r/t mean sea level? Some other reference point?) Many smartphones/GPS trackers will rely on satellite data (based on WGS84 https://en.wikipedia.org/wiki/World_Geodetic_System#A_new_World_Geodetic_System:_WGS_84) or measure altitude with air pressure (may not be accurate in pressure controlled areas e.g. inside an airplane).

But also there is a value that corresponds to -30,000 ft below something -- that looks awfully low no matter what the metric is? Corresponds to one measurement from user 42. Could have been a measurement error... 

Codebook does say altitude values of '-777' are 'invalid'. At this point we have not processed them. 

The Earth's elevation point ranges from 1385 ft below sea level at the Dead Sea, and to 29035 ft at the summit of Mt. Everest. We will lower bound using the elevation at the Dead Sea, although judging from the threshold of -777 for invalid altitudes, that might be a generous threshold. 

We will also filter latitude values falling outside of the [-90 deg, 90 deg] range. 

In [None]:
# Let's filter out "unnrealistic values". We'll have two filtered dataframes where one's
# lower bounded by -777 (number given by codebook) and the other is lower bounded by the Dead Sea
# and see how much data we lose out on. 

LOW_ALT = -1385
INVAL_ALT = -777
LOWER_LAT = -90
UPPER_LAT = 90

# lower bounded by codebook threshold
df_valid_cb = df[(df['alt'] > INVAL_ALT) & (df['lat'] > LOWER_LAT) & (df['lat'] < UPPER_LAT)]
# lower bounded by the Dead Sea
df_valid_ds = df[(df['alt'] > LOW_ALT) & (df['lat'] > LOWER_LAT) & (df['lat'] < UPPER_LAT)]

In [None]:
df_valid_cb.describe()

In [None]:
# overall summary
df_valid_ds.describe()

In [None]:
# user-level summary
pd.set_option('display.max_rows', None)
df_valid_ds.groupby('user').describe()

In [None]:
df_valid_cb.shape[0]/df.shape[0]

Codebook excerpt on features

"Line 1…6 are useless in this dataset, and can be ignored. Points are described in following lines, one for each line.
Field 1: Latitude in decimal degrees.
Field 2: Longitude in decimal degrees.
Field 3: All set to 0 for this dataset.
Field 4: Altitude in feet (-777 if not valid).
Field 5: Date - number of days (with fractional part) that have passed since 12/30/1899.
Field 6: Date as a string.
Field 7: Time as a string.
Note that field 5 and field 6&7 represent the same date/time in this dataset. You may use either of them.
Example:
39.906631,116.385564,0,492,40097.5864583333,2009-10-11,14:04:30
39.906554,116.385625,0,492,40097.5865162037,2009-10-11,14:04:35"

Sanity checks on features i.e. known possible range
Latitude: [-90 deg., +90 deg.]
Longitutde: [-180 deg., 180 deg.]
Altitude: 
https://en.wikipedia.org/wiki/List_of_lowest_airports

In [None]:
# Plotting
unique_users = df_valid_cb['user'].unique()[1:10]
plt.figure(figsize=(10, 8))

for user in unique_users:
    user_data = df_valid_cb[df_valid_cb['user'] == user]
    plt.plot(user_data['lon'], user_data['lat'], marker='o', label=f'User {user}')

plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Trajectories of Multiple Users')
plt.legend()
plt.show()

2024-02-20-tues
TODO: 
- clean up notebook loll 
- continue cross referencing with the data documentation + keep filtering to deal with nonsensical values
- discretize the surface covered by these gps coordinates (e.g. google s2 or something else) such that each element of that partition corresponds to a “location”
- visualize trajectories some random subset of the users — do you know good Python packages for visualizing gps data? I was thinking Folium and GeoPandas — also this tutorial seems promising: https://courses.spatialthoughts.com/python-dataviz.html

https://geopandas.org/en/stable/docs/user_guide/mapping.html
https://python-visualization.github.io/folium/latest/getting_started.html

Thanks for the suggestions! Looks like folium and geopandas are the way to go w/r/t visualization (and partitioning locations). I’ll check these out once I’m done cleaning the data and plot some sample trajectories. I’ll also generate some user-level summaries.

general processing gps data
- https://jovian.com/jonpappalord/skmob03-preprocessing#C11
