In [1]:
# notebook setup
import datetime as dt
from io import StringIO
import requests
import os
import sys

from IPython.core.interactiveshell import InteractiveShell
from IPython.core.display import display, HTML
import numpy as np
import pandas as pd
from pathlib import Path
stdout = sys.stdout
p = Path().resolve().parents[1]
sys.path.append(str(p))
import gscap.gps as gps
sys.stdout = stdout
InteractiveShell.ast_node_interactivity = 'all'
np.random.seed(42)

  from IPython.core.display import display, HTML


GSCAP configuration file not found. See the docs at https://uw-creativ-lab.github.io/gSCAP/ for more details.


## The GSCAP GPS scripts 
provide a number of  miscellaneous, clustering, and api wrapper methods. The main format of each call is a Pandas dataframe containing three columns: lat, lon, and ts like shown in the next cell.

In [2]:
r = requests.get('https://raw.githubusercontent.com/UW-Creativ-Lab/gSCAP/master/notebooks/demos/some_data.csv')

if r.ok:
    sio = StringIO(str(r.content,'utf-8'))    
    df = pd.read_csv(sio, parse_dates=['ts'])
    df.head()
    
    sio.close()
    del sio

Unnamed: 0,lat,lon,ts
0,45.50473,-122.7829,2015-06-01 00:01:04.561
1,45.50473,-122.7829,2015-06-01 00:02:16.037
2,45.50471,-122.78284,2015-06-01 00:03:01.134
3,45.50471,-122.78284,2015-06-01 00:07:51.359
4,45.50469,-122.78291,2015-06-01 00:09:10.460


#### `resample_gps_intervals(records)`
Currently, this method resamples at one-minute intervals taking the mean of each lat, long within the interval. The timestamp returned is the minute for which the sampling occurred.

In [3]:
df = gps.resample_gps_intervals(df)
df.head()

Unnamed: 0,lat,lon,ts
0,45.50473,-122.7829,2015-06-01 00:01:00
1,45.50473,-122.7829,2015-06-01 00:02:00
2,45.50471,-122.78284,2015-06-01 00:03:00
3,45.50471,-122.78284,2015-06-01 00:07:00
4,45.50469,-122.78291,2015-06-01 00:09:00


#### `impute_stationary_coordinates(records, freq)` 
This method can be used to upsample the number of points in a stationary location. This is used to account for specific data collection methods that don't take samples if the research participant hasn't moved locations. This is beneficial when used with the density based clustering techniques defined below.
* freq - is a given interval as defined in the Pandas timeseries methods. Read more in the [Pandas Docs](https://pandas.pydata.org/pandas-docs/stable/timeseries.html)

In [4]:
df = gps.impute_stationary_coordinates(df, freq='10Min')
df.head()

imputing coordinates: 100%|█████████████████████████████████████████████████████| 46927/46927 [01:47<00:00, 438.21it/s]
processing metrics: 100%|███████████████████████████████████████████████████████| 47556/47556 [01:37<00:00, 486.71it/s]


Unnamed: 0,lat,lon,ts,displacement,time_delta,velocity,binning,date
0,45.50473,-122.7829,2015-06-01 00:01:00,,,,,2015-06-01
1,45.50473,-122.7829,2015-06-01 00:02:00,0.0,60.0,0.0,stationary,2015-06-01
2,45.50471,-122.78284,2015-06-01 00:03:00,5.2,60.0,0.086,stationary,2015-06-01
3,45.50471,-122.78284,2015-06-01 00:07:00,0.0,240.0,0.0,stationary,2015-06-01
4,45.50469,-122.78291,2015-06-01 00:09:00,5.9,120.0,0.049,stationary,2015-06-01


### Clustering

#### `get_clusters_with_context(records)`
Again, this takes the same arguments as above.  This method partitions the records into three sets - home, work, and everything else. Then, performs separate clustering on each partition. Two variables are returned. The first is the records with an additional row assigning cluster cids, and the second is the clusters themselves.

This function utilizes the unsupervised clustering algorithm DBSCAN and the provided records. Parameters should be a dict and contain both min_samples and eps. See the [scikit-learn docs](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) for more information. Default parameters will be used if none are supplied. However, these parameters were tuned to generalize well on a specific dataset and I recommend doing the same. Some good sources for tuning DBSCAN hyperparameters are:

* x
* y

In [5]:
df, clusters = gps.get_clusters_with_context(df)
df.head()

Unnamed: 0,lat,lon,ts,displacement,time_delta,velocity,binning,date,cid,distance_from_home
0,45.50473,-122.7829,2015-06-01 00:01:00,,,,,2015-06-01,xNot,4.515553
0,45.50473,-122.7829,2015-06-01 00:02:00,0.0,60.0,0.0,stationary,2015-06-01,home,4.515553
1,45.50471,-122.78284,2015-06-01 00:03:00,5.2,60.0,0.086,stationary,2015-06-01,home,8.618027
2,45.50471,-122.78284,2015-06-01 00:07:00,0.0,240.0,0.0,stationary,2015-06-01,home,8.618027
3,45.50469,-122.78291,2015-06-01 00:09:00,5.9,120.0,0.049,stationary,2015-06-01,home,8.895594


In [6]:
clusters = clusters.sort_values(by='cid')
clusters.head()

Unnamed: 0,lat,lon,cid,name,categories
18,45.50477,-122.78291,home,home,home
19,45.51163,-122.68454,work,work,work
0,45.51224,-122.68435,x0,nap,nap
1,45.54832,-122.65123,x1,nap,nap
10,45.46099,-122.70129,x10,nap,nap


### Place lookup

The place methods are currently built to allow calls to either Yelp or Google for contextual lookup. Both are processed via the same call and require either to be given with in the form of a list of `gps.PlaceRequest`. Each request is cached to a backend sqlite file so that the same request doesn't have to be submitted more than once. Results are returned a dictionary containing three keys:
1. `hits` tells you how many of your requests were returned from the cache
2. `misses` tells you how many went through the API endpoint 
3. `request` is a dataframe of the resulting places

An enum is implemented within the gps module that allows easy selection of an API source. The possibilities are: 
1. `gps.ApiSource.YELP` 
2. `gps.ApiSource.GOOGLE`

With either, a value for `rankby` should also be supplied.  For Google, only prominence is currently implemented - `gps.GmapsRankBy.PROMINENCE`. A number of options are available for Yelp:
* `gps.YelpRankBy.BEST_MATCH`
* `gps.YelpRankBy.RATING`
* `gps.YelpRankBy.REVIEW_COUNT`
* `gps.YelpRankBy.DISTANCE`

#### `request_nearby_places(request)`

In [None]:
xone = list(clusters.loc[clusters.cid=='x0'].itertuples())[0]
request = gps.PlaceRequest(
    lat=xone.lat,
    lon=xone.lon, 
    radius=50, 
    source=gps.ApiSource.YELP,
    rankby=gps.YelpRankBy.BEST_MATCH,
)
results = gps.request_nearby_places(request)
results['request']

### Gathering metrics

#### `get_cluster_times(records, clusters)`
Calculates daily entry and exit times a research participant makes into their defined clusters.

In [15]:
entries = gps.get_cluster_times(df)
entries.head()

  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()


  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()


  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()


  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()


  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()


  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()


  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()


  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()


  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()
  y = pd.Series()


Unnamed: 0,cid,date,tod,tod_bin,time_in,midpoint,time_out,duration,pos,n_points,lat,lon,local_tz
0,home,2015-06-01,00:36:00,,2015-06-01 00:02:00,2015-06-01 00:36:00,2015-06-01 01:10:00,0 days 01:08:00,2,43,45.50472,-122.7829,America/Los_Angeles
1,x0,2015-06-01,02:03:30,,2015-06-01 01:30:00,2015-06-01 02:03:30,2015-06-01 02:37:00,0 days 01:07:00,2,53,45.51877,-122.67947,America/Los_Angeles
2,xNot,2015-06-01,02:48:30,,2015-06-01 02:46:00,2015-06-01 02:48:30,2015-06-01 02:51:00,0 days 00:05:00,3,2,45.53233,-122.652173,America/Los_Angeles
3,x1,2015-06-01,03:29:00,,2015-06-01 02:58:00,2015-06-01 03:29:00,2015-06-01 04:00:00,0 days 01:02:00,4,53,45.54831,-122.65259,America/Los_Angeles
4,home,2015-06-01,13:42:00,,2015-06-01 04:17:00,2015-06-01 13:42:00,2015-06-01 23:07:00,0 days 18:50:00,5,635,45.50478,-122.78284,America/Los_Angeles


#### `get_daily_metrics(records, entries)`
Uses the records set and  daily features

In [16]:
daily_metrics = gps.get_daily_metrics(df)
daily_metrics.head()

Unnamed: 0,date,location_variance,hours_accounted_for,hours_stationary_nhw,came_to_work,number_of_clusters,hours_stationary,hours_walking,hours_active,hours_powered_vehicle,hours_high_speed_transportation,distance_stationary,distance_walking,distance_active,distance_powered_vehicle,distance_high_speed_transportation,hours_in_home,hours_in_work,hours_in_top3,hours_of_sleep
0,2015-06-01,-2.821,24,2.567,False,4,22.55,0.367,0.517,0.533,0.0,4311.3,1451.6,5323.5,31096.6,0.0,19.983,0.0,1.55,0.017
1,2015-06-02,-2.704,21,4.167,True,4,21.533,0.417,0.983,0.933,0.0,18110.0,1442.0,10688.3,48537.7,0.0,12.633,4.733,2.767,0.017
2,2015-06-03,-4.189,21,0.0,True,3,21.517,2.117,0.017,0.3,0.0,2825.9,8234.8,103.8,7559.5,0.0,18.9,2.617,0.0,0.167
3,2015-06-04,-2.825,24,4.033,True,5,22.517,0.567,0.5,0.4,0.0,4175.2,2226.6,5944.8,18912.2,0.0,13.5,4.983,2.45,0.033
4,2015-06-05,-9.256,24,0.0,False,1,23.983,0.0,0.0,0.0,0.0,2339.5,0.0,0.0,0.0,0.0,23.983,0.0,0.0,0.017


### Miscellaneous

`dd_from_zip(zipcode)`. This method allows you to get a latitude and longitude pair by providing a zipcode.

In [17]:
gps.dd_from_zip(47579)

(38.11094, -86.91513)

`zip_from_dd(lat, lon)`. provides the opposite conversion - from lat, lon back to a zipcode.

In [18]:
gps.zip_from_dd(32.3788, -84.90685)

31905

Another extremely useful method is to get the Pandas timezone from a lat, lon pair or zipcode. This functionality is provided with `tz_from_zip` and `tz_from_dd`.

In [19]:
gps.tz_from_dd((32.3788, -84.90685))
gps.tz_from_zip(98115)

array(['America/New_York'], dtype=object)

array(['America/Los_Angeles'], dtype=object)

Often you want to process entire columns of lat, lon pairs from a dataframe at once.

In [20]:
gps.tz_from_dd(df[['lat', 'lon']])

array(['America/Los_Angeles', 'America/Los_Angeles',
       'America/Los_Angeles', ..., 'America/Los_Angeles',
       'America/Los_Angeles', 'America/Los_Angeles'], dtype=object)

This also works for a series of zipcodes.

In [21]:
gps.tz_from_zip([47579, 98115, 2345])

array(['America/Chicago', 'America/Los_Angeles', 'America/New_York'],
      dtype=object)

#### `geo_distance(lat1, lon1, lat2, lon2)` 
Computes the geographic distance in meters between two latitude, longitude pairs.

In [22]:
pair1 = (47.1, 122.1)
pair2 = (47.2, 122.2)

gps.geo_distance(*pair1, *pair2)

13447.27857197738

#### `geo_pairwise_distances(x, n_jobs)`
Accepts a list of coordinate pairs and computes the pairwise geographic distance between each pair. Set `n_jobs` to -1 to use all available cores.

In [23]:
x = [
    (47.111, 122), 
    (47.112, 122),
    (47.113, 122)
]

gps.geo_pairwise_distances(x, as_array=True, n_jobs=-1)

array([111.2, 111.2, 222.4])