Skip to content

aid4mh/gSCAP

Repository files navigation

Build Status

Geospatial Context Analysis Pipeline (gSCAP)

Full documentation can be found in these Github pages.

Installation and Configuration

Currently, the module is compatible with Python versions >= 3.6. It can be installed from the source repo using pip

pip install git+https://github.com/UW-Creativ-Lab/gSCAP

Several methods within the module use external APIs to extract semantic context. A configure file must be present in the user's home directory titled '.gscapConfig' or loaded via utils.load_config(file_path).
The file should contain each API key on a separate line, separated by an equals sign as shown in the example below.

GooglePlacesAPI=YourKey
YelpAPI=YourKey
DarkSkyAPI=YourKey

More information for each API can be found at the following links:

Use Cases and Examples

The following sections show use cases and examples for both the GPS and Weather modules. However, demos have been prepared using Jupyter notebooks and may be viewed at demos.

GPS Data (demo)

The major use cases of this package provide clustering mechanisms for GPS data. For this, we use a density based approach to unsupervised clustering: DBSCAN. Density based clustering algorithms are great candidates for GPS data as it is inherently noisy with clusters forming wherever stationary. However, picking appropriate hyperparameters across a cohort of participants using a wide array of devices can be difficult. In our study, we had devices that undersampled and devices that oversampled leading to inconsistent sampling and location densities across the cohort. The method we chose to help alleviate the issue is to over or under sample streams depending on the embedded frequency of points.

Down-sample the stream to a predefined frequency

Some devices collect far more points than others. Maybe they have different settings for location services or multiple applications are requesting coordinate. The following method resamples by mean values for each minute of points. Consider the following dataframe of points. This is an example where too many points were collected in a single minute and would likely lead to a higher-than-normal density at the given location.

lat lon ts
45.5047 -122.783 2015-06-01 00:01:04.561000
45.5047 -122.783 2015-06-01 00:01:08
45.5047 -122.783 2015-06-01 00:01:10
45.5047 -122.783 2015-06-01 00:01:12
45.5047 -122.783 2015-06-01 00:01:15
45.5047 -122.783 2015-06-01 00:02:16.037000
45.5047 -122.783 2015-06-01 00:03:01.134000
45.5047 -122.783 2015-06-01 00:07:51.359000
45.5047 -122.783 2015-06-01 00:09:10.460000
45.5047 -122.783 2015-06-01 00:10:12.570000

In order to systematically decrease oversampling you can use the following code as an example:

df = gps.resample_gps_intervals(df)

The method returns the following dataframe containing the mean latitude and longitude values for each available minute in the provided dataframe.

lat lon ts
45.5047 -122.783 2015-06-01 00:01:00
45.5047 -122.783 2015-06-01 00:02:00
45.5047 -122.783 2015-06-01 00:03:00
45.5047 -122.783 2015-06-01 00:07:00
45.5047 -122.783 2015-06-01 00:09:00

Up-sample the stream to a defined frequency

To adjust for undersampling we can impute stationary coordinates between points. This is used to account for specific data collection methods that don't take samples if the research participant hasn't moved locations, has been under a roof for many hours, or any other reason a phone isn't reporting a fresh location. Imputation frequencies are defined using the standard Pandas definitions and are defined in the Pandas Docs

df = gps.impute_stationary_coordinates(df), freq='30S')
lat lon ts binning displacement time_delta velocity
45.50471 -122.78298 2015-06-01 00:01:00 null nan nan nan
45.50471 -122.78298 2015-06-01 00:01:30 stationary 0.00000 30.00000 0.00000
45.50473 -122.78290 2015-06-01 00:02:00 stationary 6.80000 30.00000 0.22600
45.50473 -122.78290 2015-06-01 00:02:30 stationary 0.00000 30.00000 0.00000
45.50471 -122.78284 2015-06-01 00:03:00 stationary 5.20000 30.00000 0.17300
45.50471 -122.78284 2015-06-01 00:03:30 stationary 0.00000 30.00000 0.00000
45.50471 -122.78284 2015-06-01 00:04:00 stationary 0.00000 30.00000 0.00000
45.50471 -122.78284 2015-06-01 00:04:30 stationary 0.00000 30.00000 0.00000
45.50471 -122.78284 2015-06-01 00:05:00 stationary 0.00000 30.00000 0.00000
45.50471 -122.78284 2015-06-01 00:05:30 stationary 0.00000 30.00000 0.00000

As you can see, the method above computes displacement, time, velocity, and estimates a velocity bin for each point. This is also a valuable tool and can be used directly with gps.process_velocities(df)

Clustering

This method partitions the records into three sets - home, work, and everything else. Then, performs separate clustering on each partition. Two variables are returned. The first is the records with an additional row assigning cluster cids, and the second is the clusters themselves.

This function utilizes the unsupervised clustering algorithm DBSCAN and the provided records. Parameters should be a dict and contain both min_samples and eps. See the scikit-learn docs for more information. Default parameters will be used if none are supplied. However, these parameters were tuned to generalize well on a specific dataset and I recommend retuning those paramaters to the dataset being used.

df, clusters = gps.get_clusters_with_context(df)

The altered GPS records.

lat lon ts binning displacement time_delta velocity cid distance_from_home
45.5047 -122.783 2015-06-01 00:01:00 null nan nan nan xNot 8.96678
45.5047 -122.783 2015-06-01 00:02:00 stationary 6.8 60 0.113 home 4.51555
45.5047 -122.783 2015-06-01 00:03:00 stationary 5.2 60 0.086 home 8.61803
45.5047 -122.783 2015-06-01 00:07:00 stationary 0 240 0 home 8.61803
45.5047 -122.783 2015-06-01 00:09:00 stationary 5.9 120 0.049 home 8.89559

The table of clusters.

cid lat lon name categories
home 45.5048 -122.783 home home
work 45.5116 -122.685 work work
x0 45.5122 -122.684 nap nap
x1 45.5483 -122.651 nap nap

Here we see 'nap' in the name and categories columns of clusters. At this point, we haven't yet called Google Places or Yelp to get the semantic context of the cluster. To do this, we use the method shown in the following example.

xone = list(clusters.loc[clusters.cid=='x0'].itertuples())[0]

request = gps.PlaceRequest(
    lat=xone.lat,
    lon=xone.lon, 
    radius=50, 
    source=gps.ApiSource.YELP,
    rankby=gps.YelpRankBy.BEST_MATCH,
)

results = gps.request_nearby_places(request)

A transposed version of the results looks like:

cols vals
dtRetrieved 2019-03-06 12:30:37.329597
lat 45.51224
lon -122.68435
radius 50
source Yelp
name Basha's Mediterranean Cuisine
rank_order 0.0
categories mideastern, foodstands
major_categories dining_out

Weather lookups (demo)

The weather module allows you to lookup an hourly or daily weather summary for a given coordinate pair (or zipcode) and datetime. See the demo for more information and uses.

now = dt.datetime.now()
today = dt.datetime(year=now.year, month=now.month, day=now.day); today

report = weather.weather_report((47.6062, 122.3321, today))
pd.DataFrame(report.get('report')).head()

A transposed version of the response

col val
cloud_cover_IQR 0.0
cloud_cover_mean 0.01
cloud_cover_median 0.0
cloud_cover_std 0.02
date 2019-03-06 00:00:00
dew_point_IQR 9.04
dew_point_mean 6.02
dew_point_median 6.04
dew_point_std 4.63
humidity_IQR 0.18
humidity_mean 0.34
humidity_median 0.35
humidity_std 0.11
lat 47.6062
lon 122.3321
precip_sum 0
temp_IQR 16.53
temp_mean 32.22
temp_med 29.61
temp_std 9.88
zipcode 4631

Issues and enhancements

For bugs or feature requests please fill out the respective ticket.

About

Create contextual geospatial features from longitudinal geolocation data

Resources

License

Stars

Watchers

Forks

Packages

No packages published