# Demo notebook for analysing location data

## Introduction

GPS location data contain rich information about people's behavioral and mobility patterns. However, working with such data is a challenging task since there exists a lot of noise and missingness. Also, designing relevant features to gain knowledge about the mobility pattern of subjects is a crucial task. To address these problems, `niimpy` provides these main functions to clean, downsample, and extract features from GPS location data:

- `niimpy.preprocessing.location.filter_location`: removes low-quality location data points
- `niimpy.util.aggregate`: downsamples data points to reduce noise
- `niimpy.preprocessing.location.extract_features_location`: feature extraction from location data

In the following, we go through analysing a subset of location data provided in [StudentLife](https://studentlife.cs.dartmouth.edu/dataset.html) dataset.

## Read data

In [None]:
import niimpy
import niimpy.preprocessing.location as nilo

In [None]:
data = niimpy.read_csv(niimpy.sampledata.LOCATION_FILE, tz='et')
data.shape

In [None]:
data.head()

The necessary columns for further analysis are `double_latitude`, `double_longitude`, `double_speed`, and `user`. `user` refers to a unique identifier for a subject.

## Filter data

Three different methods for filtering low-quality data points are implemented in `niimpy`:

- `remove_disabled`: removes data points whose `disabled` column is `True`.
- `remove_network`: removes data points whose `provider` column is `network`. This method keeps only `gps`-derived data points.
- `remove_zeros`: removes data points close to the point \<lat=0, lon=0\>.

There is no such data points in this dataset; therefore the dataset does not change after this step.

In [None]:
data = nilo.filter_location(data, remove_disabled=False, remove_network=False, remove_zeros=True)
data.shape

## Downsample

Because GPS records are not always very accurate and they have random errors, it is a good practice to downsample or aggregate data points which are recorded in close time windows. In other words, all the records in the same time window are aggregated to form one GPS record associated to that time window. There are a few parameters to adjust the aggregation setting:

- `freq`: represents the length of time window. This parameter follows the formatting of `freq` parameter in pandas resample function. For example '5T' means 5 minute intervals.
- `method_numerical`: specifies how numerical columns should be aggregated. Options are 'mean', 'median', 'sum'.
- `mthod_categorical`: specifies how categorical columns should be aggregated. Options are 'first', 'mode' (most frequent), 'last'.

The aggregation is performed for each `user` (subject) separately.

In [None]:
binned_data = niimpy.util.aggregate(data, freq='5T', method_numerical='median')
binned_data = binned_data.reset_index(0).dropna()
binned_data.shape

## Feature extraction

Here is the list of features `niimpy` extracts from location data:

1. distance based features (`niimpy.preprocessing.location.location_distance_features`):
 - `dist_total`: total distance a person traveled in meter.
 - `variance`, `log_variance`: variance is defined as sum of variance in latitudes and longitudes.
 - `speed_average`, `speed_variance`, and `speed_max`: statistics of speed (m/s). Speed, if not given, can be calculated by dividing the distance between two consequitive bins by their time difference.
 - `n_bins`: number of location bins that a user recorded in dataset.

2. Significant Place related features (`niimpy.preprocessing.location.location_significant_place_features`):
 - `n_static`: number of static points. Static points are defined as bins whose speed is lower than a threshold.
 - `n_moving`: number of moving points. Equivalent to `n_bins - n_static`.
 - `n_home`: number of static bins which are close to the person's home. Home is defined the place most visited during nights. More formally, all the locations recorded during 12 Am and 6 AM are clusterd and the center of largest cluster is assumed to be home.
 - `max_dist_home`: maximum distance from home.
 - `n_sps`: number of significant places. All of the static bins are clusterd using DBSCAN algorithm. Each cluster represents a Signicant Place (SP) for a user.
 - `n_rare`: number of rarely visited (referred as outliers in DBSCAN).
 - `n_transitions`: number of transitions between significant places.
 - `n_top1`, `n_top2`, `n_top3`, `n_top4`, `n_top5`: number of bins in the top `N` cluster. In other words, `n_top1` shows the number of times the person has visited the most freqently visited place.
 - `entropy`, `normalized_entropy`: entropy of time spent in clusters. Normalized entropy is the entropy divided by the number of clusters.
 

In [None]:
import warnings
warnings.filterwarnings('ignore', category=RuntimeWarning)

# extract all the available features
all_features = nilo.extract_features_location(binned_data)
all_features

In [None]:
# extract only distance realted featuers
distance_features = nilo.extract_features_location(
    binned_data,
    feature_functions=[nilo.location_distance_features])
distance_features

## Implementing your own features

If you want to implement a customized feature you can do so with defining a function that accepts a dataframe and returns a dataframe or a series. The reterned object should be indexed by `user`s. Then, when calling `extract_features_location` function, you add the newly implemented function to the `feature_functions` argument. The default feature functions implemented in `niimpy` are in this variable:

In [None]:
nilo.ALL_FEATURE_FUNCTIONS

You can add your new function to the `nilo.ALL_FEATURE_FUNCTIONS` list and call `extract_features_location` function. Or if you are interested in only extracting your desired feature you can pass a list containing just that function, like here:

In [None]:
# customized function. It is necessary to add **kwargs term in the function definition.
def max_speed(df, **kwargs):
    grouped = df.groupby('user')
    return grouped['double_speed'].max()

customized_features = nilo.extract_features_location(
    binned_data,
    feature_functions=[max_speed]
)
customized_features