# Stop location detection
This notebook runs the stop detection algorithm (https://github.com/ulfaslak/infostop) on the time-ordered location data for each user. 
The goal is to identify the locations where the user has stopped for a certain amount of time.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [213]:
# iterate through all files in /data/raw/sensible/location/byuser/ and load them into a list of numpy arrays
import os
data = []
for file in os.listdir('/data/raw/sensible/location/byuser/'):
    data.append(np.load('/data/raw/sensible/location/byuser/' + file))

In [15]:
len(data)

NameError: name 'data' is not defined

In [215]:
# transform data to a dataframe
df = pd.DataFrame(np.concatenate(data), columns=['user', 'timestamp', 'latitude', 'longitude', 'accuracy'])

In [216]:
df

Unnamed: 0,user,timestamp,latitude,longitude,accuracy
0,153.0,1.386241e+09,55.783641,12.518414,30.689
1,153.0,1.386241e+09,55.783681,12.518372,28.185
2,153.0,1.386241e+09,55.783646,12.518382,30.841
3,153.0,1.386241e+09,55.783648,12.518408,21.368
4,153.0,1.386242e+09,55.783680,12.518435,21.000
...,...,...,...,...,...
245304792,99.0,1.406627e+09,55.740993,12.494146,20.000
245304793,99.0,1.406627e+09,55.740994,12.494182,24.000
245304794,99.0,1.406627e+09,55.740994,12.494161,15.364
245304795,99.0,1.406627e+09,55.740950,12.494317,25.065


In [6]:
# for each np array in data extract the 2. 3. and 4. column (time, lat, lon) and append them to a new list
# this is necessary because the infostop algorithm expects a list of np arrays with 3 columns (lat, lon, time)
data2 = [d[:,1:4] for d in data]
# set column as the last column (time)
data2 = [np.roll(d, -1, axis=1) for d in data2]


Run stop location algorithm with default parameters on all users.

parameters are:
- r1 = 10: Max distance between time-consecutive points to label them as stationary
- r2 = 10: Max distance between stationary points to form an edge.
- label_singleton = True: If True, give stationary locations that was only visited once their own label. If False, label them as non-stationary (-1).
- min_staying_time = 300: The shortest duration that can constitute a stop.
- max_time_between=86400: The longest duration  that can constitute a stop.
- min_size = 2: The minimum number of points in a cluster to be considered a stop.
min_spacial_resolution : float
            The minimal difference allowed between points before they are considered the same points.
            Highly useful for spatially downsampling data and which dramatically reduces runtime. Higher
            values yields higher downsampling. For geo data, it is not recommended to increase it beyond
            1e-4. 1e-5, typically works well and has little impact on results.







In [2]:
from infostop import Infostop
model = Infostop(r1=10,
        r2=10,
        label_singleton=True,
        min_staying_time=300,
        max_time_between=86400,
        min_size=2,
        min_spacial_resolution=0,
        distance_metric="haversine",
        weighted=False,
        weight_exponent=1,
        verbose=False)


In [17]:
# for each array in data2 run the infostop algorithm and save the result in a list use tqdm to show a progress bar
from tqdm import tqdm
labels = []
for d in tqdm(data2):
    labels.append(model.fit_predict(d))

100%|███████████████████████████████████████████████| 840/840 [3:41:02<00:00, 15.79s/it]


In [21]:
# merge the labels with the original data
data3 = []
for i in range(len(data)):
    data3.append(np.hstack((data[i], labels[i].reshape(-1,1))))

In [31]:
# add labels to the dataframe
df['stoplocation'] = np.concatenate(labels)


In [32]:
# remove column label from dataframe
df = df.drop(columns=['label'])


In [36]:
# save dataframe as pickle
df.to_pickle('../data/stoplocation.pkl')