This notebook explores the selection of different radii for the clustering of start and end points. This is a follow-up to `radius selection exploration unrolled` and aims to address the issue of clusters being too large and merging together smaller, unrelated clusters. 

The error for start-of-trip detection may be greater than for end-of-trip detection as the former is a more difficult task than the latter - thus, by separating the radii for start vs end clustering, we hope to make the end clusters tighter and produce better predictions. This notebook explores hardcoding values for the radius parameter, based on the distances between distinct locations in the real world. For reference, American city blocks tend to be ~100-200m. The current radius of 500m was set in order to handle noise from start-of-trip detection. However, that was quite large, and for example, can end up clustering destinations on opposite sides of a highway.

### imports

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np

from uuid import UUID

###
# hack because jupyter notebook doesn't work properly through my vscode for
# some reason and therefore cant import stuff from emission? TODO remove this # before any pull requests
import sys

sys.path.append('/Users/hlu2/Documents/GitHub/e-mission-server/')
###

import emission.storage.timeseries.abstract_timeseries as esta
import emission.storage.decorations.trip_queries as esdtq
import emission.core.get_database as edb

import mapping

### load data

In [None]:
# to see the same outputs I described, put in the unique tokens for these users
email0 = "replace this"  # shankari
email1 = "replace this"  # tom
user0 = list(edb.get_uuid_db().find({"user_email": email0}))[0]['uuid']
user1 = list(edb.get_uuid_db().find({"user_email": email1}))[0]['uuid']
user2 = UUID('replace this')  # hannah


In [None]:
all_users = esta.TimeSeries.get_uuid_list()
user_list = np.append([user0, user1, user2, user0],
                      np.random.choice(all_users, size=10, replace=False))
confirmed_trip_df_map = {}
labeled_trip_df_map = {}
expanded_trip_df_map = {}
for i in range(len(user_list)):
    u = user_list[i]
    print(u)
    ts = esta.TimeSeries.get_time_series(u)
    ct_df = ts.get_data_df("analysis/confirmed_trip")

    # shuffle user0's data to see if the order in which oursim sees trips
    # affects the binning results
    if i == 3:
        ct_df = ct_df.sample(frac=1).reset_index(drop=True)
    confirmed_trip_df_map[i] = ct_df
    labeled_trip_df_map[i] = esdtq.filter_labeled_trips(ct_df)
    expanded_trip_df_map[i] = esdtq.expand_userinputs(labeled_trip_df_map[i])

In [None]:
# check trip counts
n_trips_df = pd.DataFrame([[
    user_list[i],
    len(confirmed_trip_df_map[i]),
    len(labeled_trip_df_map[i])
] for i in range(len(user_list))],
                          columns=["user_id", "all_trips", "labeled_trips"])
n_trips_df

In [None]:
# check trip counts by purpose for user0
expanded_trip_df_map[0].purpose_confirm.value_counts()

### playing with radius selection - end points only

Let's see the clusters of end points for user0 using DBSCAN at a variety of radii. 

The plot_clusters() function will plot points and clusters on a folium map, which will provide us with a sense of scale and allow us to intuitively check if a cluster looks 'good'. Points with the same purpose will have the same color (unless there are more purposes than available colors in folium, in which case some colors may be duplicated). Hovering over a point will also reveal the purpose in the tooltip. The clusters are visualized as convex hulls; their color doesn't mean anything right now, it's simply so we can distinguish between distinct clusters (which will be helpful when there are overlapping clusters). 

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[0],
                            alg='DBSCAN',
                            loc_type='end',
                            plot_unlabeled=True,
                            cluster_unlabeled=False,
                            radii=[50, 100, 150, 200, 500])
fig

With a 100m radius, DBSCAN is able to distinguish between the 'home', 'library', and 'shopping' clusters that are all very close to each other (each 2-4 blocks away from each other). Once we increase the radius to 150m, the 'library' and 'shopping' clusters get combined. 

One issue with DBSCAN is that as we get more and more data, we risk getting more and more noisy points (for example, a point right in between the library and shopping clusters). The noisy point may then form a 'dendrite'/bridge, causing two distinct locations to merge into a single cluster. If we use DBSCAN, perhaps a naive correction would be to vary the radius according to user dataset size? Alternatively, we should explore using user labels to inform cluster size. This is also why I'm interested in a clustering function with dynamic scanning radius, which would hopefully be able to identify sparse noise between two dense but nearby clusters.

Let's see how oursim performs:

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[0],
                            alg='oursim',
                            loc_type='end',
                            radii=[100, 150, 200, 250])
fig

Even though oursim is mostly able to distinguish between 'home', 'library', and 'shopping' at 250m, we are already seeing a concerning 'triangulation' issue in which home, exercise, and library points form the vertices of a triangle cluster that just so happened to randomly cluster unrelated but nearby things together. I think this is partly due to the order in which oursim sees trips, so we should try shuffling trips to see if it changes. Regardless, still slightly worrying. The advantage of DBSCAN is it can identify the dense cores of clusters, which would help to prevent this a little. 

At 200m, this did not seem to be as much of a problem. ~~Also nice is that oursim is able to cluster the two related trips at a nearby college, which it was unable to do at 150m.~~ (*** Actually, it turns out that the trips to the college were to two separate locations. Foreshadowing some issues we will have with varying density levels...)

Rerunning oursim on shuffled data from user 0 to see if the triangulation still occurs.

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[3],
                            alg='oursim',
                            loc_type='end',
                            radii=[100, 150, 200, 250])
fig

Ok. So the triangulation did not occur in the same place, but there are definitely weird cluster shapes that basically split up a distinct cluster and include far-away outliers. (Note that the results in the above cell won't appear the same every time since the trips are shuffled randomly.)

looking at the next user now:

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[1],
                            alg='DBSCAN',
                            loc_type='end',
                            radii=[50, 100, 150, 200, 500])
fig

For user1 (who is from the same place as user0), DBSCAN is able to distinguish between the 3 home/library/shopping clusters that were within 2-4 blocks of each other, using a radius of 150m (For the previous user, the max radius to produce distinct clusters was 100m). Again, the worry is that additional data may produce noisy and create 'dendrites' between distinct clusters.

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[1],
                            alg='oursim',
                            loc_type='end',
                            radii=[100, 150, 200, 250, 300])
fig

Here, oursim is able to distinguish the nearby clusters of home, library, and store at a radius of 200m. At 250m, it's still mostly able to distinguish between the 3 clusters, but the shopping clusters at the top right start to get a little weird. At 300m it clumps distinctly unrelated clusters together.

Looking at user2:

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[2],
                            alg='DBSCAN',
                            loc_type='end',
                            radii=[50, 100, 150, 200, 500])
fig

This user is a college student, and has work, meal, and home clusters in very close proximity to each other due to density of buildings on college campuses. While DBSCAN is able to distinguish between those at 50m, it is merging clusters at 100m already. 

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[3],
                            alg='oursim',
                            loc_type='end',
                            radii=[100, 150, 200, 250])
fig

With oursim, distinct clusters for school/work/meal/home are found at 100m. By 150m, the clustering is already starting to get wonky (inherent clusters being split into separate clusters with far-away outliers.)

Look at a random user:

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[4],
                            alg='DBSCAN',
                            loc_type='end',
                            radii=[50, 100, 150, 200, 500])
fig

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[4],
                            alg='oursim',
                            loc_type='end',
                            radii=[50, 100, 150, 200, 500])
fig

### playing with radius selection - start points only

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[0],
                            alg='DBSCAN',
                            loc_type='start',
                            radii=[100, 150, 200, 250])
fig

DBSCAN identified distinct home/library/shopping departure clusters at 100m but started merging clusters at 150m.

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[0],
                            alg='oursim',
                            loc_type='start',
                            radii=[100, 150, 200, 250])
fig

Oursim clusters are decently distinct at 200-250m. 

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[1],
                            alg='DBSCAN',
                            loc_type='start',
                            radii=[100, 150, 200, 250])
fig

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[1],
                            alg='oursim',
                            loc_type='start',
                            radii=[100, 150, 200, 250])
fig

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[3],
                            alg='DBSCAN',
                            loc_type='start',
                            radii=[100, 150, 200, 250])
fig

DBSCAN has created a big fat cluster for school/meal/home already at 100m. Not looking great.

In [None]:
fig = mapping.plot_clusters(expanded_trip_df_map[3],
                            alg='oursim',
                            loc_type='start',
                            radii=[100, 150, 200, 250])
fig

oursim found distinct school/meal/home clusters at 100m. It starts blurring by 150m and some distinct clusters are fully merged by 200m.

Verdict: 
- oursim produces clusters which have a constrained maximum diameter. if we use it, we should decrease the radius to 100-200m, perhaps 300m if we want to use the technique of getting large clusters and then separating further via SVM or other methods. One major issue is the problem of varying densities and varying scales. For instance, people who frequently visit college campuses may do different things in adjacent buildings. I imagine this problem will also surface when comparing people who live in the suburbs/exurbs vs people living in city downtowns (for example, in the suburb someone may drive out to go to a big grocery store, whereas in the city they may just walk around the corner to a produce store, which may be located right next to a laundromat, the library, the park, etc.)
- however, it also sometimes splits the density core of real-world cluster into multiple artificial clusters. DBSCAN doesn't have this issue as much because it finds these density cores. The main issue with DBSCAN is the dendrites. If we use sub-clustering techniques like SVM, we should use DBSCAN as the base clusterer rather than oursim so that the density cores will be intact. 

TODO: graph the cluster count/request pct/homogeneity/etc as radius grows, then see if there is a region where cluster count/etc stabilizes, and use that to aid radius selection. Also, I would like to test out OPTICS, which is supposed to be slightly better at handling clusters of varying densities. 