# Evaluate the use of the similarity code for trip clustering

In this notebook, we are going to dig deeper into the use of the similarity code for trip clustering. The original goal of the similarity code was to find and display common trips to users in a user interface. We plan to use it to automatically label common trips.

The goals seem very similar, but are they? Or should our usage patterns for this new use case be subtly different?

This notebook is intended to be run against the participant-only version of the CanBikeCO Jan 31 minipilot dataset.
If you have the older version that includes data from non-participants as well, please replace

```
all_users = esta.TimeSeries.get_uuid_list()
```

with 

```
participant_uuid_obj = list(edb.get_profile_db().find({"install_group": "participant"}, {"user_id": 1, "_id": 0}))
all_users = [u["user_id"] for u in participant_uuid_obj]
```

### First, we read the data and extract the most common purpose labels

In [None]:
import pandas as pd
import numpy as np
import geojson as gj
import sklearn.cluster as sc
import sklearn.metrics.pairwise as smp

In [None]:
import json
import copy

In [None]:
import folium
import branca.element as bre

In [None]:
import matplotlib.pyplot as plt
import matplotlib.colors as pltc
import seaborn as sns

In [None]:
from IPython import display
from uuid import UUID

import bson.json_util as bju
import bson.objectid as boi

In [None]:
import emission.storage.timeseries.abstract_timeseries as esta
import emission.storage.decorations.trip_queries as esdtq
import emission.analysis.modelling.tour_model.similarity as eamts

In [None]:
import emission.core.wrapper.entry as ecwe
import emission.core.wrapper.confirmedtrip as ecwct

### Read data and setup variables

In [None]:
all_users = esta.TimeSeries.get_uuid_list()
confirmed_trip_df_map = {}
labeled_trip_df_map = {}
expanded_trip_df_map = {}
for u in all_users:
    ts = esta.TimeSeries.get_time_series(u)
    ct_df = ts.get_data_df("analysis/confirmed_trip")
    confirmed_trip_df_map[u] = ct_df
    labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)
    expanded_trip_df_map[u] = esdtq.expand_userinputs(labeled_trip_df_map[u])

In [None]:
FINAL_RADIUS = 500
FINAL_POINT_DBSCAN = sc.DBSCAN(FINAL_RADIUS, min_samples=2, metric="precomputed")
FINAL_TRIP_DBSCAN = sc.DBSCAN(FINAL_RADIUS * 2, min_samples=2, metric="precomputed")

### Standard functions (currently copied over from other notebooks; should be refactored into a python file)

In [None]:
def get_loc_df(loc_series):
    loc_df = pd.DataFrame(loc_series.apply(lambda p: p["coordinates"]).to_list(), columns=["longitude", "latitude"])
    # display.display(end_loc_df.head())
    return loc_df

In [None]:
def get_distance_matrix(loc_df):
    EARTH_RADIUS = 6371000
    radians_lat_lon = np.radians(loc_df[["latitude", "longitude"]])
    dist_matrix_meters = pd.DataFrame(smp.haversine_distances(radians_lat_lon, radians_lat_lon) * 6371000)
    return dist_matrix_meters

In [None]:
def add_loc_clusters(user_id, modeling_support_objects, trip_df):
    user_trip_df = trip_df[trip_df.user_id == user_id]
    start_distance_matrix = get_distance_matrix(get_loc_df(user_trip_df.start_loc))
    end_distance_matrix = get_distance_matrix(get_loc_df(user_trip_df.end_loc))
    start_loc_model = copy.copy(FINAL_POINT_DBSCAN).fit(start_distance_matrix)
    end_loc_model = copy.copy(FINAL_POINT_DBSCAN).fit(end_distance_matrix)
    trip_df.loc[user_trip_df.index, "start_loc_cluster"] = start_loc_model.labels_
    trip_df.loc[user_trip_df.index, "end_loc_cluster"] = end_loc_model.labels_

    curr_model_support = modeling_support_objects.get(user_id)
    if curr_model_support is None:
        modeling_support_objects[user_id] = {}
        curr_model_support = modeling_support_objects[user_id]
    curr_model_support["start_distance_matrix"] = start_distance_matrix
    curr_model_support["end_distance_matrix"] = end_distance_matrix   
    curr_model_support["start_loc_model"] = start_loc_model
    curr_model_support["end_loc_model"] = end_loc_model

    return trip_df

In [None]:
def add_trip_clusters_dbscan(user_id, trip_df):
    user_trip_df = trip_df[trip_df.user_id == user_id]
    all_combos = user_trip_df.groupby(["start_loc_cluster", "end_loc_cluster"])
    valid_combos = [p for p in all_combos.groups if p[0] != -1 and p[1] != -1]
    print(f"After validating, all_combos {len(all_combos.groups)} -> {len(valid_combos)}")
    all_combos_dict = dict(all_combos.groups)
    valid_combos_series = pd.Series(valid_combos)
    for g, idxlist in all_combos_dict.items():
        print(g, idxlist)
        match = valid_combos_series[valid_combos_series == g]
        if len(match) == 0:
            print(f"invalid combo {g} found for entries {idxlist}, trip is not in a cluster")
            trip_df.loc[idxlist, "trip_cluster_dbscan"] = -1
        else:
            print(f"valid combo {g} found for entries {idxlist}, setting trip cluster to {match.index[0]}")
            trip_df.loc[idxlist, "trip_cluster_dbscan"] = int(match.index[0])
    return trip_df

### First, we pick a participant to work with

In [None]:
n_trips_df = pd.DataFrame([[u, len(confirmed_trip_df_map[u]), len(labeled_trip_df_map[u])] for u in all_users], columns=["user_id", "all_trips", "labeled_trips"]); n_trips_df

In [None]:
median_user = n_trips_df[n_trips_df.labeled_trips == n_trips_df.labeled_trips.median()].user_id.iloc[0]; median_user

In [None]:
median_user_df = expanded_trip_df_map[median_user]

In [None]:
# add_trip_clusters_oursim(participant_df.index[0], all_expanded_df)
user_id = median_user
user_trip_df = median_user_df
user_trip_list = [ecwe.Entry({"data": ecwct.Confirmedtrip(tr), "_id": tr["_id"], "metadata": {"key": "analysis/confirmed_trip"}}) for tr in user_trip_df.to_dict("records")]
curr_sim = eamts.similarity(user_trip_list, FINAL_RADIUS)
curr_sim.filter_trips()
curr_sim.bin_data()

### Step 1: filtering too short trips

This is the first step of the binning, where we have not yet determined the cutoff and deleted the bins below it. However, we *have* filtered out trips that are too short. Let's examine if this filtering makes sense to us.

In [None]:
filtered_trip_df = pd.DataFrame([e["data"] for e in curr_sim.data])
removed_trips = user_trip_df[~user_trip_df._id.isin(filtered_trip_df._id)]
removed_trips.distance

Let's now characterise this in multiple ways.

In [None]:
removed_trips.distance.describe()

In [None]:
removed_trips.distance[removed_trips.distance > FINAL_RADIUS]; len(removed_trips.distance[removed_trips.distance > FINAL_RADIUS])

In [None]:
removed_trips.boxplot("distance")

### Filtering does not make sense

It turns out that this filtering does not make sense.

1. Even if the start and end are close to each other, the actual trip may not be very short because it may be a round trip.
2. For example, around half of the trips that are filtered actually have a distance > our filter.
3. The others may be "not a trip" aka fake trips. But those are still valuable! If the user has been labeling short trips to and from a particular location as "Not a trip", then automatically labeling those as "Not a trip" will still save the user a ton of time!

### Step 2: binning

We now bin the trips to effectively create "clusters". Let us compare these clusters against the DBSCAN clusters.

In [None]:
# Create DBSCAN clusters
modeling_support_objects = {}
add_trip_clusters_dbscan(user_id, add_loc_clusters(user_id, modeling_support_objects, user_trip_df))

In [None]:
# Create oursim bin
# curr_sim.bin_data()
all_bins = curr_sim.bins
for b in all_bins:
    print(b)

In [None]:
len(curr_sim.data), len(curr_sim.filtered_data), len(curr_sim.all_data)

Let's try to convert the bins into labels to be consistent with the sklearn clustering algorithms
**TODO: This should really go into the sim implementation**
again, to be consistent with the sklearn configuration, we use a min_sample of 2
so trips that are in a bin of length 1 are noise. the others are in a labeled cluster

In [None]:
removed_trips.index

In [None]:
len(removed_trips), len(filtered_trip_df), len(user_trip_df)

In [None]:
import itertools

In [None]:
flat_all_bins = list(itertools.chain(*all_bins)); flat_all_bins[:5]

In [None]:
len(flat_all_bins), max(flat_all_bins)

In [None]:
# Let's start by assuming that everything is noise, and set the ones that are not
# This is a bit tricky wrt indexing, since the indices of the trips in the bin are after filtering,
# so don't match up 1:1 with the indices in the trip dataframe
# since we create a new dataframe for the filtered trips, they should match up with the filtered dataframe
# but the index of the filtered dataframe is a new RangeIndex, so it doesn't work for indexing into the result series
# so we need to follow a two-step process as below
def get_result_labels(user_trip_df, filtered_trip_df):
    """
    user_trip_df: all trips that we are determining similarity for
    filtered_trip_df: trips filtered out by the "too_short" metric
    """
    result_labels = pd.Series([-1] * len(user_trip_df))
    for i, curr_bin in enumerate(all_bins):
        if len(curr_bin) > 1:
            # get the trip ids of matching filtered trips for the current bin
            matching_filtered_trip_ids = filtered_trip_df.loc[curr_bin]._id
            # then, match by tripid to find the corresponding entries in the all_trips dataframe
            matching_all_trip_ids = user_trip_df[user_trip_df._id.isin(matching_filtered_trip_ids)].index
            result_labels.loc[matching_all_trip_ids] = i
            
    removed_trips = user_trip_df[~user_trip_df._id.isin(filtered_trip_df._id)]
    # For now, we also mark the "too short" labels with -2 to help with our understanding
    result_labels.loc[removed_trips.index] = -2
    return result_labels

In [None]:
user_trip_df[user_trip_df._id.isin(filtered_trip_df.loc[[15, 38, 45, 70, 98, 133]]._id)].index

In [None]:
result_labels = get_result_labels(user_trip_df, filtered_trip_df)

In [None]:
# Quick inline unit test
all_bins[3], result_labels.loc[all_bins[3]], result_labels[result_labels == 3]

In [None]:
np.count_nonzero(result_labels == -1), np.count_nonzero(result_labels == -2), np.count_nonzero(result_labels > 0), len(result_labels)

In [None]:
user_trip_df["trip_cluster_oursim"] = get_result_labels(user_trip_df, filtered_trip_df)

In [None]:
# Moment of truth: which has more noise? The custom binning algo, which is not surprising since it is fairly naive
np.count_nonzero(user_trip_df.trip_cluster_dbscan == -1), np.count_nonzero(user_trip_df.trip_cluster_oursim == -1), np.count_nonzero(user_trip_df.trip_cluster_oursim < 0)

### Our similarity code marks more trips as noisy

This is not super surprising, since our implementation is a fairly naive one written by an undergrad over the summer, while DBSCAN is a (hopefully) more sophisticated and better performing algorithm from the literature.

At this point, between the too short filtering (unnecessary and somewhat erroneous), which removed 49 trips, and the noisier binning (67 v/s 31), we have removed 116 trips out of our original 208 (around 55%). In contrast, the DBSCAN implementation has removed only 31 / 208 = 14%. This is a huge difference in terms of ongoing modeling.

### Step 3: Determining the cutoff

Since we have already marked all bins of length 1 as noise, I don't anticipate this make very much of a difference. But let's finish it anyway.

In [None]:
curr_sim.delete_bins()
above_cutoff_bins = curr_sim.bins

In [None]:
above_cutoff_bins

In [None]:
flat_all_bins = list(itertools.chain(*all_bins)); flat_all_bins[:5]

In [None]:
len(curr_sim.data), len(curr_sim.newdata), len(flat_all_bins)

In [None]:
# It looks like the trips were not re-indexed here, so let's just go ahead with generating labels in the same way 
user_trip_df["trip_cluster_oursim_above_cutoff"] = get_result_labels(user_trip_df, filtered_trip_df)

In [None]:
# As expected, for this user, there is no difference
np.count_nonzero(user_trip_df.trip_cluster_dbscan == -1), np.count_nonzero(user_trip_df.trip_cluster_oursim == -1), np.count_nonzero(user_trip_df.trip_cluster_oursim < 0)

### Step 4: Spot checking the noise

Let's pick a trip that is noise in oursim, find its cluster in DBSCAN and plot it

In [None]:
def get_geojson_for_trip_cluster(cluster_trips):
    print(len(cluster_trips))
    # [[[X1, Y1], [X1, Y1]],
    # [[X1, Y1], [X1, Y1]]]
    clistarray = cluster_trips[["start_loc", "end_loc"]].apply(
                    lambda se: [p["coordinates"] for p in se]).to_numpy().tolist()
    print([len(clist) for clist in clistarray])
    linestrings = [gj.LineString(coordinates=clist) for clist in clistarray]
    purpose_locs = gj.FeatureCollection(cluster_trips.start_loc.to_list() +
                                        cluster_trips.end_loc.to_list() +
                                        linestrings)
    return folium.features.GeoJson(purpose_locs)

In [None]:
extra_noise_in_oursim = user_trip_df.query("(trip_cluster_oursim == -1) and (trip_cluster_dbscan != -1)")
extra_clusters = extra_noise_in_oursim.trip_cluster_dbscan.unique(); extra_clusters

In [None]:
user_trip_df[user_trip_df.trip_cluster_dbscan == extra_clusters[3]]

In [None]:
x = user_trip_df[user_trip_df.trip_cluster_dbscan == extra_clusters[-3]]; x

### Most of the "non-noisy" trips in the DBSCAN method have only one trip!!

Since we cluster the start and locations separately (with min_samples = 2) but then just find unique combinations with them without requiring min_samples = 2 for the pair (the trip), we can end up with one trip and have it be non-noisy. This still seems principled - if both the start and the end are known places, even if there is only one trip between them, it is unlikely to represent an overlap with a different trip.

But this does mean that our original criterion for what is noisy is too strict.
On the other hand, step 3 would have filtered all those out as noisy anyway.

So I don't think it changes the results for this user, but we should change our implementation to check for other users. Since this is unrolled, let's preserve the original result and copy-paste the new implementation below.

In [None]:
### run all the similarity code again
curr_sim = eamts.similarity(user_trip_list, FINAL_RADIUS)
curr_sim.filter_trips()
curr_sim.bin_data()
filtered_trip_df = pd.DataFrame([e["data"] for e in curr_sim.data])

In [None]:
curr_sim.data[0]["_id"], curr_sim.data[0]["data"]["_id"]

In [None]:
# Since this is an unrolled notebook, we redefine the function instead of modifying the previous implementation
def get_result_labels_new(user_trip_df, filtered_trip_df):
    """
    user_trip_df: all trips that we are determining similarity for
    filtered_trip_df: trips retained after the "too_short" metric.
    We need this because the model is a list of bins, and stores trip
    indices based these filtered trips. But we want to set the results into the full user_trip_df.
    """
    result_labels = pd.Series([-1] * len(user_trip_df))
    for i, curr_bin in enumerate(curr_sim.bins):
        # get the trip ids of matching filtered trips for the current bin
        matching_filtered_trip_ids = filtered_trip_df.loc[curr_bin]._id
        # then, match by tripid to find the corresponding entries in the all_trips dataframe
        matching_all_trip_ids = user_trip_df[user_trip_df._id.isin(matching_filtered_trip_ids)].index
        result_labels.loc[matching_all_trip_ids] = i
            
    removed_trips = user_trip_df[~user_trip_df._id.isin(filtered_trip_df._id)]
    # For now, we also mark the "too short" labels with -2 to help with our understanding
    result_labels.loc[removed_trips.index] = -2
    return result_labels

In [None]:
user_trip_df["trip_cluster_oursim_single_trip_clusters"] = get_result_labels_new(user_trip_df, filtered_trip_df)

In [None]:
# At this point, there is no noise from oursim because all trips are in at least their cluster (a cluster of one)
np.count_nonzero(user_trip_df.trip_cluster_dbscan == -1), np.count_nonzero(user_trip_df.trip_cluster_oursim_single_trip_clusters == -1), np.count_nonzero(user_trip_df.trip_cluster_oursim_single_trip_clusters < 0)

In [None]:
curr_sim.delete_bins()
above_cutoff_bins = curr_sim.bins

In [None]:
user_trip_df["trip_cluster_oursim_single_trip_clusters_above_cutoff"] = get_result_labels_new(user_trip_df, filtered_trip_df)

In [None]:
# After finding the cutoff
np.count_nonzero(user_trip_df.trip_cluster_dbscan == -1), np.count_nonzero(user_trip_df.trip_cluster_oursim_single_trip_clusters_above_cutoff == -1), np.count_nonzero(user_trip_df.trip_cluster_oursim_single_trip_clusters_above_cutoff < 0)

### Spot checking the noise again

In [None]:
extra_noise_in_oursim = user_trip_df.query("(trip_cluster_oursim_single_trip_clusters_above_cutoff == -1) and (trip_cluster_dbscan != -1)")
extra_clusters = extra_noise_in_oursim.trip_cluster_dbscan.unique(); extra_clusters

In [None]:
# Find which of these clusters has > 1 trip (if any)
two_trip_clusters = [c for c in extra_clusters if np.count_nonzero(user_trip_df.trip_cluster_dbscan == c) > 1]; two_trip_clusters

In [None]:
fig = bre.Figure()
for i, c in enumerate(two_trip_clusters):
    print(i,c)
    fig.add_subplot(3,2,i+1).add_child(folium.Map().add_child(get_geojson_for_trip_cluster(user_trip_df[user_trip_df.trip_cluster_dbscan == c])))
fig

Most of the mismatches are for really small clusters, but cluster 0 is a really big one. Let's see how the similarity code dealt with it by visualizing the outputs

In [None]:
user_trip_df[user_trip_df.trip_cluster_dbscan == 0].trip_cluster_oursim_single_trip_clusters_above_cutoff.value_counts()

In [None]:
fig = bre.Figure()
fig.add_subplot(1,3,1).add_child(folium.Map().add_child(get_geojson_for_trip_cluster(user_trip_df.query("trip_cluster_dbscan == 0 and trip_cluster_oursim_single_trip_clusters_above_cutoff >= 0"))))
fig.add_subplot(1,3,2).add_child(folium.Map().add_child(get_geojson_for_trip_cluster(user_trip_df.query("trip_cluster_dbscan == 0 and trip_cluster_oursim_single_trip_clusters_above_cutoff == -1"))))
fig.add_subplot(1,3,3).add_child(folium.Map().add_child(get_geojson_for_trip_cluster(user_trip_df.query("trip_cluster_dbscan == 0 and trip_cluster_oursim_single_trip_clusters_above_cutoff == -2"))))

The DBSCAN algorithm is in fact more promiscuous the classic similarity algorithm in finding clusters, but the effect is minimal. We find only a handful of cases (at least for this user) (6 trips; 2%) where the similarity code did not find a cluster above the cutoff, but there was more than one trip found by DBSCAN. Looking at them exhaustively, the trip quality is not that great.

Specifically, focusing on cluster 0, which had 34 trips in it, the similarity code found 7 trip cluster, a single trip cluser (below cutoff) and 25 "too short" trips. Looking at the 25 too short trips, most of them seem to be to a shopping center across the street from where the user lives and/or works. Would the user like to have these trips automatically labeled? You betcha!

## Conclusion: we need to make significant changes in the way the similarity code is used for auto-labeling


The similarity code was designed to surface common trips for user review and focus. It was not designed to automatically label trips. As we adapted it for this new use case, we need to make some simple changes to it. Note that the original goal was to surface the "big ticket items" of the trips, the new goal is to automatically handle the "small ticket items" so they don't bother the user.

Concretely:

- The filtering of short trips is unnecessary and sometimes incorrect, and it removes 49 trips (23%) of the trips already. See shopping center example above. While we may not need to surface these short shopping trips as part of the  larger tour model graph as a significant part of the user travel, it is particularly these trips that we should autolabel to reduce user burden.
- The filtering of infrequent trips (below cutoff) from the model does not help at all. We instituted a cutoff because showing all these infrequent trips to the user would increase their cognitive burden. But we are now not showing these infrequent trips to the user, we are building a computer model from them. The computer can handle such a cognitive load without any problems. If the user took one trip and then took the same trip only a month later instead of every day, why would we not want to automatically label the trip? Again, because our goal is to take the cognitive burden off the user, it is better for the computer to remember the trip from the month ago and autolabel it.

With these two changes, the results from this user indicate that the performance of the current similarity code can match or surpass the DBSCAN code. If these results hold against the multi-dataset, we can stick with the current similarity code and only make these changes to improve the overall results.