# Experimenting with bin radius

In this notebook, we are going to experiment with various values for the radius used to bin trips together. It is critical that we get bins that are large enough for appropriate modeling, without clumping together trips that are not actually related.

We are going to do this by focusing on purpose labels that are likely to be repeated, such as home and work. Prior work has indicated that these are repeated for the vast majority of users. We can then use the empirical data from that to determine the radius. We may even be able to come up with a data-driven method to determine the radius for an individual phone/context.

We focus on the purpose since it is most likely to have a 1:1 mapping with unique locations. Both the modes can easily be used for multiple trips whether or not they are unique.

This notebook is intended to be run on the **CanBikeCO dataset, participant only, until Jan 31**. It will probably work for other datasets as well - I have tried to avoid hardcoding values - but I have not tested against them.

### First, we read the data and extract the most common purpose labels

In [None]:
import pandas as pd
import geojson as gj
import sklearn
import scipy

In [None]:
import folium
import branca.element as bre

In [None]:
import matplotlib.pyplot as plt
import matplotlib.colors as pltc

In [None]:
from IPython import display

In [None]:
import emission.storage.timeseries.abstract_timeseries as esta
import emission.storage.decorations.trip_queries as esdtq

In [None]:
all_users = esta.TimeSeries.get_uuid_list()
confirmed_trip_df_map = {}
labeled_trip_df_map = {}
expanded_trip_df_map = {}
for u in all_users:
    ts = esta.TimeSeries.get_time_series(u)
    ct_df = ts.get_data_df("analysis/confirmed_trip")
    confirmed_trip_df_map[u] = ct_df
    labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)
    expanded_trip_df_map[u] = esdtq.expand_userinputs(labeled_trip_df_map[u])

### Let's focus on a single user first, before expanding the analysis to all users

Let's pick a user with the median number of labeled trips, so we don't get super excited about the results

In [None]:
n_trips_df = pd.DataFrame([[u, len(confirmed_trip_df_map[u]), len(labeled_trip_df_map[u])] for u in all_users], columns=["user_id", "all_trips", "labeled_trips"]); n_trips_df

In [None]:
median_user = n_trips_df[n_trips_df.labeled_trips == n_trips_df.labeled_trips.median()].user_id.iloc[0]; median_user

So we end up with a user with ~ 200 confirmed and labeled trips, which is not too bad

#### Visualizing three purposes, but not necessarily the top 3

In [None]:
median_user_df = expanded_trip_df_map[median_user]

In [None]:
median_user_df.purpose_confirm.value_counts()

In [None]:
def get_geojson_for_purpose(exp_df, purpose):
    purpose_trips = exp_df[exp_df.purpose_confirm == purpose]
    # purpose_locs = gj.FeatureCollection([gj.Feature(geometry=p, properties=p) for p in list(median_user_df.head().end_loc)])
    # return folium.features.GeoJson(purpose_locs, popup=folium.features.GeoJsonPopup(fields=["coordinates"]))
    purpose_locs = gj.FeatureCollection(purpose_trips.end_loc.to_list())
    return folium.features.GeoJson(purpose_locs)

This user does have "home" as the most common purpose, but "transit transfer" and "personal med" are above "work". Let's start with focusing on home.

In [None]:
fig = bre.Figure()
fig.add_subplot(1,3,1).add_child(folium.Map().add_child(get_geojson_for_purpose(median_user_df, "home")))
fig.add_subplot(1,3,2).add_child(folium.Map().add_child(get_geojson_for_purpose(median_user_df, "transit_transfer")))
fig.add_subplot(1,3,3).add_child(folium.Map().add_child(get_geojson_for_purpose(median_user_df, "work")))

The visualizations are interesting.
- Home seems to have a very clear clustering with a couple of outliers.
- Transit transfers are more spread out, although there are also some clear clusters, around Denver Union Station, for example.
- And while work is more spread out, there are also some clear clusters visible.

Let's see if we can plot this out in 2-D but not as a map

In [None]:
def get_loc_df_for_purpose(exp_df, purpose, loc_field="end_loc"):
    # Reuse the same function to get the loc_df
    purpose_trips = exp_df
    if purpose is not None:
        purpose_trips = exp_df[exp_df.purpose_confirm == purpose]
    end_loc_df = pd.DataFrame(purpose_trips[loc_field].apply(lambda p: p["coordinates"]).to_list(), columns=["longitude", "latitude"])
    # display.display(end_loc_df.head())
    return end_loc_df

In [None]:
fig = plt.Figure(figsize=(20,5))
ax = fig.add_subplot(1,3,1)
get_loc_df_for_purpose(median_user_df, "home").plot(kind="scatter", x="longitude", y="latitude", ax=ax)
ax = fig.add_subplot(1,3,2)
get_loc_df_for_purpose(median_user_df, "transit_transfer").plot(kind="scatter", x="longitude", y="latitude", ax=ax)
ax = fig.add_subplot(1,3,3)
get_loc_df_for_purpose(median_user_df, "transit_transfer").plot(kind="scatter", x="longitude", y="latitude", ax=ax)
fig

#### Using these to get an estimate of the radius

It looks like we can use at least the "home" purpose to get an estimate of the radius to use. Let's calculate the distance matrix for each type of purpose separately and explore them.

In [None]:
home_df = median_user_df[median_user_df.purpose_confirm == "home"]
end_loc_df = pd.DataFrame(home_df.end_loc.apply(lambda p: p["coordinates"]).to_list(), columns=["longitude", "latitude"])

In [None]:
import sklearn
import numpy as np

In [None]:
import sklearn.metrics.pairwise as smp

In [None]:
def get_distance_matrix(loc_df):
    EARTH_RADIUS = 6371000
    radians_lat_lon = np.radians(loc_df[["latitude", "longitude"]])
    dist_matrix_meters = pd.DataFrame(smp.haversine_distances(radians_lat_lon, radians_lat_lon) * 6371000)
    return dist_matrix_meters

In [None]:
home_dist_matrix_meters = get_distance_matrix(end_loc_df)
home_dist_series_meters = pd.Series(np.ravel(home_dist_matrix_meters))

In [None]:
home_dist_series_meters.quantile([0.9,0.95,0.99])

In [None]:
import sklearn.cluster as sc

In [None]:
end_loc_df["90%"] = sc.DBSCAN(home_dist_series_meters.quantile(0.90), metric="precomputed").fit(home_dist_matrix_meters).labels_
end_loc_df["95%"] = sc.DBSCAN(home_dist_series_meters.quantile(0.95), metric="precomputed").fit(home_dist_matrix_meters).labels_
end_loc_df["99%"] = sc.DBSCAN(home_dist_series_meters.quantile(0.99), metric="precomputed").fit(home_dist_matrix_meters).labels_

In [None]:
end_loc_df["90%"].unique()

In [None]:
plt.get_cmap("tab20", 2).colors.tolist()

In [None]:
def get_colormap(labels):
    # If we have any outliers, noise etc
    if np.count_nonzero(labels.unique() == -1) > 0:
        # we have some noise
        # always put red first to make sure that outliers are red
        return pltc.ListedColormap(["red"] + plt.get_cmap("tab20", np.count_nonzero(labels.unique() != -1)).colors.tolist())
    else:
        # we don't have any noise, so no need to add red
        return plt.get_cmap("tab20", np.count_nonzero(labels.unique() != -1))

In [None]:
fig = plt.Figure(figsize=(20,5))
ax = fig.add_subplot(1,3,1)
# end_loc_df.plot(kind="scatter", x="longitude", y="latitude", color = end_loc_df["90%"].apply(lambda c: colors[c]), ax=ax, colorbar=False)
end_loc_df.plot(kind="scatter", x="longitude", y="latitude", c = "90%", cmap=get_colormap(end_loc_df["90%"]), ax=ax, colorbar=False)
ax = fig.add_subplot(1,3,2)
end_loc_df.plot(kind="scatter", x="longitude", y="latitude", c = "95%", cmap=get_colormap(end_loc_df["95%"]), ax=ax, colorbar=False)
ax = fig.add_subplot(1,3,3)
end_loc_df.plot(kind="scatter", x="longitude", y="latitude", c = "99%", cmap=get_colormap(end_loc_df["99%"]), ax=ax, colorbar=False)
fig

Based on this, the 90% radius of around 100 meters looks good. Let's see if that is generalizable to the two other categories

####  Generalizing to other purposes

In [None]:
def add_loc_clusters(loc_df, quantiles, radii, min_samples=5):
    print("in add_loc_clusters %d" % len(loc_df))
    dist_matrix_meters = get_distance_matrix(loc_df)
    dist_series_meters = pd.Series(np.ravel(dist_matrix_meters))
    if radii is None:
        assert quantiles is not None
        radii = dist_series_meters.quantile(quantiles)
    for r in radii:
        loc_df[f"{r}_m"] = pd.Categorical(sc.DBSCAN(r, metric="precomputed", min_samples=min_samples).fit(dist_matrix_meters).labels_)
    # Map to categorical variables so the plotting is easier
    # for r in radii:
    #    loc_df[str(r)] = loc_df[str(r)].apply(lambda l: "c%d"%l if l !=-1 else "N")
    return radii

In [None]:
fig = plt.Figure(figsize=(20,10))
quantiles = [0.9, 0.95, 0.99]
fig_index = 0
for p in ["home", "transit_transfer", "work"]: 
    p_loc_df = get_loc_df_for_purpose(median_user_df, p)
    radii = add_loc_clusters(p_loc_df, quantiles, None)
    # print(radii)
    for r in radii:
        fig_index = fig_index + 1
        ax = fig.add_subplot(3,3,fig_index)
        p_loc_df.plot(title="Radius = %.2f meters" % r, kind="scatter", x="longitude", y="latitude", c = f"{r}_m", cmap=get_colormap(p_loc_df[f"{r}_m"]), ax=ax)
fig

This is interesting as well, but the median values for the other purposes are so high that the evaluated clusters are too big. For example, for the work purpose, all the trips are in the same cluster. We want the clusters to look like one dot in the visualization, even at the 0.9 quantile. Let's hardcode the evaluation radii and retry.

In [None]:
fig = plt.Figure(figsize=(20,10))
radii = [100, 300, 500]
fig_index = 0
for p in ["home", "transit_transfer", "work"]:
    p_loc_df = get_loc_df_for_purpose(median_user_df, p)
    radii = add_loc_clusters(p_loc_df, None, radii)
    # print(radii)
    for r in radii:
        fig_index = fig_index + 1
        ax = fig.add_subplot(3,3,fig_index)
        p_loc_df.plot(title="Radius = %.2f meters" % r, kind="scatter", x="longitude", y="latitude", c = f"{r}_m", cmap=get_colormap(p_loc_df[f"{r}_m"]), ax=ax)
fig

There is a max of three clusters above. All the green points are outliers. The other color points are actual clusters. 100 seems to be too tight - for "transit transfer" above, we can see that there are green points really close to both label 0 and label 1 that look like they are overlapping, but that are not in the same cluster. There doesn't seem to be a huge difference between 300 and 500, at least for these examples.

Note that because DBSCAN takes the biggest between between **any two points** as the parameter, we can have some linear clusters (as for work) as opposed to circular clusters.

#### Generalize to all purposes

In [None]:
median_user_df.purpose_confirm.value_counts()

In [None]:
median_user_df.purpose_confirm.unique()

In [None]:
fig = plt.Figure(figsize=(20,40))
radii = [300, 500, 700]
fig_index = 0
purpose_list = pd.Series(median_user_df.purpose_confirm.unique()).dropna()
for i, p in enumerate(purpose_list):
    p_loc_df = get_loc_df_for_purpose(median_user_df, p)
    radii = add_loc_clusters(p_loc_df, quantiles=None, radii=radii)
    # print(radii)
    for r in radii:
        fig_index = fig_index + 1
        ax = fig.add_subplot(len(purpose_list), 3,fig_index)
        p_loc_df.plot(title="%s: Radius = %.2f meters" % (p, r), kind="scatter", x="longitude", y="latitude", c = f"{r}_m", cmap=get_colormap(p_loc_df[f"{r}_m"]), ax=ax)
fig

The differences are slight, but based on this, 500 seems to be the best option overall.
- It is better than 300 in shopping
- It is better than 700 in work and exercise

I would be OK with 300 because it is fairly close

### Summarize and generalize for all users

Now, let's summarize and generalize for all users. The summary should make it easier for us to identify differences and look at them more carefully. We use the number of points in clusters and the number of outliers as our summary metrics.

In [None]:
def get_cluster_summary(user_df, radii, min_samples=5):
    cluster_summary = []
    curr_user_id = user_df.user_id.iloc[0]
    print(f"About to get loc_df for user_df of length {len(user_df)}")
    purpose_list = pd.Series(user_df.purpose_confirm.unique()).dropna()
    for p in purpose_list:
        print(f"For {curr_user_id}, in get_cluster_summary, getting loc_df for {p}")
        p_loc_df = get_loc_df_for_purpose(user_df, p)
        radii = add_loc_clusters(p_loc_df, None, radii, min_samples)
        curr_entry = {"user_id": curr_user_id, "purpose": p, "n_locs": len(p_loc_df)}
        for r in radii:
            curr_col = p_loc_df[f"{r}_m"]
            # print(curr_col)
            curr_entry.update({"n_clusters_%s" % r: np.count_nonzero(curr_col.unique() != -1),
                                "n_valid_%s" % r: np.count_nonzero(curr_col != -1),
                                "n_noise_%s" % r: np.count_nonzero(curr_col == -1)})
        cluster_summary.append(curr_entry)
    return pd.DataFrame(cluster_summary)

In [None]:
# This is sufficiently different that we create a new function for it
# TODO: See if it is possible to unify both these functions
def get_cluster_summary_no_grouping(user_df, radii, min_samples=5):
    cluster_summary = []
    # We need to reset the index here because the user_df is actually filtered from a larger dataset
    # confirmed trip -> labeled trip
    # so there can be gaps in the index and can have entries beyond the length of the dataframe
    # while creating get_loc_df_for_purpose, we create a new dataframe from the end_loc
    # so this index has no gaps and is limited to the length of the dataframe
    # so when we try to select the loc entries based on the user_df index, we can fail
    # resetting the index allows us to ensure that the indices match up
    # not sure if we need to copy the dataframe first (changes to the view don't propagate backwards anyway)
    # but better to be safe than sorry
    user_df = user_df.copy()
    user_df.reset_index(inplace=True)
    curr_user_id = user_df.user_id.iloc[0]
    print(f"About to get loc_df for user_df {curr_user_id} of length {len(user_df)}")
    all_loc_df = get_loc_df_for_purpose(user_df, purpose=None)
    print(f"all_loc_df length {len(user_df)}")
    radii = add_loc_clusters(all_loc_df, None, radii, min_samples)
    purpose_list = pd.Series(user_df.purpose_confirm.unique()).dropna()
    for p in purpose_list:
        # print(f"user_df {len(user_df)}, all_loc_df {len(all_loc_df)}")
        # print(f"Going to select index {user_df[user_df.purpose_confirm == p].index} of length {len(user_df[user_df.purpose_confirm == p])} from {all_loc_df.index}")
        # print(f"Before filtering, index {user_df.index}")
        p_loc_df = all_loc_df.loc[user_df[user_df.purpose_confirm == p].index]
        # print(f"For {curr_user_id}, in get_cluster_summary, getting loc_df for {p}")
        curr_entry = {"user_id": curr_user_id, "purpose": p, "n_locs": len(p_loc_df)}
        for r in radii:
            curr_col = p_loc_df[f"{r}_m"]
            # print(curr_col)
            curr_entry.update({"n_clusters_%s" % r: np.count_nonzero(curr_col.unique() != -1),
                              "n_valid_%s" % r: np.count_nonzero(curr_col != -1),
                               "n_noise_%s" % r: np.count_nonzero(curr_col == -1)})
        cluster_summary.append(curr_entry)
    print(f"Returning dataframe of length {len(cluster_summary)}, matching purpose length {len(purpose_list)}")
    return pd.DataFrame(cluster_summary)

In [None]:
median_user_cluster_summary = get_cluster_summary(median_user_df, radii = [300, 500, 700])
median_user_cluster_summary

In [None]:
diff_median_user = median_user_cluster_summary.query("(n_valid_500 - n_valid_300) > 1 or (n_valid_700 - n_valid_500) > 1")
diff_median_user

In [None]:
fig = plt.Figure(figsize=(20,10))
radii = [300, 500, 700]
fig_index = 0
purpose_list = pd.Series(diff_median_user.purpose.unique()).dropna()
for i, p in enumerate(purpose_list):
    p_loc_df = get_loc_df_for_purpose(median_user_df, p)
    radii = add_loc_clusters(p_loc_df, None, radii)
    # print(radii)
    for r in radii:
        fig_index = fig_index + 1
        ax = fig.add_subplot(len(purpose_list), 3,fig_index)
        p_loc_df.plot(title="%s: Radius = %.2f meters" % (p, r), kind="scatter", x="longitude", y="latitude", c = f"{r}_m", cmap=get_colormap(p_loc_df[f"{r}_m"]), ax=ax)
fig

### Now for all users

In [None]:
all_user_diffs = []
for u in all_users:
    curr_user_df = expanded_trip_df_map[u]
    if len(curr_user_df) == 0:
        continue
    # print("Continuing with %s" % (len(curr_user_df)))
    curr_cluster_summary = get_cluster_summary(curr_user_df, [300,500,700])
    curr_user_diff = curr_cluster_summary.query("(n_valid_500 - n_valid_300) > 1 or (n_valid_700 - n_valid_500) > 1")
    all_user_diffs.append(curr_user_diff)

In [None]:
all_user_diffs_df = pd.concat(all_user_diffs)
all_user_diffs_df

In [None]:
# unique_users_with_diffs = pd.Series(all_user_diffs.user_id)
ax = all_user_diffs_df[["n_valid_300", "n_valid_500", "n_valid_700", "n_locs"]].plot(kind="bar", figsize=(20,10))
ax.set_xticklabels(all_user_diffs_df.user_id.apply(lambda u: str(u)[:2]+"_") + all_user_diffs_df.purpose)
ax

##### With separate axes so we can see the details

In [None]:
high_diff_df = all_user_diffs_df.query("(n_valid_500 - n_valid_300) > 5 or (n_valid_700 - n_valid_500) > 5")
fig = plt.Figure(figsize=(20,50))
radii = [300, 500, 700]
fig_index = 0
user_purpose_list = high_diff_df[["user_id", "purpose"]]
for i, row in user_purpose_list.iterrows():
    # print(row)
    p_loc_df = get_loc_df_for_purpose(expanded_trip_df_map[row.user_id], row.purpose)
    radii = add_loc_clusters(p_loc_df, None, radii)
    # print(radii)
    for r in radii:
        fig_index = fig_index + 1
        ax = fig.add_subplot(len(high_diff_df), 3,fig_index)
        p_loc_df.plot(title="%s: Radius = %.2f meters" % (p, r), kind="scatter", x="longitude", y="latitude", c = f"{r}_m", cmap=get_colormap(p_loc_df[f"{r}_m"]), ax=ax)
fig

##### With shared x and y, so we can compare them better

In [None]:
high_diff_df = all_user_diffs_df.query("(n_valid_500 - n_valid_300) > 5 or (n_valid_700 - n_valid_500) > 5")
fig = plt.Figure(figsize=(20,50))
axarr = fig.subplots(len(high_diff_df), 3)
radii = [300, 500, 700]
fig_index = 0
user_purpose_list = high_diff_df[["user_id", "purpose"]]
for (i, row), carr in zip(user_purpose_list.iterrows(),axarr):
    # print(row)
    p_loc_df = get_loc_df_for_purpose(expanded_trip_df_map[row.user_id], row.purpose)
    radii = add_loc_clusters(p_loc_df, None, radii)
    # print(radii)
    for r, ax in zip(radii, carr):
        fig_index = fig_index + 1
        # ax = fig.add_subplot(len(high_diff_df), 3,fig_index, sharex=figax[0], sharey=figax[0])
        p_loc_df.plot(title="%s: Radius = %.2f meters" % (p, r), kind="scatter", x="longitude", y="latitude", c = f"{r}_m", cmap=get_colormap(p_loc_df[f"{r}_m"]), ax=ax)
fig

This is hard to work with. The 700 meters is clearly better than 500 in terms of grouping, but that is only to be expected. Is that going to be too large? Let's **visualize on a map to confirm**.

Picking row 1, which seems to have quite a few changes.

In [None]:
to_disp_entry = user_purpose_list.iloc[1]
fig = bre.Figure()
fig.add_child(folium.Map().add_child(get_geojson_for_purpose(expanded_trip_df_map[to_disp_entry.user_id], to_disp_entry.purpose)))

Looking at three groups that changed between 500 and 700 meters:

#### cluster 4 from 700m (brown, top left)

#### cluster 8 from 700m (light blue, middle)

#### cluster 7 from 700m (yellow, bottom right)

In all of these, the blue oval indicates the cluster at 300m, while the red oval indicates the cluster at 700m.
Based on these, we should almost certainly not go to 700m - in all the maps here, the red ovals are way too large.
However, I am surprised that some of the points are not clustered. Concretely, the two points in the bottom right of the map seem like they should be clustered. There are similar clusters in other locations as well.

Let's look at the actual distances.

In [None]:
to_disp_entry = user_purpose_list.iloc[1]
radii = [500,700]
p_loc_df = get_loc_df_for_purpose(expanded_trip_df_map[to_disp_entry.user_id], to_disp_entry.purpose)
radii = add_loc_clusters(p_loc_df, None, radii)
p_loc_df

#### checking distances: cluster 4 from 700m (brown, top left)

In [None]:
to_disp_distance_matrix = get_distance_matrix(p_loc_df)

In [None]:
cluster_4_700_m = p_loc_df[p_loc_df["700_m"] == 4]
matching_500_m = p_loc_df.loc[cluster_4_700_m.index, "500_m"]
matching_500_m

In [None]:
unclustered_500_m = matching_500_m[matching_500_m == -1]
to_disp_distance_matrix.loc[unclustered_500_m.index, unclustered_500_m.index]

Although the third distance > 300m, it is still < 500m, and the first two points are pretty close. Why didn't we get another cluster with just points 20 and 81. Why didn't point 93 get into the cluster? I guess the reason could be that the core sample was far enough away from it.

Let's get the distance matrix for all the points in the 700m cluster.

In [None]:
to_disp_distance_matrix.loc[cluster_4_700_m.index, cluster_4_700_m.index]

Based on that, the core sample is probably one of 87, 99, 101, 119 or 128, since those are > 500m from points that are not in the cluser.

Can we get the core samples from the DBSCAN model?

In [None]:
testmodel = sc.DBSCAN(500, metric="precomputed").fit(to_disp_distance_matrix.loc[cluster_4_700_m.index, cluster_4_700_m.index])
print(testmodel.core_sample_indices_)
to_disp_distance_matrix.loc[cluster_4_700_m.index, cluster_4_700_m.index].iloc[testmodel.core_sample_indices_.tolist(),]

In [None]:
testmodel = sc.DBSCAN(500, metric="precomputed", min_samples=2).fit(to_disp_distance_matrix.loc[cluster_4_700_m.index, cluster_4_700_m.index])
print(testmodel.core_sample_indices_)
to_disp_distance_matrix.loc[cluster_4_700_m.index, cluster_4_700_m.index].iloc[testmodel.core_sample_indices_.tolist(),]

In [None]:
def get_500m_clusters(c700m):
    cluster_700_m = p_loc_df[p_loc_df["700_m"] == c700m]
    return p_loc_df.loc[cluster_700_m.index]

In [None]:
saved_changed_clusters_700_m = [4,7,8]
get_500m_clusters(4)

In [None]:
fig = plt.Figure(figsize=(10,3))
axarr = fig.subplots(1,3)
for cm700m, ax in zip(saved_changed_clusters_700_m, axarr):
    get_500m_clusters(cm700m).plot.scatter(x="longitude", y="latitude", c="500_m", cmap=get_colormap(p_loc_df[f"{r}_m"]), ax=ax)
fig

### Important findings about DBSCAN

- we can get the list of core samples from the model (we can reuse this as the cluster centroid)
- ah!! the default parameters for DBSCAN have min_samples = 5, so we won't create clusters of size 2. Let's experiment with other values for it

### Now for all users with min_samples = 2 (for comparison)

In [None]:
all_user_diffs = []
for u in all_users:
    curr_user_df = expanded_trip_df_map[u]
    if len(curr_user_df) == 0:
        continue
    # print("Continuing with %s" % (len(curr_user_df)))
    curr_cluster_summary = get_cluster_summary(curr_user_df, [300,500,700], min_samples=2)
    curr_user_diff = curr_cluster_summary.query("(n_valid_500 - n_valid_300) > 1 or (n_valid_700 - n_valid_500) > 1")
    all_user_diffs.append(curr_user_diff)

In [None]:
all_user_diffs_df = pd.concat(all_user_diffs)
all_user_diffs_df

In [None]:
# unique_users_with_diffs = pd.Series(all_user_diffs.user_id)
ax = all_user_diffs_df[["n_valid_300", "n_valid_500", "n_valid_700", "n_locs"]].plot(kind="bar", figsize=(20,10))
ax.set_xticklabels(all_user_diffs_df.user_id.apply(lambda u: str(u)[:2]+"_") + all_user_diffs_df.purpose)
ax

In [None]:
# We have so many clusters now that tab20 isn't going to cut it any more. Let's try something like Pastel1 instead
def get_colormap(labels):
    # If we have any outliers, noise etc
    if np.count_nonzero(labels.unique() == -1) > 0:
        # we have some noise
        # always put red first to make sure that outliers are red
        return pltc.ListedColormap(["red"] + plt.get_cmap("Pastel1", np.count_nonzero(labels.unique() != -1)).colors.tolist())
    else:
        # we don't have any noise, so no need to add red
        return plt.get_cmap("Pastel1", np.count_nonzero(labels.unique() != -1))

In [None]:
high_diff_df = all_user_diffs_df.query("(n_valid_500 - n_valid_300) > 5 or (n_valid_700 - n_valid_500) > 5")
fig = plt.Figure(figsize=(20,10))
radii = [300, 500, 700]
fig_index = 0
user_purpose_list = high_diff_df[["user_id", "purpose"]]
for i, row in user_purpose_list.iterrows():
    # print(row)
    p_loc_df = get_loc_df_for_purpose(expanded_trip_df_map[row.user_id], row.purpose)
    radii = add_loc_clusters(p_loc_df, None, radii, min_samples=2)
    # print(radii)
    for r in radii:
        fig_index = fig_index + 1
        ax = fig.add_subplot(len(high_diff_df), 3,fig_index)
        p_loc_df.plot(title="%s: Radius = %.2f meters" % (p, r), kind="scatter", x="longitude", y="latitude", c = f"{r}_m", cmap=get_colormap(p_loc_df[f"{r}_m"]), ax=ax)
fig

Checking the same points as before, we now see that way more points are in clusters. There are only two entries for which there is a large difference in the number of valid entries. Fortunately, our test case is still one of them. It's hard to see the clusters now because this categorical list doesn't have nice labels. But we can clearly see that there are now clusters in all the parts where there were outliers before.

- old cluster 4 from 700m:
  - 500m: the point (pink) and the line next to it (green), both of these seem to be reasonable from the map
  - 700m: all the points are in one cluster, which seems too large
- old cluster 7 from 700m:
  - 500m: two separate clusters, both of which seem reasonable from the map
  - 700m: they are merged, which seems bad
- old cluster 8 from 700m:
  - 500m: the obvious cluster is a cluster; the point below the cluster is not part of it
  - 700m: the point below is part of it, which seems bad


- The % of valid trips has shot up significantly
- There are a lot more entries with where there is a difference of 1 valid trip
- There are a lot fewer entries where there is a difference > 5 valid trips; only 2

# Final result of tuning: 500m radius

Also, if we choose to use DBSCAN, the parameters need to be:
- distance matrix (metric=precomputed)
- min_samples=2

Before we end, let's generate some quick graphs on the comparision between min_samples and the overall dataset quality for the selected values. If we want to plot the entire dataset, and not just the diff values, we probably want to use a scatter plot.

In [None]:
# Let's just get the cluster summary for all users and bung it into a giant dataframe
all_user_summaries = []
for u in all_users:
    curr_user_df = expanded_trip_df_map[u]
    if len(curr_user_df) == 0:
        continue
    # print("Continuing with %s" % (len(curr_user_df)))
    # dataframe mapping user id and purpose to n locs, n clusters, etc
    curr_cluster_summary = get_cluster_summary(curr_user_df, [300,500,700], min_samples=2)
    curr_cluster_summary["min_samples"] = 2
    # curr_cluster_summary = get_cluster_summary(curr_user_df, [300,500,700], min_samples=5)
    # curr_cluster_summary["min_samples"] = 5
    all_user_summaries.append(curr_cluster_summary)

In [None]:
len(all_user_summaries)

In [None]:
all_user_summary_df = pd.concat(all_user_summaries)
all_user_summary_df.reset_index(inplace=True)
all_user_summary_df["user_id_trunc"] = all_user_summary_df.user_id.apply(lambda u: str(u)[:2])
all_user_summary_df["pct_valid"] = all_user_summary_df.n_valid_500 / all_user_summary_df.n_locs
all_user_summary_df

### Validity per user and purpose

In [None]:
colors = dict(zip(all_user_summary_df["user_id_trunc"].unique(), plt.get_cmap("tab20", len(all_user_summary_df.user_id.unique())).colors))
# print(colors)
ax = all_user_summary_df.plot(kind="scatter", x="purpose", y="pct_valid", color = all_user_summary_df["user_id_trunc"].apply(lambda c: colors[c]), figsize=(20,6))
ax.set_xticklabels(range(len(all_user_summary_df)))
# ax.legend(colors.keys())
ax

In [None]:
all_user_summary_df.pct_valid.hist()

In [None]:
all_zero_df = all_user_summary_df[all_user_summary_df.pct_valid == 0]
all_zero_df[["user_id", "purpose", "n_locs", "n_clusters_500", "n_valid_500"]]

### Validity by user only

In [None]:
agg_user_summary_df = all_user_summary_df[["user_id", "n_locs", "n_valid_500"]].groupby("user_id").agg('sum')
agg_user_summary_df["pct_valid"] = agg_user_summary_df.n_valid_500 / agg_user_summary_df.n_locs
agg_user_summary_df.pct_valid

In [None]:
agg_user_summary_df.n_locs

In [None]:
agg_user_summary_df["user_id_trunc"] = [str(u)[:2] for u in agg_user_summary_df.index]
ax = agg_user_summary_df.pct_valid.plot(kind="bar")
ax.set_xticklabels(agg_user_summary_df.user_id_trunc.to_list())
ax

This means that, even with the current, somewhat wonky dataset, more than half of the trip ends can fit into some cluster, which means that if a new trip comes in, we should be able to match at least for the purpose. Of course, for the mode, etc, we really need both ends to match, but that is a separate notebook.

# Checking final results without separating by purpose

So far, we have been grouping by the purpose before clustering. We can still do this for model building, but we won't necessarily be able to do this for model prediction because we won't (by definition) have a purpose then. Does binning by the purpose matter? Will we get dramatically different result if we don't bin by purpose first?

In [None]:
# Let's just get the cluster summary for all users and bung it into a giant dataframe
all_user_summaries = []
for u in all_users:
    curr_user_df = expanded_trip_df_map[u]
    if len(curr_user_df) == 0:
        continue
    # print("Continuing with %s" % (len(curr_user_df)))
    # dataframe mapping user id and purpose to n locs, n clusters, etc
    # display.display(curr_user_df.head())
    curr_cluster_summary = get_cluster_summary_no_grouping(curr_user_df, [300,500,700], min_samples=2)
    curr_cluster_summary["min_samples"] = 2
    # curr_cluster_summary = get_cluster_summary(curr_user_df, [300,500,700], min_samples=5)
    # curr_cluster_summary["min_samples"] = 5
    all_user_summaries.append(curr_cluster_summary)

In [None]:
all_user_summary_df = pd.concat(all_user_summaries)
all_user_summary_df.reset_index(inplace=True)
all_user_summary_df["user_id_trunc"] = all_user_summary_df.user_id.apply(lambda u: str(u)[:2])
all_user_summary_df["pct_valid"] = all_user_summary_df.n_valid_500 / all_user_summary_df.n_locs
all_user_summary_df

In [None]:
colors = dict(zip(all_user_summary_df["user_id_trunc"].unique(), plt.get_cmap("tab20", len(all_user_summary_df.user_id.unique())).colors))
# print(colors)
ax = all_user_summary_df.plot(kind="scatter", x="purpose", y="pct_valid", color = all_user_summary_df["user_id_trunc"].apply(lambda c: colors[c]), figsize=(20,6))
ax.set_xticklabels(range(len(all_user_summary_df)))
# ax.legend(colors.keys())
ax

In [None]:
agg_user_summary_df = all_user_summary_df.groupby("user_id").agg('sum')
agg_user_summary_df["pct_valid"] = agg_user_summary_df.n_valid_500 / agg_user_summary_df.n_locs
agg_user_summary_df.pct_valid

In [None]:
agg_user_summary_df["user_id_trunc"] = [str(u)[:2] for u in agg_user_summary_df.index]
ax = agg_user_summary_df.pct_valid.plot(kind="bar")
ax.set_xticklabels(agg_user_summary_df.user_id_trunc.to_list())
ax

Aha! Clustering without separating by purpose actually gives significantly better results. Almost everybody has almost all their trips in clusters.

There are ~ 6 (user-cluster pairs) with no mappings. This result indicates that there are multiple labels in each cluster. So when we first group by the purpose, there are end locations that don't have any nearby end locations. But when we first cluster, we do have nearby end locations, just labeled with a different purpose. This can either be a genuine problem, or an issue with incorrect labels in the dataset.

Note also that the total points (`n_loc`) seems to be different in the log statements and in the sum of the locations counted - e.g. for `e7b24d99-324d-4d6d-b247-9edc87d3c848`, the printed length is 81 but there are only 77 entries when we sum up the dataframe (both ways!)

```
About to get loc_df for user_df 576e37c7-ab7e-4c03-add7-02486bc3f42e of length 226
About to get loc_df for user_df 8b563348-52b3-4e3e-b046-a0aaf4fcea15 of length 1
About to get loc_df for user_df 5079bb93-c9cf-46d7-a643-dfc86bb05605 of length 212
About to get loc_df for user_df feabfccd-dd6c-4e8e-8517-9d7177042483 of length 217
About to get loc_df for user_df 113aef67-400e-4e21-a29f-d04e50fc42ea of length 33
About to get loc_df for user_df c8b9fe22-86f8-449a-b64f-c18a8d20eefc of length 163
About to get loc_df for user_df e7b24d99-324d-4d6d-b247-9edc87d3c848 of length 81
About to get loc_df for user_df 1044195f-af9e-43d4-9407-60594e5e9938 of length 380
About to get loc_df for user_df 898b1a5e-cdd4-4a0c-90e4-942fa298e456 of length 264
About to get loc_df for user_df 1d292b85-c549-409a-a10d-746e957582a0 of length 208
About to get loc_df for user_df cb3222a7-1e72-4a92-8b7b-2c4795402497 of length 22
About to get loc_df for user_df 960835ac-9d8a-421d-8b8a-bf816f8a4b92 of length 618
```

In [None]:
agg_user_summary_df.n_locs

In [None]:
(all_user_summary_df[all_user_summary_df.user_id == agg_user_summary_df.index[-2]][["purpose", "n_locs"]].n_locs.sum(),
expanded_trip_df_map[agg_user_summary_df.index[-2]].groupby(by="purpose_confirm").agg('count').source.sum(),
len(expanded_trip_df_map[agg_user_summary_df.index[-2]]))

Aha! There are 81 entries in total, but grouping by purpose and summing up gets us to 77. What is the list of purpose entries for this user? Can that give us a clue? If that doesn't work, we can also create two dataframes and generate a set diff to quickly determine them.

In [None]:
expanded_trip_df_map[agg_user_summary_df.index[-2]].purpose_confirm.unique()

In [None]:
expanded_trip_df_map[agg_user_summary_df.index[-2]].purpose_confirm

In [None]:
expanded_trip_df_map[agg_user_summary_df.index[-2]].purpose_confirm.dropna()

OK! So this is in fact because of nan entries for the purpose.

### Next steps:

- let's run this same code against a couple of other datasets, including the staging enviroment.
- We also need to see how clustering based on both the start and end locations will work

Will probably do that in a separate notebook, generated by simplifying the code in here, and potentially pulling out into separate functions.