This notebook investigates how well our prediction confidence matches reality. 

In [None]:
import pandas as pd
import numpy as np
import logging

from data_wrangling import expand_df_dict, expand_df_list, add_top_pred
import emission.storage.timeseries.abstract_timeseries as esta
import emission.analysis.modelling.tour_model_first_only.data_preprocessing as pp

In [None]:
all_users = esta.TimeSeries.get_uuid_list()

In [None]:
# set up dataframe with our desires columns 

excluded_user_count = 0
df_list = []

for user in all_users:
    trips = pp.read_data(user)
    trip_df = pd.DataFrame(trips)
    trip_df = expand_df_dict(trip_df, 'data')
    if 'inferred_labels' in trip_df.columns:
        trip_df = trip_df.drop(columns=['source', 'end_ts', 'end_local_dt', 'raw_trip', 'start_ts', 
                                        'start_local_dt', 'start_place', 'end_place', 'cleaned_trip'])
        trip_df = expand_df_dict(trip_df, 'user_input').rename(columns={'mode_confirm':'mode_true', 
                                                                        'purpose_confirm':'purpose_true', 
                                                                        'replaced_mode':'replaced_true'})
        try:
            trip_df = expand_df_list(trip_df, 'inferred_labels')
            trip_df = expand_df_dict(trip_df, 'inferred_labels')
            trip_df = expand_df_dict(trip_df, 'labels').rename(columns={'mode_confirm':'mode_pred', 
                                                                        'purpose_confirm':'purpose_pred',
                                                                        'replaced_mode':'replaced_pred'})
        except Exception as e:
            logging.debug('{} excluded'.format(user))
            excluded_user_count += 1
            logging.debug(str(e))
            continue
            
       # indicates if the predicted label was chosen to be presented to user
        trip_df = add_top_pred(trip_df, trip_id_column='_id', pred_conf_column='p')
        df_list += [trip_df]

    else:
        logging.debug('{} excluded, no inferred labels'.format(user))
        excluded_user_count += 1

all_trips = pd.concat(df_list, ignore_index=True)

In [None]:
print('{} users included, {} users excluded from dataset with {} total users'.format(len(all_users) - excluded_user_count, excluded_user_count, len(all_users)))

In [None]:
all_trips[['_id','user_id', 'mode_true', 'purpose_true', 'mode_pred', 'purpose_pred', 'p', 'top_pred']].sort_values('_id')

In [None]:
all_trips.groupby(['_id','user_id', 'mode_pred'], as_index=False).agg({'p':sum, 'top_pred':any})

Let's check with the raw trip dictionaries to make sure data was added correctly.

In [None]:
# trips = pp.read_data(all_users[1])
# for t in trips:
#     for e in t:
#         print(e)
#         print(t[e])
#         print()
#     print()

In [None]:
from uuid import UUID

def conf_verification(trips_df, label_type, n, user=None, only_top_pred=False):
    """ Returns a data frame with n rows showing the predicted and actual accuracy for n quantiles. 
    
        Args:
            trips_df: dataframe containing true labels, predicted labels, and predicted confidence
                Should have the columns '<label_type>_true', '<label_type>_pred', 'p'
            label_type (str): 'mode', 'purpose', or 'replaced'
            n (int): number of quantiles
            user (str or UUID): UUID, if we only want the confidence table for a single user.
            top_pred (bool): whether or not we only look at top predictions, or look at all predictions
                (including alternative labels that may not have been suggested to the user)
    """
    conf_quantiles = []
    
    if user:
        if type(user) == str:
            user = UUID(user)
        assert type(user) == UUID
        trips_df = trips_df[trips_df.user_id==user]

    # ignore trips that don't have confirmed user input
    trips_df = trips_df.dropna(subset=[label_type + '_true'])
    
    # merge rows with duplicates of this label_type 
    # e.g. if a trip has two predictions, (car, work, bike) at 50% and 
    # (car, shopping, no travel) at 30%, and we want to get the confidence 
    # of mode labels, we want to combine the confidence of these two predictions 
    # to yield the true confidence for 'car'
    trips_df = trips_df.groupby(['_id','user_id', label_type + '_true', label_type + '_pred'], as_index=False).agg({'p':sum, 'top_pred':any})
    
    if only_top_pred:
        trips_df = trips_df[(trips_df.top_pred)]
                                      
    for i in range(n):
        quantile_min = i / n
        quantile_max = (i + 1)/n
        
        # get trips in this quantile
        trips_in_range = trips_df[(trips_df['p'] >= quantile_min) & (trips_df['p'] < quantile_max)]
        
        num_predictions = len(trips_in_range)
        num_correct = len(trips_in_range[trips_in_range[label_type + '_true'] == trips_in_range[label_type + '_pred']])

        conf_quantiles.append(['{:.2f} - {:.2f}'.format(quantile_min, quantile_max), num_predictions, num_correct])

    columns=['stated confidence range', 'num_predictions', 'num_correct']
    conf_df = pd.DataFrame(conf_quantiles, columns=columns)
    conf_df['accuracy'] = np.round(conf_df['num_correct'] / conf_df['num_predictions'], 3)

    return conf_df

In [None]:
print('confidence for mode predictions')
conf_verification(all_trips, 'mode', 10, only_top_pred=False)

In [None]:
print('confidence for purpose predictions')
conf_verification(all_trips, 'purpose', 10, only_top_pred=False)

In [None]:
print('confidence for purpose predictions')
conf_verification(all_trips, 'replaced', 10, only_top_pred=False)

This is the data for 38 users who had received label suggestions. We evaluate *all* of our label predictions for these users, including backup/alternative predictions that may not have been shown. 

This is pretty good! The confidences are fairly realistic, though there is some variability (we are underconfident for some and overconfident for some). 

Now I want to see what happens if we break this up into 0.05 increments.

In [None]:
conf_verification(all_trips, 'mode', 20, only_top_pred=False)

The fact that our confidence is pretty realistic makes intuitive sense – imagine if we threw all trips from every single user into a singular giant cluster, gave everybody the same prediction based on the probability distribution of that giant cluster, and set the confidence to the the frequency of that label. The confidence would be pretty realistic if people continued taking the same kinds of trips. So it's good that our confidence is fairly realistic but the more important part is that we want more predictions to be at the higher confidence levels. Accurately stating low confidence in a bad prediction doesn't help the users much. 

I wonder how it will look if we only looked at the confidence of the top prediction. 

In [None]:
conf_verification(all_trips, 'mode', 10, only_top_pred=True)

Let's also look at the confidence for some random users to see how much variation there is: 

In [None]:
random_user = np.random.choice(all_users)
# most users actually didn't have predicted labels so I had to run the RNG a bunch of
# times until I actually got a valid user
print(random_user)
conf_verification(all_trips, 'mode', 10, user=random_user, only_top_pred=False)

In [None]:
random_user = np.random.choice(all_users)
# again: most users actually didn't have predicted labels so I had to run the RNG a bunch
# of times until I actually got a valid user
print(random_user)
conf_verification(all_trips, 'mode', 10, user=random_user, only_top_pred=False)

Purely anecdotally, it appears that we tend to be slightly underconfident in the trips that we say we are 50-80% confident in. 

Also, there appears to be more variability in confidence agreement for individual users than for the entire dataset, which is expected due to smaller sample size.

(I went back and updated the code to address the following comments.)

~~I have a hypothesis for why our confidence was underestimated: we provide confidences for an entire tuple rather on a mode/purpose/replaced mode basis. Consider an example where we have the following predictions:~~

- ~~prediction 1, 50% confidence: mode car, purpose recreation, replaced walk~~
- ~~prediction 2, 30% confidence: mode car, purpose shopping, replaced bike~~

~~Notice that the overall accuracy for the mode car prediction is 80%, even though we listed it as two separate predictions of 50% and 30%. Thus, in the way that I'm currently assessing confidence, these would be treated as two separate predictions, and be placed in the 0.3-0.4 and 0.5-0.6 deciles, even though it should be a single prediction in the 0.8-0.9 decile. ~~

~~This also explains why we're 'overconfident' for lower deciles but it becomes more realistic for higher deciles: at higher deciles, it is less likely that I counted a prediction in the wrong decile (there are simply fewer deciles above it that it could belong in). ~~

~~It is slightly more concerning that we were overconfident for a few users, because that can't be explained by this bug. (Well, maybe its possible that fixing this bug and shifting predictions to their correct decile will smooth things out)~~

There are two things I can do: 
1. ~~fix the confidence assessment to account for this~~ (yep, did this)
2. update label predictions to have separated confidences for mode, purpose, and replaced

Option 2 is going to take more work with refactoring the code base, but it also seems like the better long-term option, since we plan to merge label assist with sensed-mode in the ensemble algorithm (which means some trips may have predicted mode but no predicted purpose), and we can try to extend the label assist algorithm to predict a trip's purpose even when the start location is not known (which would not yield a predicted mode or replaced mode). 

Another thought: because there might be correlation between mode and purpose, it may make sense to assess the confidence at the tuple level. (The interesting thing about this tuple-packaging is that we may end up with a mode label, for example, that is the suggested label even if it does not have the highest cumulative confidence.)