# Pilot Before After -- Pass Three
This is the same as Pass Two except with an additional "sensitivity analysis" section at the end. Also, possible minor modifications to increase compatibility with different types of database dump.

## Pass Two
This reimplements the elements of `Pilot Before After -- Pass One` that we actually use in a more streamlined manner and also does some pilot-specific analysis. This is the main analysis code used to generate the results in the paper with the working title "A configurable approach to requesting user input and validation of low-confidence trip label inferences."

## Usage
To use, starting with a database dump tarball for each of the pilot programs plus stage:

  1. Expand each tarball into its own folder
  2. Make a new empty folder and start `mongod`, using the `--dbpath` option to tell it to store the database in that empty folder
  3. `mongorestore` each of the dumps, one at a time, verifying that there are no weird errors. I did this in ascending order of dump size except ending with stage.
  4. Ideally you would have "0 document(s) failed to restore" for all of the `mongorestore`s, but I consistently get `1400` documents failing to restore for `fc` and `1402` for `cc`.
  5. Set the proper environment variables so this notebook can find the `emission` scripts (I do this for myself with a slightly hacky `sys.path.insert` below)
  6. Run the notebook top to bottom. It should take on the order of 20 minutes (benchmarked on a 2015 MacBook Air) and run without errors. It takes massive (a few gigabytes) amounts of RAM, both in the `python` process and on the part of `mongod`.
  7. The notebook is structured to load everything it needs from the database in the first few cells and then does not rely on the database again, so if you are tweaking later analyses and want to recover some RAM you can terminate the `mongod` process.

tl;dr: this relies on a rather specific database configuration.

## Desired Output

### All the useful information we want to keep from the previous notebook
Dataset 1 -- no "after update" condition necessary:
 * Number of participants
 * Unlabeled trips that users need not interact with at all
 * Trips that would be in To Label with no red labels

Dataset 2 (a subset of Dataset 1) -- yes "after update" condition necessary:
 * Number of participants
 * Frequency of app opens
 * Taps actually avoided by verify button
 * Taps actually avoided by hiding high confidence trips
 * Overall taps avoided (total, per trip, percentage of taps)
 * Fraction of users who used the verify button
 * Fraction of trips finalized using the verify button

### New features
 * Graphs of weekly labeling percentage and number of days per week the app was used over time
 * Comparisons of pilot programs that started before the update was released to pilot programs that started with the update already installed
 * Histogram of expectation confidences, segmented by how they are presented to the user
 * Weekly labeling visualizations at per-user granularity
 * Visualization of how we save taps
 * Summary of all the numbers used in the draft paper

### Still to do
 * What happens if we were to change the confidence thresholds? Can we save users more taps? This should be explored by figuring out how much of the current tapping is correcting the algorithm vs. filling in red labels vs. simply not using the new features -- which can be done with the "select_label" instrumentation event. If we lower the low threshold, we would expect the number of taps used to fill in red labels to decrease, but not the other two categories of taps.
 * Try to measure the time spent on each of the screens from which it is possible to label (To Label, All Unlabeled, Diary, etc.) -- this might be difficult, but it would be useful both to do a comparison between screens and also to see if this update changes the amount of time people spend labeling.
 * Maybe some more per-program breakdowns would be useful?
 * Statistical tests might be useful.

In [None]:
TESTINGMODE = False
USING_LABEL_TEST_DB = False
USING_MINI_TRIP_ANALYSIS_VALIDATION_DB = False
USING_ANALYSIS_VALIDATION_DB = False

# Declarations -- we declare variables here so that we don't accidentally clear them later quite as much
EXCLUDE_UUIDS = []
stats = {}             # dictionary with the following database keys:
                       # "time": "stats/client_time", "error": "stats/client_error", "nav": "stats/client_nav_event"
                       # These are found in the Stage_timeseries collection
user_info = {}
confirmed_trip_df_map = {}
user_before_start = {}  # When the "before" period starts for each user
user_after_start = {}  # When the "after" period starts for each user
filter1_users = []  # Users with enough total trips
filter2_users = []  # Users that have installed the update
filter3_users = []  # Users with enough before trips
filter4_users = [] # Users with enough after trips
server_filtered_users = []
filtered_users = []
match_histogram = {}
g_high_confidence_n_after_unlabeled = None  # Hack to give old code access to this later. TODO: rework this.

## Choices
 * Let "before" be from June 1 until the user installed the update
 * Let "after" be from when the user installed the update until the most recent data available (as of writing, October 18)
 * Require 30 total trips for inclusion in Dataset 1
 * Require 10 trips after the switch for inclusion in Dataset 2
 * For looking at frequency of app opens, analyze the entire Dataset 2 and then look at those who have opened the app at least 5 times before the switch and 5 times after

In [None]:
REQUIRED_TRIPS_TOTAL = 30 if not USING_MINI_TRIP_ANALYSIS_VALIDATION_DB else 1 # Must be at least 1 to prevent division by zero
if TESTINGMODE: REQUIRED_TRIPS_TOTAL = 1
REQUIRED_TRIPS_BEFORE = 0  # Changed from 10 on 2021-12-18 -- this is a significant change in methodology!
REQUIRED_TRIPS_AFTER = 10 if not USING_MINI_TRIP_ANALYSIS_VALIDATION_DB else 1  # Changed from 0 to 10 on 2021-12-18
REQUIRED_OPENS_TOTAL = 0
REQUIRED_OPENS_BEFORE = 0  # Changed from 5 on 2021-12-18 -- this is a significant change in methodology!
REQUIRED_OPENS_AFTER = 5
import arrow
MY_TZ = "America/Denver"  # Timezone we use when that information is absent (TODO this can be inferred from other data structures)
BEFORE_START = arrow.get("2021-06-01T00:00-06:00")
AFTER_END = arrow.get("2021-11-15T09:00-08:00")
weeks = list(arrow.Arrow.span_range("week", BEFORE_START, AFTER_END))[:-1]  # The weeks we care about when doing weekly analyses
from uuid import UUID  # This part is for if you want to manually exclude certain users
# EXCLUDE_UUIDS = [UUID(s) for s in input("Enter UUIDs to exclude, separated by spaces: ").split(" ") if len(s) > 0]
print(EXCLUDE_UUIDS)

## Settings
The below settings correctly mirror the production configuration; however, **the confidence thresholds are different for stage**. This must be kept in mind when doing analysis of stage data.

In [None]:
LABEL_CATEGORIES = ["mode_confirm", "purpose_confirm", "replaced_mode"]
HIGH_CONFIDENCE_THRESHOLD_PRODUCTION = 0.89  # Confidence we need to not put a trip in To Label
# if TESTINGMODE: HIGH_CONFIDENCE_THRESHOLD_PRODUCTION = 0.80
LOW_CONFIDENCE_THRESHOLD_PRODUCTION = 0.25  # confidenceThreshold from the config file
OLD_TAPS = 2*len(LABEL_CATEGORIES)  # Number of taps each trip required to fully label under the old UI

## Other Setup

### Imports, aliases, logging, config

In [None]:
import sys
sys.path.insert(0, "/Users/mallen2/All_Label_Data/e-mission-server")  # Works for my configuration; might be different for you

import emission.storage.timeseries.abstract_timeseries as esta
import emission.core.get_database as edb
import emission.storage.timeseries.aggregate_timeseries as estag
import emission.storage.timeseries.timequery as estt
from statistics import mean
import json
import logging
import numpy as np
import pandas as pd

# Various ways to play with how large a firehose of information you want
_default_log_level = logging.DEBUG
def set_log_level(level):
    logging.getLogger().setLevel(level)
def reset_log_level():
    global _default_log_level
    logging.getLogger().setLevel(_default_log_level)
set_log_level(logging.WARNING)

agts = estag.AggregateTimeSeries()       

db_keys = {
    "time": "stats/client_time",
    "error": "stats/client_error",
    "nav": "stats/client_nav_event"
}


### Helper functions

In [None]:
filter_between = lambda dataset, key, start, end: dataset[(dataset[key] >= start) & (dataset[key] <= end)]
fu = lambda d, users=filtered_users: {k: d[k] for k in d if k in users}  # Filter users, builds a dictionary

def filter_update(new, old, reason):
    print(f"Excluded {len(old)-len(new)} users, left with {len(new)}: {reason}")

format_frac_percent = lambda num, denom: f"{num}/{denom}={(num/denom if denom != 0 else float('NaN')):.2%}"

delta2days = lambda d: d.days+d.seconds/86400

format_arrow_comma = lambda a, b, c: f"{a:.2f}->{b:.2f}, {c}"

## Get data

### Load stats and user databases

In [None]:
# This may take a while -- clocked at between 6m and 10m on an early-2015 MacBook Air
for key in db_keys:
    print(f'Adding "{db_keys[key]}" to stats as "{key}"')
    stats[key] = agts.get_data_df(db_keys[key])        # agts is a emission.storage.timeseries.aggregate_timeseries AggregateTimeSeries object defined earlier
    print(f"-> Done; found {stats[key].shape[0]} entries")

In [None]:
programs_all = {"ens": [], "before_after": [], "only_after": []}  
# This will be a dict where keys are all pilot programs plus 
# "ens" for "ensemble, no stage" and "stage" for stage; 
# and values are the users corresponding to each
# Late in this process, I added "before_after" and "only_after": "only_after" is an ensemble for the programs 
# where users were given the app with the update already installed, 
# and "before_afer" is an ensemble for the remaining programs
only_after_programs = ["vail", "pc"]
for u in edb.get_uuid_db().find():         # add users to proper locations in programs 
    program = u["user_email"].split("_")[0]    # This info is in the Stage_uuids collection of the database
    uuid = u["uuid"]
    u["program"] = program
    if program not in programs_all.keys(): programs_all[program] = []
    if program != "stage":
        programs_all["ens"].append(uuid)
        if program in only_after_programs:
            programs_all["only_after"].append(uuid)
        else:
            programs_all["before_after"].append(uuid)
    programs_all[u["program"]].append(uuid)
    user_info[uuid] = u
print("Programs all: "+str({k: len(programs_all[k]) for k in programs_all}))

# Ignore the small ensembles in certain cases
programs_some = programs_all.copy()
programs_some.pop("before_after")
programs_some.pop("only_after")
print("Programs some: "+str({k: len(programs_some[k]) for k in programs_some}))

programs = programs_all  # For backwards compatibility
# The upside to this way of doing ensembles is it's really easy. The downside is a given user's data is calculated as many times as that user appears in the list of programs -- so with our current ensembles, we're doing most calculations three times when we only really need to do them once.

### Get Dataset 1

In [None]:
# This may take a while -- clocked at 3m10 on an early-2015 MacBook Air
set_log_level(logging.INFO)
all_users = esta.TimeSeries.get_uuid_list()
print(f"Working with {len(all_users)} initial users")

filter0_users = [u for u in all_users if u not in EXCLUDE_UUIDS]  # Users that we don't explicitly exclude
filter_update(filter0_users, all_users, "presence on exclusion list")

if TESTINGMODE: gooduser = filter0_users[1]

# filter out users that don't have enough trips
for u in filter0_users:
    ts = esta.TimeSeries.get_time_series(u if not TESTINGMODE else gooduser)
    ct_df = ts.get_data_df("analysis/confirmed_trip")
    confirmed_trip_df_map[u] = ct_df
    if ct_df.shape[0] >= REQUIRED_TRIPS_TOTAL: filter1_users.append(u)
filter_update(filter1_users, filter0_users, "not enough total trips")   # (new, old, reason)

# To find a user's UUID based on the end date of their first trip:
# for u in filter1_users:
#     ct_df = confirmed_trip_df_map[u].copy()
#     ct_df.sort_values("end_ts", ascending=True, inplace=True)
#     print(u)
#     print(arrow.get(ct_df.iloc[0]["end_ts"]).to("America/Chicago"))
#     print()

for u in filter1_users:
    # Convert timestamps to more usable values; find per-user starting points
    # Makes new columns of converted timestamps for each timestamp type 
    # I used to do this later in the process, but it turns out end_arrow and user_before_start 
    # are useful for server_confirmed_users too
    ct_df = confirmed_trip_df_map[u]
    if "end_ts" in ct_df:
        ct_df["end_arrow"] = ct_df["end_ts"].apply(arrow.get)
        ct_df.sort_values("end_arrow", ascending=True, inplace=True)
    else:
        print("end_ts not in dataframe for "+str(u))
    if "metadata_write_ts" in ct_df:
        ct_df["write_arrow"] = ct_df["metadata_write_ts"].apply(arrow.get)
    else:
        print("metadata_write_ts not in dataframe for "+str(u))
    this_before_start = max(ct_df.iloc[0]["end_arrow"], BEFORE_START)
    user_before_start[u] = this_before_start

for user in filter1_users: 
    if user not in server_filtered_users:
        server_filtered_users.append(user)
print(f"For metrics that don't need user interaction, working with {len(server_filtered_users)} filtered users")

reset_log_level()

### Get Dataset 2

In [None]:
# This may take a while -- clocked at between 30s and 50s on an early-2015 MacBook Air
for u in filter1_users:
    lts = stats["time"][(stats["time"]["user_id"] == (u if not TESTINGMODE else gooduser)) & (stats["time"]["name"] == "label_tab_switch")]
    if len(lts) > 0:
        filter2_users.append(u)
        lts = lts.copy()
        lts.sort_values("ts", ascending=True, inplace=True)
        this_after_start = arrow.get(lts.iloc[0]["ts"])   
        user_after_start[u] = this_after_start
filter_update(filter2_users, filter1_users, "have not installed the update")

for u in filter2_users:
    ct_df = confirmed_trip_df_map[u]
    n_before = filter_between(ct_df, "end_arrow", user_before_start[u], user_after_start[u]).shape[0]
    if n_before >= REQUIRED_TRIPS_BEFORE: filter3_users.append(u)
filter_update(filter3_users, filter2_users, "not enough before trips")

for u in filter3_users:
    ct_df = confirmed_trip_df_map[u]
    n_after = ct_df[(ct_df["end_arrow"] >= user_after_start[u])].shape[0]
    if n_after >= REQUIRED_TRIPS_AFTER: filter4_users.append(u)
filter_update(filter4_users, filter3_users, "not enough after trips")

for user in filter4_users:
    if user not in filtered_users:
        filtered_users.append(user)
print(f"For metrics that do need user interaction, working with {len(filtered_users)} filtered users")

## Results from Dataset 1!
To review, we want:
 * Number of participants
 * Unlabeled trips that users need not interact with at all
 * Trips that would be in To Label with no red labels

and breakdowns of (all of? some of?) the above for each individual pilot program region ("program")

### Number of participants

In [None]:
print(f"NUMBER OF USERS IN DATASET 1: {len(server_filtered_users)}")
print("Breakdown by program: "+str({k: len([u for u in programs[k] if u in server_filtered_users]) for k in programs}))

### Unlabeled trips that users need not interact with at all
Note: for this one, we calculate a stat across All of time, one for Before, and one for After. Currently, Before includes all of Dataset 1 (and obviously After only includes Dataset 2). TODO: figure out whether it might be a better comparison to report only Dataset 2 for Before.

In [None]:
# This may take a while -- clocked at between 30s and 50s on an early-2015 MacBook Air
# Load the user confirmation data from the database
# These are in Stage_timeseries (metadata.key: "manual/label")
manuals = {label: agts.get_data_df("manual/"+label) for label in LABEL_CATEGORIES}

In [None]:
# This seems to work but is way too slow; see below for the faster way 
set_log_level(logging.WARNING)
import emission.storage.decorations.trip_queries as esdt
import emission.core.wrapper.entry as ecwe
###trip_to_manuals = {}  # Dictionary by user of (dictionary by trip ID of (dictionary by label type of ()))
trip_write_times = {} # Dictionary by user of (dictionary by trip ID)

n_mislabel_events = 0

malleable = lambda: type('', (), {})  # An object we can do anything with
n_confirmed_trips = sum([confirmed_trip_df_map[u].shape[0] for u in all_users])
print(n_confirmed_trips)
print(sum([confirmed_trip_df_map[u].shape[1] for u in all_users]))  

def final_candidate(filter_fn, potential_candidates):
    potential_candidate_objects = [ecwe.Entry(c) for c in potential_candidates]
    extra_filtered_potential_candidates = list(filter(filter_fn, potential_candidate_objects))
    if len(extra_filtered_potential_candidates) == 0:
        return None

    # In general, all candidates will have the same start_ts, so no point in
    # sorting by it. Only exception to general rule is when user first provides
    # input before the pipeline is run, and then overwrites after pipeline is
    # run
    sorted_pc = sorted(extra_filtered_potential_candidates, key=lambda c:c["metadata"]["write_ts"])
    #print('Printing sorted_pc write time differences:')
    #write_time_diff = sorted_pc[1]['metadata']['write_ts'] - sorted_pc[0]['metadata']['write_ts'] if len(sorted_pc) > 1 else 'NA'

    # Need to count the number of sorted pc candidates and output it.
    n_good_candidates = len(sorted_pc)
    #print(f'len(sorted_pc) = {n_good_candidates}, write time difference is {write_time_diff}')

    entry_detail = lambda c: c.data.label if "label" in c.data else c.data.start_fmt_time
    logging.debug("sorted candidates are %s" %
        [{"write_fmt_time": c.metadata.write_fmt_time, "detail": entry_detail(c)} for c in sorted_pc])
    most_recent_entry = sorted_pc[-1]
    logging.debug("most recent entry is %s, %s" %
        (most_recent_entry.metadata.write_fmt_time, entry_detail(most_recent_entry)))
    return most_recent_entry, n_good_candidates

def get_user_input_for_trip_object(ts, trip_obj, user_input_key):
    tq = estt.TimeQuery("data.start_ts", trip_obj.data.start_ts, trip_obj.data.end_ts)
    potential_candidates = ts.find_entries([user_input_key], tq)
    return final_candidate(esdt.valid_user_input(ts, trip_obj), potential_candidates)

for u in all_users:
    #print(u)
    ###trip_to_manuals[u] = {}
    trip_write_times[u] = {}
    ts = esta.TimeSeries.get_time_series(u)
    print(confirmed_trip_df_map[u].shape)

    for i, trip in confirmed_trip_df_map[u].iterrows():
        #print(trip.keys())
        #break
        #print(i, end=" ")
        ###trip_to_manuals[u][trip._id] = {}
        single_trip_write_times = {}
        for label in manuals:
            # ui = esdt.get_user_input_for_trip("analysis/confirmed_trip", u, trip._id, "manual/"+label)
            # trip_obj = ts.get_entry_from_id("analysis/confirmed_trip", trip._id)
            trip_obj = malleable()
            trip_obj.data = trip
            trip_obj.metadata = malleable()
            trip_obj.metadata.time_zone = trip.start_local_dt_timezone

            user_input_info = get_user_input_for_trip_object(ts, trip_obj, "manual/"+label)

            if user_input_info is not None: 
                ui = user_input_info[0]
                n_mislabel_events += user_input_info[1] - 1
                single_trip_write_times[label] = ui['metadata']['write_ts']

            ###trip_to_manuals[u][trip._id][label] = ui  #old code
        
        trip_write_times[u][trip._id] = min(single_trip_write_times.values()) if len(single_trip_write_times) > 0 else float("-inf")

mislabel_proportion = n_mislabel_events/n_confirmed_trips
print(mislabel_proportion)

In [None]:
ui

In [None]:
def do_matching(users):
    for user in users:
        ct_df = confirmed_trip_df_map[user]
        ct_df["label_write_time"] = ct_df['_id'].map(trip_write_times[u])
        
do_matching(server_filtered_users)
'''
# This may take a while -- clocked at 5m27 on an early-2015 MacBook Air
# Calculate which trips were manually labeled before the inference algorithm ran. 
# These trips are part of the "train" dataset, so we need to exclude them from the "test" dataset.
# To do this, we must match confirmed trip entries with entries from the user input database.
# TODO My matcher is a rather blunt tool, and it misses a lot of matches. 
# Can we get something with success rates approaching that of esdt.get_user_input_for_trip_object 
# without sacrificing so much time?
import time
def filter_time_permissive(df, trip, threshold=15):       # 15 seconds threshold
    start = trip["start_ts"]
    end = trip["end_ts"]
    before_g = df["start_ts"] >= start-threshold
    before_l = df["start_ts"] <= start+threshold
    after_g = df["end_ts"] >= end-threshold
    after_l = df["end_ts"] <= end+threshold
    result = df[before_g & before_l & after_g & after_l]
    return result

def get_write_time(df, trip):
    if len(trip["user_input"]) == 0: return float("NaN")
    candidates = filter_time_permissive(df, trip)
    if len(candidates) not in match_histogram: match_histogram[len(candidates)] = 0
    match_histogram[len(candidates)] += 1
    write_times = candidates["metadata_write_ts"].values
    return min(write_times) if len(write_times) > 0 else float("-inf") # If we can't find a match, assume the worst

def get_write_times(trip):
    times = [get_write_time(df, trip) for df in manuals.values()]    # for a given trip, get the write time for each label category
    return min(times)

def explore_matching(user):
    print(user)
    ct_df = confirmed_trip_df_map[user].copy()
    print(ct_df.shape)
    to_match = manuals["mode_confirm"]
    print(to_match.keys())

    t1 = time.time()
    for i, trip in ct_df.iterrows():
        if len(trip["user_input"]) == 0: continue
        t_start = trip["start_ts"]
        t_end = trip["end_ts"]
        # print(t_start)
        THRESHOLD = 60
        filtered = filter_time_permissive(to_match, t_start, t_end)
        if (filtered.shape[0] != 1):
            print(filtered["metadata_write_ts"].values)
            print(filtered.shape[0])
    print(time.time()-t1)

# explore_matching(filtered_users[1])

def do_matching(users):
    for user in users:
        # print(user)
        ct_df = confirmed_trip_df_map[user]
        # print(ct_df.keys())
        # print(ct_df.shape)
        ct_df["label_write_time"] = ct_df.apply(lambda trip: get_write_times(trip), axis=1)

do_matching(server_filtered_users)
print(sorted(match_histogram.items()))  # We want as many items as possible to have exactly one match. Zero matches means we will be forced to exclude the trip, and multiple matches means we must take the most pessimistic match.
'''

In [None]:
# This may take a while -- clocked at 19s on an early-2015 MacBook Air

# Previously, we only operated on trips that were *actually* unlabeled. 
# Now, we operate on trips that were unlabeled at the time of expectation generation.
def filter_unlabeled(df):
    # return df[df["user_input"].apply(len) == 0]
    # Check first for label_write_time is NaN and then for label_write_time after expectation generation
    return df[(df["label_write_time"].isna()) | (df["label_write_time"] > df["metadata_write_ts"])]

def is_unlabeled(trip):
    # return len(trip["user_input"]) == 0
    # print("Labeled at: "+str(trip["label_write_time"]))
    # print("Inferred at: "+str(trip["metadata_write_ts"]))
    # tried .isna on the left but got the error: 'float' object has no attribute 'isna'
    return (trip["label_write_time"] != trip["label_write_time"]) | (trip["label_write_time"] > trip["metadata_write_ts"])

def high_stats(users):
    global g_high_confidence_n_after_unlabeled  # see the early variable declarations near the top.
    # if this is None, it gets set to the value we find for high_confidence_n_after_unlabeled
    
    total_trip_n = {}
    total_trip_n_after = {}
    total_trip_n_unlabeled = {}
    total_trip_n_after_unlabeled = {}
    high_confidence_n = {}  # Trips with inferences so confident they don't need to go in To Label
    high_confidence_n_after = {}
    high_confidence_n_unlabeled = {}
    high_confidence_n_after_unlabeled = {}
    mid_confidence_n = {}  # Trips that need to go in To Label but have no red labels
    mid_confidence_n_after = {}
    mid_confidence_n_unlabeled = {}
    mid_confidence_n_after_unlabeled = {}
    high_confidence_frac = {}
    mid_confidence_frac = {}
    mid_confidence_any = {}  # Trips that need to go in To Label but have at least one yellow label
    mid_confidence_any_after = {}
    all_confidences = []

    for u in users:
        ct_df = confirmed_trip_df_map[u]
        total_trip_n[u] = ct_df.shape[0]
        high_confidence_n[u] = 0
        mid_confidence_n[u] = 0

        total_trip_n_unlabeled[u] = filter_unlabeled(ct_df).shape[0]
        high_confidence_n_unlabeled[u] = 0
        mid_confidence_n_unlabeled[u] = 0

        mid_confidence_any[u] = 0
        mid_confidence_any_after[u] = 0

        if u in filtered_users:
            high_confidence_n_after[u] = 0
            high_confidence_n_after_unlabeled[u] = 0
            mid_confidence_n_after[u] = 0
            mid_confidence_n_after_unlabeled[u] = 0
            this_after_start = user_after_start[u]
            trips_after = ct_df[(ct_df["end_arrow"] >= this_after_start)]
            total_trip_n_after[u] = trips_after.shape[0]
            total_trip_n_after_unlabeled[u] = filter_unlabeled(trips_after).shape[0]
            ids = []
            for _, trip in trips_after.iterrows():
                ids.append(trip["_id"])

        # Get each trip's inference tuples.
        # Add to a confidence sum for each label value using the p value field of the inference tuples
        # For a given label type, choose the highest confidence sum as the confidence value for the label type.
        # Choose the confidence of the least confident label type as the trip's confidence.
        for _, trip in ct_df.iterrows():
            inference = trip["inferred_labels"]
            # Here goes a quick and partial reimplementation of the on-the-fly (client-side) inference algorithm
            confidences = {}
            for label_type in LABEL_CATEGORIES:
                counter = {}
                for line in inference:
                    if label_type not in line["labels"]: continue  # Seems we have some incomplete tuples!
                    val = line["labels"][label_type]
                    if val not in counter: counter[val] = 0
                    counter[val] += line["p"]
                confidences[label_type] = max(counter.values()) if len(counter) > 0 else 0 # This needs to be max, not sum!!! A major bug in a previous version.
            trip_confidence = min(confidences.values())
            all_confidences.append(trip_confidence)
            # if (trip_confidence >= 0.01 and trip_confidence <= 0.99): print(trip_confidence)
            if trip_confidence > HIGH_CONFIDENCE_THRESHOLD_PRODUCTION:
                high_confidence_n[u] += 1
                if u in filtered_users and trip["_id"] in ids:
                    high_confidence_n_after[u] += 1
                    if is_unlabeled(trip): high_confidence_n_after_unlabeled[u] += 1
                if is_unlabeled(trip): high_confidence_n_unlabeled[u] += 1
            if trip_confidence > LOW_CONFIDENCE_THRESHOLD_PRODUCTION:
                mid_confidence_n[u] += 1
                if u in filtered_users and trip["_id"] in ids:
                    mid_confidence_n_after[u] += 1
                    if is_unlabeled(trip): mid_confidence_n_after_unlabeled[u] += 1
                if is_unlabeled(trip): mid_confidence_n_unlabeled[u] += 1
            
            # What's happening with mid_any?
            if max(confidences.values()) > LOW_CONFIDENCE_THRESHOLD_PRODUCTION and trip_confidence <= LOW_CONFIDENCE_THRESHOLD_PRODUCTION:
                mid_confidence_any[u] += 1
                if u in filtered_users and trip["_id"] in ids:
                        mid_confidence_any_after[u] += 1
        high_confidence_frac[u] = high_confidence_n[u] / total_trip_n[u]
        in_to_label = total_trip_n[u]-high_confidence_n[u]
        mid_confidence_frac[u] = (mid_confidence_n[u]-high_confidence_n[u]) / in_to_label if in_to_label != 0 else float("NaN")
        
    results = {"high": {
                "All": (sum(high_confidence_n_unlabeled.values()), sum(total_trip_n_unlabeled.values())),
                "Before": (sum(high_confidence_n_unlabeled.values())-sum(high_confidence_n_after_unlabeled.values()), sum(total_trip_n_unlabeled.values())-sum(total_trip_n_after_unlabeled.values())),
                "After": (sum(high_confidence_n_after_unlabeled.values()), sum(total_trip_n_after_unlabeled.values()))
               },
               "mid": {  # For mid confidence, we subtract the high confidence counts from both numerator and denominator to only capture what's going on in To Label
                "All": (sum(mid_confidence_n_unlabeled.values())-sum(high_confidence_n_unlabeled.values()), sum(total_trip_n_unlabeled.values())-sum(high_confidence_n_unlabeled.values())),
                "Before": ((sum(mid_confidence_n_unlabeled.values())-sum(mid_confidence_n_after_unlabeled.values()))-(sum(high_confidence_n_unlabeled.values())-sum(high_confidence_n_after_unlabeled.values())), (sum(total_trip_n_unlabeled.values())-sum(total_trip_n_after_unlabeled.values()))-(sum(high_confidence_n_unlabeled.values())-sum(high_confidence_n_after_unlabeled.values()))),
                "After": (sum(mid_confidence_n_after_unlabeled.values())-sum(high_confidence_n_after_unlabeled.values()), sum(total_trip_n_after_unlabeled.values())-sum(high_confidence_n_after_unlabeled.values()))
               },
               # Should there be denominators for these?
               "mid_any": {
                   "All": sum(mid_confidence_any.values()),
                   "Before": sum(mid_confidence_any.values())-sum(mid_confidence_any_after.values()),
                   "After": sum(mid_confidence_any_after.values())
               },
               "all_confidences": all_confidences}
    # Let's rework this hack? Does it only compute for the first program?
    if g_high_confidence_n_after_unlabeled is None: g_high_confidence_n_after_unlabeled = high_confidence_n_after_unlabeled

    return results, high_confidence_n_after_unlabeled

# Get the inference confidences for each program
high_stats_each_program = {k: high_stats([u for u in programs[k] if u in server_filtered_users]) for k in programs}

complete_results = {k: high_stats_each_program[k][0] for k in programs}
all_high_confidence_n_after_unlabeled = {k: high_stats_each_program[k][1] for k in programs}

if len(all_high_confidence_n_after_unlabeled['ens']) >0:
    g_high_confidence_n_after_unlabeled = all_high_confidence_n_after_unlabeled['ens'] 
else:   # if there's nothing in ensemble, we are probably working with stage data only
    g_high_confidence_n_after_unlabeled = all_high_confidence_n_after_unlabeled['stage'] 

# Info on mini-trip-analysis-validation-db:
# counts of each confirmed trip in Stage_analysis_timeseries: low/mid/high/labeled/incomplete_inference: 32/24/10/5/2
# total confirmed trips: 73. number of unlabeled trips: 68
# Totals after: All,high,mid,low,incomplete,labeled:
#   [24, 2, 6, 16, 0, 0]
if USING_MINI_TRIP_ANALYSIS_VALIDATION_DB: 
    test_results = complete_results['stage'] 
    # high
    assert sum(g_high_confidence_n_after_unlabeled.values()) == 2
    assert test_results['high']['All'] == (10,68)
    assert test_results['high']['Before'] == (10-2,68-24)  # the total number of unlabeled 'after' trips is 24
    assert test_results['high']['After'] == (2,24)

    # mid
    assert test_results['mid']['All'] == (24,68-10)
    assert test_results['mid']['Before'] == (24-6,68-24-8)
    assert test_results['mid']['After'] == (6,24-2)

print("Considering only unlabeled data, we calculate the average percentage of trips users do not need to interact at all with:")
for k in complete_results:
    print(f"{k} ({len([u for u in programs[k] if u in server_filtered_users])} users for \"all\" and \"before\"; {len([u for u in programs[k] if u in filtered_users])} for \"after\"):")
    for s in complete_results[k]["high"]:
        print(f"\t{s}: {format_frac_percent(*complete_results[k]['high'][s])}")

In [None]:
i = 0
for u in server_filtered_users:
    ct_df = confirmed_trip_df_map[u]
    for _,row in ct_df.iterrows():
        if is_unlabeled(row):
            i += 1

print(i)

### Trips that would be in To Label with no red labels
Same note as above applies here.

Numerator is number of trips in To Label with no red labels, denominator is number of trips in To Label at all.

In [None]:
print("Considering only unlabeled data, for trips that would appear in To Label, we calculate the percent of trips with no red labels")
for k in complete_results:
    print(f"{k} ({len([u for u in programs[k] if u in server_filtered_users])} users for \"all\" and \"before\"; {len([u for u in programs[k] if u in filtered_users])} for \"after\"):")
    for s in complete_results[k]["mid"]:
        print(f"\t{s}: {format_frac_percent(*complete_results[k]['mid'][s])}")

In [None]:
print("Considering only unlabeled data, we calculate the number of trips appearing in To Label with some red labels but with at least yellow labels, just to reassure ourselves that it is a lot:")
for k in complete_results:
    print(f"{k} ({len([u for u in programs[k] if u in server_filtered_users])} users for \"all\" and \"before\"; {len([u for u in programs[k] if u in filtered_users])} for \"after\"):")
    for s in complete_results[k]["mid_any"]:
        print(f"\t{s}: {complete_results[k]['mid_any'][s]}")

# Construct the test database
##### Collect trips with each confidence level from the analysis-validation database.
I also mongodumped the full Stage_timeseries and Stage_uuid collections from analysis-validation-db\
and mongorestored them into mini-trip-analysis-validation-db.\
eg docker exec analysis-validation-db sh -c 'mongodump --archive --db=Stage_database --collection=Stage_uuids --query='{}' ' > db_uuids.dump\
And for mongorestore:\
docker exec -i mini-trip-analysis-validation-db sh -c "cd ~/mini_analysis_validation && mongorestore --archive" < ~/All_Label_Data/ mini_analysis_validation/db_uuids.dump


In [None]:
#if USING_ANALYSIS_VALIDATION_DB:
HIGHS_PER_USER = 2
MIDS_PER_USER = 3
LOWS_PER_USER = 4
LABELED_TRIPS_PER_USER = 5
N_INCOMPLETES = 2

high_list = []
mid_list = []
low_list = []
labeled_list = []
incomplete_inference_list = []  # I'm calling an inference incomplete if it has any incomplete tuples. It looks like they all have low confidence
incompletes = 0

for u in server_filtered_users:
    #print(f"user id: {u}")
    ct_df = confirmed_trip_df_map[u]

    # Reset the counts that keep track of the number of each trip confidence per user
    highs = 0
    mids = 0
    lows = 0
    labeled_count = 0

    # Get each trip's inference tuples.
    # Add to a confidence sum for each label value using the p field of the inference tuples
    # For a given label type, choose the highest confidence sum as the confidence value for the label type.
    # Choose the confidence of the least confident label type as the trip's confidence.
    for j, trip in ct_df.iterrows():

        if not is_unlabeled(trip): 
            if labeled_count < LABELED_TRIPS_PER_USER:    # side note: it looks like only the first user in analysis-validation-db 
                                                        # has trips that are not "unlabeled"
                labeled_list.append(ct_df['_id'].iloc[j])
            labeled_count += 1
            continue    # prevents labeled trips from showing up in the other lists

        inference = trip["inferred_labels"]
        confidences = {}
            
        incomplete_inference = False
        for label_type in LABEL_CATEGORIES:
            label_confidence_sums = {}

            for line in inference:
                if label_type not in line["labels"]: 
                    incomplete_inference = True
                    continue 
                label_value = line["labels"][label_type]
                if label_value not in label_confidence_sums: 
                    label_confidence_sums[label_value] = 0
                label_confidence_sums[label_value] += line["p"]
            confidences[label_type] = max(label_confidence_sums.values()) if len(label_confidence_sums) > 0 else 0    

        trip_confidence = min(confidences.values())

        # if the inference is incomplete, add to the incomplete list and move to the next trip so it doesn't go in another list.
        if incomplete_inference:
            if incompletes < N_INCOMPLETES:
                incomplete_inference_list.append(ct_df['_id'].iloc[j])
                if trip_confidence > HIGH_CONFIDENCE_THRESHOLD_PRODUCTION:
                    conf_type = 'high'
                if trip_confidence > LOW_CONFIDENCE_THRESHOLD_PRODUCTION:
                    conf_type = 'mid'
                else:
                    conf_type = 'low'
                print(f'Trip confidence for this incomplete inference tuple is: {conf_type}')

            incompletes += 1
            continue

        if trip_confidence > HIGH_CONFIDENCE_THRESHOLD_PRODUCTION:
            if highs < HIGHS_PER_USER:
                high_list.append(ct_df['_id'].iloc[j])
            highs += 1            
        elif trip_confidence > LOW_CONFIDENCE_THRESHOLD_PRODUCTION:
            if mids < MIDS_PER_USER:
                mid_list.append(ct_df['_id'].iloc[j])
            mids += 1
        
        else:
            if lows < LOWS_PER_USER:
                low_list.append(ct_df['_id'].iloc[j])
            lows += 1

    print(f'{j} trips for this user. low/mid/high/labeled: {lows}/{mids}/{highs}/{labeled_count}')        

In [None]:
# Get the numbers of each type of trip that we have. 
# Compare these with the numbers when we run on the notebook on only the trips in full_list.
# add if using analysis validation db:
#if USING_ANALYSIS_VALIDATION_DB:
full_list = low_list + mid_list + high_list + labeled_list + incomplete_inference_list
print(f"list lengths low/mid/high/labeled/incomplete: {len(low_list)}/{len(mid_list)}/{len(high_list)}/{len(labeled_list)}/{len(incomplete_inference_list)}")
print(f'The incomplete inference trips in mini-db: {incomplete_inference_list}')
print(f'length of full list: {len(full_list)}. Number of unlabeled in full list: {len(full_list) - len(labeled_list)}')

after_count = {"user": ('all,high,mid,low,incomplete,labeled')}
after_users = []
for u in filter1_users:
    lts = stats["time"][(stats["time"]["user_id"] == u) & (stats["time"]["name"] == "label_tab_switch")]
    if len(lts) > 0:
        after_users.append(u)
        lts = lts.copy()
        lts.sort_values("ts", ascending=True, inplace=True)
        this_after_start = arrow.get(lts.iloc[0]["ts"])   
        user_after_start[u] = this_after_start
        ct_df = confirmed_trip_df_map[u]
        ct_sub = ct_df[ct_df['_id'].isin(full_list)]

        n_after = ct_sub[(ct_sub["end_arrow"] >= user_after_start[u])].shape[0]
        n_high_after = ct_sub[(ct_sub["end_arrow"] >= user_after_start[u]) & ct_sub["_id"].isin(high_list)].shape[0]
        n_mid_after = ct_sub[(ct_sub["end_arrow"] >= user_after_start[u]) & ct_sub["_id"].isin(mid_list)].shape[0]
        n_low_after = n_after - n_high_after - n_mid_after   # since I know my incomplete inferences are low confidence
        n_incomplete_after = ct_sub[(ct_sub["end_arrow"] >= user_after_start[u]) & ct_sub["_id"].isin(incomplete_inference_list)].shape[0]
        n_labeled_after = ct_sub[(ct_sub["end_arrow"] >= user_after_start[u]) & ct_sub["_id"].isin(labeled_list)].shape[0]
        after_count[u] = [n_after,n_high_after,n_mid_after, n_low_after, n_incomplete_after,n_labeled_after]

print("Users that will have \'after\' trips in the mini-db?")
print(after_count.values())

In [None]:
i = 0
for u in filter1_users:
    ct_df = confirmed_trip_df_map[u]
    ct_sub = ct_df[ct_df['_id'].isin(full_list)]

    for _,row in ct_sub.iterrows():
        if is_unlabeled(row):
            i += 1

print(i)

In [None]:
#if USING_ANALYSIS_VALIDATION_DB:
after_count_df = pd.DataFrame(after_count)
totals = []

for _, row in after_count_df.iterrows():
    totals.append(sum(row[1:len(row)]))
print('Totals after: All,high,mid,low,incomplete,labeled')
print(totals)

In [None]:
# Write the query to a text file or print it
# You also need to include Stage_timeseries and Stage_uuids to test properly. 
# The matching uses Stage_timeseries in get_user_input_for_trip_object(ts, trip_obj, "manual/"+label)
if USING_ANALYSIS_VALIDATION_DB:
    full_list = low_list + mid_list + high_list + labeled_list + incomplete_inference_list
    trip_query_file = open("trip_query.txt","w")
    trip_query_file.write("docker exec analysis-validation-db sh -c 'mongodump --archive --db=Stage_database --collection=Stage_analysis_timeseries --query=\"")
    trip_query_file.write("{\\\"_id\\\":{\\\"\\$in\\\":[") 

    # should look something like:
    #query="{\"_id\": {\"\$in\": [{\"\$oid\":\"<objectId>\"}]} }"' > db_testing.dump

    for k in range(0,len(full_list)):
        if k != len(full_list)-1:
            trip_query_file.write(f"{{\\\"\\$oid\\\":\\\"{str(full_list[k])}\\\"}},")
        else:
            trip_query_file.write(f"{{\\\"\\$oid\\\":\\\"{str(full_list[k])}\\\"}} ] }} }}\"\' > db_trip_queried.dump")
            #trip_query_file.write(f"ObjectId(\\\"{str(full_list[k])}\\\")]}} }}\" \' > db_trip_queried.dump")
    trip_query_file.close()
    #print(full_list)

## Results from Dataset 2!
To review, we want:
 * Number of participants
 * Frequency of app opens
 * Taps actually avoided by verify button
 * Taps actually avoided by hiding high confidence trips
 * Overall taps avoided (total, per trip, percentage of taps)
 * Fraction of users who used the verify button
 * Fraction of trips finalized using the verify button
 * How much To Label is used vs. other tabs of Label vs. Diary

### Number of participants

In [None]:
print(f"NUMBER OF USERS IN DATASET 2: {len(filtered_users)}")
print("Breakdown by program: "+str({k: len([u for u in programs[k] if u in filtered_users]) for k in programs}))

## Frequency of app opens

In [None]:
def compute_frequencies(users):
    days_before = {}
    days_after = {}
    opens_before = {}
    opens_after = {}
    opens_per_day_before = {}
    opens_per_day_after = {}
    
    for u in users:
        ct_df = confirmed_trip_df_map[u]
        this_before_start = user_before_start[u]
        this_before_end = user_after_start[u]
        this_after_start = user_after_start[u]
        this_after_end = AFTER_END

        days_before[u] = delta2days(this_before_end-this_before_start)
        days_after[u] = delta2days(this_after_end-this_after_start)

        opens = stats["nav"][(stats["nav"]["user_id"] == u) & (stats["nav"]["name"] == "opened_app")].copy()
        opens["ts_arrow"] = opens["ts"].apply(arrow.get)
        opens_before[u] = filter_between(opens, "ts_arrow", this_before_start, this_before_end).shape[0]
        opens_after[u] = filter_between(opens, "ts_arrow", this_after_start, this_after_end).shape[0]

        opens_per_day_before[u] = opens_before[u]/days_before[u]
        opens_per_day_after[u] = opens_after[u]/days_after[u]
    
    print("Everybody in Dataset 2:")
    output_results(users, opens_per_day_before, opens_per_day_after, opens_before, opens_after)
    
    opens_filtered_users = [u for u in users if opens_before[u] >= REQUIRED_OPENS_BEFORE and opens_after[u] >= REQUIRED_OPENS_AFTER and opens_before[u]+opens_after[u] >= REQUIRED_OPENS_TOTAL]
    print(f"\nOnly those with >={REQUIRED_OPENS_TOTAL} opens total, >={REQUIRED_OPENS_BEFORE} opens Before, >={REQUIRED_OPENS_AFTER} opens After:")
    output_results(opens_filtered_users, opens_per_day_before, opens_per_day_after, opens_before, opens_after)
    
def output_results(users, opens_per_day_before, opens_per_day_after, opens_before, opens_after, do_breakdown=True):
    opens_per_day_before, opens_per_day_after, opens_before, opens_after = map(lambda d: {k: d[k] for k in d if k in users}, [opens_per_day_before, opens_per_day_after, opens_before, opens_after])
    n_users = len(users)
    print("App opens per day before->after, total opens before+after:")
    print("SUM:")
    print(format_arrow_comma(sum(opens_per_day_before.values()), sum(opens_per_day_after.values()), sum(opens_before.values())+sum(opens_after.values())))
    print("AVERAGE:")
    if n_users > 0:
        print(format_arrow_comma(sum(opens_per_day_before.values())/n_users, sum(opens_per_day_after.values())/n_users, (sum(opens_before.values())+sum(opens_after.values()))/n_users))
    else: print("N/A")
    if not do_breakdown: return
    print("User breakdown:")
    if n_users > 0:
        for u in users:
            print(format_arrow_comma(opens_per_day_before[u], opens_per_day_after[u], opens_before[u]+opens_after[u]))
    else: print("N/A")
    
compute_frequencies(filtered_users)

In [None]:
if USING_ANALYSIS_VALIDATION_DB:
    from pymongo import MongoClient
    client = MongoClient() 
    db = client.Stage_database
    Stage_timeseries = db.Stage_timeseries

    # Make a list of objectIds
    # I counted 6 verify fully labels events and 2 label fully labels events
    event_list = []
    for u in sorted(filtered_users)[:4]:
        print('new user')
        verify_docs = Stage_timeseries.find({  
                    "$and": [ {"metadata.key":"stats/client_time"},{"user_id": u},{"data.name": "verify_trip"}],
                    }).limit(3)
        for doc in verify_docs:
            event_list.append(doc['_id'])
            #print(doc['data']['reading'])

        label_docs = Stage_timeseries.find({  
                "$and": [ {"metadata.key":"stats/client_time"},{"user_id": u},{"data.name": "select_label"}],
                }).limit(3)
        for doc in label_docs:
            event_list.append(doc['_id'])


    trip_query_file = open("label_and_verify_events_query.txt","w")
    trip_query_file.write("docker exec analysis-validation-db sh -c 'mongodump --archive --db=Stage_database --collection=Stage_timeseries --query=\"")
    trip_query_file.write("{\\\"_id\\\":{\\\"\\$in\\\":[") 

    # should look something like:
    #query="{\"_id\": {\"\$in\": [{\"\$oid\":\"<objectId>\"}]} }"' > db_testing.dump

    for k in range(0,len(event_list)):
        if k != len(event_list)-1:
            trip_query_file.write(f"{{\\\"\\$oid\\\":\\\"{str(event_list[k])}\\\"}},")
        else:
            trip_query_file.write(f"{{\\\"\\$oid\\\":\\\"{str(event_list[k])}\\\"}} ] }} }}\"\' > label_and_verify_events.dump")
    trip_query_file.close()

    # get the object ids, print them to a file in a query
    # Include all confirmed trips in this test db so you can get the correct users in filtered users
    # and uuids

### Taps actually avoided by verify button
### Taps actually avoided by hiding high confidence trips
### Overall taps avoided (total, per trip, percentage of taps)

In [None]:
if USING_ANALYSIS_VALIDATION_DB:
    from pymongo import MongoClient
    client = MongoClient() 
    db = client.Stage_database
    Stage_timeseries = db.Stage_timeseries

# This may take a while -- clocked at between 10s and 30s on an early-2015 MacBook Air
# Whether or not a press of the verify button fully labels a trip
def verify_fully_labels(event):
    if not event["reading"]["verifiable"]: return False  # Forgot about this case until working with real pilot program data...
    user_input = json.loads(event["reading"]["userInput"])
    final_inference = json.loads(event["reading"]["finalInference"])

    # If the user has only labeled some of the labels AND the inference fills in the rest, 
    # then hitting the verify button fully labeled the trip. We make sets to get only unique inputs.
    return len(user_input) < len(LABEL_CATEGORIES) and len(set(user_input.keys()) | set(final_inference.keys())) == len(LABEL_CATEGORIES)

# Whether or not a given label dropdown menu selection fully labels a trip
def label_fully_labels(event):
    user_input = json.loads(event["reading"]["userInput"])

    # If the trip was not fully labeled before  
    # and the label the user is currently inputting is not an input that they have already made, return true
    return len(user_input) == len(LABEL_CATEGORIES)-1 and event["reading"]["inputKey"] not in event["reading"]["userInput"]

verifieds = {}
taps = {}
trips_labeled = {}
taps_avoided = {}
taps_avoided_per_trip = {}
verifiers = set()

vevents_total = 0
levents_total = 0
for u in filtered_users:
    trips_labeled[u] = 0
    verifieds[u] = 0
    # TODO: maybe only consider unlabeled-at-time-of-inference-generation trips here?
    # below is basically stats[time][user id is u & the event is verify trip]

    # if is_unlabeled() 

    verify_events = stats["time"][(stats["time"]["user_id"] == (u if not TESTINGMODE else gooduser)) & (stats["time"]["name"] == "verify_trip")]
    vevents_total += len(verify_events)
    if len(verify_events) > 0: verifiers.add(u)
    label_events = stats["time"][(stats["time"]["user_id"] == (u if not TESTINGMODE else gooduser)) & (stats["time"]["name"] == "select_label")]
    levents_total += len(label_events)
    taps[u] = len(verify_events)+2*len(label_events)
    # The testing user seems to have an unusually high number of mistaps. When crunching real data, we will let this count against taps saved,
    # but to have useful data to debug the sensitivity analysis that appears later, let's artificially assume that 1 in 10 label_events is a mistap and ignore those.
    if TESTINGMODE: taps[u] = len(verify_events)+(1-0.10)*(2*len(label_events))
    if verify_events.shape[0] > 0:
        for _, ve in verify_events.iterrows():
            if verify_fully_labels(ve):
                verifieds[u] += 1
                trips_labeled[u] += 1
    if label_events.shape[0] > 0:
        for _, le in label_events.iterrows():
            if label_fully_labels(le): trips_labeled[u] += 1
    taps_avoided[u] = OLD_TAPS*trips_labeled[u]-taps[u]
    taps_avoided_per_trip[u] = taps_avoided[u]/trips_labeled[u] if trips_labeled[u] != 0 else float("NaN")
    # print(f"User tapped {taps[u]} times, avoided {taps_avoided[u]} taps to label {trips_labeled[u]} trips")

    if USING_ANALYSIS_VALIDATION_DB:
        # Use pymongo to get the number of verify and label events for the current user
        # Check that the numbers match those found above
        # find returns a cursor object that has to be iterated through to use
        ver_events_mongo = Stage_timeseries.find({    
            "$and": [ {"metadata.key":"stats/client_time"},{"user_id": u},{"data.name": "verify_trip"}]
            })

        verify_count = 0
        mongo_trips_verified = 0
        mongo_trips_labeled = 0
        for doc in ver_events_mongo:
            verify_count += 1
            event = doc['data']

            # Reimplements verify_fully_labels
            if not event['reading']['verifiable']: continue
            user_input_m= json.loads(event['reading']['userInput'])  # m for mongo
            final_inference_m = json.loads(event['reading']['finalInference'])  
            if len(user_input_m) < len(LABEL_CATEGORIES) and len(set(user_input_m.keys()) | set(final_inference_m.keys())) == len(LABEL_CATEGORIES):
                mongo_trips_verified += 1
                mongo_trips_labeled += 1
        
        label_events_mongo = Stage_timeseries.find({
            "$and": [ {"metadata.key":"stats/client_time"},{"user_id": u},{"data.name": "select_label"}]
            })
        
        label_count = 0
        for doc in label_events_mongo:
            label_count += 1
            event = doc['data']

            # Reimplements label_fully_labels
            user_input_m = json.loads(event["reading"]["userInput"])
            if len(user_input_m) == len(LABEL_CATEGORIES)-1 and event["reading"]["inputKey"] not in event["reading"]["userInput"]:
                mongo_trips_labeled += 1

        assert verify_count == len(verify_events)
        assert label_count == len(label_events)
        assert mongo_trips_verified == verifieds[u]
        assert mongo_trips_labeled == trips_labeled[u]

if USING_LABEL_TEST_DB:
    assert sum(trips_labeled.values()) == 8
    assert sum(verifieds.values()) == 6

def print_tap_summary(users):
    total_taps = sum(fu(taps, users).values())
    total_taps_avoided = sum(fu(taps_avoided, users).values()) 
    total_trips_labeled = sum(fu(trips_labeled, users).values())
    avoided_per_labeled = total_taps_avoided/total_trips_labeled
    print(f"Overal, users tapped {total_taps} times to label {total_trips_labeled} trips.")
    print(f"Overall, {total_taps_avoided} taps were avoided, {avoided_per_labeled:.2f} per trip -- that's {avoided_per_labeled/OLD_TAPS:.2%} of taps")
    print(f"We also saved {OLD_TAPS*sum(fu(g_high_confidence_n_after_unlabeled, users).values())} taps across {sum(fu(g_high_confidence_n_after_unlabeled).values())} trips by not soliciting user input on very confident trips")

    total_taps_avoided_high = total_taps_avoided+OLD_TAPS*sum(fu(g_high_confidence_n_after_unlabeled, users).values())
    total_trips_labeled_high = total_trips_labeled+sum(fu(g_high_confidence_n_after_unlabeled, users).values())
    avoided_per_labeled_high = total_taps_avoided_high/total_trips_labeled_high
    print(f"If we also count the taps we avoided by not putting high-confidence inferences on the To Label screen:")
    print(f"Overall, {total_taps_avoided_high} taps were avoided across {total_trips_labeled+sum(fu(g_high_confidence_n_after_unlabeled, users).values())} trips, {avoided_per_labeled_high:.2f} per trip -- that's {avoided_per_labeled_high/OLD_TAPS:.2%} of taps")

print_tap_summary(filtered_users)
print(f"number of verify,label events: {vevents_total},{levents_total}")  # should be 495, 1022 for analysis-validation-db 
# (in mongo: db.Stage_timeseries.count({"data.name":"verify_trip"}),  db.Stage_timeseries.count({"data.name":"select_label"}))


In [None]:
# making sense of the 'fully labels' functions
event = stats['time'][(stats["time"]["user_id"] == u) & (stats["time"]["name"] == "verify_trip")].iloc[30]
user_input = json.loads(event["reading"]["userInput"])
final_inference = json.loads(event["reading"]["finalInference"])

user_input
#len(set(user_input.keys()) | set(final_inference.keys())) == 3
    #final_inference = json.loads(event["reading"]["finalInference"])
    #return len(user_input) < len(LABEL_CATEGORIES) and len(set(user_input.keys()) | set(final_inference.keys())) == len(LABEL_CATEGORIES)

In [None]:
stats['time'][(stats["time"]["user_id"] == u) & (stats["time"]["name"] == "select_label")].iloc[20]
#user_input = json.loads(event["reading"]["userInput"])

### Fraction of users who used the verify button

In [None]:
# For each program, prints number of filtered users who used the verify button over total number of filtered users
print("Users who have used the verify button at least once: "+str({k: str(len([u for u in programs[k] if u in filtered_users and u in verifiers]))+"/"+str(len([u for u in programs[k] if u in filtered_users])) for k in programs}))

### Fraction of trips finalized using the verify button

In [None]:
print("Number of trips finalized using the verify button as a fraction of total number of user-confirmed trips:\n"+str({k: format_frac_percent(sum([verifieds[u] for u in programs[k] if u in filtered_users]), sum([trips_labeled[u] for u in programs[k] if u in filtered_users])) for k in programs}))

### How much To Label is used vs. other tabs of Label vs. Diary

First, let's (re)aquaint ourselves with what the instrumentation data can provide.

In [None]:
def explore_instrumentation(u):
    print(stats.keys())

    nav_stats = stats["nav"][(stats["nav"]["user_id"] == u) & (stats["nav"]["name"] != "sync_launched")].copy()
    # print(nav_stats.shape)
    # print(nav_stats.keys())
    # print(nav_stats.head()[["name","reading","ts"]])
    # print(nav_stats[["name","reading","ts"]].to_csv())

    time_stats = stats["time"][(stats["time"]["user_id"] == u) & (stats["time"]["name"] != "push_duration") & (stats["time"]["name"] != "pull_duration") & (stats["time"]["name"] != "sync_duration")].copy()
    # print(time_stats.shape)
    # print(time_stats.keys())
    # print(time_stats.head()[["name","reading","ts"]])
    # print(time_stats[["name","reading","ts"]].to_csv())
    # print(time_stats[time_stats["name"] == "label_tab_switch"][["name","reading","ts"]].to_csv())


explore_instrumentation(filtered_users[1])

Based on the results of this brief investigation, it seems that it would be hard to measure time spent on To Label vs. the other screens because the stats I have don't seem to be accurately monitoring when the app stops being used. This should be revisited in the future, though -- there are probably some other stats elsewhere that could help.

## Graphs

In [None]:
set_log_level(logging.WARNING)
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates

### Weekly labeling percentage over time for each program

In [None]:
# This may take a while -- clocked at 18s on an early-2015 MacBook Air
# This cell rather clumsily calculates per-week labeling statistics for each user and each program.
# TODO could use some major refactoring

def filter_before_or_to_label(df, user):
    # empty = (df["end_arrow"] < user_after_start[user]) & (df["end_arrow"] > user_after_start[user])
    # (df["end_arrow"] < user_after_start[user])
    if user not in filter2_users: return df  # Everything is before update if you haven't installed the update! TODO filter2_users was not meant to be referred to so very globally

    # Get the data before the after start for this user or where the expectation is to_label?
    return df[(df["end_arrow"] < user_after_start[user]) | (df["expectation"].apply(lambda val: "to_label" in val and val["to_label"]))]

# This could almost certainly be done more efficiently, but it's not worth worrying about right now
def count_weekly(users):
    total_count_weekly = {}
    labeled_count_weekly = {}
    labeling_frac_weekly = {}
    total_count_weekly_u = {}
    labeled_count_weekly_u = {}
    labeling_frac_weekly_u = {}
    for u in users:
        total_count_weekly_u[u] = {}
        labeled_count_weekly_u[u] = {}
        labeling_frac_weekly_u[u] = {}
    for program in programs.keys():
        total_count_weekly[program] = {}
        labeled_count_weekly[program] = {}
        labeling_frac_weekly[program] = {}
        for week in weeks:
            this_total_count = 0
            this_labeled_count = 0
            for u in users:
                if u in programs[program]:
                    if programs == "stage": print(u)
                    total = filter_between(confirmed_trip_df_map[u], "end_arrow", *week)
                    total = filter_before_or_to_label(total, u)
                    fully_labeled = total[(total["user_input"].apply(len) == len(LABEL_CATEGORIES))]    
                        # does this account for the cases where the user used the verify button? does user_input get populated after that?
                    this_total_count += total.shape[0]
                    this_labeled_count += fully_labeled.shape[0]

                    total_count_weekly_u[u][week] = total.shape[0]
                    labeled_count_weekly_u[u][week] = fully_labeled.shape[0]
                    labeling_frac_weekly_u[u][week] = labeled_count_weekly_u[u][week]/total_count_weekly_u[u][week] if total_count_weekly_u[u][week] != 0 else float("nan")
            total_count_weekly[program][week] = this_total_count
            labeled_count_weekly[program][week] = this_labeled_count
            labeling_frac_weekly[program][week] = this_labeled_count/this_total_count if this_total_count != 0 else float("nan")
    return total_count_weekly, labeled_count_weekly, labeling_frac_weekly, total_count_weekly_u, labeled_count_weekly_u, labeling_frac_weekly_u

total_count_weekly, labeled_count_weekly, labeling_frac_weekly, _, _, _ = count_weekly(filtered_users)
total_count_weekly_ds1, labeled_count_weekly_ds1, labeling_frac_weekly_ds1, total_count_weekly_u, labeled_count_weekly_u, labeling_frac_weekly_u = count_weekly(server_filtered_users)

In [None]:
from pymongo import MongoClient
client = MongoClient() 
db = client.Stage_database

sat = db.Stage_analysis_timeseries  ## how do I get certain times?
# maybe to get data from certain weeks I could reuse Gabriel's code

In [None]:
sat.find_one({"metadata.key":"analysis/confirmed_trip"})['data'].keys()  # we want end_ts

In [None]:
# how about this: figure out the weeks you need, then for each trip, if the end_arrow is in the week you want, add the object Id to a list
weeks

In [None]:
# Getting some vail trips
# Let's do week indices 16, 17, and 22
for u in sorted(programs['only_after'])[:2]:
    i = 0
    have_started = False
    for week in weeks:
        total = filter_between(confirmed_trip_df_map[u], "end_arrow", *week)
        total = filter_before_or_to_label(total, u)
        if (total.shape[0] > 0):
            have_started = True
            print(f'week index: {i} nrows: {total.shape[0]}')

        # Save some trip object ids
        if have_started:
            print(total['_id'].iloc[0])
            print(f"length of user input for this trip: {total['user_input'].apply(len).iloc[0]}")
        i +=1

In [None]:
# Do the same for a before_after site. I picked
for u in sorted(programs['before_after'])[:2]:
    i = 0
    have_started = False
    for week in weeks:
        total = filter_between(confirmed_trip_df_map[u], "end_arrow", *week)
        total = filter_before_or_to_label(total, u)
        if (total.shape[0] > 0):
            have_started = True
            print(f'week index: {i} nrows: {total.shape[0]}')

        # Save some trip object ids
        if have_started:
            print(total['_id'].iloc[0])
            print(f"length of user input for this trip: {total['user_input'].apply(len).iloc[0]}")
        i +=1

In [None]:
weeks[23]

In [None]:
data = {'a':[1,2,3],'b': [4,5,6]}
df = pd.DataFrame(data)
df

In [None]:
# Count how many of our users have installed the update at the end of a given week
def count_users_weekly(users):
    updated_users_weekly = {}
    denom = {}
    for program in programs:
        updated_users_weekly[program] = {}
        denom[program] = {}
        for week in weeks:
            updated_users_weekly[program][week] = 0
            denom[program][week] = 0
            for u in users:
                if u in programs[program]:
                    if filter_between(confirmed_trip_df_map[u], "end_arrow", *week).shape[0] > 0:
                        denom[program][week] += 1
                        if u in user_after_start and user_after_start[u] < week[1]: updated_users_weekly[program][week] += 1
        
        for week in weeks:
            n = updated_users_weekly[program][week]
            # labeled = labeling_frac_weekly[program][week] #NaN-ify points where there were no labeled trips
            updated_users_weekly[program][week] = n/denom[program][week] if denom[program][week] != 0 else float("NaN")
    return updated_users_weekly

updated_users_weekly = count_users_weekly(filtered_users)
updated_users_weekly_ds1 = count_users_weekly(server_filtered_users)

In [None]:
# Count the number of days users have used the app per week
def count_days_per_week(users):
    days_per_week_u = {u: {} for u in users}
    for u in users:
        time_stats = stats["time"][(stats["time"]["user_id"] == u) & (stats["time"]["name"] != "push_duration") & (stats["time"]["name"] != "pull_duration") & (stats["time"]["name"] != "sync_duration")]
        nav_stats = stats["nav"][(stats["nav"]["user_id"] == u) & (stats["nav"]["name"] != "sync_launched")].copy()
        # print(pd.unique(nav_stats["name"]))
        switches = nav_stats[nav_stats["name"] == "opened_app"].copy()
        switches["ts_arrow"] = switches["ts"].apply(lambda ts: arrow.get(ts).to(MY_TZ))
        for week in weeks:
            these_switches = filter_between(switches, "ts_arrow", *week)
            days_per_week_u[u][week] = pd.unique(these_switches["ts_arrow"].apply(lambda ts: (ts-BEFORE_START).days)).shape[0]
        # print(switches["ts_arrow"])
    return days_per_week_u

days_per_week_u = count_days_per_week(server_filtered_users)
# days_per_week_u = count_days_per_week([filtered_users[5]])

In [None]:
week_labels = [week[0].datetime for week in weeks]  # f"{week[0].month}/{week[0].day}"

def setup_weekly_axis(ax):
    ax.xaxis_date()
    ax.xaxis.set_major_formatter(mdates.DateFormatter("%m/%d"))
    ax.set_xticks(week_labels)
    ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
    plt.setp(ax.get_xticklabels(), visible=True)
    ax.tick_params(labelbottom=True)
    ax.set_ylim([0, 1])

fig, axs = plt.subplots(2, figsize=(15,8), gridspec_kw = {"height_ratios": [1, 3]}, sharex=True)
for ax in axs:
    setup_weekly_axis(ax)

for program in programs_some:
    n = len([u for u in programs[program] if u in filtered_users])
    if n == 0: continue
    label = f"{program if program != 'ens' else 'ensemble'} (n={n})"
    lines = []
    lines.append(axs[1].plot(week_labels, list(labeling_frac_weekly[program].values()), label=label, zorder = 1 if program == "ens" else 0)[0])
    lines.append(axs[0].plot(week_labels, list(updated_users_weekly[program].values()), label=label, zorder = 1 if program == "ens" else 0)[0])
    for line in lines:
        if program == "ens":
            line.set_color("black")
            line.set_linewidth(3)

axs[0].set_ylabel("% installed update")
axs[1].set_ylabel("% of expected trips labeled")

for ax in axs:
    ax.legend()

So that's not entirely encouraging, but it's also kind of hard to tell what's going on. Let's try something else:

### Weekly labeling but it's Vail and Pueblo County vs. all the others
(and we skip the first week and only do four weeks after that)

In [None]:
# Indices of the weeks we care about. TODO make this not dependent on starting date!!!
first_start = 7
first_end = 11
second_start = 19
second_end = 23
for i in (first_start, first_end, second_start, second_end): print(weeks[i])

fig, axs = plt.subplots(2, figsize=(15,8), gridspec_kw = {"height_ratios": [1, 3]}, sharex=True)
for ax in axs:
    setup_weekly_axis(ax)

for program in programs_all:
    if program == "ens": continue
    n = len([u for u in programs[program] if u in server_filtered_users])
    if n == 0: continue

    if program == "before_after": label = "Before/after ensemble"
    elif program == "only_after": label = "Only after ensemble"
    else: label = program
    label += f" n={n}"

    y1 = list(labeling_frac_weekly_ds1[program].values())
    y2 = list(updated_users_weekly_ds1[program].values())
    not_in_first = lambda x: x < first_start or x >= first_end
    not_in_second = lambda x: x < second_start or x >= second_end
    for i in range(len(weeks)):  # NaN out the data we don't care about
        if program in ["only_after", *only_after_programs]:
            if not_in_second(i): y1[i] = y2[i] = float("NaN")
        else:
            if not_in_first(i) and not_in_second(i): y1[i] = y2[i] = float("NaN")
    lines = []
    lines.append(axs[1].plot(week_labels, y1, label=label, zorder = 1 if program in ("before_after", "only_after") else 0)[0])
    lines.append(axs[0].plot(week_labels, y2, label=label, zorder = 1 if program in ("before_after", "only_after") else 0)[0])
    for line in lines:
        if program == "before_after":
            line.set_color("black")
            line.set_linewidth(3)
        elif program == "only_after":
            line.set_color("black")
            line.set_linewidth(3)
            line.set_linestyle("dashed")


axs[0].set_ylabel("% installed update")
axs[1].set_ylabel("% of expected trips labeled")

axs[1].legend(loc="center")


Let's do a second version of that with three sections:
  1. Before/after group when they've just started (no update)
  2. Before/after group just after they install the update (meaning calculate an offset for each user) (100% update)
  3. Only after group when they've just started (100% update)

and then I suppose we could have a fourth plot with the three ensembles from that.

Then, let's use the same type of graph to figure out whether the update has any effect on labeling cadence. If so, that might be a sign of reduction (or increase) in burden.

In [None]:
week_of_before_start = {}  # Index of week that user installed the app to begin with
for u in server_filtered_users:
    for i, week in enumerate(weeks):
        if user_before_start[u] >= week[0] and user_before_start[u] < week[1]:
            week_of_before_start[u] = i
            break

week_of_start = {}  # Index of week that user installed the update
for u in user_after_start:
    for i, week in enumerate(weeks):
        if user_after_start[u] >= week[0] and user_after_start[u] < week[1]:
            assert u not in week_of_start
            week_of_start[u] = i
# print(week_of_start)

def weekly_average_per_user_offset(tseries, users, start_func, n_weeks, valid_func = lambda user, week: True):
    augmented_weeks = weeks+[None]*n_weeks  # Account for overflow
    per_user = []  # This could all be written as a massive comprehension; wouldn't that be fun.
    for user in users:
        start = start_func(user)
        user_weeks = [augmented_weeks[i] if valid_func(user, i) else None for i in range(start, start+n_weeks)]
        user_results = [tseries[user][week] if week != None else float("nan") for week in user_weeks]
        assert len(user_results) == n_weeks, len(user_results)
        per_user.append(user_results)
    return np.nanmean(per_user, axis=0)

# print(weekly_average_per_user_offset(labeling_frac_weekly_u, filtered_users, lambda u: 5, 4))

def style_lines(program, lines):
    for line in lines:
        if program == "before_after":
            line.set_color("black")
            line.set_linewidth(3)
        elif program == "only_after":
            line.set_color("black")
            line.set_linewidth(3)
            # line.set_linestyle("dashed")

def plot_labeling_breakdown_v2(tseries, y_percent):
    labels = {program: "ensemble" if program == "before_after" else "ensemble" if program == "only_after" else program for program in programs_all}
    fig, axs = plt.subplots(1, 4, figsize=(22,8))
    if y_percent: plt.subplots_adjust(wspace=0.3)
    x = [1.5, 2.5, 3.5, 4.5]
    axs[0].set_title("Before/after group at initial installation")
    axs[1].set_title("Before/after group at updating")
    axs[2].set_title("Only after group at installation")
    axs[3].set_title("Inter-scenario comparison")
    for ax in axs:
        ax.xaxis.set_major_locator(plt.MultipleLocator(1))
        ax.set_xlabel("Weeks after event")
        ax.set_xlim([1, 5])
        if (y_percent):
            ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
            ax.set_ylim([0, 1])
            ax.set_ylabel("% of expected trips labeled")
        else:
            ax.set_ylim([0, 7])
            ax.yaxis.set_major_locator(plt.MultipleLocator(1))
            ax.set_ylabel("Number of days per week users open app")
    
    ensembles = []

    # Actual plotting #1
    lines = []
    for program in programs_all:
        if program in ["ens", "stage", "only_after", *only_after_programs]: continue
        users = [u for u in programs[program] if u in server_filtered_users and u in tseries and u in week_of_before_start]
        if len(users) == 0:
            print(f"No before users for program: {program}")
            continue
        y = weekly_average_per_user_offset(tseries, users, lambda u: week_of_before_start[u], 4, valid_func = lambda u, week: u not in week_of_start or week < week_of_start[u])
        if program == "before_after": ensembles.append(y)
        lines.append(axs[0].plot(x, y, label=labels[program]+f" n={len(users)}", zorder = 1 if program == "before_after" else 0)[0])
        style_lines(program, lines)
    
    # Actual plotting #2
    lines = []
    for program in programs_all:
        if program in ["ens", "stage", "only_after", *only_after_programs]: continue
        users = [u for u in programs[program] if u in week_of_start]
        if len(users) == 0:
            print(f"No after users for program: {program}")
            continue
        y = weekly_average_per_user_offset(tseries, users, lambda u: week_of_start[u], 4)
        if program == "before_after": ensembles.append(y)
        lines.append(axs[1].plot(x, y, label=labels[program]+f" n={len(users)}", zorder = 1 if program == "before_after" else 0)[0])
        style_lines(program, lines)

    # Actual plotting #3
    lines = []
    for program in programs_all:
        if program not in ["only_after", *only_after_programs]: continue
        users = [u for u in programs[program] if u in week_of_start]
        if len(users) == 0:
            print(f"No after users for program: {program}")
            continue
        y = weekly_average_per_user_offset(tseries, users, lambda u: week_of_start[u], 4)
        if program == "only_after": ensembles.append(y)
        lines.append(axs[2].plot(x, y, label=labels[program]+f" n={len(users)}", zorder = 1 if program == "before_after" else 0)[0])
        style_lines(program, lines)
    
    # Actual plotting #4
    for i in range(len(ensembles)):
        axs[3].plot(x, ensembles[i], label=axs[i].get_title(), linewidth=3)

    for ax in axs:
        ax.legend(loc = ("lower left" if y_percent else "best"))
    
plot_labeling_breakdown_v2(labeling_frac_weekly_u, True)
plot_labeling_breakdown_v2(days_per_week_u, False)


Let's do a histogram of expectation confidences. To usefully display the data, we will have to constrain the y-axis such that some bars (e.g., the first) cannot fully display, so we annotate the graph with information about these bars.

In [None]:
n_bins = 100
# y_step = 250
# allowed_overflow = 1
max_y = 1700  # TODO determine this algorithmically (for now, tune it manually per dataset)
if TESTINGMODE: max_y = 20

if len(all_high_confidence_n_after_unlabeled['ens']) >0:
    all_confidences = complete_results["ens"]["all_confidences"].copy()
else:   # if there's nothing in ensemble, we are probably working with stage data only
   all_confidences = complete_results["stage"]["all_confidences"].copy()


all_confidences.sort()
# from collections import Counter
# conf_count = Counter(all_confidences)
# print(sorted(conf_count.values(), reverse=True)[:4])
# max_y = sorted(conf_count.values(), reverse=True)[allowed_overflow]  # fails to take into account binning
# print(max_y)
# max_y = np.ceil(max_y/y_step)*y_step
# print(max_y)

bins = np.arange(0, 1+0.1/n_bins, 1/n_bins)
assert len(bins) == n_bins+1

range1 = [x for x in all_confidences if x <= LOW_CONFIDENCE_THRESHOLD_PRODUCTION]
range2 = [x for x in all_confidences if x > LOW_CONFIDENCE_THRESHOLD_PRODUCTION and x <= HIGH_CONFIDENCE_THRESHOLD_PRODUCTION]
range3 = [x for x in all_confidences if x > HIGH_CONFIDENCE_THRESHOLD_PRODUCTION]

fig, ax = plt.subplots(1, figsize=(15,8))
p1 = plt.hist(range1, bins=bins, label=f"User input required, 1 or more red labels (total of {len(range1)} trips)")
p2 = plt.hist(range2, bins=bins, label=f"User input required, all yellow labels (total of {len(range2)} trips)")
p3 = plt.hist(range3, bins=bins, label=f"No user input required (total of {len(range3)} trips)")
ax.set_ylim([0, max_y])
ax.set_xlim([0, 1])
ax.xaxis.set_major_formatter(mtick.PercentFormatter(1))
ax.xaxis.set_major_locator(plt.MultipleLocator(0.1))
ax.yaxis.set_major_locator(plt.MultipleLocator(250))
if TESTINGMODE: ax.yaxis.set_major_locator(plt.MultipleLocator(5))

ax.set_title("Histogram of trip confidence, segmented by presentation to user")
ax.set_ylabel("Number of trips")
ax.set_xlabel("Confidence level of final inference")
ax.legend()
ax.grid(True)

# Mark overflows on the graph
for p in (p1, p2, p3):
    for i, bar in enumerate(zip(*p)):
        rect = bar[2]
        height = int(rect.get_height())
        if height > max_y:
            ax.text(rect.get_x()+rect.get_width()*1.1, max_y*0.99, f"⬆{height}", ha="left", va="top")
            print(f"Overflow note: the bar for the region [{bins[i]:.1%}, {bins[i+1]:.1%}{']' if i == len(bins)-1 else ')'} contains {height} items.")

Debugging break!

In [None]:
# There was a bug in the confidence algorithm; I used this code to figure it out.
def explore_confidences(u):
    ct_df = confirmed_trip_df_map[u]
    print(ct_df.shape)
    # """
    for i, trip in ct_df.iterrows():
        if i != 179: continue
        inference = trip["inferred_labels"]
        if len(inference) > 0:
            for option in inference:
                print(f"{len(option['labels'])}, {option['p']:.2f}", end="; ")
            print(i)

        confidences = {}
        for label_type in LABEL_CATEGORIES:
            print(label_type)
            counter = {}
            for line in inference:
                if label_type not in line["labels"]: continue
                val = line["labels"][label_type]
                if val not in counter: counter[val] = 0
                counter[val] += line["p"]
            print(counter)
            confidences[label_type] = max(counter.values())  # THIS WAS SUM BEFORE!!! THAT WAS THE PROBLEM!!!
        print("CONFIDENCES:")
        print(confidences)
        trip_confidence = min(confidences.values())
        print(trip_confidence)
    # """
    # print(ct_df.iloc[338]["inferred_labels"])
    print(ct_df.iloc[179]["inferred_labels"])

# explore_confidences(filtered_users[1])

Let's do the weekly labeling thing but per-user, and let's line things up so everyone installs the update at the same point on the graph. We do this first as a scatterplot; then, we see if it might be clearer as a bar graph displaying three-week averages.

In [None]:
# Apologies for the rather awful code to follow. TODO at some point this should be cleaned up.
aligned_labeling_frac_u = {}
for u in week_of_start.keys():  # It's possible not everyone in filtered_users has a week_of_start within the weeks we're graphing, in which case drop that user
    aligned_labeling_frac_u[u] = {}
    for i, week in enumerate(weeks):
        aligned_labeling_frac_u[u][i-week_of_start[u]] = labeling_frac_weekly_u[u][week]
# aligned_labeling_frac_u = {1: {-8: 0, 0: 0}, 2: {-6: 0, 0: 0}, 3: {0: 0, 0: 0}, 4: {3: 0, 0: 0}, 5: {5: 0, 0: 0}, 6: {7: 0, 0: 0}}
# min_offset = min([min(a.keys()) for a in aligned_labeling_frac_u.values()])
# max_offset = max([max(a.keys()) for a in aligned_labeling_frac_u.values()])
min_offset, max_offset = map(lambda f: (f([f(a.keys()) for a in aligned_labeling_frac_u.values()]) if len(aligned_labeling_frac_u) > 0 else float("nan")), (min, max))
min_offset, max_offset = (lambda x: (-x,x))(min(-min_offset, max_offset))  # fun fun
x = np.array(range(min_offset, max_offset+1)) if len(aligned_labeling_frac_u) > 0 else []

wk_u_y = {u: [aligned_labeling_frac_u[u][i] if i in aligned_labeling_frac_u[u] else float("nan") for i in x] for u in week_of_start.keys()}
wk_ens_u = [np.nanmean([wk_u_y[u][i] for u in week_of_start.keys()]) for i in range(len(x))]
# print(wk_ens_u[x.index(3)])
# print(wk_u_y[server_filtered_users[5]])
# print([wk_u_y[u][x.index(3)] for u in week_of_start.keys()])

def plot_weekly_per_user_line():
    fig, ax = plt.subplots(1, figsize=(15,8))
    ax.set_xlim([min_offset, max_offset])
    ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
    ax.set_ylim([0, 1])

    for u in week_of_start.keys():
        ax.scatter(x, wk_u_y[u])
        ax.plot(x, wk_u_y[u], alpha=0.25)

    line = ax.plot(x, wk_ens_u)[0]
    line.set_color("black")
    line.set_linewidth(3)

def plot_weekly_per_user_bar():
    x2 = np.array([-9, -6, -3, 0, 3, 6, 9])
    y2 = {u: [] for u in week_of_start.keys()}
    for i in range(len(x2)-1):
        xstart = list(x).index(x2[i])
        xend = list(x).index(x2[i+1])  # If you are reading this code, my condolences
        for u in week_of_start.keys():
            # print(wk_u_y[u][xstart:xend])
            y2[u].append(np.mean(wk_u_y[u][xstart:xend]))
    x2 = x2[:-1]
    # print(y2[server_filtered_users[5]])

    width = 3/len(week_of_start.keys())*0.9
    fig, ax = plt.subplots(1, figsize=(20,8))
    ax.set_xlim([-9, 9])
    ax.xaxis.set_major_locator(plt.MultipleLocator(3))
    ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
    ax.set_ylim([0, 1])
    ax.xaxis.grid(True, color="black", linewidth=2, linestyle="dotted")

    for i, u in enumerate(week_of_start.keys()):
        ax.bar(x2+width*i, y2[u], width)

    ax.plot(x, wk_ens_u, color="black", linewidth=3)

def plot_weekly_per_user():
    if TESTINGMODE: return  # This just isn't going to work in TESTINGMODE
    plot_weekly_per_user_line()
    plot_weekly_per_user_bar()


Let's construct an infographic visualizing how we eliminated taps

In [None]:
def draw_stacked_bars(title, labels, scenarios, max_taps, width=6, hook=None):
    user_burden = [max_taps-scenario[0]-scenario[1] for scenario in scenarios]
    due_to_confirm = [scenario[0] for scenario in scenarios]
    due_to_expectations = [scenario[1] for scenario in scenarios]

    fig, ax = plt.subplots(1, figsize=(width,6))
    p1 = ax.bar(labels, user_burden, bottom = list(map(lambda x: x[0]+x[1], zip(due_to_expectations, due_to_confirm))), label = "User burden")
    p2 = ax.bar(labels, due_to_confirm, bottom = due_to_expectations, label = "Eliminated due to confirm button")
    p3 = ax.bar(labels, due_to_expectations, label = "Eliminated due to expectations")
    # Label user burden and all nonzero eliminations
    rects_to_label = list(p1)+[r for i, r in enumerate(p2) if due_to_confirm[i] != 0]+[r for i, r in enumerate(p3) if due_to_expectations[i] != 0]
    for rect in rects_to_label:
        height = rect.get_height()
        ax.text(rect.get_x()+rect.get_width()/2, rect.get_y()+rect.get_height()/2, f"{height:.2f}", ha="center", va="center")

    ax.legend()  # loc="upper left"
    ax.set_ylabel("Average taps per user-labeled or confidently auto-labeled trip")
    ax.set_title(title)
    if hook is not None: hook(ax)
    print()

#### added old taps to stacked_denom
stacked_denom = OLD_TAPS*(sum(fu(trips_labeled).values())+sum(fu(g_high_confidence_n_after_unlabeled).values()))
saved_due_to_confirm = sum(fu(taps_avoided).values())/stacked_denom  # Usually we take the denominator for this number to be only trips in To Label, but here it has to be all user- or confidently auto-labeled trips
saved_due_to_expectations = (OLD_TAPS*sum(fu(g_high_confidence_n_after_unlabeled).values()))/stacked_denom
np.testing.assert_almost_equal(saved_due_to_confirm+saved_due_to_expectations, ((sum(fu(taps_avoided).values())+OLD_TAPS*sum(fu(g_high_confidence_n_after_unlabeled).values()))/(OLD_TAPS*(sum(fu(trips_labeled).values())+sum(fu(g_high_confidence_n_after_unlabeled).values())))))
draw_stacked_bars("Reducing user burden without sacrificing data quality", ["Before", "After"], [(0, 0), (saved_due_to_confirm, saved_due_to_expectations)], 6)

## Numbers of Note
Here is an attempt to put all the numbers I actually use in the paper in one place. All of these numbers should be merely restating what is above.

In [None]:
def tprint(label, value):  # Tabularly print
    print(label.ljust(35, ' ')+" "+str(value).rjust(20, ' '))

tprint("Size of Dataset 1", len(server_filtered_users))
tprint("Size of Dataset 2", len(filtered_users))
print()

tprint("% need not label ensemble all", format_frac_percent(*complete_results["ens"]["high"]["All"]))
tprint("% To Label all yellow ens all", format_frac_percent(*complete_results["ens"]["mid"]["All"]))
tprint("% need not label ens after", format_frac_percent(*complete_results["ens"]["high"]["After"]))
tprint("% To Label all yellow ens after", format_frac_percent(*complete_results["ens"]["mid"]["After"]))
print()

tprint("# taps after", sum(fu(taps).values()))
tprint("# trips labeled after", sum(fu(trips_labeled).values()))
tprint("# taps saved due to Confirm", sum(fu(taps_avoided).values()))
tprint("taps saved per trip due to Confirm", f"{sum(fu(taps_avoided).values())/sum(fu(trips_labeled).values()):.2f}")
tprint("% taps saved due to Confirm", f"{sum(fu(taps_avoided).values())/sum(fu(trips_labeled).values())/OLD_TAPS:.2%}")
tprint("# of users who used Confirm", len([u for u in programs["ens"] if u in filtered_users and u in verifiers]))
tprint("% of trips finalized using Confirm", format_frac_percent(sum([verifieds[u] for u in programs["ens"] if u in filtered_users]), sum([trips_labeled[u] for u in programs["ens"] if u in filtered_users])))
print()

tprint("# trips not in To Label", sum(fu(g_high_confidence_n_after_unlabeled).values()))
tprint("# taps saved due to To Label", OLD_TAPS*sum(fu(g_high_confidence_n_after_unlabeled).values()))
tprint("# taps saved total", sum(fu(taps_avoided).values())+OLD_TAPS*sum(fu(g_high_confidence_n_after_unlabeled).values()))

trips_relevant_total = sum(fu(trips_labeled).values())+sum(fu(g_high_confidence_n_after_unlabeled).values())
tprint("# trips relevant total", trips_relevant_total)  
tprint("# taps saved per trip total", f"{(sum(fu(taps_avoided).values())+OLD_TAPS*sum(fu(g_high_confidence_n_after_unlabeled).values()))/(trips_relevant_total):.2f}")
tprint("% taps saved per trip total", f"{(sum(fu(taps_avoided).values())+OLD_TAPS*sum(fu(g_high_confidence_n_after_unlabeled).values()))/(trips_relevant_total)/OLD_TAPS:.2%}")

## Sensitivity Analysis
Here's the plan:

ASSUME as we have been doing that people don't misclick (i.e., once you label a label, you don't relabel it)

Perhaps TODO test this assumption

Limit consideration to labeled trips during the after period that were unlabeled when the inference algorithm ran on them.

For each trip:
 1. Calculate pre-update idealized number of taps (always 6)
 2. Calculate post-update actual number of taps (verify_events+2*label_events as before)
 3. Calculate post-update idealized number of taps if user had followed intended algorithm:
    1. If all the yellow labels are correct, press verify button
    2. Repeat substep 1 until no more correct yellow labels
    3. Input true value for ~most certain~ first non-green label
    4. Repeat substeps 1-3 until trip completely labeled
    - Note that this is NOT the optimal algorithm -- the optimal algorithm would have people clicking the verify button if ANY of the yellow labels are correct, but we don't teach that you should do that

We've already shown stacked-bars graphs of 1 vs. 2. Now, show graphs of 1 vs. 2 vs. 3. Is 3 sufficiently close to 2 that this is a useful approximation? If so, continue. If not, a much more complicated algorithm is needed.

ASSUME that people follow the intended algorithm (see above). Now for a given low and high confidence, we can easily calculate a counterfactual stacked-bar graph.

In [None]:
test_labelstruct = [
    {"labels": {"mode_confirm": "walk", "purpose_confirm": "shopping", "replaced_mode": "placeholder"}, "p": 0.15},
    {"labels": {"mode_confirm": "walk", "purpose_confirm": "entertainment", "replaced_mode": "placeholder"}, "p": 0.05},
    {"labels": {"mode_confirm": "drove_alone", "purpose_confirm": "work", "replaced_mode": "placeholder"}, "p": 0.45},
    {"labels": {"mode_confirm": "shared_ride", "purpose_confirm": "work", "replaced_mode": "placeholder"}, "p": 0.35}
]

test_groundtruth = {"mode_confirm": "walk", "purpose_confirm": "shopping", "replaced_mode": "placeholder"}

# confidences = {}
# for label_type in LABEL_CATEGORIES:
#     counter = {}
#     for line in inference:
#         if label_type not in line["labels"]: continue  # Seems we have some incomplete tuples!
#         val = line["labels"][label_type]
#         if val not in counter: counter[val] = 0
#         counter[val] += line["p"]
#     confidences[label_type] = max(counter.values()) if len(counter) > 0 else 0 # This needs to be max, not sum!!! A major bug in a previous version.
# trip_confidence = min(confidences.values())

# Basically a copypaste of the reimplementation above -- this is fiddly stuff we'd like to touch as little as possible
# TODO refactor to eliminate code duplication, but only do this in the presence of testing data sufficient to ensure we don't mess it up

# Labelstruct is the inferred probabilites for different label tuples for 1 trip.

def sum_confidences(labelstruct):
    confidences = {}
    for label_type in LABEL_CATEGORIES:
        counter = {}
        for line in labelstruct:  # line is one of the sublists
            if label_type not in line["labels"]: continue   # eg mode_confirm not in the value list associated with "labels"
            val = line["labels"][label_type]    # eg "walk"
            if val not in counter: counter[val] = 0
            counter[val] += line["p"]         # add the associated confidence to the sum for the current label value
        confidences[label_type] = counter # place counter as the value for the current label_type
    return confidences

def best_confidences(labelstruct):
    confidences = sum_confidences(labelstruct)
    # Take the largest confidence labels
    return {k: max(confidences[k].items(), key = lambda item: item[1]) if len(confidences[k]) > 0 else (None, 0) for k in confidences}
    
# Calculates the label categories and values we can display as yellow
# Assumes we've already filtered out the non-viable options and renormalized
def get_yellows(labelstruct, low_thresh):
    confidences = best_confidences(labelstruct)
    # print("cs")
    # print(confidences)
    return {k: confidences[k][0] for k in confidences if confidences[k][1] > low_thresh}

def next_green(labelstruct, established, ground_truth):
    # Figure out the most probable thing for the user to next fill in
    # confidences = sum_confidences(labelstruct)
    # confidences = {k: confidences[k] for k in confidences if k not in established}  # Eliminate categories we've already greened
    # choice = max(confidences.items(), key = lambda item: item[1][ground_truth[item[0]]] if len(item[1]) > 0 else 0)  # Pick the item whose probability matching the actual truth is highest
    # return choice[0]

    # Actually let's just say the user fills in the labels in order (this also seems plausible)
    return next(filter(lambda category: category not in established, LABEL_CATEGORIES))

def filter_and_renormalize(labelstruct, established, certainty):
    for label_type in established:
        # print(label_type)
        labelstruct = list(filter(lambda row: row["labels"][label_type] == established[label_type], labelstruct))
    new_certainty = sum([row["p"] for row in labelstruct])
    for row in labelstruct: row["p"] *= certainty/new_certainty
    return labelstruct

def is_above_high(labelstruct, high_thresh):
    confidences = best_confidences(labelstruct)
    confidences = {k: confidences[k][1] for k in confidences}  # store the best confidence value for each label category
    trip_confidence = min(confidences.values())
    return trip_confidence > high_thresh

def intended_taps(labelstruct, ground_truth, low_thresh):
    # print(ground_truth)
    certainty = sum([row["p"] for row in labelstruct])
    established = {}
    taps = 0
    
    while(len(established) < len(LABEL_CATEGORIES)):
        # print("eold")
        # print(established)
        candidates = get_yellows(labelstruct, low_thresh)
        # print("c")
        # print(candidates)
        # If all the yellow labels are correct, press the verify button and loop again
        # (note the choice of all instead of any)
        # print("g")
        # print(ground_truth)
        if len(candidates) > 0 and all([candidates[k] == ground_truth[k] for k in candidates]):
            taps += 1
            established.update(candidates)
        # Otherwise manually label the most confident
        else:
            taps += 2
            selected = next_green(labelstruct, established, ground_truth)
            established[selected] = ground_truth[selected]
        # print("enew")
        # print(established)
        labelstruct = filter_and_renormalize(labelstruct, established, certainty)
        # print("ren")
        # print(labelstruct)
        # print()
    return taps

assert intended_taps(test_labelstruct, test_groundtruth, LOW_CONFIDENCE_THRESHOLD_PRODUCTION) == 3
# It works!

In [None]:
import time
# This get's 'apply'ed to the ct_df dataframe
def is_saved_due_to_expectations(high_thresh):
    return lambda row: int(is_above_high(row["inferred_labels"], high_thresh))  # found it? multiply by 6?

def is_saved_due_to_confirm(low_thresh, high_thresh):
    return lambda row: 0 if is_above_high(row["inferred_labels"], high_thresh) \
        else 6 - intended_taps(row["inferred_labels"], row["user_input"], low_thresh)   ##### not right

def get_eliminations(low_thresh, high_thresh):
    print(f"{low_thresh},{high_thresh}")
    u_due_to_expectations = {}
    u_due_to_confirm = {}
    u_n_trips = {}
    for user in filtered_users:
        ct_df = confirmed_trip_df_map[u]
        
        # Filter to get only unlabeled trips or trips that were labeled after the inference.
        ct_df = filter_unlabeled(ct_df)
        ct_df = ct_df[ct_df["user_input"].apply(len) != 0]
        # print(ct_df["user_input"])
        
        # Make new columns that have a 1 if the algorithm saved the user a label
        ct_df["due_to_expectations"] = ct_df.apply(is_saved_due_to_expectations(high_thresh), axis=1)
        ct_df["due_to_confirm"] = ct_df.apply(is_saved_due_to_confirm(low_thresh, high_thresh), axis=1)
        
        # per user taps saved due to expectations?
        u_due_to_expectations[user] = ct_df["due_to_expectations"].sum()  
        
        u_due_to_confirm[user] = ct_df["due_to_confirm"].sum()
        u_n_trips[user] = ct_df.shape[0]
    eliminations = sum(u_due_to_confirm.values())/sum(u_n_trips.values()), sum(u_due_to_expectations.values())/sum(u_n_trips.values())


    # THE STACKED BARS EXPECTS SAVED DUE TO CONFIRM FIRST!!!! BUT ELIMINATIONS HAS SAVED DUE TO EXPECTATIONS FIRST!!


    return eliminations
        
t1 = time.time()
(idealized_saved_due_to_confirm, idealized_saved_due_to_expectations) = \
    get_eliminations(LOW_CONFIDENCE_THRESHOLD_PRODUCTION, HIGH_CONFIDENCE_THRESHOLD_PRODUCTION)
print(time.time()-t1)
draw_stacked_bars("Graph of 1 vs. 2 vs. 3 as described above", \
                  ["1 = Before", "2 = After", "3 = Idealized"], \
                  [(0, 0), (saved_due_to_confirm, saved_due_to_expectations), \
                   (idealized_saved_due_to_confirm, idealized_saved_due_to_expectations)], 6)

Now we can actually do the sensitivity analysis!

In [None]:
scenarios = [
    (LOW_CONFIDENCE_THRESHOLD_PRODUCTION, HIGH_CONFIDENCE_THRESHOLD_PRODUCTION),
    
    (0.05, 0.89),
    (0.33, 0.89),
    (0.40, 0.89),
    (0.60, 0.89),
    
    (0.25, 0.75),
    (0.25, 0.80),
    (0.25, 0.95),
    (0.25, 0.99) #.95
]
labels = list(map(lambda scenario: f"({scenario[0]:.2f}, {scenario[1]:.2f})", scenarios))
labels[0] = "curr="+labels[0]
draw_stacked_bars("Sensitivity analysis!", labels, \
                  list(map(lambda t: get_eliminations(*t), scenarios)), \
                  6, width=12, \
                  hook = lambda ax: ax.set_xlabel("Scenario: (lower threshold, upper threshold)"))


The above graphs look rather strange to me, but that might be because I'm not working with the full dataset. This analysis should be re-run with the full dataset.