# Pilot Before After -- Pass One

This began as a copypaste of `Explore stage before vs after.ipynb` and remains a fairly faithful adaptation thereof.

## Planning

### Users to include in the study
All except:
 * Users that don't correspond to real people
 * Users that don't have X confirmed trips both before and after
 * Users that don't have Y labeled trips both before and after?
 * Myself and possibly others who had very close contact with the updates (Shankari)

### Time period
Let "**before**" be from June 1 (to only capture summer patterns and exclude any old versions of the app and old behaviors) or whenever the user started using the app (measured by user's first `OPEN_APP` event), whichever is later, until July 19, a day before a meeting in which users may have been encouraged to label their trips more than usual.

Let "**after**" be from whenever the user installed the update (measured by user's first `LABEL_TAB_SWITCH` event) until the new confidence threshold went into effect (let's say 3:20 PM MDT on August 11)

### Things to measure
 * Fraction of trips labeled before vs. after
 * Frequency of interaction with app before vs. after
 * Some metric representing the amount of work we saved the user
   * Number of times "verify" was used
   * Number of taps avoided
   * Fraction of trips with no red labels
   * Fraction of trips we didn't need to put in To Label
 * How much To Label is used vs. other tabs/screens

### Instrumentation that exists for both before and after
~~strikethrough = irrelevant to us~~
 * `STATE_CHANGED`: Not sure, guessing when a page is switched using the buttons at the bottom of the screen?
 * ~~`BUTTON_FORCE_SYNC`: When a force sync is activated~~
 * `OPENED_APP`: When the app is opened
 * `CHECKED_DIARY`: When the Diary page is opened
 * `EXPANDED_TRIP`: When a map for a trip is opened?
 * `NOTIFICATION_OPEN`: When user clicks on a push notification?
 * ~~`METRICS_TIME`: Time spent on the metrics page~~
 * ~~`BEAR_TIME`: Time spent on the bear page~~
 * `DIARY_TIME`: Time spent on the Diary page, plus maybe something about the detail view?
 * `CHECKED_INF_SCROLL`: When we open the Label page
 * `INF_SCROLL_TIME`: Time spent on the Label page

### Instrumentation that exists for after but not before
 * `VERIFY_TRIP`: When user presses the confirm button and what the labels look like when they do
 * `LABEL_TAB_SWITCH`: When user switches between filter tabs on the Label page
 * `SELECT_LABEL`: When user manually selects a label and what the labels look like when they do

### How to measure things
 * Fraction of trips labeled before vs. after
   * We can do this just using confirmed trip data
   * If we do that, it will be based on trip time, rather than labeling time
   * That might not actually be okay, given that since our "after" period is so short, people might very well be labeling "before" trips during the "after" period
   * Maybe let's skip this metric for now
 * Frequency of interaction with app before vs. after
   * Measure frequency of `OPENED_APP`
 * Number of times "verify" was used
   * Measure frequency of `VERIFY_TRIP`
 * Fraction of trips with no red labels
   * Just use confirmed trip data
 * Fraction of trips we didn't need to put in To Label
   * Just use confirmed trip data plus expectations
 * Number of taps avoided per trip
   * Let `taps` be the sum of the occurrences of `VERIFY_TRIP` and the occurrences of `SELECT_LABEL`
   * Let `trips_labeled` be the sum of the occurrences of `VERIFY_TRIP` for which none of the labels were red, the occurrences of `SELECT_LABEL` for which all labels except the one being selected were green, and the number of trips we didn't need to put in To Label
   * Final value is `6-taps/trips_labeled`
 * How much To Label is used vs. other tabs/screens
   * Count the time spent on each of the screens using `LABEL_TAB_SWITCH` and `INF_SCROLL_TIME`

### Given limited time to write analysis code, how to prioritize each of the metrics
 1. Frequency of interaction with app before vs. after
   * Probably not going to be the most useful metric, especially given how much we have influenced this, but it's a nice easy one to take as a "warmup"
 2. Fraction of trips we didn't need to put in To Label
   * Good context for the rest is to come
 3. Fraction of trips with no red labels
   * Good context for the rest is to come
 4. Number of taps avoided per trip
   * Hard to calculate, but guaranteed to show that we saved the users some work -- just nice to know exactly how much
 5. Number of times "verify" was used
   * Easy to calculate and nice to know
 6. How much To Label is used vs. other tabs/screens
   * This can be skipped if necessary -- for now, qualitative feedback probably fills this niche better anyway

## Settings

### HARDCODED THINGS THAT SHOULD BE FETCHED PROGRAMMATICALLY IF WHEN I REVISIT THIS IT ISN'T 2AM
Please don't forget about this and botch future analyses by using old constants

In [None]:
LABEL_CATEGORIES = ["mode_confirm", "purpose_confirm", "replaced_mode"]
HIGH_CONFIDENCE_THRESHOLD = 0.95  # Confidence we need to not put a trip in To Label
LOW_CONFIDENCE_THRESHOLD = 0.5  # confidenceThreshold from the config file (or from last time I thought hardcoding was a good idea...)
# Note that this will become more complicated if future analyses are done under a regime with multiple collection modes.

(end ugly hardcoding)

### Defining the sample

In [None]:
REQUIRED_TRIPS_TOTAL = 1 # previously 30
REQUIRED_TRIPS_BEFORE = 0 # previously 10
REQUIRED_TRIPS_AFTER = 0
import arrow
BEFORE_START = arrow.get("2021-06-01T00:00-06:00")
BEFORE_END = arrow.get("2021-07-20T00:00-06:00")
AFTER_END = arrow.get("2021-08-11T15:20-06:00")
from uuid import UUID
EXCLUDE_UUIDS = [] # EXCLUDE_UUIDS = [UUID(s) for s in input("Enter UUIDs to exclude, separated by spaces: ").split(" ") if len(s) > 0]
print(EXCLUDE_UUIDS)

### Relevant instrumentation keys

In [None]:
db_keys = {
    "time": "stats/client_time",
    "error": "stats/client_error",
    "nav": "stats/client_nav_event"
}

### Do not truncate dataframes to print

In [None]:
import pandas as pd
pd.set_option("display.max_rows", None)
pd.set_option("display.width", None)

### Logging

In [None]:
import logging
_default_log_level = logging.DEBUG
def set_log_level(level):
    logging.getLogger().setLevel(level)
def reset_log_level():
    global _default_log_level
    logging.getLogger().setLevel(_default_log_level)
reset_log_level()
# If we wanted to be even more elaborate we could push and pop from a stack of logging levels...

### Imports and aliases

In [None]:
import emission.storage.timeseries.abstract_timeseries as esta
import emission.core.get_database as edb
import emission.storage.timeseries.aggregate_timeseries as estag
agts = estag.AggregateTimeSeries()
import emission.storage.timeseries.timequery as estt
from statistics import mean
import json

## Setup

### Helper functions

In [None]:
filter_between = lambda dataset, key, start, end: dataset[(dataset[key] >= start) & (dataset[key] <= end)]
def print_div_zero(f):
    try:
        print(f())
    except ZeroDivisionError:
        print("[denominator is zero]")

### Get the instrumentation data

In [None]:
# This is very much slower if I provide total_time_query to get_data_df.
# total_time_query = estt.TimeQuery("data.ts", BEFORE_START.timestamp, AFTER_END.timestamp)
stats = {}
for key in db_keys:
    print(f'Adding "{db_keys[key]}" to stats as "{key}"')
    stats[key] = agts.get_data_df(db_keys[key])
    print(f"-> Done; found {stats[key].shape[0]} entries")

### Get the trip data and filter a bit

In [None]:
def filter_update(new, old, reason):
    print(f"Excluded {len(old)-len(new)} users, left with {len(new)}: {reason}")

set_log_level(logging.INFO)
all_users = esta.TimeSeries.get_uuid_list()
confirmed_trip_df_map = {}
print(f"Working with {len(all_users)} initial users")

filter0_users = [u for u in all_users if u not in EXCLUDE_UUIDS]  # Users that we don't explicitly exclude
filter_update(filter0_users, all_users, "presence on exclusion list")

filter1_users = []  # Users with enough total trips
for u in filter0_users:
    ts = esta.TimeSeries.get_time_series(u)
    ct_df = ts.get_data_df("analysis/confirmed_trip")
    confirmed_trip_df_map[u] = ct_df
    if ct_df.shape[0] >= REQUIRED_TRIPS_TOTAL: filter1_users.append(u)
filter_update(filter1_users, filter0_users, "not enough total trips")

# To find a user's UUID based on the end date of their first trip:
# for u in filter1_users:
#     ct_df = confirmed_trip_df_map[u].copy()
#     ct_df.sort_values("end_ts", ascending=True, inplace=True)
#     print(u)
#     print(arrow.get(ct_df.iloc[0]["end_ts"]).to("America/Chicago"))
#     print()

server_filtered_users = filter1_users
print(f"For metrics that don't need user interaction, working with {len(server_filtered_users)} filtered users")

reset_log_level()

In [None]:
def explore_stats_existence(name, dataset=filter1_users):
    print(f"Exploring {name}")
    m = 0
    for u in dataset:
        for k in db_keys:
            n = len(stats[k][(stats[k]["user_id"] == u) & (stats[k]["name"] == name)])
            if n > 0:
                print(f"{k}: {n}; ", end="", flush=True)
                m += 1
    print(f"\nTotal: {m} out of {len(dataset)}")
explore_stats_existence("opened_app")
explore_stats_existence("label_tab_switch")

In [None]:
filter2_users = []  # Users with enough before trips
user_before_start = {}
for u in filter1_users:
    ct_df = confirmed_trip_df_map[u]
    ct_df["end_arrow"] = ct_df["end_ts"].apply(arrow.get)
    ct_df["write_arrow"] = ct_df["metadata_write_ts"].apply(arrow.get)
    ct_df.sort_values("end_arrow", ascending=True, inplace=True)
    this_before_start = max(ct_df.iloc[0]["end_arrow"], BEFORE_START)
    n_before = filter_between(ct_df, "end_arrow", this_before_start, BEFORE_END).shape[0]
    user_before_start[u] = this_before_start
    if n_before >= REQUIRED_TRIPS_BEFORE: filter2_users.append(u)
filter_update(filter2_users, filter1_users, "not enough before trips")

filter3_users = []  # Users that have installed the update
for u in filter2_users:
    lts = len(stats["time"][(stats["time"]["user_id"] == u) & (stats["time"]["name"] == "label_tab_switch")])  # (or haven't used the update enough to generate any of these events, which is highly unlikely if they do have the update)
    if lts > 0: filter3_users.append(u)
filter_update(filter3_users, filter2_users, "have not installed the update")

filter4_users = [] # Users with enough after trips
user_after_start = {}
for u in filter3_users:
    lts = stats["time"][(stats["time"]["user_id"] == u) & (stats["time"]["name"] == "label_tab_switch")].copy()
    lts.sort_values("ts", ascending=True, inplace=True)
    this_after_start = arrow.get(lts.iloc[0]["ts"])
    user_after_start[u] = this_after_start
    ct_df = confirmed_trip_df_map[u]
    n_after = filter_between(ct_df, "end_arrow", this_after_start, AFTER_END).shape[0]
    if n_after >= REQUIRED_TRIPS_AFTER: filter4_users.append(u)
filter_update(filter4_users, filter3_users, "not enough after trips")

filtered_users = filter4_users
print(f"For metrics that do need user interaction, working with {len(filtered_users)} filtered users")

## Analysis!

### 1. Frequency of interaction with app before vs. after

In [None]:
days_before = {}
days_after = {}
opens_before = {}
opens_after = {}
opens_per_day_before = {}
opens_per_day_after = {}

delta2days = lambda d: d.days+d.seconds/86400

for u in filtered_users:
    ct_df = confirmed_trip_df_map[u]
    this_before_start = user_before_start[u]
    this_before_end = BEFORE_END
    this_after_start = user_after_start[u]
    this_after_end = AFTER_END
    
    days_before[u] = delta2days(this_before_end-this_before_start)
    days_after[u] = delta2days(this_after_end-this_after_start)
    
    opens = stats["nav"][(stats["nav"]["user_id"] == u) & (stats["nav"]["name"] == "opened_app")].copy()
    opens["ts_arrow"] = opens["ts"].apply(arrow.get)
    opens_before[u] = filter_between(opens, "ts_arrow", this_before_start, this_before_end).shape[0]
    opens_after[u] = filter_between(opens, "ts_arrow", this_after_start, this_after_end).shape[0]
    
    opens_per_day_before[u] = opens_before[u]/days_before[u]
    opens_per_day_after[u] = opens_after[u]/days_after[u]

for u in filtered_users:
    print(f"App opens per day before->after, opens before+after: {opens_per_day_before[u]:.2f}->{opens_per_day_after[u]:.2f}, {opens_before[u]+opens_after[u]:.2f}")

Well, with the `stage_2021-08-07` dump dataset, I don't think that tells us anything at all conclusive. Let's move on...

### 2. Fraction of trips we didn't need to put in To Label
This one is good because we don't need the users to have done anything, even install the app.

In [None]:
total_trip_n = {}
total_trip_n_after = {}
total_trip_n_unlabeled = {}
total_trip_n_after_unlabeled = {}
high_confidence_n = {}  # Trips with inferences so confident they don't need to go in To Label
high_confidence_n_after = {}
high_confidence_n_unlabeled = {}
high_confidence_n_after_unlabeled = {}
mid_confidence_n = {}  # Trips that need to go in To Label but have no red labels
high_confidence_frac = {}
mid_confidence_frac = {}

for u in server_filtered_users:
    ct_df = confirmed_trip_df_map[u]
    total_trip_n[u] = ct_df.shape[0]
    high_confidence_n[u] = 0
    mid_confidence_n[u] = 0
    
    total_trip_n_unlabeled[u] = ct_df[ct_df["user_input"].apply(len) == 0].shape[0]
    high_confidence_n_unlabeled[u] = 0
    
    if u in filtered_users:
        high_confidence_n_after[u] = 0
        high_confidence_n_after_unlabeled[u] = 0
        this_after_start = user_after_start[u]
        this_after_end = AFTER_END
        trips_after = filter_between(ct_df, "end_arrow", this_after_start, this_after_end)
        total_trip_n_after[u] = trips_after.shape[0]
        total_trip_n_after_unlabeled[u] = trips_after[trips_after["user_input"].apply(len) == 0].shape[0]
        ids = []
        for _, trip in trips_after.iterrows():
            ids.append(trip["_id"])
    
    for _, trip in ct_df.iterrows():
        inference = trip["inferred_labels"]
        # Here goes a quick and partial reimplementation of the on-the-fly (client-side) inference algorithm
        confidences = {}
        for label_type in LABEL_CATEGORIES:
            counter = {}
            for line in inference:
                if label_type not in line["labels"]: continue  # Seems we have some incomplete tuples!
                val = line["labels"][label_type]
                if val not in counter: counter[val] = 0
                counter[val] += line["p"]
            confidences[label_type] = sum(counter.values())
        trip_confidence = min(confidences.values())
        # if (trip_confidence >= 0.01 and trip_confidence <= 0.99): print(trip_confidence)
        if trip_confidence > HIGH_CONFIDENCE_THRESHOLD:
            high_confidence_n[u] += 1
            if u in filtered_users and trip["_id"] in ids:
                high_confidence_n_after[u] += 1
                if len(trip["user_input"]) == 0: high_confidence_n_after_unlabeled[u] += 1
            if len(trip["user_input"]) == 0: high_confidence_n_unlabeled[u] += 1
        if trip_confidence > LOW_CONFIDENCE_THRESHOLD:
            mid_confidence_n[u] += 1
    high_confidence_frac[u] = high_confidence_n[u] / total_trip_n[u]
    in_to_label = total_trip_n[u]-high_confidence_n[u]
    mid_confidence_frac[u] = (mid_confidence_n[u]-high_confidence_n[u]) / in_to_label if in_to_label != 0 else float("NaN")

for u in server_filtered_users:
    print(f"{high_confidence_frac[u]:.2%}")
print(f"Average percentage of trips users do not need to interact at all with: {sum(high_confidence_n.values())}/{sum(total_trip_n.values())}={sum(high_confidence_n.values())/sum(total_trip_n.values()):.2%}")
print("Considering only those in the \"after\" period:")
print_div_zero(lambda: f"Average percentage of trips users do not need to interact at all with: {sum(high_confidence_n_after.values())}/{sum(total_trip_n_after.values())}={sum(high_confidence_n_after.values())/sum(total_trip_n_after.values()):.2%}")
print("Considering all unlabeled:")
print(f"Average percentage of trips users do not need to interact at all with: {sum(high_confidence_n_unlabeled.values())}/{sum(total_trip_n_unlabeled.values())}={sum(high_confidence_n_unlabeled.values())/sum(total_trip_n_unlabeled.values()):.2%}")
print("Considering only those both unlabeled and in the \"after\" period:")
print_div_zero(lambda: f"Average percentage of trips users do not need to interact at all with: {sum(high_confidence_n_after_unlabeled.values())}/{sum(total_trip_n_after_unlabeled.values())}={sum(high_confidence_n_after_unlabeled.values())/sum(total_trip_n_after_unlabeled.values()):.2%}")




### 3. Fraction of trips with no red labels

In [None]:
for u in server_filtered_users:
    print(f"{mid_confidence_frac[u]:.2%}")
print(f"Average percentage of trips in To Label with no red labels: {mean(filter(lambda v: v == v, mid_confidence_frac.values())):.2%}")  # The filter excludes NaNs

### 4. Number of taps avoided per trip

In [None]:
OLD_TAPS = 2*len(LABEL_CATEGORIES)  # Number of taps each trip requires to fully label under the old UI

verifieds = {}
taps = {}
trips_labeled = {}
taps_avoided = {}
taps_avoided_per_trip = {}

# Whether or not a press of the verify button fully labels a trip
def verify_fully_labels(event):
    if not event["reading"]["verifiable"]: return False  # Forgot about this case until working with real pilot program data...
    user_input = json.loads(event["reading"]["userInput"])
    final_inference = json.loads(event["reading"]["finalInference"])
    return len(user_input) < len(LABEL_CATEGORIES) and len(set(user_input.keys()) | set(final_inference.keys())) == len(LABEL_CATEGORIES)

# Whether or not a given label dropdown menu selection fully labels a trip
def label_fully_labels(event):
    user_input = json.loads(event["reading"]["userInput"])
    return len(user_input) == len(LABEL_CATEGORIES)-1 and event["reading"]["inputKey"] not in event["reading"]["userInput"]

for u in filtered_users:
    trips_labeled[u] = 0
    verifieds[u] = 0
    verify_events = stats["time"][(stats["time"]["user_id"] == u) & (stats["time"]["name"] == "verify_trip")]
    label_events = stats["time"][(stats["time"]["user_id"] == u) & (stats["time"]["name"] == "select_label")]
    taps[u] = len(verify_events)+2*len(label_events)
    if verify_events.shape[0] > 0:
        for _, ve in verify_events.iterrows():  # Not entirely sure why apply() wasn't working for this, don't have time to figure out
            if verify_fully_labels(ve):
                verifieds[u] += 1
                trips_labeled[u] += 1
    if label_events.shape[0] > 0:
        for _, le in label_events.iterrows():
            if label_fully_labels(le): trips_labeled[u] += 1
    taps_avoided[u] = OLD_TAPS*trips_labeled[u]-taps[u]
    taps_avoided_per_trip[u] = taps_avoided[u]/trips_labeled[u] if trips_labeled[u] != 0 else float("NaN")

print(f"We assume that under the old design, users must tap {OLD_TAPS} times to label each trip -- two taps per label category")
for u in filtered_users:
    print(f"User tapped {taps[u]} times, avoided {taps_avoided[u]} taps to label {trips_labeled[u]} trips")

total_taps_avoided = sum(taps_avoided.values()) 
total_trips_labeled = sum(trips_labeled.values())
avoided_per_labeled = total_taps_avoided/total_trips_labeled
print(f"Overall, {total_taps_avoided} taps were avoided, {avoided_per_labeled:.2f} per trip -- that's {avoided_per_labeled/OLD_TAPS:.2%} of taps")

# The following stats seem very impressive but aren't really fair because there are so many unlabeled trips:
# total_taps_avoided_high = total_taps_avoided+OLD_TAPS*sum(high_confidence_n.values())
# total_trips_labeled_high = total_trips_labeled+sum(high_confidence_n.values())
# avoided_per_labeled_high = total_taps_avoided_high/total_trips_labeled_high
# print(f"If we also count the taps we avoided by not putting high-confidence inferences on the To Label screen:")
# print(f"Overall, {total_taps_avoided_high} taps were avoided, {avoided_per_labeled_high:.2f} per trip -- that's {avoided_per_labeled_high/OLD_TAPS:.2%} of taps")

# Here's a fairer way of saying it:
print(f"We also saved {OLD_TAPS*sum(high_confidence_n_after_unlabeled.values())} taps across {sum(high_confidence_n_after_unlabeled.values())} trips by not soliciting user input on very confident trips")

total_taps_avoided_high = total_taps_avoided+OLD_TAPS*sum(high_confidence_n_after_unlabeled.values())
total_trips_labeled_high = total_trips_labeled+sum(high_confidence_n_after_unlabeled.values())
avoided_per_labeled_high = total_taps_avoided_high/total_trips_labeled_high
print(f"If we also count the taps we avoided by not putting high-confidence inferences on the To Label screen:")
print(f"Overall, {total_taps_avoided_high} taps were avoided across {sum(trips_labeled.values())+sum(high_confidence_n_after_unlabeled.values())} trips, {avoided_per_labeled_high:.2f} per trip -- that's {avoided_per_labeled_high/OLD_TAPS:.2%} of taps")



### 5. Number of times "verify" was used

In [None]:
for u in filtered_users:
    print(verifieds[u])
print(f"Overall, {len([v for v in verifieds.values() if v > 0])} users used the verify button for a total of {sum(verifieds.values())} uses")
print(f"Total number of trips labeled: {sum(trips_labeled.values())}")

### 6. How much To Label is used vs. other tabs/screens

I'm not going to implement this one for now. I don't think we will get any meaningful results from it given the extremely limited sample we're working with, so for now qualitative feedback answers the "Was the To Label tab a good idea?" question better.

## Lingering questions
### Is this all... correct?
I wrote this all in one evening, very quickly. It should be tested more extensively.

### Train vs. test
Currently, I sometimes use unlabeled trips as a "test" dataset to remove the effect of the algorithm "cheating" by predicting data it trained on. Might there be a better way to do this?

## A note on multiple programs
Using `mongod --dbpath`, it is possible to create many different databases. If one database is created for each of a handful of programs restored using `mongorestore` and these results are compared to results from an aggregate database created by `mongorestore`-ing all the dumps to the same database, certain statistics (e.g., total number of trips) turn out to be the same. Thus, it seems to be feasible to analyze multiple programs using only one aggregate database.

## Making sense of data from multiple programs
In the final analysis, we will want to aggregate data from multiple pilot programs but also to be able to analyze each one individually. `mongorestore`-ing all the dumps into one database works well for the ensemble; for now, let's just make sure we can figure out which program each user in that ensemble came from so we can disaggregate.

In [None]:
user_info = {}
for u in edb.get_uuid_db().find():
    u["program"] = u["user_email"].split("_")[0]
    user_info[u["uuid"]] = u
    

for u in server_filtered_users:
    print(u)
    print(user_info[u]["program"])

Success!