## Template to get started with data exploration

The other notebooks show the results of existing analysis. Most of the resulting code has been moved from the notebooks to the associated python modules, in order to showcase the results. But that makes it harder to experiment with the data and come up with new analyses. This is particularly true because the current data structures that store the data are a little complicated. Maybe after we switch to xarrays in the future, we will no longer need this!

But for now, users can use this exploration template and plug in their code/analyses here. And finally, they can put the code into a module for re-use elsewhere

## Set up the dependencies

In [None]:
# for reading and validating data
import emeval.input.spec_details as eisd
import emeval.input.phone_view as eipv
import emeval.input.eval_view as eiev
from emeval.input.tabularize import tabularize_pv_map

In [None]:
# Visualization helpers
import emeval.viz.phone_view as ezpv
import emeval.viz.eval_view as ezev
import emeval.viz.geojson as ezgj

In [None]:
# For plots
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# For maps
import folium
import branca.element as bre

In [None]:
# For easier debugging while working on modules
import importlib

In [None]:
import arrow
from IPython.display import display

## The spec

The spec defines what experiments were done, and over which time ranges. Once the experiment is complete, most of the structure is read back from the data, but we use the spec to validate that it all worked correctly. The spec also contains the ground truth for the legs. Here, we read the spec for the trip to UC Berkeley.

In [None]:
AUTHOR_EMAIL = "shankari@eecs.berkeley.edu"

# If using ServerSpecDetails, data can alternatively be retrieved as such:
# DATASTORE_LOC = "http://localhost:8080"
# sd = eisd.ServerSpecDetails(DATASTORE_LOC, AUTHOR_EMAIL, "train_bus_ebike_mtv_ucb")

# You must run `cd bin/ && python dump_data_to_file.py --spec-id train_bus_ebike_mtv_ucb`
# before using this notebook!

DATASTORE_LOC = "bin/data/"
sd = eisd.FileSpecDetails(DATASTORE_LOC, AUTHOR_EMAIL, "train_bus_ebike_mtv_ucb")

## The views

There are two main views for the data - the phone view and the evaluation view. 

### Phone view

In the phone view, the phone is primary, and then there is a tree that you can traverse to get the data that you want. Traversing that tree typically involves nested for loops; here's an example of loading the phone view and traversing it. You can replace the print statements with real code. When you are ready to check this in, please move the function to one of the python modules so that we can invoke it more generally

In [None]:
pv = eipv.PhoneView(sd)

The `tabularize_pv_map` function in `emeval.input.tabularize` can be used to convert a phone view tree into a series of dataframes for more intuitive querying. The dataframes are organized by the operating systems of the test phones and the type of data they hold.

In [None]:
pv_dfs = tabularize_pv_map(pv.map())

In [None]:
for phone_os, df_map in pv_dfs.items():
    for df_label, df in df_map.items():
        print(f"{phone_os=}, {df_label=}")
        print(df.columns)

### Eval view

In the eval view, the experiment is primary, and then there is a similar tree that you can traverse to get the data that you want. Traversing that tree typically involves nested for loops; here's an example of manipulating the eval view and traversing it. You can replace the print statements with real code. When you are ready to check this in, please move the function to one of the python modules so that we can invoke it more generally

In [None]:
ev = eiev.EvaluationView()
ev.from_view_eval_trips(pv, "", "")

We see these evaluation trips have labels such as `HAHFDC_0`, or `HAMFDC_0`. What do these mean?

`HA` or `MA` refer to high accuracy or medium accuracy, respectively. Trips with `HA` will tend to favor GPS utilization and result in high power consumption.

`HF` or `MF` refer to high frequency or medium accuracy, respectively. Trips with `HF` will sense and process more often, and are likely to have higher spatiotemporal accuracy (e.g. will hug corners more accurately), albeit with higher power consumption.

In [None]:
for phone_os, eval_map in ev.map("evaluation").items():
    print(15 * "=*")
    print(phone_os, eval_map.keys())
    for (curr_calibrate, curr_calibrate_trip_map) in eval_map.items():
        print(4 * ' ', 15 * "-*")
        print(4 * ' ', curr_calibrate, curr_calibrate_trip_map.keys())
        for trip_id, trip_map in curr_calibrate_trip_map.items():
            print(8 * ' ', 30 * "=")
            print(8 * ' ', trip_id, trip_map.keys())
            for run_label, tr in trip_map.items():
                print(12 * ' ', 30 * "=")
                print(12 * ' ', run_label, tr.keys())
                for sr in tr["evaluation_section_ranges"]:
                    print(16 * ' ', 30 * "~")
                    print(16 * ' ',sr["trip_id"], sr.keys())

### Ground truth

The ground truth is stored in the spec, and we can retrieve it from there. Once we have retrieved the trip, there are many possible analyses using them. Please see `get_concat_trajectories` for an example.

We can confirm that there are valid timestamp (`ts`) values for each trip range:

In [None]:
for phone_os, df_map in pv_dfs.items():
    for df_label, df in df_map.items():
        print("-"*10, f"{phone_os=}, {df_label=}", "-"*10)
        if "trip_range_id" in df.columns:
            tr_ids = df["trip_range_id_base"].dropna().unique()
            for tr_id in tr_ids:
                tr_id_rows = df.query(f"trip_range_id_base == '{tr_id}'")[["trip_range_id_base", "trip_range_gt_start_ts", "trip_range_gt_end_ts"]].drop_duplicates()
                for _, _, gt_start_ts, gt_end_ts in (tr_id_rows.itertuples()):
                    assert not df.query(f"{gt_start_ts} < ts < {gt_end_ts}")["ts"].isnull().values.any()
        print("Has valid `ts` values for each trip range ✅ ")

As well as each section range:

In [None]:
for phone_os, df_map in pv_dfs.items():
    for df_label, df in df_map.items():
        print("-"*10, f"{phone_os=}, {df_label=}", "-"*10)
        if "section_range_id" in df.columns:
            sr_ids = df["section_range_id_base"].dropna().unique()
            for sr_id in sr_ids:
                sr_id_rows = df.query(f"section_range_id_base == '{sr_id}'")[["section_range_id_base", "section_range_gt_start_ts", "section_range_gt_end_ts"]].drop_duplicates()
                for _, _, gt_start_ts, gt_end_ts in (sr_id_rows.itertuples()):
                    assert not df.query(f"{gt_start_ts} < ts < {gt_end_ts}")["ts"].isnull().values.any()
        print("Has valid `ts` values for each section range ✅ ")

### For trips

Using the phone view dataframes:

In [None]:
tr_subset = ["trip_range_id_base", "trip_range_gt_start_ts", "trip_range_gt_end_ts"]
for phone_os, df_map in pv_dfs.items():
    for df_label, df in df_map.items():
        print("-"*10, f"{phone_os=}, {df_label=}", "-"*10)
        if "trip_range_id" in df.columns:
            tr_info = df[tr_subset].drop_duplicates().dropna()
            for _, tr_id_base, gt_start_ts, gt_end_ts in tr_info.itertuples():
                print(f"{tr_id_base=}, {gt_start_ts=}, {gt_end_ts=}")
                gt_trip = sd.get_ground_truth_for_trip(tr_id_base, gt_start_ts, gt_end_ts)
                print(eisd.SpecDetails.get_concat_trajectories(gt_trip)["properties"], end="\n\n")

### For sections

Same as above:

In [None]:
sr_subset = ["trip_range_id_base", "section_range_gt_start_ts", "section_range_gt_end_ts"]
for phone_os, df_map in pv_dfs.items():
    for df_label, df in df_map.items():
        print("-"*10, f"{phone_os=}, {df_label=}", "-"*10)
        if "section_range_id" in df.columns:
            sr_info = df[sr_subset].drop_duplicates().dropna()
            for _, tr_id_base, gt_start_ts, gt_end_ts in sr_info.itertuples():
                print(f"{tr_id_base=}, {gt_start_ts=}, {gt_end_ts=}")
                gt_trip = sd.get_ground_truth_for_trip(tr_id_base, gt_start_ts, gt_end_ts)
                print(eisd.SpecDetails.get_concat_trajectories(gt_trip)["properties"], end="\n\n")

### Work with a single trip

You can also work with the details of a single trip - here, we look at the battery drain across phones for the third repetition. Code inspired by `plot_all_power_drain`, located in `emeval.viz.phone_view`

In [None]:
ifig, ax = plt.subplots(ncols=1, nrows=1, figsize=(10,5))
for phone_os, df_map in pv_dfs.items():
    battery_df = df_map["battery_df"]
    for i, phone_label in enumerate(battery_df["phone_label"].unique()):
        third_rep = battery_df.query(
            f"phone_label == '{phone_label}' and range_type == 'evaluation' and range_index == 2")
        role = third_rep["role"].unique()[0]
        third_rep.plot(ax=ax, x="hr", y="battery_level_pct", label=f"{phone_label} ({role})", ylim=(0, 100))

### Work with a single leg

You can also work with the details of a single leg. This is not likely to be useful for power estimates because there are so few points, but it is going to be easier to work with trajectory estimates

In [None]:
bart_leg = pv_dfs["ios"]["location_df"].query(
    "phone_label == 'ucb-sdb-ios-1'"
    "and range_type == 'evaluation'"
    "and range_index == 2"
    "and trip_range_index == 0"
    "and section_range_index == 5"
)

bart_leg

In [None]:
gt_subset = ["trip_range_id_base", "section_range_id_base", "section_range_gt_start_ts", "section_range_gt_end_ts"]

tr_id_base, sr_id_base, gt_start_ts, gt_end_ts = tuple(bart_leg[gt_subset].drop_duplicates().iloc[0].values)

gt_leg = sd.get_ground_truth_for_leg(tr_id_base, sr_id_base, gt_start_ts, gt_end_ts)

gt_leg

#### Display the leg

Note the layer control on the map that allows you to toggle the lines separately

In [None]:
curr_map = folium.Map()
gt_leg_gj = sd.get_geojson_for_leg(gt_leg)
sensed_section_gj = ezgj.get_geojson_for_loc_df(bart_leg)
gt_leg_gj_feature = folium.GeoJson(gt_leg_gj, name="ground_truth")
sensed_leg_gj_feature = folium.GeoJson(sensed_section_gj, name="sensed_values")
curr_map.add_child(gt_leg_gj_feature)
curr_map.add_child(sensed_leg_gj_feature)
curr_map.fit_bounds(sensed_leg_gj_feature.get_bounds())
folium.LayerControl().add_to(curr_map)
curr_map

#### Display the leg with points

In this case, the points are in a separate layer so they can be toggled indepdendently of the underlying lines

In [None]:
curr_map = folium.Map()
gt_leg_gj = sd.get_geojson_for_leg(gt_leg)
sensed_section_gj = ezgj.get_geojson_for_loc_df(bart_leg)
gt_leg_gj_feature = folium.GeoJson(gt_leg_gj, name="ground_truth")
gt_leg_gj_points = ezgj.get_point_markers(gt_leg_gj[2], name="ground_truth_points", color="green")
sensed_leg_gj_feature = folium.GeoJson(sensed_section_gj, name="sensed_values")
sensed_leg_gj_points = ezgj.get_point_markers(sensed_section_gj, name="sensed_points", color="red")
curr_map.add_child(gt_leg_gj_feature)
curr_map.add_child(gt_leg_gj_points)
curr_map.add_child(sensed_leg_gj_feature)
curr_map.add_child(sensed_leg_gj_points)
curr_map.fit_bounds(sensed_leg_gj_feature.get_bounds())
folium.LayerControl().add_to(curr_map)
curr_map