In this notebook, we will attempt to federate multiple datasets, so that we can put the results into context.
The datasets used are:
- CanBikeCO mini-pilot
- NREL location history
- CanBikeCO staging

This notebook assumes that the datasets are loaded into separate docker containers with ports exposed at 27071, 27018 and 27019. It relies on a new commit that allows for reloading the database connection.

Note that I had to bump up my docker resource limits to 200GB of disk space and 20GB of RAM to get this to work.
With the previous 50GB and 2GB limits, the containers crashed consistently.

Because of the high resource requirements for this notebook, and the fact that we are not currently using trajectories for this analysis, we will simply save a csv dataframe for now. The real analysis can read the csv dataframe and move on from there. This will make it easier for others (aka interns) to run the analysis scripts, improve the outputs and generate results.

This doesn't need to be a notebook, but will leave it as one for now since all the other top level scripts here are notebooks.

In [None]:
import pandas as pd

In [None]:
import emission.core.get_database as edb

In [None]:
import emission.storage.timeseries.abstract_timeseries as esta
import emission.storage.decorations.trip_queries as esdtq

In [None]:
all_expanded_df = pd.DataFrame()

In [None]:
def get_expanded_df_list(uuid_list):
    expanded_df_list = []
    valid_lambda = lambda u: edb.get_analysis_timeseries_db().count_documents({"user_id": u,
                                                                               "metadata.key": "analysis/confirmed_trip"}) > 0
    valid_user_list = list(filter(valid_lambda, uuid_list))
    print(f"After filtering, went from {len(uuid_list)} -> {len(valid_user_list)}")
    for u in valid_user_list:
        ts = esta.TimeSeries.get_time_series(u)
        ct_df = ts.get_data_df("analysis/confirmed_trip")
        print(u, len(ct_df))
        lt_df = esdtq.filter_labeled_trips(ct_df)
        expanded_df_list.append(esdtq.expand_userinputs(lt_df))
    return expanded_df_list

In [None]:
def get_program_df(program, uuid_list):
    program_expanded_df_list = pd.concat(get_expanded_df_list(uuid_list))
    program_expanded_df_list["program"] = pd.Categorical([program] * len(program_expanded_df_list))
    return program_expanded_df_list

In [None]:
all_expanded_df = pd.concat([all_expanded_df, get_program_df("minipilot", esta.TimeSeries.get_uuid_list())]); all_expanded_df.tail()

In [None]:
esta.TimeSeries._reset_url("localhost:27018")

In [None]:
all_expanded_df = pd.concat([all_expanded_df, get_program_df("nrel_lh", esta.TimeSeries.get_uuid_list())]); all_expanded_df.tail()

In [None]:
edb.get_profile_db().distinct("client")

In [None]:
esta.TimeSeries._reset_url("localhost:27019")

In [None]:
all_expanded_df = pd.concat([all_expanded_df, get_program_df("stage", esta.TimeSeries.get_uuid_list())]); all_expanded_df.tail()

In [None]:
print(len(all_expanded_df[all_expanded_df.program == "minipilot"].user_id.unique()),
      len(all_expanded_df[all_expanded_df.program == "nrel_lh"].user_id.unique()),
      len(all_expanded_df[all_expanded_df.program == "stage"].user_id.unique()))

In [None]:
all_expanded_df.reset_index(inplace=True)

In [None]:
all_expanded_df.head()

In [None]:
all_expanded_df.columns

In [None]:
import bson.json_util as bju

In [None]:
all_expanded_df.to_json("/tmp/federated_trip_only_dataset.json", orient="records", default_handler=bju.default)