Author: Wen Zhang

Timeline: 
1. 06.27.2023 - 07.10.2023 Coding EC
2. 07.10.2023 - 07.11.2023 Coding CE
3. 07.12.2023 - 07.14.2023 Coding Mode count
4. 07.17.2023 - 07.20.2023 Coding Mode shared by distance
5. 07.24.2023   Coding Mode count/distance new method 
5. 07.25.2023 - 07.28.2023 Analysing variance of Mode distance

Sections of this notebook:
1. Imports: inputing the classes from other files
2. Loading data: loading the data from docker
3. Running model and getting the testing and validation dataset and their predictions
4. Establishing each user's confusion matrix
5. Calculating mean and variance of energy intensity, carbon intensity, energy consumption and carbon emission
6. Plotting the results of energy consumption and carbon emission
7. Mode count and Mode shared by distance


### Imports

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
from uuid import UUID
import os

import matplotlib.pyplot as plt

import sys 
sys.path.append(os.path.abspath(os.path.join(os.getcwd(),"../..")) + '/e-mission-server')
import emission.storage.timeseries.abstract_timeseries as esta
import emission.storage.decorations.trip_queries as esdtq

sys.path.append(os.path.abspath(os.path.dirname(os.getcwd())) + "/TRB_label_assist")
import performance_eval
import models
sys.path.append(os.path.abspath(os.path.dirname(os.getcwd())) + '/Error_bars/Public_Dashboard/auxiliary_files')


sys.path.append(os.path.abspath(os.path.dirname(os.getcwd())) + '/Error_bars')
import confusion_matrix_handling as cm_handling
import get_EC,helper_functions
import uuid
import math


In [2]:
# Set the display options to show more rows and columns
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

### Loading data

In [2]:
all_users = esta.TimeSeries.get_uuid_list()
confirmed_trip_df_map = {}
labeled_trip_df_map = {}
expanded_labeled_trip_df_map = {}
expanded_all_trip_df_map = {}


In [3]:
# loading the data from the docker, cost about 5-6 minutes
for u in all_users:
    ts = esta.TimeSeries.get_time_series(u)
    ct_df = ts.get_data_df("analysis/confirmed_trip")

    confirmed_trip_df_map[u] = ct_df
    labeled_trip_df_map[u] = esdtq.filter_labeled_trips(ct_df)
    expanded_labeled_trip_df_map[u] = esdtq.expand_userinputs(
        labeled_trip_df_map[u])
    expanded_all_trip_df_map[u] = esdtq.expand_userinputs(
        confirmed_trip_df_map[u])

In [None]:
# Length of the total dataset
total_len = 0
for ele in range(len(all_users)):
    total_len += len(expanded_labeled_trip_df_map[all_users[ele]])
total_len

check how many labeled/unlabeled trips there are:

In [None]:
n_trips_df = pd.DataFrame(
    [[u, len(confirmed_trip_df_map[u]),
      len(labeled_trip_df_map[u])] for u in all_users],
    columns=["user_id", "all_trips", "labeled_trips"])

all_trips = n_trips_df.all_trips.sum()
labeled_trips = n_trips_df.labeled_trips.sum()
unlabeled_trips = all_trips - labeled_trips
n_users = len(n_trips_df)

print('{} ({:.2f}%) unlabeled, {} ({:.2f}%) labeled, {} total trips'.format(
    unlabeled_trips, unlabeled_trips / all_trips, labeled_trips,
    labeled_trips / all_trips, all_trips))

### Running model and get the testing and validation dataset and their predictions

The following cell will load the cross-validation results for the listed models. 
In the initial TRB_label_assist/performance_eval.py, there are list of models. However I commented the most of the models except for the
"random forests (coordinates)", becasue it is the best model we found and now we only need to evaluate the downstream metrics based on "the best model".

'cv_results' contains the result of "random forests (coordinates)", which are testing dataset and the validation dataset.
Here the testing dataset contains 80% of the total data and validation dataset contains 20% of the total data.
Testing dataset contains 80% of the total data. This becasue we use k-fold cross validation for the traning and testing dataset.
For example, if the total dataset contains 100 trips, we separate the dataset to temp_dataset 80% and validation dataset 20%. For the temp_dataset we seperate the data to 4 part, each part has 20 trips. We can label the each part as 1,2,3,4. 
1. First fold, We use part 1 as testing dataset and part 2,3,4 as the training dataset. 
2. Second fold, We use part 2 as testing dataset and part 1,3,4 as the training dataset.  
3. Third fold, We use part 3 as testing dataset and part 1,2,4 as the training dataset.  
4. Forth fold, We use part 4 as testing dataset and part 1,2,3 as the training dataset.  
After 4-fold cross validation, we have the prediction of temp_dataset, then we aim this 80% of the total dataset as the test dataset.

Note: If the cross-validation results for the model have already been generated, it will attempt to load it from the csv file to avoid the time-consuming process of re-running it. Otherwise, it will run the cross-validation from scratch. (This feature can be toggled with the override_prior_runs parameter - if True, it will ignore existing csv's and re-run from scratch.)

In [None]:
# Doing 4-fold cross validation for the temp dataset, k=4. 
# Also, getting the testing and validation dataset and their predictions

model_names = list(performance_eval.PREDICTORS.keys())
cv_results = performance_eval.cv_for_all_algs(
    uuid_list=all_users,
    expanded_trip_df_map=expanded_labeled_trip_df_map,
    model_names=model_names,
    override_prior_runs=False,
    k=4, # 4-fold 
    raise_errors=False,
    random_state=42,
)

In [20]:
# cv_results contains test_trips and validation_trips
RFc_df = pd.DataFrame(cv_results['random forests (coordinates)'])

# get distance in 'miles'
METERS_TO_MILES = 0.000621371 # 1 meter = 0.000621371 miles
RFc_df['distance_miles'] = RFc_df.distance*METERS_TO_MILES

# get distance in 'km'
RFc_df['distance_km'] = RFc_df.distance/1000

# get validation_trips
validation_trips = RFc_df[RFc_df['dataset'] == 'validation_dataset']

# get test_trips
test_trips = RFc_df[RFc_df['dataset'] != 'validation_dataset']

#### The below three cells are a try for the conbimed confusion matrixes (can ignore)

In [21]:
def get_user_confusion_matrices_dic(mode_results_df):
    attribute = 'distance'
        
    user_confusion_matrices_dic0 = {}

    grouped_user = mode_results_df.groupby('user_id')
    for user_id, group_df in grouped_user:
        predicted_values = group_df['predicted_value']
        true_values = group_df['true_value']
        if attribute == 'distance':
            sample_weight = group_df['distance_km']
            confusion_matrix = pd.crosstab(true_values, predicted_values, sample_weight, aggfunc='sum') # weight: trip distance
        elif attribute == 'duration':
            sample_weight = group_df['duration']
            confusion_matrix = pd.crosstab(true_values, predicted_values, sample_weight, aggfunc='sum') # weight: trip duration
        else: # attribute: 'tripCount'
            confusion_matrix = pd.crosstab(true_values, predicted_values) # weight: trip count

        confusion_matrix[confusion_matrix.isnull()] = 0

        user_confusion_matrices_dic0[user_id] = confusion_matrix
        
    return user_confusion_matrices_dic0

In [22]:

def is_square_matrix(matrix):
    # Get the number of rows and columns of the matrix
    num_rows, num_cols = matrix.shape
    
    # Check if the matrix is square (number of rows equals number of columns)
    return num_rows == num_cols

In [23]:
# Create multiply confusion matrixes base on the k-fold results.
# Create conbimed confusion matrixes
# Get the nn_test_trips which contains the test trips of user who has N*N summed_matrix.
test_trip_user_list = list(test_trips['user_id'].unique())
not_square_matrix_num = 0
square_matrix_num = 0
nn_test_trips = pd.DataFrame()
valid_sum_count = 0
for user in test_trip_user_list:
    user_trips = test_trips[test_trips['user_id']==user]
    user_validation_trips = validation_trips[validation_trips['user_id']==user]
    user_validation_trips_pred_mode = list(user_validation_trips.mode_pred.unique())

    grouped = user_trips.groupby('fold_number_list')
    grouped_dataframes = {}
    for group_name, group_data in grouped:
        grouped_dataframes[group_name] = group_data

    user_confusion_matrices_dic0 = {}
    user_confusion_matrices_dic1 = {}
    user_confusion_matrices_dic2 = {}
    user_confusion_matrices_dic3 = {}

    for gd_index in range(len(grouped_dataframes)): # len(grouped_dataframes) = 4
        mode_results = {}
        # get results
        results = performance_eval.get_clf_metrics(grouped_dataframes[gd_index],
                                    "mode",
                                    # weight='distance',
                                    keep_nopred=True,
                                    ignore_custom=False)

        mode_results['predicted_value'] = results['label_pred']
        mode_results['true_value'] = results['label_true']
        mode_results['user_id'] = results['user_id']
        mode_results['duration'] = results['duration']
        mode_results['distance_meter'] = results['trip_dists']
        mode_results['distance_km'] = results['trip_dists']/1000

        METERS_TO_MILES = 0.000621371 # 1 meter = 0.000621371 miles
        mode_results['distance_miles'] = results['trip_dists']*METERS_TO_MILES
        mode_results_df = pd.DataFrame(mode_results)
        
        if gd_index==0:
            user_confusion_matrices_dic0 = get_user_confusion_matrices_dic(mode_results_df)

        if gd_index==1:
            user_confusion_matrices_dic1 = get_user_confusion_matrices_dic(mode_results_df)

        if gd_index==2:
            user_confusion_matrices_dic2=get_user_confusion_matrices_dic(mode_results_df)

        if gd_index==3:
            user_confusion_matrices_dic3=get_user_confusion_matrices_dic(mode_results_df)

    CM0 = user_confusion_matrices_dic0.get(user)
    CM1 = user_confusion_matrices_dic1.get(user)

    CM2 = user_confusion_matrices_dic2.get(user)
    CM3 = user_confusion_matrices_dic3.get(user)

    summed_matrix = pd.DataFrame()
    mean_matrix = pd.DataFrame()
    median_matrix = pd.DataFrame()

    all_matrices = [CM0, CM1, CM2, CM3]

    for matrix in all_matrices:
        summed_matrix = summed_matrix.add(matrix, fill_value=0)

    mean_matrix = summed_matrix / len(all_matrices)

    combined_matrix = pd.concat(all_matrices)
    median_matrix = combined_matrix.groupby(level=0).median()


    mean_matrix = mean_matrix.fillna(0)
    median_matrix = median_matrix.fillna(0)
    summed_matrix = summed_matrix.fillna(0)
    CM0 =CM0.fillna(0)
    CM1 =CM1.fillna(0)
    CM2 =CM2.fillna(0)
    CM3 =CM3.fillna(0)

    if is_square_matrix(summed_matrix) and list(summed_matrix.index)==list(summed_matrix.columns) and set(summed_matrix.index).issubset(set(user_validation_trips_pred_mode)):
        square_matrix_num +=1
        nn_test_trips = nn_test_trips.append(user_trips)
        if len(set(summed_matrix.index))<len(set(user_validation_trips_pred_mode)):
            valid_sum_count += 1
        
    else:
        not_square_matrix_num +=1



In [31]:
# get the user_id list in the test list
# if we want to use nn_test_trips, we need to change the below code 'test_trips' to 'nn_test_trips'.
test_trips_user_list = list(test_trips.user_id.unique())

#### clean validation_trips and add device information

In [32]:
# clean validation_trips 
validation_trips = validation_trips.reset_index(drop=True)
validation_trips = validation_trips.rename(columns={"mode_initial": "mode_confirm"})

# add device information for validation_trips. device can be 'ios' or 'android'
validation_trips['os'] = ['ios' if x == 'DwellSegmentationDistFilter' else 'android' for x in validation_trips['source']]
validation_trips['user_id'] = validation_trips['user_id'].astype(str)


In [None]:
# if we want to check partly users in the test_trips, we need to filter the validation_trips for those users also.
validation_trips = validation_trips[validation_trips['user_id'].isin(test_trips_user_list)]

### Establishing each user's confusion matrix

In [None]:
# Regularize the form of testing dataset
mode_results = {}

for model_name in cv_results.keys(): # only one model's results will be in the cv_results.keys(), because we are only using the best model which is random forests (coordinates)
    print(f'now geting: {model_name}')
    # get 'label_pred', 'label_true', 'user_id','duration', 'trip_dists' of test_trips
    results = performance_eval.get_clf_metrics(test_trips,
                                "mode",
                                keep_nopred=True,
                                ignore_custom=False)

    mode_results['predicted_value'] = results['label_pred']
    mode_results['true_value'] = results['label_true']
    mode_results['user_id'] = results['user_id']
    mode_results['duration'] = results['duration']
    mode_results['distance_meter'] = results['trip_dists']

    # Get distance in miles
    METERS_TO_MILES = 0.000621371 # 1 meter = 0.000621371 miles
    mode_results['distance_miles'] = results['trip_dists']*METERS_TO_MILES

    # Get distance in km
    mode_results['distance_km'] = results['trip_dists']/1000


In [42]:
# save the prediction and the ground truth of the testing dataset
pd.DataFrame(mode_results).to_csv("CSVs/compare_true_pred_mode.csv")

In [None]:
# read the prediction and the ground truth of the testing dataset
mode_df = pd.DataFrame(pd.read_csv("CSVs/compare_true_pred_mode.csv"))
mode_df = mode_df.drop(['Unnamed: 0'], axis=1)

In [47]:
# Build a confusion matrix for each user based on the prediction and the ground truth of 'attribute'
# 'attribute' can be 'distance', 'duration' and 'tripCount'. The value of 'attribute' will also show in the plot and the file name (later we will save some files).
attribute = 'distance'

In [None]:
# get each user's confusion matrices and store in a dictionary
user_confusion_matrices_dic = {}

grouped_user = mode_df.groupby('user_id')
for user_id, group_df in grouped_user:
    predicted_values = group_df['predicted_value']
    true_values = group_df['true_value']
    if attribute == 'distance':
        sample_weight = group_df['distance_km']
        confusion_matrix = pd.crosstab(true_values, predicted_values, sample_weight, aggfunc='sum') # weight: trip distance
    elif attribute == 'duration':
        sample_weight = group_df['duration']
        confusion_matrix = pd.crosstab(true_values, predicted_values, sample_weight, aggfunc='sum') # weight: trip duration
    else: # attribute: 'tripCount'
        confusion_matrix = pd.crosstab(true_values, predicted_values) # weight: trip count

    confusion_matrix[confusion_matrix.isnull()] = 0
    user_confusion_matrices_dic[user_id] = confusion_matrix
    


### Calculating mean and variance of energy intensity, carbon intensity, energy consumption and carbon emission

In [49]:
# get mean and variance of different devices(unit_distance_MCS.csv)
unit_dist_MCS_df = pd.read_csv(os.path.abspath(os.path.dirname(os.getcwd())) + '/Error_bars/unit_distance_MCS.csv').set_index("moment")

# get energy intensity file (energy_intensity.csv)
df_EI = pd.read_csv(os.path.abspath(os.path.dirname(os.getcwd())) + '/Error_bars/Public_Dashboard/auxiliary_files/energy_intensity.csv') # r stands for raw string, only matters if the path is on Windows

unit_distance_MCS.csv:

| moment | android   | ios   |
|------|------|------|
| mean | ...| ...|
| var  | ...| ... |

energy_intensity.csv

|mode|fuel|(kWH)/trip|EI(kWH/PMT)|energy_intensity_factor|energy_intensity_units|CO2_factor|CO2_factor_units|
|------|------|------|------|------|------|------|------|
|"Gas Car, drove alone"|gasoline|0||...|BTU/PMT|...|lb_CO2/MMBTU|
|...|...|...|...|...|...|...|...|

In [50]:
def get_elt_with_errors(valid_trips_user, user_moments_df, unit_dist_MCS_df, elt_with_errors_all, intensity_dict):
    '''
        Inputs:
            valid_trips_user: one user's validation dataset
            user_moments_df: one user's mean and variance of the energy intensity or carbon intensity
            unit_dist_MCS_df: mean and variance of the devices error
            elt_with_errors_all: an empty DataFrame
            intensity_dict: dictionary by mode of energy intensities or by mode of carbon intensities

        Outputs:
            1. dictionary contains:
            (1) dif_expected_user_laberd_mean: distance from expected mean to user_laberd mean
            (2) expected_mean: expected mean
            (3) user_labeled_mean: user labeled mean
            (4) all_mode_expected_SD_EC: standard deviation of this user's all trips. calculate the variance of each mode first, then combine the variance of all the modes.
            2. elt_with_errors_all: each trip's information, e.g.expected value, user_labeled value, variance of expected value, variance of user_labeled value and so on.
    '''

    expected = []
    user_labeled = []
    confusion_based_variance = []
    user_based_variance = []
    expected_error_list = []

    EI_length_covariance = 0
    
    # iterate each trip of this user
    for _,ct in valid_trips_user.iterrows():
        # Calculate expected energy consumption
        ct["section_modes"] =  [ct["mode_true"]]

        # according to the device used for this trip, assign the mean and variane of device to this trip.
        ct["section_distances"] =  [ct["distance"]]
        if ct['os']=='ios':
            ios_EI_moments = user_moments_df
            android_EI_moments = pd.DataFrame()
        elif ct['os']=='android':
            android_EI_moments = user_moments_df
            ios_EI_moments = pd.DataFrame()

        # calculate the mean and the variance of the energy consumption and carbon emission. 
        # get_EC.get_expected_EC_for_one_trip() was created for EC, but it also works for CE. Same logic, the only difference is that user_moments_df comes from EI or CI.
        trip_expected, trip_confusion_based_variance = get_EC.get_expected_EC_for_one_trip(ct,unit_dist_MCS_df,android_EI_moments,ios_EI_moments, EI_length_covariance)

        # Calculate the mean and the variance of the user labeled energy consumption
        trip_user_labeled, trip_user_based_variance = get_EC.get_user_labeled_EC_for_one_trip(ct,unit_dist_MCS_df,intensity_dict)

        expected.append(trip_expected)
        user_labeled.append(trip_user_labeled)

        confusion_based_variance.append(trip_confusion_based_variance)
        user_based_variance.append(trip_user_based_variance)
            
        expected_error = trip_expected - trip_user_labeled

        expected_error_list.append(expected_error)
        # if (abs(expected_error) > 100):
        #     print(f"Large EC error: EC user labeled, EC expected: {trip_user_labeled:.2f}, {trip_expected:.2f}")
        #     print(f"\tTrip info: mode_confirm,sensed,distance (mi): {ct['mode_confirm'],ct['section_modes']},{ct['distance']*METERS_TO_MILES:.2f}")

    total_expected = sum(expected)
    total_user_labeled = sum(user_labeled)

    percent_error_expected = helper_functions.relative_error(total_expected,total_user_labeled)*100

    # Append the values to expanded_labeled_trips
    elt_with_errors = valid_trips_user.copy()  # elt: expanded labeled trips
    elt_with_errors['error_for_confusion'] = expected_error_list
    elt_with_errors['expected'] = expected
    elt_with_errors['user_labeled'] = user_labeled

    # Append variances
    elt_with_errors['confusion_var'] = confusion_based_variance
    elt_with_errors['user_var'] = user_based_variance
    elt_with_errors['confusion_sd'] = np.sqrt(np.array(confusion_based_variance))
    elt_with_errors['user_sd'] = np.sqrt(np.array(user_based_variance))

    # expected mean, user_laberd mean   # e.g. (0.22, 0.4)
    expected_mean = elt_with_errors.expected.sum()
    user_labeled_mean = elt_with_errors.user_labeled.sum()
    
    # distance from expected mean to user_laberd mean
    dif_expected_user_laberd_mean =  expected_mean - user_labeled_mean # e.g. 0.18
   
    os_EI_moments_map = {'ios': user_moments_df, 'android': user_moments_df} # assume the trips in the train and test dataset use same OS 
    valid_trips_user['primary_mode'] = valid_trips_user['mode_true']
    all_mode_expected_variance_EC = get_EC.compute_aggregate_variance_by_primary_mode(valid_trips_user, os_EI_moments_map, unit_dist_MCS_df)

    elt_with_errors_all = elt_with_errors_all.append(elt_with_errors)

    # standard deviation of expected   # e.g. 0.17

    return {'dif_expected_user_laberd_mean': dif_expected_user_laberd_mean, 'expected_mean': expected_mean, 'user_labeled_mean': user_labeled_mean,
            'all_mode_expected_SD_EC': np.sqrt(all_mode_expected_variance_EC)}, elt_with_errors_all

    # # standard deviation of user_laberd   # e.g. 0.104
    # print(np.sqrt(elt_with_errors.user_var.sum()))

### Calculate Expected matrixes and Uncertenties

In [52]:
def user_elt_with_errors_method(intensity_dict, user_confusion_matrices_dic, unit_dist_MCS_df, validation_trips):

    ''' Inputs:
            intensity_dict: energy/carbon intensity
            user_confusion_matrices_dic: dictionary conatains all users' confusion matrix 
            unit_dist_MCS_df: mean and variance of different devices (ios and android)
            validation_trips: validation dataset 

        Returns:
            user_elt_with_errors and elt_with_errors_all
            user_elt_with_errors is a dictionary. The key is the user's id, value is a dictionary which contains:
                (1) dif_expected_user_laberd_mean: distance from expected mean to user_laberd mean
                (2) expected_mean: expected mean
                (3) user_labeled_mean: user labeled mean
                (4) all_mode_expected_SD_EC: standard deviation of this user's all trips. calculate the variance of each mode first, then combine the variance of all the modes.
                
            elt_with_errors_all is a DataFrame. It contains all the user's all the trips. Row is a trip, it contains trip information, e.g.expected value, user_labeled value, variance of expected value, variance of user_labeled value and so on.        
    '''

    user_elt_with_errors = {}
    elt_with_errors_all = pd.DataFrame()
    
    # iterate each user's confusion matrix.
    for user_id, confusion_matrix in user_confusion_matrices_dic.items():
        # confusion_matrix_df is each user's confusion matrix, predicted_value VS true_value
        confusion_matrix_df = pd.DataFrame(confusion_matrix) 

        # get mean and variance of expected energy/carbon intensity base on the confusion matrix and ground truth energy/carbon intensity
        user_EI_moments_df = cm_handling.get_conditional_EI_expectation_and_variance(confusion_matrix_df, intensity_dict) 

        # get user's validation dataset.
        valid_trips_user = validation_trips[validation_trips['user_id']== user_id]
        
        user_elt_with_errors[user_id], elt_with_errors_all = get_elt_with_errors(valid_trips_user, user_EI_moments_df, unit_dist_MCS_df, elt_with_errors_all, intensity_dict)
        
    return user_elt_with_errors, elt_with_errors_all

In [53]:
# Calculating carbon intensity
def get_carbon_dict(energy_carbon_dataframe):
    '''
    energy_carbon_dataframe: dataframe based on energy_intensity.csv
    units: lb/PMT

    Returns a dictionary by mode of carbon intensity in lb/PMT
    '''
    carbon_dict = {}
    for _,row in energy_carbon_dataframe.iterrows():
        # if BTU/PMT -> MMBTU/PMT -> lb_CO_2/PMT (diesel or gas)
        # else kWH/PMT -> MWH/PMT -> lb_CO_2/PMT (electric)
        if row["fuel"] not in ["electric","human_powered"] :
            carbon_intensity_lb = row["energy_intensity_factor"] * 0.000001 * row["CO2_factor"] #btu
        else:
            carbon_intensity_lb = row["energy_intensity_factor"] * 0.001 * row["CO2_factor"]
        carbon_dict[row['mode']] = carbon_intensity_lb

    # Add 'no_gt'
    carbon_dict['no_gt'] = 0
    return carbon_dict


In [54]:
# Setting matrix
# 'matrix' can be 'energy consumption' or 'carbon emission'
matrix = 'carbon emission'

In [None]:
intensity_dict = {}

if matrix == 'carbon emission':
    unit = 'lb/PMT'
    intensity_dict = get_carbon_dict(df_EI)
elif matrix == 'energy consumption':
    unit = 'MWH'
    intensity_dict = cm_handling.get_energy_dict(df_EI, units='MWH')

user_elt_with_errors, elt_with_errors_all = user_elt_with_errors_method(intensity_dict, user_confusion_matrices_dic, unit_dist_MCS_df, validation_trips)
user_elt_with_errors_df = pd.DataFrame(data = user_elt_with_errors)


### Plotting the results of energy consumption and carbon emission

#### Difference between Expected and User labeled matrix VS Standard Deviation of Expected matrix for all Modes each User, Ordered by the Descending of Standard Deviation of Expected matrix --- confusion matrix based on different attributes

In [None]:
df_T = user_elt_with_errors_df.T
df_T = df_T.rename_axis('user_id').reset_index()
df_sorted = df_T.sort_values('all_mode_expected_SD_EC', ascending=False)
df_sorted.reset_index(drop=True, inplace=True)
df_sorted.head()


In [None]:
# double check attribute and matrix before store the results and plot
attribute, matrix

In [61]:
df_sorted.to_csv('CSVs/' + matrix + '_' + attribute + '_diff_SD.csv')
df_sorted = pd.read_csv('CSVs/' + matrix + '_' + attribute + '_diff_SD.csv')
df_sorted = df_sorted.drop(['Unnamed: 0'], axis=1)

In [62]:
# pip install -U kaleido

import plotly.graph_objects as go

def diff_SD_plot(df, matrix, attribute, unit):

    fig = go.Figure()
    x = df.index
    title = ''
    y1 = df['dif_expected_user_laberd_mean'].apply(abs)
    fig.add_trace(go.Scatter(x=x, y=y1, mode='markers', name='Difference between Expected and User labeled ' + matrix, 
                            text=df['user_id']))

    y2 = df['all_mode_expected_SD_EC']
    fig.add_trace(go.Scatter(x=x, y=y2, mode='markers', name='Standard Deviation of Expected ' + matrix,
                            text=df['user_id']))

    fig.update_layout(title='Difference between Expected and User labeled '+ matrix +' VS Standard Deviation of Expected '+ matrix +' <br /> for all Modes each User Ordered by the Descending of Standard Deviation of Expected '+ matrix +'  <br /> --- confusion matrix based on ' + attribute, #+ ' ' + unit,
                    xaxis_title='user', yaxis_title= matrix + ' (' + unit +')' #'Energy (MWH)', lb/PMT
                    , legend=dict(orientation='h'),
                    width=1200,height=800)
    fig.write_image("plots/"+matrix+"_" + attribute + "_diff_SD_4_folds.png", format="png")

    fig.show()


In [None]:
diff_SD_plot(df_sorted, matrix, attribute, unit)

#### Cumulative EC/CE by user --- confusion matrix based on different attributes

In [None]:
df_plot = elt_with_errors_all.copy()
grouped_df = df_plot.groupby('user_id').sum()
grouped_df = grouped_df.reset_index()

grouped_df_sorted = grouped_df.sort_values('expected', ascending=False)
grouped_df_sorted.reset_index(drop=True, inplace=True)
grouped_df_sorted['user_id'] = grouped_df_sorted['user_id'].apply(str)


In [66]:
grouped_df_sorted.to_csv('CSVs/' + matrix + '_'+ attribute +'_user_cumulative.csv')

grouped_df_sorted = pd.read_csv('CSVs/' + matrix + '_' + attribute + '_user_cumulative.csv')
grouped_df_sorted = grouped_df_sorted.drop(['Unnamed: 0'], axis=1)

In [67]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import plot

def cumulative_matrix_plot(df, matrix, attribute, unit):
    # expected	user_labeled	confusion_var	user_var	confusion_sd	user_sd
    df = df.reset_index()
    df['index'] = df.index+1
    # df = df[df['user_id']== uuid.UUID(user_id)]

    # create plot
    fig = go.Figure()

    # Create subplots with specified row heights
    fig = make_subplots(rows=2, cols=1, row_heights=[1, 0.3])

    # Add the first trace to the first row
    fig.add_trace(go.Bar(
        x=df['index'],
        y=df['expected'],
        name='Inferred',
        marker_color='#1f77b4',  # blue
        error_y=dict(
            type='data',
            array=df['confusion_sd'],
            visible=True,
            color='#000000', 
            thickness=0.5  
        )
        # ,texttemplate='%{text}',
        # text= df['user_id']
    ),row=1, col=1)

    # Add the second trace to the second row
    fig.add_trace(go.Bar(
        x=df['index'],
        y=df['user_labeled'],
        name='user labeled',
        marker_color='#2ca02c',  # green
        error_y=dict(
            type='data',
            array=df['user_sd'],
            visible=True,
            color='#000000',  
            thickness=0.5 
        )
    ), row=1, col=1)

    data1 = {'inferred': df['expected'].sum(), 'user labeled': df['user_labeled'].sum()}
    df1 = pd.DataFrame(data1.items(), columns=['sum_name', 'sum_value'])

    data2 = {'confusion_sd_sum': math.sqrt(df['confusion_var'].sum()), 'user_sd_sum': math.sqrt(df['user_var'].sum())}
    df2 = pd.DataFrame(data2.items(), columns=['sd_sum_name', 'sd_sum_value'])

    merged_df = pd.concat([df1, df2], axis=1)
    title = ''
    fig.add_trace(go.Bar(
        x=merged_df['sum_value'],
        y=merged_df['sum_name'],
        marker_color=['#1f77b4', '#2ca02c'],  
        orientation='h',
        showlegend=False,
        error_x=dict(
            type='data',
            array=merged_df['sd_sum_value'],
            visible=True,
            color='#000000',  
            thickness=1  
        ), width=[0.4, 0.4, 0.4], 
    ), row=2, col=1)


    fig.update_layout(
        title='Cumulative ' + matrix + ' by user --- confusion matrix based on ' + attribute, # + ' (second)',
        xaxis_title= 'user ID',
        yaxis_title= matrix + ' (' + unit + ')',
        font=dict(
            family='Arial',  
            size=12,  
            color='#333333'  
        ),
        barmode='group', 
        bargap=0.1,
        bargroupgap=0.1
    )

    fig.update_layout(
        width=1200,  
        height=800 
    )

    fig.update_xaxes(title_text= matrix + ' (' + unit + ')', row=2, col=1)
    # fig.update_yaxes(title_text="Cumulative energy consumption", showgrid=False, row=2, col=1)


    # plot(fig, filename='user_Cumulative_EC.html', auto_open=True)
    fig.write_image("plots/" + matrix + "_" + attribute + "_cumulative_4_folds.png", format="png")

    # Show the figure
    fig.show()



In [None]:
cumulative_matrix_plot(grouped_df_sorted, matrix, attribute, unit)

##### initial example (can ignore)

In [None]:
import pandas as pd
import plotly.graph_objects as go
df = grouped_df_sorted
expected_sum = df['expected'].sum()
user_labeled_sum = df['user_labeled'].sum()
confusion_sd_sum = math.sqrt(df['confusion_var'].sum())
user_sd_sum = math.sqrt(df['user_var'].sum())

summary_df = pd.DataFrame({'Metric': ['Expected', 'User Labeled', 'Confusion SD', 'User SD'],
                           'Sum': [expected_sum, user_labeled_sum, confusion_sd_sum, user_sd_sum]})

fig = go.Figure(data=[go.Bar(y=summary_df['Metric'], x=summary_df['Sum'], orientation='h')])

fig.update_layout(title='Sum of Metrics', xaxis_title='Sum', yaxis_title='Metric')

fig.show()


### Mode count and Mode shared by distance

#### Grace's methods

##### old method 

 Sampling and probability fcns for both counts and distances

In [22]:
# # sampling and probability calculation fcns which you need to do for both counts and distances.

# import pandas as pd
# import numpy as np
# from numpy.random import default_rng

# '''
# Sample a CM (as a DF) and create n # of CMs based on it.
# input: a CM to sample, number of times to sample
# output: a list of n CMs
# '''
# def sampling(samplingCM, n):
#     # dirichlet sampling
#     v = [] # from the CM, going left to right by row
#     for row in samplingCM.index:
#         for col in samplingCM.columns:
#             v.append(samplingCM.loc[row][col])
#     a = np.ones(samplingCM.size)

#     rng = default_rng()
#     dirichlet_samples = rng.dirichlet((v+a), n)

#     # multinomial sampling
#     multinomial_samples = []
#     n_trips = 0
#     for col in samplingCM:
#         n_trips += samplingCM[col].sum()

#     for params in dirichlet_samples:
#         s = rng.multinomial(n_trips, params)
#         multinomial_samples.append(s)

#     # put each of these into their own CM, same dimensions as samplingCM (do row by row)
#     output_CMs = []
#     for samples in multinomial_samples:
#         samples2D = np.reshape(samples, (len(samplingCM.index), len(samplingCM.columns)))
#         outputCM = pd.DataFrame(samples2D, columns = samplingCM.columns, index = samplingCM.index)
#         output_CMs.append(outputCM)
#     return output_CMs


In [18]:
# # '''
# # Finding P(actual|sensed) by dividing each cell by column sum, for a list of DFs
# # input: list of DFs of values
# # output: list of DF of P(actual|sensed)
# # '''
# def actual_given_sensed_CM(list_of_value_CMs):
#     actual_given_sensed = []
#     for cm in list_of_value_CMs:
#         probs = cm.div(cm.sum(axis=0), axis=1)
#         actual_given_sensed.append(probs)
#     return actual_given_sensed

Count estimation function

In [29]:
# # count estimate and variance calculation function

# import pandas as pd
# import numpy as np


# '''
# Finding estimated values and variances FOR COUNTS.
# input:
#     predicted_counts: a dictionary {"mode1": # of trips predicted in mode1...}
#     actual_given_sensed: list of DFs, which have probabilities in each cell
# output: a single estimated count for each mode, and a single estimated variance for each mode count. 
#     (prints some other interesting stuff out too)
# '''
# def count_estimate(predicted_counts, actual_given_sensed):
#     # find expected counts based on each actual_given_sensed CM and predicted counts
#     expected_counts = [] #list of dfs (one df per cm)
#     for df in actual_given_sensed:
#         expected_value = df.mul(pd.Series(predicted_counts), axis = 'columns') # multiply row by row
#         expected_value = expected_value.sum(axis='columns') # sum of each row
#         expected_counts.append(expected_value)

#     # average expected values: concat dfs and find mean
#     all_expected = pd.concat(expected_counts, axis='columns')
#     average_ev = all_expected.mean(axis='columns')

#     #VARIANCES
#     # variance of each cell
#     df_list = []
#     for df in actual_given_sensed:
#         df_list.append(df.to_numpy())
#     cell_variance = pd.DataFrame(np.square(pd.DataFrame(np.dstack((df_list)).std(axis=2), columns = actual_given_sensed[0].columns, index = actual_given_sensed[0].index))) # calculate variance for each entry  changed

#     # multiply each row of cell variances by  the row of n_i^2
#     predicted_counts = pd.Series(predicted_counts)

#     n_squared = predicted_counts ** 2
#     # n_squared = np.square(predicted_counts) #row of n^2s
#     n2_times_var = cell_variance.mul(n_squared, axis = 'columns') # var(ax) = (a^2)*var(x) ///changed

#     # sum up rows
#     variance = n2_times_var.sum(axis='columns') 
#     return (average_ev, variance)

Distance estimation function

In [254]:
# def sort_dict(initial_dict, order):
#     sorted_dict = {}
#     for ele in order:
#         if ele not in initial_dict:
#             sorted_dict[ele] = 0
#         else:
#             sorted_dict[ele] = initial_dict[ele]    
#     return sorted_dict

In [None]:
# # distance estimation and variance calculation fcn. very similar to the count fcn, not sure whether to combine them or not.

# import pandas as pd
# import numpy as np

# '''
# Finding estimated values and variances FOR DISTANCES.
# input: 
#     predicted_distances: a dictionary {"mode1": # of trips predicted in mode1...}
#     actual_given_sensed: list of DFs, which have probabilities in each cell
#     os: which OS we're using, either "ios" or "android"
# output: a single estimated distance for each mode, and a single estimated variance for each mode. 
#     (prints some other interesting stuff out too)
# '''
# def distance_estimate(predicted_distances, actual_given_sensed, os, sampling):
#     os_unit_info = unit_dist_MCS_df #////changed

#     # adjusting using os unit info
#     adjusted_predicted_distances = {}
#     for mode in predicted_distances.keys(): # e.g.mode = ('Gas Car, with others', 'android'); os = 'multiple_devices'
#         if os == 'multiple_devices':
#             mode_os =  mode[1]
#             mode_name = mode[0]
#             if  mode_os == 'ios' or mode_os == 'android':
#                 if mode_name in adjusted_predicted_distances:
#                     adjusted_predicted_distances[mode_name] += predicted_distances[mode] * os_unit_info[mode_os][0]
#                 else: 
#                     adjusted_predicted_distances[mode_name] = predicted_distances[mode] * os_unit_info[mode_os][0]
#             else:
#                 raise Exception("multiple_devices: New device discovered.")

#         else:
#             adjusted_predicted_distances[mode] = predicted_distances[mode] * os_unit_info[os][0]

#     adjusted_predicted_distances = sort_dict(adjusted_predicted_distances, actual_given_sensed[0].columns)

#     # find expected counts based on each actual_given_sensed CM and predicted counts
#     expected_counts = [] #list of dfs (one df per cm)
#     for df in actual_given_sensed:
#         expected_value = df.mul(adjusted_predicted_distances, axis = 'columns') # multiply row by row
#         expected_value = expected_value.sum(axis='columns') # sum of each row
#         expected_counts.append(expected_value)
    
#     # average expected values: concat dfs and find mean
#     all_expected = pd.concat(expected_counts, axis='columns')
#     average_ev = all_expected.mean(axis='columns')

#     #VARIANCES
#     df_list = []
#     for df in actual_given_sensed:
#         df_list.append(df.to_numpy())

#     # Note: If sampling=False, variance1 is 0. There is only one CM, so we don't have cell_variance.

#     # variance of each cell in prob CMs    
#     cell_variance = np.square(pd.DataFrame(np.dstack((df_list)).std(axis=2), columns = actual_given_sensed[0].columns, index = actual_given_sensed[0].index)) 
#     # result = np.nanvar(np.dstack(df_list), axis=2)
#     # cell_variance = pd.DataFrame(result, columns = actual_given_sensed[0].columns, index = actual_given_sensed[0].index)
    
#     # multiply each row of cell variances by the row of L_i^2
#     adjusted_predicted_distances = pd.DataFrame([adjusted_predicted_distances])
#     n_squared = np.square(adjusted_predicted_distances) #row of L^2s
#     variance1 = cell_variance.mul(n_squared.values, axis = 'columns') # E(L_mode)^2 V(p)
    

#     # Note: If sampling=False, avg_actual_given_sensed is the CM itself.

#     # extra variance term since distance has its own uncertainty
#     avg_actual_given_sensed = pd.DataFrame(np.dstack((df_list)).mean(axis=2), columns = actual_given_sensed[0].columns, index = actual_given_sensed[0].index)
    

#     dist_variance_dic = {}
#     for mode in predicted_distances.keys(): # e.g.mode = ('Gas Car, with others', 'android'); os = 'multiple_devices'
#         if os == 'multiple_devices':
#             mode_os =  mode[1]
#             mode_name = mode[0]
#             if  mode_os == 'ios' or mode_os == 'android':
#                 if mode_name in dist_variance_dic:
#                     dist_variance_dic[mode_name] += np.square(predicted_distances[mode]) * os_unit_info[mode_os][1]
#                 else:
#                     dist_variance_dic[mode_name] = np.square(predicted_distances[mode]) * os_unit_info[mode_os][1] # row of L_i^2s
#         else:
#             dist_variance_dic[mode] = np.square(predicted_distances[mode]) * os_unit_info[os][1] # row of L_i^2s
        
#     dist_variance_dic = sort_dict(dist_variance_dic, avg_actual_given_sensed.columns)
    
#     dist_variance = pd.Series(dist_variance_dic)
#     # dist_variance = np.square(pd.Series(predicted_distances)) * os_unit_info[os][1] # row of L_i^2s
    
#     variance2 = dist_variance.mul(np.square(avg_actual_given_sensed)) # E(p)^2*V(L_mode)]

#     # sum up rows
#     variance = variance1.add(variance2)
#     variance = variance.sum(axis='columns')

#     return (average_ev, variance)


find_counts and find_distances


In [None]:
# # find counts 
# def find_counts(trainingCM, data, sampling):
#     output_CMs = []
#     if sampling:
#         countSamples = sampling(trainingCM, 2000) # you can change 2000 to anything!
#     else:
#         output_CMs.append(trainingCM)
#     countProbs = actual_given_sensed_CM(output_CMs)
#     countEstimates = count_estimate(data, countProbs)
#     return countEstimates

# # find distances
# def find_distances(trainingCM, data, os, sampling):
#     output_CMs = []
#     if sampling:
#         output_CMs = sampling(trainingCM, 2000) # you can change 2000 to anything!
#     else:
#         output_CMs.append(trainingCM)
#     distanceProbs = actual_given_sensed_CM(output_CMs)
#     distanceEstimates = distance_estimate(data, distanceProbs, os, sampling)
#     return distanceEstimates

##### new method

Count estimation function

In [70]:
# sort the dictionary by the order
def sort_dict(initial_dict, order):
    sorted_dict = {}
    for ele in order:
        if ele not in initial_dict:
            sorted_dict[ele] = 0
        else:
            sorted_dict[ele] = initial_dict[ele]    
    return sorted_dict

Next cell was shared from Grace

In [71]:
import pandas as pd
import numpy as np

'''
inputCM: a dataframe of column normalized counts (so of P(actual|predicted)), columns = sensed, rows = actual modes
predictions: a Series of predicted counts per mode
NMC: number of times to repeat
'''

def new_method(inputCM, predictions):
    # get column predictions from inputCM
    column_probabilities = inputCM
    print("probabilities:\n", column_probabilities)
    mean = {}
    variance = {}
    # calculate mean per mode
    for index, row in column_probabilities.iterrows():
        mean[index] = (row*predictions).sum()
    
    # calculate variance per mode
    for index, row in column_probabilities.iterrows():
        variance[index] =  (predictions*row*(1-row)).sum()
    
    return(pd.Series(mean), pd.Series(variance))

#### Mode Counts & Mode shared by distance

In [72]:
import pickle

def confusion_matrices_to_mode_count_distance_mean_variance(user_confusion_matrices_dic, validation_trips, matrix, sampling):
    mean_ev_dic = {}
    variance_dic = {}
    combined_mean_ev_dic = {}
    combined_variance_dic = {}

    for user_id, confusion_matrix in user_confusion_matrices_dic.items():
        print(user_id)
        valid_trips_user = validation_trips[validation_trips['user_id']== user_id]
        
        if matrix=='trip_count':
            valid_trips_user_dic_temp = valid_trips_user.groupby('mode_pred').size().to_dict()
        elif (matrix=='distance'):
            
            valid_trips_user_dic_temp = valid_trips_user.groupby('mode_pred').sum()['distance'].to_dict()
            
            # Divide each value by 1000 and create a new dictionary
            valid_trips_user_dic_temp_1 = {key: value / 1000 for key, value in valid_trips_user_dic_temp.items()}

            # Convert the new dictionary's values to integers and create another new dictionary
            valid_trips_user_dic_temp_2 = {key: int(value) for key, value in valid_trips_user_dic_temp_1.items()}

        cm = pd.DataFrame(confusion_matrix, index= list(confusion_matrix.index))

        if matrix=='trip_count':
            valid_trips_user_dic_temp = sort_dict(valid_trips_user_dic_temp, cm.columns)

            predictions = pd.Series(valid_trips_user_dic_temp)
            (mean_ev, variance) = new_method(cm/cm.sum(axis=0), predictions)
            
            # mean_ev, variance = find_counts(confusion_matrix, valid_trips_user_dic_temp, sampling) # old method

        elif matrix=='distance':
            valid_trips_user_dic_temp_2 = sort_dict(valid_trips_user_dic_temp_2, cm.columns)

            predictions = pd.Series(valid_trips_user_dic_temp_2)
            (mean_ev, variance) = new_method(cm/cm.sum(axis=0), predictions)

            # mean_ev, variance = find_distances(confusion_matrix, valid_trips_user_dic_temp_2, sampling)  # old method

        mean_ev_d = mean_ev.to_dict()
        variance_d = variance.to_dict()
        mean_ev_dic[user_id] = mean_ev_d
        variance_dic[user_id] = variance_d

        for key, value in mean_ev_d.items():
            if key in combined_mean_ev_dic:
                combined_mean_ev_dic[key] += value
            else:
                combined_mean_ev_dic[key] = value

        for key, value in variance_d.items():
            if key in combined_variance_dic:
                combined_variance_dic[key] += value
            else:
                combined_variance_dic[key] = value

    # Save the dictionary to a file
    with open('mean_ev_dic_' + matrix +'.pickle', 'wb') as file:
        pickle.dump(mean_ev_dic, file)
    
    with open('variance_dic_' + matrix +'.pickle', 'wb') as file:
        pickle.dump(variance_dic, file)
    
    with open('combined_mean_ev_dic_' + matrix +'.pickle', 'wb') as file:
        pickle.dump(combined_mean_ev_dic, file)
    
    with open('combined_variance_dic_' + matrix +'.pickle', 'wb') as file:
        pickle.dump(combined_variance_dic, file)
        
    return mean_ev_dic, variance_dic, combined_mean_ev_dic, combined_variance_dic


In [None]:
# 'matrix' can be 'trip_count' or 'distance'
matrix = 'distance'
sampling = False

average_ev_dic, variance_dic, combined_average_ev_dic, combined_variance_dic = confusion_matrices_to_mode_count_distance_mean_variance(user_confusion_matrices_dic, validation_trips, matrix, sampling)


The below cell is checking distrubution of mode_pred and mode_true for a special mode (can ignore)


In [None]:
# check distrubution of mode_pred and mode_true for a special mode 

validation_trip_car_alone_pred = validation_trips[validation_trips['mode_pred']=='Gas Car, drove alone']
validation_trip_car_alone_users_pred = validation_trip_car_alone_pred.groupby('user_id').sum()
validation_trip_car_alone_users_pred = validation_trip_car_alone_users_pred.reset_index()

validation_trip_car_alone = validation_trips[validation_trips['mode_true']=='Gas Car, drove alone']
validation_trip_car_alone_users = validation_trip_car_alone.groupby('user_id').sum()
validation_trip_car_alone_users = validation_trip_car_alone_users.reset_index()

import pandas as pd
import plotly.express as px

# Replace 'column_name' with the actual column name you want to plot
# Assuming your DataFrame is named 'df'
fig = px.histogram(validation_trip_car_alone_users, x='distance_miles', nbins=500)  # Adjust the number of bins as needed

# Update layout if desired
fig.update_layout(
    xaxis_title='Values',
    yaxis_title='Frequency',
    title='Histogram'
)

# Show the plot
fig.show()

# Replace 'column_name' with the actual column name you want to plot
# Assuming your DataFrame is named 'df'
fig = px.histogram(validation_trip_car_alone_users_pred, x='distance_miles', nbins=500)  # Adjust the number of bins as needed

# Update layout if desired
fig.update_layout(
    xaxis_title='Values',
    yaxis_title='Frequency',
    title='Histogram'
)

# Show the plot
fig.show()



In [76]:
def clean_df(dic_df, expected_value):
    dic_df = dic_df.T
    dic_df.reset_index(inplace=True)
    dic_df.rename(columns={'index': 'user_id'}, inplace=True)
    # df: user_id	predicted_true_mode	expected_distance
    new_df = pd.DataFrame(columns=['user_id', 'mode', expected_value])
    for _, row in dic_df.iterrows():
        user = row['user_id']
        for column in dic_df.columns:
                if column != 'user_id' and pd.notnull(row[column]):
                    predicted_true_mode = column
                    expected_matrix = row[column]
                    
                    new_df = new_df.append({'user_id': user, 'mode': predicted_true_mode, expected_value: expected_matrix}, ignore_index=True)
    return new_df     

    

In [None]:
# check the matrix
matrix

In [None]:
# expected_true_df contains the 'user_id', 'mode', 'expected_distance' and 'variance_distance'

variance_dic_df = pd.DataFrame(pd.read_pickle('variance_dic_' + matrix +'.pickle'))
variance_dic_df = clean_df(variance_dic_df,'variance_distance')

expected_dic_df = pd.DataFrame(pd.read_pickle('mean_ev_dic_' + matrix +'.pickle'))
expected_dic_df = clean_df(expected_dic_df,'expected_distance')

expected_true_df = pd.merge(expected_dic_df, variance_dic_df, on=['user_id', 'mode'])
expected_true_df


In [None]:
# In validation dataset, get the sum distance of each mode of each user 
valid_mode_true_distance = pd.DataFrame(validation_trips.groupby(['user_id', 'mode_true']).sum()['distance_km'])
valid_mode_true_distance = valid_mode_true_distance.reset_index()
valid_mode_true_distance.rename(columns={'mode_true': 'mode'}, inplace=True)
valid_mode_true_distance.head(10)

In [82]:
# calculate the accuracy based on the confusion_matrix.
# input: confusion_matrix, each row is true values, each column is predicted values
def get_accuracy(confusion_matrix):
    column_predictions = pd.DataFrame(confusion_matrix, index= list(confusion_matrix.index))

    # Initialize the sum of correct predictions (where row index matches the column)
    correct_sum = 0

    # Iterate over each row and each column
    for index, row in column_predictions.iterrows():
        for column_name, value in row.items():
            # Check if the row index matches the column name
            if index == column_name:
                correct_sum += value
                break

    # Calculate the total sum of all values in the DataFrame
    total_sum = column_predictions.to_numpy().sum()

    # Calculate accuracy by dividing correct_sum by total_sum
    accuracy = correct_sum / total_sum
    print("Accuracy:", accuracy)
    return accuracy


In [None]:
#  If we want to calculate the accuracy of trip mode, we need to check the mode_true and mode_pred
#  Get the accuracy of each user based on the ground truth value and predicted value
def get_user_accuracyDf_from_CM(dataset_trips, column_name):
    dataset_trips_user_id_list = list(dataset_trips['user_id'].unique())

    user_dataset_CM_dic = {}
    user_dataset_accuracy_CM_dic = {}
    for user_id in dataset_trips_user_id_list:
        user_dataset_trips = dataset_trips[dataset_trips['user_id'] == user_id]
        
        predicted_values = user_dataset_trips['mode_pred']
        true_values = user_dataset_trips['mode_true']
        sample_weight = user_dataset_trips['distance']

        confusion_matrix = pd.crosstab(true_values, predicted_values, sample_weight, aggfunc='sum') # weight: trip duration
        confusion_matrix[confusion_matrix.isnull()] = 0
        accuracy = get_accuracy(confusion_matrix)
        user_dataset_accuracy_CM_dic[user_id] = accuracy
        user_dataset_CM_dic[user_id] = confusion_matrix
        
        
    sorted_user_dataset_accuracy_CM_dic = dict(sorted(user_dataset_accuracy_CM_dic.items(), key=lambda item: item[1], reverse=True))
    sorted_user_dataset_accuracy_CM_df = pd.DataFrame(data = sorted_user_dataset_accuracy_CM_dic, index = [0])
    sorted_user_dataset_accuracy_CM_df = sorted_user_dataset_accuracy_CM_df.T.reset_index()
    sorted_user_dataset_accuracy_CM_df = sorted_user_dataset_accuracy_CM_df.rename(columns = {'index':'user_id', 0:column_name})
    return sorted_user_dataset_accuracy_CM_df

sorted_user_testing_accuracy_CM_df = get_user_accuracyDf_from_CM(test_trips,'model_testing_prediction_accuracy')
sorted_user_validation_accuracy_CM_df = get_user_accuracyDf_from_CM(validation_trips,'model_validation_prediction_accuracy')
sorted_user_validation_accuracy_CM_list = list(sorted_user_validation_accuracy_CM_df['user_id'])



In [None]:
# accuracy_GT_ED is the accuracy of predicted mode distance, not the accuracy of trip mode.
# If we want to calculate the accuracy of trip mode, we need to check the mode_true and mode_pred

GT_mode_df = pd.DataFrame({})
user_id_list = list(expected_true_df.user_id.unique())

for user_id in user_id_list:
    user_expected_true_df = expected_true_df[expected_true_df['user_id'] == user_id]
    user_valid_mode_true_distance = valid_mode_true_distance[valid_mode_true_distance['user_id'] == user_id]
    df_result = user_expected_true_df.merge(user_valid_mode_true_distance[['mode','distance_km']], on='mode', how='outer')
    df_result = df_result.replace(np. nan,0) 
    df_result['user_id'] = user_id
    df_result['diff_GT_ED'] = abs(df_result['expected_distance'] - df_result['distance_km'])
    GT_mode_df= pd.concat([GT_mode_df, df_result], axis=0)

GT_mode_df_users = pd.DataFrame(GT_mode_df.groupby('user_id').sum())
GT_mode_df_users = GT_mode_df_users.reset_index()

GT_mode_df_users['accuracy_GT_ED'] = 1-(GT_mode_df_users['diff_GT_ED'] / GT_mode_df_users['distance_km']) 
GT_mode_df_sorted = GT_mode_df_users.sort_values('accuracy_GT_ED', ascending=False)
GT_mode_df_sorted.reset_index(drop=True, inplace=True)

user_id_list_basedOnAccuracy = list(GT_mode_df_sorted['user_id'])

GT_mode_df_sorted.head(10)


In [None]:
# Sort the df based on the accuracy of user's validation dataset
expected_true_df['user_id'] = expected_true_df['user_id'].astype('category')
expected_true_df['user_id'].cat.reorder_categories(sorted_user_validation_accuracy_CM_list, inplace=True)
expected_true_df.sort_values('user_id', inplace= True)
expected_true_df.head(10)

In [89]:
# Define a function to calculate the square root
def calculate_square_root(x):
    return x ** 0.5

In [None]:
df_result_text = expected_true_df.merge(GT_mode_df_sorted[['user_id','accuracy_GT_ED']], on='user_id', how='left')
df_result_text = df_result_text.merge(sorted_user_validation_accuracy_CM_df[['user_id','model_validation_prediction_accuracy']], on='user_id', how='left')
df_result_text = df_result_text.merge(sorted_user_testing_accuracy_CM_df[['user_id','model_testing_prediction_accuracy']], on='user_id', how='left')
df_result_text['SD_distance'] = df_result_text['variance_distance'].apply(calculate_square_root)

df_result_text.tail(10)

Columns:

1. 'expected_distance': the expected distance of the trip
2. 'variance_distance': the variance of the expected distance
3. 'accuracy_GT_ED': 1 - abs(ground truth distance - expected distance) / ground truth distance
4. 'model_validation_prediction_accuracy': the accuracy of the validation dataset # Calculation: (TP+TN)/ALL
5. 'model_testing_prediction_accuracy': the accuracy of the testing dataset # Calculation: (TP+TN)/ALL
6. 'SD_distance': the standard deviation of the expected distance
7. 'distance_km': the ground truth distance in kilometer
8. 'SD_Count': abs(ground truth distance - expected distance) / standard deviation of the expected distance
9. 'var_Count': abs(ground truth distance - expected distance) / variance of the expected distance

In [None]:
df_result_text2 = df_result_text.merge(GT_mode_df[['user_id','distance_km','mode']], on=['user_id','mode'], how='left')

df_result_text2['SD_Count'] = abs(df_result_text2['expected_distance']-df_result_text2['distance_km'])/df_result_text2['SD_distance']
df_result_text2['var_Count'] = abs(df_result_text2['expected_distance']-df_result_text2['distance_km'])/df_result_text2['variance_distance']

# some rows with variance_distance == 0, so the SD_Count came to "inf".
# Now if the 'variance_distance' == 0, we set the SD_Count and var_Count to 0
df_result_text2['SD_Count'] = df_result_text2.apply(lambda row: 0 if row['variance_distance'] == 0 else row['SD_Count'], axis=1)
df_result_text2['var_Count'] = df_result_text2.apply(lambda row: 0 if row['variance_distance'] == 0 else row['var_Count'], axis=1)
df_result_text2.head()


In [97]:
df_result_text3 = df_result_text2[df_result_text2['SD_distance']>1]


In [None]:
import pandas as pd
import plotly.express as px

# Create the DataFrame
df = df_result_text2

# Sort the DataFrame by 'Accuracy' column in ascending order
df = df.sort_values(by='model_validation_prediction_accuracy', ascending=False)

# Create the box plot using Plotly
fig = px.box(df, x='model_validation_prediction_accuracy', y='var_Count', color='user_id', points='all')

# Update the layout
fig.update_layout(
    xaxis=dict(autorange='reversed'),
    xaxis_title='Accuracy',
    yaxis_title='variance_Count',
    title='Box Plot of the counts of variance of expected distance  <br /> from the expected distance to the ground truth distance for different Users and Modes',
    legend_title='User'
)

# Show the plot
fig.show()


In [None]:
import pandas as pd
import plotly.express as px

# Create the DataFrame
df = df_result_text3

# Sort the DataFrame by 'Accuracy' column in ascending order
df = df.sort_values(by='model_validation_prediction_accuracy', ascending=False)

# Create the box plot using Plotly
fig = px.box(df, x='model_validation_prediction_accuracy', y='var_Count', color='user_id', points='all')

# Update the layout
fig.update_layout(
    xaxis=dict(autorange='reversed'),
    xaxis_title='Accuracy',
    yaxis_title='variance_Count',
    title='Plot of the counts of variance of expected distance  <br /> from the expected distance to the ground truth distance for different Modes and Users with n*n confusion matrix <br /> (standard diviation > 1 data points)',
    legend_title='User'
)

# Show the plot
fig.show()


In [None]:
import pandas as pd
import plotly.express as px

# Create the DataFrame
df = df_result_text2

# Sort the DataFrame by 'Accuracy' column in ascending order
df = df.sort_values(by='model_validation_prediction_accuracy', ascending=False)

# Create the box plot using Plotly
fig = px.box(df, x='model_validation_prediction_accuracy', y='SD_Count', color='user_id', points='all')

# Update the layout
fig.update_layout(
    xaxis=dict(autorange='reversed'),
    xaxis_title='Accuracy',
    yaxis_title='SD_Count',
    title='Plot of the counts of standard deviation of expected distance  <br /> from the expected distance to the ground truth distance for different Users and Modes',
    legend_title='User'
)

# Show the plot
fig.show()


In [101]:
validation_trips_mode_true_dic =  validation_trips.groupby('mode_true').size().to_dict()


In [102]:
validation_trips_mode_true_dic =  validation_trips.groupby('mode_true').sum()['distance_km'].to_dict()


In [103]:
def drop_mode_except_in_list(dictionary, except_list):
    keys = list(dictionary.keys())
    dictionary['others'] = 0

    for key in keys:
        if key not in except_list:
            dictionary['others'] += dictionary[key]
            dictionary.pop(key)

# only keep the mode in the energy_intensity.csv, otherwise count in 'others'
drop_mode_except_in_list(combined_average_ev_dic, list(df_EI['mode']))
drop_mode_except_in_list(combined_variance_dic, list(df_EI['mode']))
drop_mode_except_in_list(validation_trips_mode_true_dic, list(df_EI['mode']))            

In [None]:
import plotly.graph_objects as go

# Example dictionaries
dict1 = combined_average_ev_dic
dict2 = validation_trips_mode_true_dic
dict3 = {key:val for key,val in combined_variance_dic.items()}

# Merge dictionaries and keep only the common keys
common_keys = set(dict1.keys()).intersection(dict2.keys()).intersection(dict3.keys())
merged_dict = {key: (dict1[key], dict2[key], dict3[key]) for key in common_keys}

# Sort the merged_dict
merged_dict = dict(sorted(merged_dict.items(), key=lambda x:x[1], reverse=True))

# Extract keys and values from the merged dictionary
keys = list(merged_dict.keys())
values_dict1 = [merged_dict[key][0] for key in keys]
values_dict2 = [merged_dict[key][1] for key in keys]
values_dict3 = [merged_dict[key][2] for key in keys]


# Create bar traces for each dictionary
trace1 = go.Bar(x=keys, y=values_dict1, name='Expected mode count',
                error_y=dict(
                type='data',
                array=values_dict3,
                visible=True,
                color='#000000',
                thickness=0.5
            ))
trace2 = go.Bar(x=keys, y=values_dict2, name='Ground truth mode count')

# Create the layout
layout = go.Layout(
    title='Expected mode count VS Ground truth mode count',
    xaxis=dict(title='Mode name'),
    yaxis=dict(title='Mode count'),
    barmode='group'
)

# Create the figure and add the traces
fig = go.Figure(data=[trace1, trace2], layout=layout)
fig.update_layout(width = 900, height = 600)

fig.write_image("plots/Expected mode count VS Ground truth mode count --- with one variance.png", format="png")

# Show the figure
fig.show()


In [None]:
import plotly.graph_objects as go

# Example dictionaries
dict1 = combined_average_ev_dic
dict2 = validation_trips_mode_true_dic
dict3 = {key:math.sqrt(val) for key,val in combined_variance_dic.items()}

# Merge dictionaries and keep only the common keys
common_keys = set(dict1.keys()).intersection(dict2.keys()).intersection(dict3.keys())
merged_dict = {key: (dict1[key], dict2[key], dict3[key]) for key in common_keys}

# Sort the merged_dict
merged_dict = dict(sorted(merged_dict.items(), key=lambda x:x[1], reverse=True))

# Extract keys and values from the merged dictionary
keys = list(merged_dict.keys())
values_dict1 = [merged_dict[key][0] for key in keys]
values_dict2 = [merged_dict[key][1] for key in keys]
values_dict3 = [merged_dict[key][2] for key in keys]


# Create bar traces for each dictionary
trace1 = go.Bar(x=keys, y=values_dict1, name='Expected mode shared by distance',
                error_y=dict(
                type='data',
                array=values_dict3,
                visible=True,
                color='#000000',
                thickness=0.5
            ))
trace2 = go.Bar(x=keys, y=values_dict2, name='Ground truth mode shared by distance')

# Create the layout
layout = go.Layout(
    title='Expected mode share by distance VS Ground truth mode shared by distance   <br /> --- with one standard deviation error bar',
    xaxis=dict(title='Mode name'),
    yaxis=dict(title='Distance'),
    barmode='group'
)

# Create the figure and add the traces
fig = go.Figure(data=[trace1, trace2], layout=layout)
fig.update_layout(width = 900, height = 600)

fig.write_image("plots/Expected mode shared by distance VS Ground truth mode shared by distance --- with one standard deviation.png", format="png")

# Show the figure
fig.show()


In [None]:
import plotly.graph_objects as go

# Example dictionaries
dict1 = combined_average_ev_dic
dict2 = validation_trips_mode_true_dic
dict3 = {key:val for key,val in combined_variance_dic.items()}

# Merge dictionaries and keep only the common keys
common_keys = set(dict1.keys()).intersection(dict2.keys()).intersection(dict3.keys())
merged_dict = {key: (dict1[key], dict2[key], dict3[key]) for key in common_keys}

# Sort the merged_dict
merged_dict = dict(sorted(merged_dict.items(), key=lambda x:x[1], reverse=True))

# Extract keys and values from the merged dictionary
keys = list(merged_dict.keys())
values_dict1 = [merged_dict[key][0] for key in keys]
values_dict2 = [merged_dict[key][1] for key in keys]
values_dict3 = [merged_dict[key][2] for key in keys]


# Create bar traces for each dictionary
trace1 = go.Bar(x=keys, y=values_dict1, name='Expected mode shared by distance',
                error_y=dict(
                type='data',
                array=values_dict3,
                visible=True,
                color='#000000',
                thickness=0.5
            ))
trace2 = go.Bar(x=keys, y=values_dict2, name='Ground truth mode shared by distance')

# Create the layout
layout = go.Layout(
    title='Expected mode share by distance VS Ground truth mode shared by distance <br /> --- with one variance error bar)',
    xaxis=dict(title='Mode name'),
    yaxis=dict(title='Distance'),
    barmode='group'
)

# Create the figure and add the traces
fig = go.Figure(data=[trace1, trace2], layout=layout)
fig.update_layout(width = 900, height = 600)

fig.write_image("plots/Expected mode shared by distance VS Ground truth mode shared by distance --- with one variance_NN.png", format="png")

# Show the figure
fig.show()
