# Risk Model Backtesting

The goal of the notebook is to backtest our risk model. In order to do so, we will check the past success rate of a given trip and compare with the risk given by our model.

## Initialize the environment

In [6]:
%load_ext sparkmagic.magics

The sparkmagic.magics extension is already loaded. To reload it, use:
  %reload_ext sparkmagic.magics


In [7]:
import os
from IPython import get_ipython
username = os.environ['RENKU_USERNAME']
server = "http://iccluster044.iccluster.epfl.ch:8998"

# set the application name as "<your_gaspar_id>-final-project"
get_ipython().run_cell_magic(
    'spark',
    line='config', 
    cell="""{{ "name": "{0}-final-projectt", "executorMemory": "4G", "executorCores": 4, "numExecutors": 10, "driverMemory": "4G" }}""".format(username)
)

In [8]:
get_ipython().run_line_magic(
    "spark", f"""add -s {username}-final-projectt -l python -u {server} -k"""
)

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
7268,application_1680948035106_6677,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


In [66]:
%%spark

# Imports
import pyspark.sql.functions as F
from pyspark.sql import Row
from math import radians, cos, sin, asin, sqrt, floor
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType, ArrayType, FloatType
import matplotlib.pyplot as plt
from scipy.stats import gamma
from scipy.stats import norm
from scipy.optimize import curve_fit
import numpy as np

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Test 1

Here we test the trip from Zürich Oerlikon to Zürich HB that leaves at 18h11 and arrives at 18h16. We say that we need to arrives before or at 18h17 and we compute the success rate of the trip.

In [92]:
%%spark

# Path to the ORC data
path_istdaten = "/data/sbb/part_orc/istdaten"
path_allstops = "/data/sbb/orc/allstops"

# Loading ORC data into a Spark dataframe
df_istdaten = spark.read.orc(path_istdaten)
df_allstops = spark.read.orc(path_allstops)

# get only a sample of the data
df_istdaten = df_istdaten#.sample(1, 0)
df_allstops = df_allstops#.sample(1, 0)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [93]:
%%spark

# Remove the data that we don't need
df_istdaten_clean = df_istdaten.filter((df_istdaten['ZUSATZFAHRT_TF'] == False) # We remove additional trips
                                       & (df_istdaten['DURCHFAHRT_TF'] == False) # We remove trips where the transport do not stop
                                       & (df_istdaten['FAELLT_AUS_TF'] == False)).select( # We remove failed trips 
        df_istdaten['FAHRT_BEZEICHNER'].alias('trip_id'),
        df_istdaten['BETRIEBSTAG'].alias('date'), 
        df_istdaten['PRODUKT_ID'].alias('transport_type'), 
        df_istdaten['HALTESTELLEN_NAME'].alias('stop_name'), 
        F.to_timestamp(df_istdaten["ANKUNFTSZEIT"], 'dd.MM.yyy HH:mm').alias("arrival_time"),
        F.to_timestamp(df_istdaten["AN_PROGNOSE"], 'dd.MM.yyyy HH:mm:ss').alias("actual_arrival_time"),
        df_istdaten['AN_PROGNOSE_STATUS'].alias('arrival_time_status'),
        F.to_timestamp(df_istdaten["ABFAHRTSZEIT"], 'dd.MM.yyy HH:mm').alias("departure_time"),
        F.to_timestamp(df_istdaten["AB_PROGNOSE"], 'dd.MM.yyyy HH:mm:ss').alias("actual_departure_time"),
        df_istdaten['AB_PROGNOSE_STATUS'].alias('departure_time_status'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [134]:
%%spark
# We keep only the data where we have the 'real' arrival time
stopA = df_istdaten_clean.filter((df_istdaten_clean['stop_name'] == 'Zürich Oerlikon')).select('trip_id','departure_time', 'date').cache()
stopB = df_istdaten_clean.filter((df_istdaten_clean['stop_name'] == 'Zürich HB') & (df_istdaten_clean['arrival_time_status'] == "REAL")).cache()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [135]:
%%spark
# We keep all the past trips of a given trip
trip_by_A = stopA.filter((F.hour(stopA['departure_time']) == 18) & (F.minute(stopA['departure_time']) == 11))
trip_by_B = stopB.filter((F.hour(stopB['arrival_time']) == 18) & (F.minute(stopB['arrival_time']) == 16))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [136]:
%%spark
# We want the trip to pass by both A and B on the same day
inter = trip_by_A.join(trip_by_B, (trip_by_A.trip_id == trip_by_B.trip_id) & (trip_by_A.date == trip_by_B.date))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [137]:
%%spark
count_tot = inter.count()
print("We have a total of {} trips.".format(count_tot))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

We have a total of 4839 trips.

In [140]:
%%spark
we_are_on_time = inter.filter((F.hour(inter['actual_arrival_time']) == 18) & (F.minute(inter['actual_arrival_time']) <= 17)).count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [141]:
%%spark
score = we_are_on_time / count_tot
print("The trip has a success rate of {}%.".format(score))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

The trip has a success rate of 0.9764414135151891%.