# Base Score Calculation

## General Approach

In order to understand the performance of our base model we need to calculate a base score. 
The goal of your model is to forecast the category of delay between creation of a shipment and when it will have its `first hub scan`. In other words the target is based on the binned difference between the timestamps `created_at` and `first_hub_scan` of a shipment. We will work with the following bins refering to the number of days between the two timestamps:
- 0 (days)
- 1 (days)
- 2 (days)
- 3 (days)
- 4 (days)
- 5 (days)
- 6 (days)
- 7 (days)

For calculating the base score we will use the average delay between `created` and `first_hub_scan`. 

## Loading and Preparing the Data

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from dispatcher.data.ticket import Ticket
from dispatcher.data.shipment import Shipment

In [3]:
ticket = Ticket.get_ticket_features(Ticket)
ticket.head()

ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

In [None]:
shipment = Shipment.get_shipment_features(Shipment)
shipment.head()

We only need `shipment_id`, `CREATED_AT`, and `First hub scan`.

In [None]:
created = shipment[['ID','CREATED_AT']]
fhs = ticket[['First hub scan']]
fhs.head()

In [None]:
merged = created.merge(fhs, how='left',left_on='ID',right_index=True)
merged.head()

In [None]:
merged['DIFF_TRUE'] = merged['First hub scan'] - merged['CREATED_AT']
merged['DIFF_TRUE'] = merged['DIFF_TRUE'].astype('timedelta64[D]')
merged.head()

We exclude all differences that are not in the bins we are interested in. 

In [None]:
clean_df = merged[merged['DIFF_TRUE'].isin([1,2,3,4,5,6,7])].copy()

## Calculating the Average Delay

In [None]:
avg_diff = round(clean_df[['DIFF_TRUE']].mean(),0)
avg_diff

In [None]:
clean_df[['DIFF_PRED']] = avg_diff
clean_df.head()

## Calculating the Base Score `Accuracy`

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
base_score = round(accuracy_score(y_true=clean_df['DIFF_TRUE'], y_pred=clean_df['DIFF_PRED']),2)
base_score