## Zego Data Science Telematics Technical Test
### Samuel Dolman, August 16th 2021

In [14]:
import pandas as pd
import numpy as np
import os

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, auc

## Overview of Approach

* 37 trips are available - we will take **one observation for each labelled trip**
* an alternative would be to split each trip into e.g. minute long segments / 'sub-trips', and apply the same label to every sub-trip.
    * rejected this approach due to time constraints (if more time allowed, it would very likely be superior)
    * potential advantage: many more observations for training / test (although need to be careful to keep sub-trips from the same trip within the same validation split, to avoid leakage) - allows use of more powerful algorithms
    * potential disadvantage: no guarantee 'true' trip labels are consistent across entire trip - e.g. customer may walk first 2 minutes to the vehicle, then drive the remainder 
    * potential disadvantage: many sub-trips would not contain 'pulling away' and 'braking to a stop' manouvers - likely that these manuevers are important for achieving good classification accuracy
* binary classification task with a small/moderate amount of class imbalance - **AUC** should be a good choice of performance metric (alternatives could be Accuracy / F1 Score, etc)
* ideal validation approach would be fixed **train / validate / test** sets, however not enough trips to do this robustly. So, all results are presented as **mean out-of-fold AUC using stratified 4-fold CV** (with only four folds, can guarantee at least two of the minority class appear in each fold)
* algorithm of choice will be a simple **logistic regression** (with **L1 and L2 elastic-net** penalties to minimise overfitting)
    * I would definitely prefer xgboost/lgbm/catboost classifier (or even just random forest), but decision to take one observation per trip is severely limiting
    * 1D Convolutional Neural Network - have seen these used succesfully so would investigate further if given time
    * ARIMA - often used for time domain problems so worth investigating. However I suspect they may not perform well - usually more useful for projection/extrapolation tasks rather than classificatiopn
* feature engineering*:
    * statistical
    * time based
    * spectral - split into segments, apply fourier transform to generate spectral, avg over all segments





\* Note that the feature engineering approach was partly inspired by two papers:

    Samuli Hemminki - https://www.cs.helsinki.fi/u/shemmink/Transportation/hemminki13transportation.pdf
    
    Nguyen, Linh Vuong - https://dspace.mit.edu/handle/1721.1/120606

In [8]:

os.getcwd()

'D:\\Data_Science\\Projects\\zego_telematics'

## Load Data

In [9]:
data_location = 'data'
trip_location = 'data/trips'

In [27]:
trip_classes = pd.read_csv(os.path.join(data_location, 'trip_classes.csv'))

trips = {row['trip_id']: pd.read_csv(os.path.join(trip_location, row['trip_data'])) for index, row in trip_classes.iterrows()}

In [23]:
trip_classes.head()

Unnamed: 0,trip_type,trip_id,trip_data
0,car,0,trip_0.csv
1,walk_bicycle_still,1,trip_1.csv
2,car,2,trip_2.csv
3,car,3,trip_3.csv
4,car,4,trip_4.csv


In [None]:
trips

In [29]:
trip_classes.head()

Unnamed: 0,trip_type,trip_id,trip_data
0,car,0,trip_0.csv
1,walk_bicycle_still,1,trip_1.csv
2,car,2,trip_2.csv
3,car,3,trip_3.csv
4,car,4,trip_4.csv


## EDA

In [30]:
## timestamp lengths per trip?
## trip type value counts?
# visualise a few examples?

## Process Single Trip (Feature Engineering)

In [None]:
## remove nulls

In [None]:
## normalise timestamps?

In [None]:
## normalise out gravity direction?? e.g. do the rest on X/Y only?

In [None]:
## statistical features

In [None]:
## spectral features
    ## moving timeframes (overlapping)
    ## calc spectral metrics for each timeframe
    ## summarise metrics across all timeframes

In [None]:
## time based features

## Process All Trips

## Train Model

## Analysis