<h1>Safety Challenge - Preprocessing</h1>

This is the preprocessing step of my submission for the [Grab AI for SEA - Safety Challenge](https://www.aiforsea.com/safety). Given a dataset, this will produce a file containing feature data that will be used in the training or testing step.
For training purpose, assume this [dataset](https://s3-ap-southeast-1.amazonaws.com/grab-aiforsea-dataset/safety.zip) is already extracted in the same folder with this notebook. For testing purpose, the config below may need to be changed.

In [1]:
INPUT_DATASET_FEATURES_DIR = './safety/features/'
INPUT_DATASET_LABEL_DIR = './safety/labels/'
OUTPUT_FEATURES = 'dataset-ready.csv'

<h3>Import Libraries</h3>

In [2]:
import pandas as pd
import numpy as np
import glob, os

<h3>Load Data</h3>

In [3]:
telematics = pd.concat(map(pd.read_csv, glob.glob(os.path.join(INPUT_DATASET_FEATURES_DIR, "*.csv"))))
labels = pd.concat(map(pd.read_csv, glob.glob(os.path.join(INPUT_DATASET_LABEL_DIR, "*.csv"))))

<h3>Exploration and Cleansing</h3>

<h4>Telematics</h4>

In [4]:
telematics.head()

Unnamed: 0,bookingID,Accuracy,Bearing,acceleration_x,acceleration_y,acceleration_z,gyro_x,gyro_y,gyro_z,second,Speed
0,1202590843006,3.0,353.0,1.228867,8.9001,3.986968,0.008221,0.002269,-0.009966,1362.0,0.0
1,274877907034,9.293,17.0,0.032775,8.659933,4.7373,0.024629,0.004028,-0.010858,257.0,0.19
2,884763263056,3.0,189.0,1.139675,9.545974,1.951334,-0.006899,-0.01508,0.001122,973.0,0.667059
3,1073741824054,3.9,126.0,3.871543,10.386364,-0.136474,0.001344,-0.339601,-0.017956,902.0,7.913285
4,1056561954943,3.9,50.0,-0.112882,10.55096,-1.56011,0.130568,-0.061697,0.16153,820.0,20.419409


In [5]:
telematics.shape[0]

16135561

In [6]:
telematics['bookingID'].unique().size

20000

Remove data with invalid speed (<0 and >300 km/s)

In [7]:
filtered_telematics = telematics[(telematics['Speed'] >= 0) & (telematics['Speed'] <= 83)]
filtered_telematics.shape[0]

15895172

Remove data with low accuracy

In [8]:
filtered_telematics = filtered_telematics[filtered_telematics['Accuracy'] <= 50]
filtered_telematics.shape[0]

15875645

Remove invalid trips

In [9]:
trips = filtered_telematics[['bookingID','second']].groupby('bookingID').agg(['max','count'])
trips.head()

Unnamed: 0_level_0,second,second
Unnamed: 0_level_1,max,count
bookingID,Unnamed: 1_level_2,Unnamed: 2_level_2
0,1589.0,1003
1,1034.0,838
2,825.0,195
4,1094.0,1094
6,1094.0,1095


In [10]:
bookingID_to_remove = trips[(trips[('second', 'max')] > 43200) | (trips[('second', 'count')] < 100)].index.tolist()
filtered_telematics = filtered_telematics[~filtered_telematics['bookingID'].isin(bookingID_to_remove)]
filtered_telematics.shape[0]

15874802

In [11]:
filtered_telematics['bookingID'].unique().size

19959

<h4>Labels</h4>

In [12]:
labels.head()

Unnamed: 0,bookingID,label
0,111669149733,0
1,335007449205,1
2,171798691856,0
3,1520418422900,0
4,798863917116,0


In [13]:
filtered_labels = labels[labels['bookingID'].isin(filtered_telematics['bookingID'].unique())]
filtered_labels.shape[0]

19977

In [14]:
filtered_labels['bookingID'].unique().size

19959

Since there are some duplicate bookings, remove them and keep label 1 (dangerous) if any.

In [15]:
filtered_labels = filtered_labels.sort_values(by='label', ascending=False)
filtered_labels = filtered_labels.drop_duplicates(subset='bookingID', keep='first')
filtered_labels.shape[0]

19959

<h3>Feature Extraction</h3>

Calculate magnitude of acceleration and gyro

In [16]:
filtered_telematics['acceleration'] = np.sqrt(filtered_telematics['acceleration_x']**2 \
                                              + filtered_telematics['acceleration_y']**2 \
                                              + filtered_telematics['acceleration_z']**2)
filtered_telematics['gyro'] = np.sqrt(\
                                      filtered_telematics['gyro_x']**2 \
                                      + filtered_telematics['gyro_y']**2 \
                                      + filtered_telematics['gyro_z']**2)
# filtered_telematics.head()

Extract features:
- Speed (max, mean, IQR, max change)
- Acceleration (min, max, mean, IQR, max change)
- Gyro (min, max, mean, IQR, max change)
- Duration
- Distance
- Rotation

In [17]:
def iqr():
    def iqr_(x):
        return x.quantile(0.75) - x.quantile(0.25)
    iqr_.__name__ = 'iqr'
    return iqr_

aggregated_telematics = filtered_telematics[['bookingID','Speed','acceleration','gyro','second']]\
    .groupby('bookingID')\
    .agg({'Speed': [np.max, np.mean, iqr()], \
         'acceleration': [np.max, np.mean, iqr()], \
         'gyro': [np.max, np.mean, iqr()],
         'second': [np.max]})
# aggregated_telematics.head()

In [18]:
df = pd.DataFrame()
df['speed_max'] = aggregated_telematics[('Speed','amax')]
df['speed_mean'] = aggregated_telematics[('Speed','mean')]
df['speed_iqr'] = aggregated_telematics[('Speed','iqr')]

df['acceleration_max'] = aggregated_telematics[('acceleration','amax')]
df['acceleration_mean'] = aggregated_telematics[('acceleration','mean')]
df['acceleration_iqr'] = aggregated_telematics[('acceleration','iqr')]

df['gyro_max'] = aggregated_telematics[('gyro','amax')]
df['gyro_mean'] = aggregated_telematics[('gyro','mean')]
df['gyro_iqr'] = aggregated_telematics[('gyro','iqr')]

df['duration'] = aggregated_telematics[('second','amax')]
df['distance'] = df['duration'] * df['speed_mean']
df['rotation'] = df['duration'] * df['gyro_mean']

In [19]:
sorted_labels = filtered_labels.sort_values(by='bookingID')
sorted_bookingIDs = sorted_labels['bookingID']
filtered_telematics = filtered_telematics.sort_values(by=['bookingID','second'])
df['speed_max_change'] = \
    [filtered_telematics[filtered_telematics['bookingID'] == bid]['Speed'].diff().abs().max() for bid in sorted_bookingIDs]
df['acceleration_max_change'] = \
    [filtered_telematics[filtered_telematics['bookingID'] == bid]['acceleration'].diff().abs().max() for bid in sorted_bookingIDs]
df['gyro_max_change'] = \
    [filtered_telematics[filtered_telematics['bookingID'] == bid]['gyro'].diff().abs().max() for bid in sorted_bookingIDs]

In [20]:
df['label'] = sorted_labels['label'].tolist()

In [21]:
df.head()

Unnamed: 0_level_0,speed_max,speed_mean,speed_iqr,acceleration_max,acceleration_mean,acceleration_iqr,gyro_max,gyro_mean,gyro_iqr,duration,distance,rotation,speed_max_change,acceleration_max_change,gyro_max_change,label
bookingID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,22.946083,9.004787,14.04531,12.988328,9.885882,0.514161,0.749086,0.10075,0.09647,1589.0,14308.606323,160.091793,6.581142,4.325513,0.687351,0
1,21.882141,8.019369,13.779108,12.790147,9.865608,0.50892,0.717864,0.065834,0.057357,1034.0,8292.027559,68.072162,4.188555,3.155147,0.410784,1
2,9.360483,3.157213,5.299983,13.40341,9.92959,0.254258,0.463685,0.097433,0.099728,825.0,2604.700695,80.382189,4.439833,4.343012,0.348618,1
4,19.780001,6.150996,8.0325,21.053265,9.813434,0.374268,0.661675,0.108875,0.085304,1094.0,6729.190006,119.109484,4.91,12.351788,0.574858,1
6,16.394695,4.628921,9.21706,14.498268,9.91809,0.531936,0.626294,0.089589,0.086697,1094.0,5064.040117,98.009978,3.744509,4.896011,0.431296,0


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19959 entries, 0 to 1709396983975
Data columns (total 16 columns):
speed_max                  19959 non-null float64
speed_mean                 19959 non-null float64
speed_iqr                  19959 non-null float64
acceleration_max           19959 non-null float64
acceleration_mean          19959 non-null float64
acceleration_iqr           19959 non-null float64
gyro_max                   19959 non-null float64
gyro_mean                  19959 non-null float64
gyro_iqr                   19959 non-null float64
duration                   19959 non-null float64
distance                   19959 non-null float64
rotation                   19959 non-null float64
speed_max_change           19959 non-null float64
acceleration_max_change    19959 non-null float64
gyro_max_change            19959 non-null float64
label                      19959 non-null int64
dtypes: float64(15), int64(1)
memory usage: 2.6 MB


In [23]:
df.to_csv(OUTPUT_FEATURES)