**Enter the Path Name of the folder containing the feature files in the file_path **

**Download the model ("finalized_model.sav") and save it the same folder **

** Training AUC is 0.7012 using hyperparameter Optimization **

In [0]:
file_path = "../data/features"

In [0]:
%matplotlib inline
import pandas as pd
import numpy as np
from numpy import array
import glob


In [0]:
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,  roc_curve, auc, confusion_matrix

from sklearn.externals import joblib

import lightgbm as lgb
import pickle

Reading the ***Label*** file.

In [0]:
# df_label = pd.read_csv("/content/drive/My Drive/data/data/labels/part-00000-e9445087-aa0a-433b-a7f6-7f4c19d78ad6-c000.csv", dtype = {'bookingID': 'uint64', 'label' : 'category'})

Reading the ***Features*** file.

In [0]:
all_files = glob.glob(file_path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0, dtype = {'bookingID':'uint64', 'Accuracy':'float32', 'Bearing':'float32', 
                                                                  'acceleration_x':'float32', 'acceleration_y':'float32',
                                                                  'acceleration_z':'float32', 'gyro_x':'float32', 
                                                                  'gyro_y':'float32', 'gyro_z':'float32', 'second':'float32', 
                                                                  'Speed':'float32'})
    li.append(df)

df_sort = pd.concat(li, axis=0, ignore_index=True)

Removing Duplicates from the Booking id.  
If same Booking ID is classified as Rash and Non-Rash,  I have taken it as Rash Driven.

In [0]:
# df_label = df_label.sort_values(['bookingID'])
# df_label = df_label.sort_values(['label'])
# df_label = df_label.drop_duplicates(subset='bookingID', keep='last')
# df_label = df_label.set_index('bookingID')

Data Massaging for the Features Table.

In [0]:
df_sort = df_sort.groupby('bookingID',as_index=False).apply(pd.DataFrame.sort_values, 'second')
df_sort = df_sort.reset_index()
df_sort = df_sort.drop(['level_0','level_1'], axis = 1)
df_sort = df_sort.set_index('bookingID')

Combining the Feature and the Label tables

## Feature Engineering

The idea of Rash driving can be classified as combination of Speed and Movement, where Speed is defined as speed along the direction of the car and Movement is defined as angular change in the GPS Bearing. 

High speed alone may not be enough for Rash Driving and similarly angular change also will not define rash driving.  A combination of both Speed and Movement is needed to classify Rash driving.

1. Movement : Change in the GPS Bearing across time will depict the the angular change per second in the direction of car.

2. Sign: If change in movement is along the different direction (clockwise and then anticlockwise or anticlockwise and then clockwise) in two consecutive seconds, it is non-zero else it is zero.

3.  Abs_Movement: Absolute Movement per second irrespective of direction(clockwise ot anticlockwise)

4.   Total Movement: Sum of Movement per second over the duration of the booking 

5. Duration: Analysis showed that Higher proportion of bookings with higher duration are classified as Rash.

5. Acceleration: Change in Speed per second

6. gyro_xyz: Total Gyro Movement as square root of the sum of squares of gyro movment in x,y and z direction. This tells that the ride is bumpy

7. Rolling Movement: Rolling Average of Movement

8. Rash_Movement: If the Rolling Average of Movement is above Threshold cutoff for permissible movement, it is classified as Rash Movement else zero.

9. Total_Movement_by_duration: Average Movement for the booking. It is calculates as Total Movement over Total Duration of the Booking.

11. Rolling_Speed : Rolling Average of Movement

12. Average Speed: Average Speed of the Booking

13. Median Speed: Median Speed of the Booking

14. Rash_ Speed:  If the Rolling Speed is above a certain threshold it is classified as 1 or else 0

15. acc_movement: If the Acceleration is above a certain threshold it is classified as 1 or else 0

16.  gyro_movement:  If the gyro_xyz is above a certain threshold it is classified as 1 or else 0

17. sign_s: Instance where both Rash Speed and Sudden Change in direction are there

18. a_m: Instance where both acc_movement and gyro_movement are there

19. a_s: Instance where both acc_movement and Rash Speed  are there

20. sign_m: Instance where both Rash Speed and Sudden Change in direction are there

21. a_m_s:  Instance where both Rash Speed , Acc_movement and gyro_movement are there

22. sign_a_m_s:  Instance where both Rash Speed , Acc_movement and gyro_movement are there

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin

class PreProcessing(BaseEstimator, TransformerMixin):
  """Custom Pre-Processing estimator for our use-case
  """

  def __init__(self):
      pass
  
  def feature_creation(self, df):
    ## Movement Features
    df['Movement'] = df.groupby(['bookingID'])['Bearing'].diff().fillna(0)
    df['sign'] = df.Movement.map(np.sign).diff(periods=1).fillna(0)
    df['Abs_Movement'] = abs(df['Movement'])
    df['Delta_Movement'] = df.groupby(['bookingID'])['Abs_Movement'].cumsum()
    df['Total_Movement'] = df.groupby('bookingID')['Abs_Movement'].transform('sum')

    ## Time Features
    df['duration'] = df.groupby('bookingID')['bookingID'].transform('count')


    ## Speed Features
    df['Acceleration'] = abs(df.groupby(['bookingID'])['Speed'].diff().fillna(0))


    ## Gyro Features
    df['gyro_xyz'] = np.sqrt(df['gyro_x']**2 + df['gyro_y']**2 + df['gyro_z']**2)

    return df


  def feature_eng(self, df, period_of_movement = 2, thresh_gyro = 0.65, thresh_acc = 5, thresh_delta = 15, thresh_roll_speed = 20):

    ## Movement Features
    df['Rolling_movement'] = df.groupby(['bookingID'])['Abs_Movement'].apply(lambda x: x.rolling(period_of_movement, 1).mean())
    df['rash_movement'] = 0
    df.loc[df.Delta_Movement >= thresh_delta, 'rash_movement'] = 1

    ## Time Features
    df['Total_Movement_by_duration'] = df['Total_Movement']/df['duration']

    ## Speed Features
    df['Rolling_Speed'] = df.groupby(['bookingID'])['Speed'].apply(lambda x: x.rolling(period_of_movement, 1).mean())
    df['Avg_Speed'] = df.groupby(['bookingID'])['Rolling_Speed'].transform('mean')
    df['Median_Speed'] = df.groupby(['bookingID'])['Rolling_Speed'].transform('median')
    df['rash_speed'] = 0
    df.loc[df.Rolling_Speed >= thresh_roll_speed, 'rash_speed'] = 1

    df[['Rolling_Speed', 'Avg_Speed','Median_Speed',
              'Rolling_movement','Total_Movement_by_duration']] = df[['Rolling_Speed', 'Avg_Speed','Median_Speed',
                                                                            'Rolling_movement','Total_Movement_by_duration']].apply(lambda x: x.astype('float32'))

    df['acc_movement'] = 0
    df.loc[df.Acceleration >= thresh_acc, 'acc_movement'] = 1

    df['gyro_movement'] = 0
    df.loc[df_merge.gyro_xyz >= thresh_gyro, 'gyro_movement'] = 1

    df['sign_s'] = df['sign'] * df['rash_speed']
    df['a_m'] = df['acc_movement'] * df['gyro_movement']
    df['a_s'] = df['acc_movement'] * df['rash_speed']
    df['sign_m'] = df['sign'] * df['gyro_movement']
    df['a_s_m'] = df['acc_movement'] * df['rash_speed'] * df['gyro_movement']

    df['sign_a_s_m'] = df['sign'] * df['a_s_m']
    
    return df


  def data_prep(self, df):

    p = df.drop(['Accuracy', 'Bearing','acceleration_x', 'acceleration_y',
                 'acceleration_z', 'gyro_x', 'gyro_y', 'gyro_z', 'second',
                 'Speed'], axis = 1)

    q = p.groupby('bookingID')['Movement', 'Delta_Movement',
         'Rolling_movement', 'duration', 'Total_Movement',
         'Total_Movement_by_duration', 'Acceleration', 'Rolling_Speed',
         'Avg_Speed', 'Median_Speed'].mean().reset_index()

    q = pd.merge(left = q,
             right = df.groupby('bookingID')['sign','rash_movement', 'rash_speed','gyro_movement','sign_s', 'a_m','a_s','sign_m','a_s_m', 'sign_a_s_m'].sum().reset_index(),
             left_on = 'bookingID',
             right_on = 'bookingID'
            )

#     q = pd.merge(left = q,
#                  right = df_label.reset_index(),
#                  left_on = 'bookingID',
#                  right_on='bookingID'
#                 )

    q['rm_freq_per_ride'] = np.round(np.log(q.rash_movement/q.duration + 1)*100, decimals = 2)
    q['rs_freq_per_ride'] = np.round(np.log(q.rash_speed/q.duration + 1)*100, decimals = 2)
    q['gm_freq_per_ride'] = np.round(np.log(q.gyro_movement/q.duration + 1)*1000 , decimals = 2)
    q['sign_freq_per_ride'] = np.round(np.log(q.sign/q.duration + 1)*100 , decimals = 2)


    #   q.duration = np.log(q.duration)
    #   q.gyro_movement = np.log(q.gyro_movement + 1)
    #   q.rash_movement = np.log(q.rash_movement + 1)
    #   q.rash_speed = np.log(q.rash_speed + 1)

    
    return q



  def transform(self, df):
      """Regular transform() that is a help for training, validation & testing datasets
         (NOTE: The operations performed here are the ones that we did prior to this cell)
      """
      
      df = self.feature_creation(df)
      
      q = self.feature_eng(df_merge,
                period_of_movement = 2,
                thresh_acc = 1.3, 
                thresh_delta = 5, 
                thresh_roll_speed = 8
               )
      q = self.data_prep(q)
      
#       scaler = MinMaxScaler()
#       q = scaler.fit_transform(q)
      
      return q

  def fit(self, df, y=None, **fit_params):
      """Fitting the Training dataset & calculating the required values from train
         e.g: We will need the mean of X_train['Loan_Amount_Term'] that will be used in
              transformation of X_test
      """

#       self.term_mean_ = df['Loan_Amount_Term'].mean()
#       self.amt_mean_ = df['LoanAmount'].mean()
      return self

In [0]:
preprocess = PreProcessing()
X = preprocess.transform(df_sort)
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [0]:
filename = file_path + 'finalized_model.sav'

In [79]:
loaded_model = joblib.load(filename)
result = loaded_model.predict(X)
# print(result)


false_positive_rate, true_positive_rate, thresholds = roc_curve(y, result)
roc_auc = auc(false_positive_rate, true_positive_rate)
print(roc_auc)

0.5067901147419026


In [0]:
# X = q.drop('label', axis = 1)
# y = q.label.astype('int')

# X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, random_state = 42, stratify = y )




# d_train = lgb.Dataset(X_train, label=y_train)

# params = {}
# params['learning_rate'] = 0.003
# params['boosting_type'] = 'gbdt'
# params['objective'] = 'binary'
# params['metric'] = 'auc'
# params['sub_feature'] = 0.5
# params['num_leaves'] = 10
# params['min_data'] = 50
# params['max_depth'] = 10

# clf = lgb.train(params, d_train, 100)

# # print_auc(clf, X_test, y_test)

# predictions=clf.predict(X_test)

In [50]:
#   false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions)
#   roc_auc = auc(false_positive_rate, true_positive_rate)
#   print(roc_auc)

0.7012745925925925
