<a href="https://colab.research.google.com/github/blessondensil294/AV-Mobility-Analytics-Competition/blob/master/Janatha_Hack_Mobility_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# JanataHack: Mobility Analytics

https://datahack.analyticsvidhya.com/contest/janatahack-mobility-analytics

With the upcoming cab aggregators and demand for mobility solutions, the past decade has seen immense growth in data collected from commercial vehicles with major contributors such as Uber, Lyft and Ola to name a few. 

There are loads of innovative data science and machine learning solutions being implemented using such data and that has led to tremendous business value for such organizations. 

This weekend we bring to you another JanataHack, this time relating to mobility business. Participate, compete and earn bragging rights against the best hackers globally.

Welcome to Sigma Cab Private Limited - a cab aggregator service. Their customers can download their app on smartphones and book a cab from any where in the cities they operate in. They, in turn search for cabs from various service providers and provide the best option to their client across available options. They have been in operation for little less than a year now. During this period, they have captured surge_pricing_type from the service providers.

You have been hired by Sigma Cabs as a Data Scientist and have been asked to build a predictive model, which could help them in predicti

Data Dictionary
Variable - Definition

Trip_ID - ID for TRIP (Can not be used for purposes of modelling)

Trip_Distance - The distance for the trip requested by the customer

Type_of_Cab - Category of the cab requested by the customer

Customer_Since_Months - Customer using cab services since n months; 0 month means current month

Life_Style_Index - Proprietary index created by Sigma Cabs showing lifestyle of the customer based on their behaviour

Confidence_Life_Style_Index - Category showing confidence on the index mentioned above

Destination_Type - Sigma Cabs divides any destination in one of the 14 categories.

Customer_Rating - Average of life time ratings of the customer till date

Cancellation_Last_1Month - Number of trips cancelled by the customer in last 1 month

Var1, Var2 and Var3 - Continuous variables masked by the company. Can be used for modelling purposes

Gender - Gender of the customer

Surge_Pricing_Type - Predictor variable can be of 3 types

## Load the Data

In [0]:
import numpy as np
import pandas as pd
df_Test_url = 'https://raw.githubusercontent.com/blessondensil294/AV-Mobility-Analytics-Competition/master/Data/test_VsU9xXK.csv'
df_Train_url = 'https://raw.githubusercontent.com/blessondensil294/AV-Mobility-Analytics-Competition/master/Data/train_Wc8LBpr.csv'
df_Train = pd.read_csv(df_Train_url)
df_Test = pd.read_csv(df_Test_url)
#Drop the Trip ID Columns
df_Train.drop(["Trip_ID"],axis = 1,inplace=True)

## Exploratory Data Analysis

In [0]:
df_Train.columns

Index(['Trip_Distance', 'Type_of_Cab', 'Customer_Since_Months',
       'Life_Style_Index', 'Confidence_Life_Style_Index', 'Destination_Type',
       'Customer_Rating', 'Cancellation_Last_1Month', 'Var1', 'Var2', 'Var3',
       'Gender', 'Surge_Pricing_Type'],
      dtype='object')

In [0]:
df_Train.head()

Unnamed: 0,Trip_Distance,Type_of_Cab,Customer_Since_Months,Life_Style_Index,Confidence_Life_Style_Index,Destination_Type,Customer_Rating,Cancellation_Last_1Month,Var1,Var2,Var3,Gender,Surge_Pricing_Type
0,6.77,B,1.0,2.42769,A,A,3.905,0,40.0,46,60,Female,2
1,29.47,B,10.0,2.78245,B,A,3.45,0,38.0,56,78,Male,2
2,41.58,,10.0,,,E,3.50125,2,,56,77,Male,2
3,61.56,C,10.0,,,A,3.45375,0,,52,74,Male,3
4,54.95,C,10.0,3.03453,B,A,3.4025,4,51.0,49,102,Male,2


In [0]:
df_Train.describe()

Unnamed: 0,Trip_Distance,Customer_Since_Months,Life_Style_Index,Customer_Rating,Cancellation_Last_1Month,Var1,Var2,Var3,Surge_Pricing_Type
count,131662.0,125742.0,111469.0,131662.0,131662.0,60632.0,131662.0,131662.0,131662.0
mean,44.200909,6.016661,2.802064,2.849458,0.782838,64.202698,51.2028,75.099019,2.155747
std,25.522882,3.626887,0.225796,0.980675,1.037559,21.820447,4.986142,11.578278,0.738164
min,0.31,0.0,1.59638,0.00125,0.0,30.0,40.0,52.0,1.0
25%,24.58,3.0,2.65473,2.1525,0.0,46.0,48.0,67.0,2.0
50%,38.2,6.0,2.79805,2.895,0.0,61.0,50.0,74.0,2.0
75%,60.73,10.0,2.94678,3.5825,1.0,80.0,54.0,82.0,3.0
max,109.23,10.0,4.87511,5.0,8.0,210.0,124.0,206.0,3.0


In [0]:
df_Train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131662 entries, 0 to 131661
Data columns (total 14 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Trip_ID                      131662 non-null  object 
 1   Trip_Distance                131662 non-null  float64
 2   Type_of_Cab                  111452 non-null  object 
 3   Customer_Since_Months        125742 non-null  float64
 4   Life_Style_Index             111469 non-null  float64
 5   Confidence_Life_Style_Index  111469 non-null  object 
 6   Destination_Type             131662 non-null  object 
 7   Customer_Rating              131662 non-null  float64
 8   Cancellation_Last_1Month     131662 non-null  int64  
 9   Var1                         60632 non-null   float64
 10  Var2                         131662 non-null  int64  
 11  Var3                         131662 non-null  int64  
 12  Gender                       131662 non-null  object 
 13 

In [0]:
#Shape of the Train Data
df_Train.shape

(131662, 14)

In [0]:
#Missing Values
df_Train.isnull().sum()

Trip_ID                            0
Trip_Distance                      0
Type_of_Cab                    20210
Customer_Since_Months           5920
Life_Style_Index               20193
Confidence_Life_Style_Index    20193
Destination_Type                   0
Customer_Rating                    0
Cancellation_Last_1Month           0
Var1                           71030
Var2                               0
Var3                               0
Gender                             0
Surge_Pricing_Type                 0
dtype: int64

## Feature Engineering

### Remove Duplicate Columns

In [0]:
df_Train.shape

(131662, 13)

In [0]:
df_Train.drop_duplicates(keep='first', inplace=True)

In [0]:
df_Train.shape

(131662, 13)

## Pycaret Modeling

In [0]:
#Missing Values
df_Train.isnull().sum()

Trip_Distance                      0
Type_of_Cab                    20210
Customer_Since_Months           5920
Life_Style_Index               20193
Confidence_Life_Style_Index    20193
Destination_Type                   0
Customer_Rating                    0
Cancellation_Last_1Month           0
Var1                           71030
Var2                               0
Var3                               0
Gender                             0
Surge_Pricing_Type                 0
dtype: int64

In [0]:
pip install pycaret

Collecting pycaret
[?25l  Downloading https://files.pythonhosted.org/packages/c7/41/f7fa05b6ce3cb3096a35fb5ac6dc0f2bb23e8304f068618fb2501be0a562/pycaret-1.0.0-py3-none-any.whl (188kB)
[K     |████████████████████████████████| 194kB 2.8MB/s 
Collecting yellowbrick==1.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/d1/cf/6d6ab47c0759d246262f9bdb53e89be3814bf1774bc51fffff995f5859f9/yellowbrick-1.0.1-py3-none-any.whl (378kB)
[K     |████████████████████████████████| 389kB 46.1MB/s 
Collecting DateTime==4.3
[?25l  Downloading https://files.pythonhosted.org/packages/73/22/a5297f3a1f92468cc737f8ce7ba6e5f245fcfafeae810ba37bd1039ea01c/DateTime-4.3-py2.py3-none-any.whl (60kB)
[K     |████████████████████████████████| 61kB 8.4MB/s 
Collecting awscli
[?25l  Downloading https://files.pythonhosted.org/packages/db/79/d35f7a279ca09ede8b219549fee3c3195fef3d242d089aa8b2a678c8dc06/awscli-1.18.61-py2.py3-none-any.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 48.5MB/s 


In [0]:
from pycaret.classification import *
clf1 = setup(data = df_Train, target = 'Surge_Pricing_Type', numeric_imputation='mean', categorical_imputation='mode',
             train_size=0.9, normalize=True, normalize_method='robust', remove_outliers=True)

 
Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,7256
1,Target Type,Multiclass
2,Label Encoded,
3,Original Data,"(131662, 13)"
4,Missing Values,True
5,Numeric Features,6
6,Categorical Features,6
7,Ordinal Features,False
8,High Cardinality Features,False
9,High Cardinality Method,


In [0]:
clf1

(        Trip_Distance  ...  Gender_Male
 0           -0.869433  ...          0.0
 1           -0.241494  ...          1.0
 2            0.093499  ...          1.0
 3            0.646196  ...          1.0
 4            0.463347  ...          1.0
 ...               ...  ...          ...
 131657      -0.732503  ...          1.0
 131658       1.012725  ...          1.0
 131659       0.054495  ...          0.0
 131660       0.240111  ...          1.0
 131661      -0.172614  ...          1.0
 
 [125078 rows x 39 columns], 0         2
 1         2
 2         2
 3         3
 4         2
          ..
 131657    3
 131658    2
 131659    2
 131660    2
 131661    1
 Name: Surge_Pricing_Type, Length: 125078, dtype: int64,         Trip_Distance  ...  Gender_Male
 24105        0.726418  ...          0.0
 17081       -0.471923  ...          1.0
 60495       -0.410788  ...          1.0
 8767        -0.337206  ...          1.0
 128582      -0.678285  ...          1.0
 ...               ...  ...      

Compare the models based on the score

In [0]:
compare_models()

IntProgress(value=0, description='Processing: ', max=170)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,Logistic Regression,0.6814,0.0,0.6501,0.6949,0.6753,0.487


KeyboardInterrupt: ignored

Creting the Catboost Model

In [0]:
catboost_cl = create_model('catboost')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,0.6925,0.0,0.66,0.7071,0.6875,0.5044
1,0.6965,0.0,0.6652,0.7108,0.6919,0.5112
2,0.6929,0.0,0.6595,0.7101,0.6883,0.504
3,0.6942,0.0,0.6621,0.7091,0.6893,0.5073
4,0.6918,0.0,0.6587,0.7055,0.687,0.5034
5,0.694,0.0,0.6613,0.7097,0.689,0.5066
6,0.6987,0.0,0.6659,0.7131,0.6939,0.5144
7,0.6943,0.0,0.6621,0.7076,0.6897,0.5077
8,0.6923,0.0,0.6604,0.7069,0.6876,0.5039
9,0.6918,0.0,0.6574,0.7089,0.6863,0.5022


In [0]:
catboost_cl.get_params

<bound method BaseEstimator.get_params of OneVsRestClassifier(estimator=<catboost.core.CatBoostClassifier object at 0x7f821bff5668>,
                    n_jobs=None)>

In [0]:
catboost_cl.feature_importance_

AttributeError: ignored

Tuning the Catboost Model

In [0]:
tuned_catboost_cl = tune_model('catboost')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa
0,0.6909,0.0,0.6582,0.7053,0.6857,0.5018
1,0.697,0.0,0.6647,0.7128,0.692,0.5115
2,0.6941,0.0,0.6602,0.7119,0.6893,0.5057
3,0.6949,0.0,0.6624,0.7107,0.6898,0.5081
4,0.6915,0.0,0.6578,0.7056,0.6864,0.5025
5,0.694,0.0,0.661,0.7103,0.6889,0.5063
6,0.6983,0.0,0.6651,0.7132,0.6933,0.5136
7,0.6949,0.0,0.6623,0.7085,0.69,0.5084
8,0.6925,0.0,0.6603,0.7072,0.6876,0.5042
9,0.6906,0.0,0.6565,0.7075,0.6852,0.5003


Ensembling the Catboost Model

In [0]:
# ensembling a trained dt model
catboost_bagged = ensemble_model(tuned_catboost_cl)

Plotting the Catboost Model

In [0]:
# create a model
#catboost_cl = create_model('catboost')
# AUC plot
plot_model(catboost_cl, plot = 'auc')
# Decision Boundary
plot_model(catboost_cl, plot = 'boundary')
# Precision Recall Curve
plot_model(catboost_cl, plot = 'pr')
# Validation Curve
plot_model(catboost_cl, plot = 'vc')

SystemExit: ignored

Evaluate the Model

In [0]:
evaluate_model(catboost_cl)

Interpret the Model

In [0]:
# summary plot
interpret_model(catboost_cl)
# correlation plot
interpret_model(catboost_cl, plot = 'correlation')

Predict the Model

In [0]:
x = df_Train
x = x.drop(['Surge_Pricing_Type'], axis=1)
y = df_Train['Surge_Pricing_Type']
x_pred = df_Test
x_pred = x_pred.drop(['Trip_ID'], axis=1)

In [0]:
y_pred = predict_model(tuned_catboost_cl, data = x_pred)

In [0]:
submission_df = pd.DataFrame({'Trip_ID':df_Test['Trip_ID'], 'Surge_Pricing_Type':y_pred['Label']})
submission_df.to_csv('Sample Submission_Pycaret_catboost_tuned.csv', index=False)