# Problem Statement - Mobility Analytics

Welcome to Sigma Cab Private Limited - a cab aggregator service. Their customers can download their app on smartphones and book a cab from any where in the cities they operate in. They, in turn search for cabs from various service providers and provide the best option to their client across available options. They have been in operation for little less than a year now. During this period, they have captured surge_pricing_type from the service providers.

You have been hired by Sigma Cabs as a Data Scientist and have been asked to build a predictive model, which could help them in predicting the surge_pricing_type pro-actively. This would in turn help them in matching the right cabs with the right customers quickly and efficiently.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train = pd.read_csv('train_Wc8LBpr.csv')
test = pd.read_csv('test_VsU9xXK.csv')

In [3]:
train.shape, test.shape

((131662, 14), (87395, 13))

In [4]:
train.head()

Unnamed: 0,Trip_ID,Trip_Distance,Type_of_Cab,Customer_Since_Months,Life_Style_Index,Confidence_Life_Style_Index,Destination_Type,Customer_Rating,Cancellation_Last_1Month,Var1,Var2,Var3,Gender,Surge_Pricing_Type
0,T0005689460,6.77,B,1.0,2.42769,A,A,3.905,0,40.0,46,60,Female,2
1,T0005689461,29.47,B,10.0,2.78245,B,A,3.45,0,38.0,56,78,Male,2
2,T0005689464,41.58,,10.0,,,E,3.50125,2,,56,77,Male,2
3,T0005689465,61.56,C,10.0,,,A,3.45375,0,,52,74,Male,3
4,T0005689467,54.95,C,10.0,3.03453,B,A,3.4025,4,51.0,49,102,Male,2


In [5]:
combine = train.append(test)
combine.shape

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


(219057, 14)

In [6]:
combine.isnull().sum()

Cancellation_Last_1Month            0
Confidence_Life_Style_Index     33520
Customer_Rating                     0
Customer_Since_Months            9886
Destination_Type                    0
Gender                              0
Life_Style_Index                33520
Surge_Pricing_Type              87395
Trip_Distance                       0
Trip_ID                             0
Type_of_Cab                     33368
Var1                           117819
Var2                                0
Var3                                0
dtype: int64

In [7]:
combine.dtypes

Cancellation_Last_1Month         int64
Confidence_Life_Style_Index     object
Customer_Rating                float64
Customer_Since_Months          float64
Destination_Type                object
Gender                          object
Life_Style_Index               float64
Surge_Pricing_Type             float64
Trip_Distance                  float64
Trip_ID                         object
Type_of_Cab                     object
Var1                           float64
Var2                             int64
Var3                             int64
dtype: object

In [8]:
combine.columns

Index(['Cancellation_Last_1Month', 'Confidence_Life_Style_Index',
       'Customer_Rating', 'Customer_Since_Months', 'Destination_Type',
       'Gender', 'Life_Style_Index', 'Surge_Pricing_Type', 'Trip_Distance',
       'Trip_ID', 'Type_of_Cab', 'Var1', 'Var2', 'Var3'],
      dtype='object')

In [9]:
combine['Cancellation_Last_1Month'].value_counts()

0    114212
1     61297
2     27077
3     11875
4      3053
5      1081
6       432
7        22
8         8
Name: Cancellation_Last_1Month, dtype: int64

In [10]:
bins= [0, 1, 2, 3, 8]
labels = ['None','Once', 'Twice','More_Than_Thrice']
combine['Cancellation_Last_1Month'] = pd.cut(combine['Cancellation_Last_1Month'], bins=bins, labels=labels, right=False)
combine['Cancellation_Last_1Month'].value_counts()

None                114212
Once                 61297
Twice                27077
More_Than_Thrice     16463
Name: Cancellation_Last_1Month, dtype: int64

In [11]:
combine['Confidence_Life_Style_Index'].value_counts()

B    67265
C    59736
A    58536
Name: Confidence_Life_Style_Index, dtype: int64

In [12]:
combine['Confidence_Life_Style_Index'].fillna('Unknown', inplace=True)
combine['Confidence_Life_Style_Index'].value_counts()

B          67265
C          59736
A          58536
Unknown    33520
Name: Confidence_Life_Style_Index, dtype: int64

In [13]:
combine['Customer_Rating'].describe()

count    219057.000000
mean          2.848632
std           0.981100
min           0.001250
25%           2.152500
50%           2.895000
75%           3.581250
max           5.000000
Name: Customer_Rating, dtype: float64

In [14]:
combine['Customer_Since_Months'].value_counts()

10.0    70817
2.0     19445
3.0     17074
0.0     16885
5.0     14405
1.0     13965
4.0     13035
7.0     12332
6.0     12279
8.0     10525
9.0      8409
Name: Customer_Since_Months, dtype: int64

In [15]:
from sklearn.preprocessing import scale
combine['Customer_Since_Months'].fillna(-1, inplace=True)
combine['Customer_Since_Months'] = scale(combine['Customer_Since_Months'])
combine['Customer_Since_Months'].describe()

count    2.190570e+05
mean     1.662343e-15
std      1.000002e+00
min     -1.746288e+00
25%     -9.631837e-01
50%      8.095576e-02
75%      1.125095e+00
max      1.125095e+00
Name: Customer_Since_Months, dtype: float64

In [16]:
combine['Destination_Type'].value_counts()

A    129010
B     49193
C     12397
D     11085
E      4549
F      3222
G      2513
H      2124
I      1334
J      1166
K      1102
L      1052
M       160
N       150
Name: Destination_Type, dtype: int64

In [17]:
combine['Gender'].value_counts()

Male      156128
Female     62929
Name: Gender, dtype: int64

In [18]:
combine['Life_Style_Index'].describe()

count    185537.000000
mean          2.802594
std           0.226323
min           1.317850
25%           2.654620
50%           2.798280
75%           2.947650
max           4.875110
Name: Life_Style_Index, dtype: float64

In [19]:
combine['Life_Style_Index'].fillna(combine['Life_Style_Index'].mean(), inplace=True)
combine['Life_Style_Index'].describe()

count    219057.000000
mean          2.802594
std           0.208288
min           1.317850
25%           2.688050
50%           2.802594
75%           2.913910
max           4.875110
Name: Life_Style_Index, dtype: float64

In [20]:
combine['Trip_Distance'].describe()

count    219057.000000
mean         44.158725
std          25.507368
min           0.310000
25%          24.560000
50%          38.140000
75%          60.720000
max         109.230000
Name: Trip_Distance, dtype: float64

In [21]:
combine['Trip_Distance'] = np.log(combine['Trip_Distance'])
combine['Trip_Distance'].describe()

count    219057.000000
mean          3.594550
std           0.671321
min          -1.171183
25%           3.201119
50%           3.641264
75%           4.106273
max           4.693456
Name: Trip_Distance, dtype: float64

In [22]:
combine['Type_of_Cab'].value_counts()

B    51585
C    46732
A    35878
D    31885
E    19609
Name: Type_of_Cab, dtype: int64

In [23]:
combine['Type_of_Cab'].fillna('Unknown', inplace=True)
combine['Type_of_Cab'].value_counts()

B          51585
C          46732
A          35878
Unknown    33368
D          31885
E          19609
Name: Type_of_Cab, dtype: int64

In [24]:
combine['Var1'].describe()

count    101238.000000
mean         64.095972
std          21.747037
min          30.000000
25%          46.000000
50%          61.000000
75%          79.000000
max         210.000000
Name: Var1, dtype: float64

In [25]:
combine['Var1'].fillna(combine['Var1'].mean(), inplace=True)
combine['Var1'] = np.log(combine['Var1'])
combine['Var1'].describe()

count    219057.000000
mean          4.133547
std           0.235032
min           3.401197
25%           4.158883
50%           4.160382
75%           4.160382
max           5.347108
Name: Var1, dtype: float64

In [26]:
combine['Var2'].describe()

count    219057.000000
mean         51.186586
std           4.974497
min          40.000000
25%          48.000000
50%          50.000000
75%          54.000000
max         124.000000
Name: Var2, dtype: float64

In [27]:
combine['Var2'] = np.log(combine['Var2'])
combine['Var2'].describe()

count    219057.000000
mean          3.931010
std           0.093373
min           3.688879
25%           3.871201
50%           3.912023
75%           3.988984
max           4.820282
Name: Var2, dtype: float64

In [28]:
combine['Var3'].describe()

count    219057.000000
mean         75.065777
std          11.580112
min          52.000000
25%          67.000000
50%          74.000000
75%          82.000000
max         206.000000
Name: Var3, dtype: float64

In [29]:
combine['Var3'] = np.log(combine['Var3'])
combine['Var3'].describe()

count    219057.000000
mean          4.307021
std           0.149217
min           3.951244
25%           4.204693
50%           4.304065
75%           4.406719
max           5.327876
Name: Var3, dtype: float64

In [30]:
combine.isnull().sum()

Cancellation_Last_1Month           8
Confidence_Life_Style_Index        0
Customer_Rating                    0
Customer_Since_Months              0
Destination_Type                   0
Gender                             0
Life_Style_Index                   0
Surge_Pricing_Type             87395
Trip_Distance                      0
Trip_ID                            0
Type_of_Cab                        0
Var1                               0
Var2                               0
Var3                               0
dtype: int64

In [31]:
combine = pd.get_dummies(combine.drop('Trip_ID', axis=1))
combine.shape

(219057, 38)

In [32]:
combine.head()

Unnamed: 0,Customer_Rating,Customer_Since_Months,Life_Style_Index,Surge_Pricing_Type,Trip_Distance,Var1,Var2,Var3,Cancellation_Last_1Month_None,Cancellation_Last_1Month_Once,...,Destination_Type_M,Destination_Type_N,Gender_Female,Gender_Male,Type_of_Cab_A,Type_of_Cab_B,Type_of_Cab_C,Type_of_Cab_D,Type_of_Cab_E,Type_of_Cab_Unknown
0,3.905,-1.224219,2.42769,2.0,1.912501,3.688879,3.828641,4.094345,1,0,...,0,0,1,0,0,1,0,0,0,0
1,3.45,1.125095,2.78245,2.0,3.383373,3.637586,4.025352,4.356709,1,0,...,0,0,0,1,0,1,0,0,0,0
2,3.50125,1.125095,2.802594,2.0,3.727619,4.160382,4.025352,4.343805,0,0,...,0,0,0,1,0,0,0,0,0,1
3,3.45375,1.125095,2.802594,3.0,4.120012,4.160382,3.951244,4.304065,1,0,...,0,0,0,1,0,0,1,0,0,0
4,3.4025,1.125095,3.03453,2.0,4.006424,3.931826,3.89182,4.624973,0,0,...,0,0,0,1,0,0,1,0,0,0


In [33]:
X = combine[combine['Surge_Pricing_Type'].isnull()!=True].drop(['Surge_Pricing_Type'], axis=1)
y = combine[combine['Surge_Pricing_Type'].isnull()!=True]['Surge_Pricing_Type']

X_test = combine[combine['Surge_Pricing_Type'].isnull()==True].drop(['Surge_Pricing_Type'], axis=1)

X.shape, y.shape, X_test.shape

((131662, 37), (131662,), (87395, 37))

In [34]:
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

In [35]:
from lightgbm import LGBMClassifier
model = LGBMClassifier(boosting_type='gbdt',
                       max_depth=5,
                       learning_rate=0.05,
                       n_estimators=5000,
                       min_child_weight=0.01,
                       colsample_bytree=0.5,
                       random_state=1994,
                       objective='multiclass')

model.fit(x_train,y_train,
          eval_set=[(x_train,y_train),(x_val, y_val.values)],
          early_stopping_rounds=100,
          verbose=200)

pred_y = model.predict(x_val)

Training until validation scores don't improve for 100 rounds.
[200]	training's multi_logloss: 0.683379	valid_1's multi_logloss: 0.704866
[400]	training's multi_logloss: 0.662537	valid_1's multi_logloss: 0.697089
[600]	training's multi_logloss: 0.649791	valid_1's multi_logloss: 0.696075
[800]	training's multi_logloss: 0.639005	valid_1's multi_logloss: 0.696177
Early stopping, best iteration is:
[742]	training's multi_logloss: 0.641806	valid_1's multi_logloss: 0.695972


In [36]:
from sklearn.metrics import accuracy_score, confusion_matrix
print(accuracy_score(y_val, pred_y))
confusion_matrix(y_val,pred_y)

0.7005658299472145


array([[3083, 1746,  611],
       [ 571, 9054, 1757],
       [ 436, 2764, 6311]], dtype=int64)

In [37]:
err = []
y_pred_tot_lgm = []

from sklearn.model_selection import StratifiedKFold

fold = StratifiedKFold(n_splits=15, shuffle=True, random_state=2020)
i = 1
for train_index, test_index in fold.split(X, y):
    x_train, x_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y[train_index], y[test_index]
    m = LGBMClassifier(boosting_type='gbdt',
                       max_depth=5,
                       learning_rate=0.05,
                       n_estimators=5000,
                       min_child_weight=0.01,
                       colsample_bytree=0.5,
                       random_state=1994,
                       objective='multiclass')
    m.fit(x_train, y_train,
          eval_set=[(x_train,y_train),(x_val, y_val)],
          early_stopping_rounds=200,
          verbose=200)
    pred_y = m.predict(x_val)
    print(i, " err_lgm: ", accuracy_score(y_val, pred_y))
    err.append(accuracy_score(y_val, pred_y))
    pred_test = m.predict(X_test)
    i = i + 1
    y_pred_tot_lgm.append(pred_test)

Training until validation scores don't improve for 200 rounds.
[200]	training's multi_logloss: 0.686314	valid_1's multi_logloss: 0.699159
[400]	training's multi_logloss: 0.667057	valid_1's multi_logloss: 0.690539
[600]	training's multi_logloss: 0.655568	valid_1's multi_logloss: 0.688687
[800]	training's multi_logloss: 0.645897	valid_1's multi_logloss: 0.687987
Early stopping, best iteration is:
[778]	training's multi_logloss: 0.646945	valid_1's multi_logloss: 0.687947
1  err_lgm:  0.7105592892128944
Training until validation scores don't improve for 200 rounds.
[200]	training's multi_logloss: 0.68534	valid_1's multi_logloss: 0.712401
[400]	training's multi_logloss: 0.666275	valid_1's multi_logloss: 0.706187
[600]	training's multi_logloss: 0.654529	valid_1's multi_logloss: 0.705286
Early stopping, best iteration is:
[527]	training's multi_logloss: 0.658427	valid_1's multi_logloss: 0.705231
2  err_lgm:  0.701104909442989
Training until validation scores don't improve for 200 rounds.
[200

In [38]:
np.mean(err, 0)

0.7066883077455753

In [44]:
err[3]

0.712495728442875

In [45]:
submission = pd.DataFrame()
submission['Trip_ID'] = test['Trip_ID']
submission['Surge_Pricing_Type'] = y_pred_tot_lgm[3]
submission.to_csv('LGBM.csv', index=False, header=True)
submission.shape

(87395, 2)

In [46]:
submission.head()

Unnamed: 0,Trip_ID,Surge_Pricing_Type
0,T0005689459,1.0
1,T0005689462,2.0
2,T0005689463,2.0
3,T0005689466,2.0
4,T0005689468,2.0


In [47]:
submission['Surge_Pricing_Type'].value_counts()

2.0    44122
3.0    29718
1.0    13555
Name: Surge_Pricing_Type, dtype: int64