# Predicting cab booking cancellation - Final Report
## Capstone Project One : Springboard Data Science career track
### Notebook by Debisree Ray

This is the final report of the capstone project - I. For more detailed version of the EDA and modelling, and the source codes, please visit the following notebooks:

https://github.com/debisree/Springboard_Debisree/blob/master/Capstone_1_predicting_cab_booking_cancellation/predicting-cab-booking-cancellations-milestone.ipynb

https://github.com/debisree/Springboard_Debisree/blob/master/Capstone_1_predicting_cab_booking_cancellation/predicting-cab-booking-cancellations-ML.ipynb

For the slides:

https://github.com/debisree/Springboard_Debisree/blob/master/Capstone_1_predicting_cab_booking_cancellation/capstone_1.pdf

## Introduction:

### 1. The Problem statement:

The business problem addressed here is to improve the customer service for Bangalore metropolitan (India) based cab company called **YourCabs**. The problem is that, a certain percentage of booking gets canceled by the company due to the unavailability of a car, and the cancellations occur at a time when the trip is about to start. Therefore it causes passengers inconvenience and a bad reputation for the company. So, the challenge is to build a predictive model, which would classify the upcoming bookings as, if they would eventually get cancelled due to car unavailability, or not. So this is a classification problem.



### 2. The Data: 

The **Kaggle** hosts the original problem and the dataset in their website as one of their competitions. Here, I downloaded the data from the Kaggle website, from the following link.

https://www.kaggle.com/c/predicting-cab-booking-cancellations2/data

These are the data fields in the dataset, which we are going to read in the Pandas data frame.


* **id** - booking ID<br />

* **user_id** - the ID of the customer (based on mobile number)<br />

* **vehicle_model_id** - vehicle model type.<br />

* **package_id** - type of package (1=4hrs & 40kms, 2=8hrs & 80kms, 3=6hrs & 60kms, 4= 10hrs &                    100kms, 5=5hrs & 50kms, 6=3hrs & 30kms, 7=12hrs & 120kms)<br />

* **travel_type_id** - type of travel (1=long distance, 2= point to point, 3= hourly rental).<br />

* **from_area_id** - unique identifier of area. Applicable only for point-to-point travel and                       packages <br />

* **to_area_id** - unique identifier of area. Applicable only for point-to-point travel <br />

* **from_city_id** - unique identifier of city <br />

* **to_city_id** - unique identifier of city (only for intercity) <br />

* **from_date** - time stamp of requested trip start <br />

* **to_date** - time stamp of trip end <br />

* **online_booking** - if booking was done on desktop website <br />

* **mobile_site_booking** - if booking was done on mobile website <br />

* **booking_created** - time stamp of booking <br />

* **from_lat** - latitude of from area <br />

* **from_long** - longitude of from area <br />

* **to_lat** - latitude of to area <br />

* **to_long** - longitude of to area <br />

* **Car_Cancellation** (available only in training data) - whether the booking was cancelled                           (1) or not (0) due to unavailability of a car. <br />

* **Cost_of_error** (available only in training data) - the cost incurred if the booking is                        misclassified. The cost of misclassifying an uncancelled booking as a                          cancelled booking (cost=1 unit). The cost associated with misclassifying                      a cancelled booking as uncancelled, This cost is a function of how close                      the cancellation occurs relative to the trip start time. The closer the                        trip, the higher the cost. Cancellations occurring less than 15 minutes                        prior to the trip start incur a fixed penalty of 100 units. <br />



### 3. The questions of interest:

The data analysis and story-telling report is organized around the following questions of interest:

* How many unique users are out there? Are there any returning customers? Did they (returning customers) got their rides canceled?

* What are the different package IDs out there? Is there any relationship with the cancellations?

* What are the different travel types, vehicle IDs and mode of bookings (mobile/website/phone)? How are they related with the cancellations?

* Is there any connection between the drop-off location/city/area ID/latitude-longitude info and cancellations? What about the same with the pick-up locations/city/area IDs

* In which areas/neighborhoods, the cab service is the most popular?

* what is the busiest hour in a day? Does that have any connection with the cancellation?

* Which day of the week is the most popular in the cab users? Is there any connection between the day of the week with the cancellations?

### 4. Executive Summary:

* In order to predict the cab booking cancellations, here we have considered a bunch of (17) features, either directly from the dataset or engineered/derived from the data. Interestingly, the engineeered features are the most important ones in terms of relative importances.

* This is a classification problem. Here we have used the following classification models:
  * Logistic Regression
  * K-Nearest Neighbor (KNN)
  * Support vector machine (SVM)
  * Random Forest
  * Naive Bayes
  * Gradient Boost

* Evaluating the performance of a model by training and testing on the same dataset can lead to the overfitting. Hence the model evaluation is based on splitting the dataset into train and validation set. But the performance of the prediction result depends upon the random choice of the pair of (train,validation) set. Inorder to overcome that, the Cross-Validation procedure is used where under the k-fold CV approach, the training set is split into k smaller sets, where a model is trained using k-1 of the folds as training data and the model is validated on the remaining part.

* We have evaluated each models in terms of model accuracy score, and 'ROC-AUC' score for both the training and test data, and plotted them. The two best performing models are the Random forest and the Gradient boost. Both are the ensemble model, based on decision trees.

* Performed the hyperparameter tuning, through the gridsearch CV for both the models seperately. This step was the most time consuming one in terms of computation. (The RF model took much longer time). With the result of the optimized hyperparameters, we have again fitted the two models, and got the predictions seperately.

* Evaluated the ROC-AUC scores with the optimized hyperparameters. Clearly, the model performance improved with the optimized parameters. The final ROC-AUC scores fro both RF and the GB are 0.886 and 0.899

### 5. Data wrangling:

This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

* **NumPy**: Provides a fast numerical array structure and helper functions.
* **pandas**: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
* **scikit-learn**: The essential Machine Learning package in Python.
* **matplotlib**: Basic plotting library in Python; most other Python plotting libraries are built on top of it.
* **Seaborn**: Advanced statistical plotting library.


The data have been downloaded as 'csv' file, with 43431 rows and 20 columns, and is read as a single Pandas Dataframe. This dataset is mostly a "clean" one. There are some missing values in some data fields (package_id, to_area_id, to_city_id, from_city_id etc.). All the variables/data columns are categorical. The target variable/column is 'Car_cancellation', which takes the value "1", if the ride gets canceled, otherwise "0". Our goal is to build a predictive model, which can predict this variable. 

The first 5 lines of the raw dataset can be seen as follows:

![Data](head.png)



**Data/column engineering:**

  
  There are two essential timestamps in the data. 1. 'Booking_created': which gives the timestamp of the ride-booking information, 2. 'from_date': which gives the timestamp of the actual trip start information. We have split those 'DateTime' objects into the separate day of the week, date, month, and hour columns.
  
**Class imbalance:**

There is a major class imbalance in the data. Very few cancellations, as compared to the large amount of non-cancellations. Only ~7% (only 3132, in total 43,431) of the total booking has been canceled. 



<img src="1.png" align="left" width="50%"/><img src="2.png" align="left" width="45%"/>

### 6. Exploratary Data Analysis:

In the EDA, every different features have been studied and visually displayed against the 'cancellations', so as to infer any relationship between them.

### 6.1 User ID:


* Each user has been assigned a unique ‘User ID’ in their booking information. 
* Total 22267 user IDs have been recorded. 
* The user with the user_id '29648' is the most frequent user, with frequency 471. So, there are some ‘returning customers’ and, some are the 'one-time users'. 
* The no. of one-time users (non-returning) are: 15935 and that of the returning customers are: 6332.

* There are some gaps/missing data in the user ID column.
* The most frequent user (user_ID no '29648') got the maximum cancellations too, 55 times! The next most unfortunate user got his/her rides canceled 25 times and so on.

* 1049 unfortunate returning customers got their rides canceled. Roughly 16.6% of the total returning customers got their trips canceled. So, 5283 returning customers did not undergo any booking cancellation experiences.



<img src="4.png" align="left" width="50%"/>
<img src="5.png" align="right" width="50%"/>

### 6.2 Package ID:

* Different package IDs are the various travel (booking) plans, from which customers can choose theirs. We are trying to evaluate if the 'package_id' has any effect on the cancellation or not. So, we have plotted the frequency of the canceled vs.not canceled rides across the different package IDs.

* What we see from the plot below is that people mostly opt for a journey of 4hrs and around 40kms, followed by 8hrs and 80kms. (The descriptions of different package IDs have been given above, in the story of the data fields.) And most of the times package_ID no: '1' gets canceled.

<img src="6.png" align="center" width="60%"/>

### 6.3 Travel type ID:

* Travel type IDs are another feature of similar kind. There are three different travel types (description of each type has been given in the data field) are available to choose. And from the following figure it’s evident that the travel type '2' ( i.e. for point to point travel ) is the most popular. 


<img src="7.png" align="center" width="60%"/>

### 6.4 Vehicle model ID:


* 27 different types of vehicles have been listed. 
* The most popular one is the vehicle with the vehicle ID no '12'. It has been used 31859 times. 
* At the same time, we see that the vehicle ID no '12' got the maximum number of cancellations (2668 times).
* Notice, that the Y-axis has been resized by using logarithmic operation, to get a clear picture of the entire data.

<img src="8.png" align="center" width="60%"/>

### 6.5 Different methods of booking:

* There are three different types of 'Booking methods.' 
* Only two types were listed such as, **'mobile booking'** and **'desktop/website booking.'**

* So, I concluded the remaining portion of the booking information as the **'Other method'** of booking. 
* We see that 1878 bookings have been made from mobile websites, 15270 bookings from desktop websites, so, 26283 bookings have been made differently! (Total no. of bookings=43431) So, other methods of booking are mostly favored though nothing has been stated about that.

* In the same figure, we have shown the same plot for the canceled bookings (with deeper shades). Interestingly, this time, the maximum frequency of cancellations correspond to the bookings made from the desktop websites.


<img src="9.png" align="center" width="60%"/>

### 6.6 Pick-up/Drop-off Area ID:

* There are two features describing the drop-off and pick-up area IDs (in and around the major Bengalore metropoilitan) for the booked rides. 
* 598 unique origin and 568 destination area information have been listed. 
* The most popular origin area is the area with area_id no. '393', which is eventually the most popular destination area as well. The  5 most popular pick up area IDs and the corresponding booking frequencies are as follows:

| Pick-up area ID  |  Booking frequency |         
|-----------------:|-------------------:|
|    393           |    3858            |
|    571           |    1631            |
|   1010           |    768             |
|    142           |    727             |
|     83           |    719             |


* 559 area IDs are listed as common to both as the pick-up and drop-off locations. 

* The five most popular destination area IDs and te corresponding booking frequencies are as follows:
   
| Drop-off area ID |  Booking frequency |         
|-----------------:|-------------------:|
|    393           |    8777            |
|    585           |    2339            |
|   1384           |    1237            |
|    571           |    664             |
|    293           |    555             |

   

* The violin-plots show both the pick-up and drop-off area distributions.


<img src="10.png" align="center" width="60%"/>


* In the first two tables below, the left columns show some area IDs (origin/destination). And the right columns show the percentage of canceled bookings corresponding to those area IDs.

* In the rightmost table below, we are evaluating the cancellations of some specific routes. This is very interesting to see that some routes are infamous in terms of cancellations; the cancellation rate for them is pretty high. As an example, the route from area ID: 626 to area ID: 122, almost 91% of the bookings were canceled.

* In the graph below, we have plotted the percentages of canceled rides over area IDs (for both the pick up and drop-off locations)


<img src="from_area.png" align="left" width="25%"/><img src="to_area.png" align="left" width="21%"/><img src="from_to.png" align="left" width="43%"/>

<img src="10a.png" align="center" width="60%"/>


### 6.7 Origin/Destination city ID :


* Another similar information have been listed in the feature set, called ‘city ID’ (Cities in and around the major Bengalore metropoiliton area).
* Only 3 origin cities have been recorded. The most popular origin city is the city with the ID no: '15'.
* Where as, the destination cities are much distributed in numbers. 116 unique destination cities are there. 
* The most popular destination city is the city with the ID no: '32' (475 rides have their destinations to this city.)

* Five most popular drop-off city IDs and their corresponding bookings are as follows:

| Drop-off city ID |  Booking frequency |         
|-----------------:|-------------------:|
|    32            |    475             |
|    55            |    174             |
|    29            |    116             |
|    146           |     89             |
|    108           |     64             |

* However, we need to remember that, only 16345 non-null values are available in 'from_city_id' information and 1588 non-null values are available in 'to_city_id' information.  So, most of the information is missing.
* Y axis has been resized using the log scale.

<img src="11.png" align="center" width="60%"/>

### 6.8 Latitude-Longitude information:

* Another GPS information about the pick-up and drop-off area locations are given in the form of latitude-longitude coordinates.  
* There are certain areas (latitude-longitude combination), for which the pick-up/drop-off cancellations are high (more than 50%).

<img src="13.png" align="left" width="50%"/> <img src="14.png" align="right" width="50%"/>

### 6.9 Booking time:


Booking time is an exciting feature, which records the timestamp of the booking (when somebody booked the cab).  We see that the maximum no. of bookings made at a given timestamp is, 18. And, the corresponding date-time is 2013-10-31 10:30:00. It is interesting to extract the day/month/hour date-time information from the single timestamp. 

* We have plotted the booking frequencies over the days of the week. Moreover, on the same graph, we have shown the canceled ones. We can see that the maximum bookings were made on Fridays. 

* Next we have plotted the booking frequencies over the dates through a month. And we have projected the same for the canceled ones on the same figure. We see that the bookings were made almost equally throughout the month.

* Plotted the booking frequencies over the months through a year. We see that the maximum bookings were made in August.

* Plotted the booking frequencies over different times of a day, along with the canceled rides.

<img src="15.png" align="left" width="50%"/><img src="15a.png" align="right" width="50%"/>
<img src="16.png" align="left" width="50%"/>
<img src="17a.png" align="right" width="50%"/>



### 6.10 Timestamp of the actual ride:


* This is one of the most important features in the dataset, which might show some connection with the cancellation. 
* This column records the timestamp of the actual rides. 
* The maximum no. of trips started at a given timestamp is, 20 and the corresponding date-time is: 2013-10-12 06:00:00 and 2013-07-04 22:15:00.

* Here we have extracted the ride frequencies over the days of the week. It seems they are almost equally distributed. We see that the maximum frequency (6990) of rides correspond to the Saturday,' followed by the 'Friday.' So, people book cabs more at the weekends. On the same figure, we have plotted the canceled ride frequencies. Moreover, they seem to appear equally distributed over the days of the week. However, the maximum cancellations (578) correspond to the 'Friday,' followed by the 'Sunday.'

<img src="19.png" align="left" width="50%"/><img src="19a.png" align="left" width="50%"/>



* Next, we have extracted the ride frequency over the months of the year. We see that the maximum frequency (5445) corresponds to the month of 'August,' followed by 'July.' On the same figure, we have plotted the canceled ride frequencies. Maximum cancellation (650) correspond to the month 'October,' followed by 'November.'


<img src="20.png" align="left" width="50%"/><img src="20a.png" align="left" width="50%"/>

* These are the frequencies of the rides across different times of the day. We can see the two humps/clusters in the distributions of the ride frequencies. So, what we see is that the maximum rides are booked for two typical timestamps in a given day. One is around the morning and another for the evening time. These two are the busiest hours, or mostly what we call as the 'office time' rush' in a day. The ride cancellation distribution also follows the same trend. Maximum numbers of rides got canceled in these two peak hours. As evident, these are the times, when rides can get canceled due to unavailability of cars.

<img src="21.png" align="left" width="50%"/><img src="21b.png" align="right" width="50%"/><img src="21a.png" align="center" width="60%"/>

### 6.11 Time difference (between the timestamp of booking time and the trip starting time) :

* This is the numerical feature created, by taking the difference of the timestamps between the ‘booking created’ and the ‘trip start time,’ to explore if that has any connection with the cancellations or not.

* We can see that in 42 entries of the dataset, the time difference is negative, which is unphysical, where you cannot book the ride, which has already initiated. So we decided to drop these entries from the dataset. 

* So now, my final data frame has 43389 entries.

* The time difference (in hours) is a numeric feature. The descriptive statistics are as follows. Also, the histogram showing the distribution is as follows.

<img src="time_diff.png"  width="25%" align="center"/>


<img src="22.png" align="left" width="50%"/><img src="23.png" align="left" width="50%"/>

### 7. Data preparation and feature selection for applying machine learning:

'df_new' (with 43389 entries) is the final dataframe, to be used in the analysis. And the following features will be considered:

These are the features, directly taken from the data set. The descriptions for each of them can be found above.

* **vehicle_model_id**  
* **package_id**
* **travel_type_id**
* **from_area_id**
* **to_area_id**
* **from_city_id**
* **to_city_id**
* **online_booking**
* **mobile_site_booking**

These following features are engineered, from any given data column.

* **booking_date**:  Date of the booking timestamp of the ride.
* **booking_month**: Month of the booking timestamp of the ride.
* **booking_time_new**:  Hour (of a day) of the booking timestamp of the ride.
* **dayofweek**:  Day of the week of the actual trip.
* **date**:  Date of the timestamp of the actual ride.
* **month**:  Month of the timestamp of the actual ride.
* **time_new**:   Hour (of a day) of the timestamp of the actual ride.
* **time_difference**:  Difference (in hours) between the booking time and the actual trip start time.

### 8. Applying Machine Learning models and comparing their performances:

This is a classification problem, in supervised learning. Here we have used the following classification models:
* Logistic Regression 
* K-Nearest Neighbor (KNN)
* Support vector machine (SVM)
* Random Forest 
* Naive Bayes
* Gradient Boost

Evaluating the performance of a model by training and testing on the same dataset can lead to the overfitting. Hence the model evaluation is based on splitting the dataset into train and validation set. However, the performance of the prediction result depends upon the random choice of the pair of (train, validation) set. To overcome, the Cross-Validation procedure is used where under the k-fold CV approach, the training set is split into k smaller sets, where a model is trained using k-1 of the folds as training data, and the model is validated on the remaining part.

**Classification/Confusion Matrix:**  This matrix summarizes the correct and incorrect classifications that a classifier produced for a certain dataset. Rows and columns of the classification matrix correspond to the true and predicted classes respectively. The two diagonal cells (upper left, lower right) give the number of correct classifications, where the predicted class coincides with the actual class of the observation. The off diagonal cells gives the count of the misclassification. The classification matrix gives estimates of the true classification and misclassification rates.


We applied different ML models above and evaluated their performances in terms of ROC-AUC score for both the training and test data. Here we have tabulated the scores and plotted them.

<img src="comp.png" alt="Drawing" width="50%" align="center"/>
 

<img src="26.png" align="left" width="50%"/><img src="27.png" align="left" width="50%"/>
Clearly, the **Gradient Boost**, and the **Random Forest** are the two best performing models. 


Both the models are similar in nature, based on decision trees. Both are ensemble learning methods and predict (regression or classification) by combining the outputs from individual trees. The two main differences are:

* How trees are built: random forests builds each tree independently while gradient boosting builds one tree at a time. This additive model (ensemble) works in a forward stage-wise manner, introducing a weak learner to improve the shortcomings of existing weak learners. 
* Combining results: random forests combine results at the end of the process (by averaging or "majority rules") while gradient boosting combines results along the way.

Ref:  https://www.datasciencecentral.com/profiles/blogs/decision-tree-vs-random-forest-vs-boosted-trees-explained

 ### 9. Hyperparameter Tuning:
 
 In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. Grid search is the process of performing hyper parameter tuning in order to determine the optimal values for a given model. This is significant as the performance of the entire model is based on the hyper parameter values specified.
 

Performed the hyperparameter tuning, through the gridsearch,  for the two ML models (Random Forest and Gradient boost),which perfomed best in the first run. Fitting these models with optimized hyperparameters (found through the grid search), we evaluated the model performance in terms of **ROC-AUC** score. The scores are as follows:


| Model         |  ROC-AUC          |
|--------------:|-----------------: |
| Random forest |0.8860217314758018 |
| Gradient Boost|0.8987293089109146 |  
 
 
Performing a feature importance search reveals that, the engineered features are the most important ones. 
 <img src="feature_rf.png" align="left" width="35%"/> <img src="feature_gb.png" align="right" width="35%"/>
 <img src="28.png" align="left" width="50%"/> <img src="29.png" align="right" width="50%"/>
 


### 10. Conclusion and Future work:

There is enough room to improve the model: 
  * Here we have used only the data of one year. The model can be improved if we can use the data from at least another year.
  * Use ensembles of the machine learning models to average out bias and improve performance.
  * Try to use more feature engineering. Especially, here we have neglected the Latitude/longitude (GPS data) info. We could have extracted the route information out of them, and use that as a feature. 
  * Try to fit and predict using the Extreme Gradient boost classifier model.

In conclusion, there are two final prediction (result) files **'final_result_gb.csv'**, and **'final_result_rf.csv'**. There are two columns, named **User ID** and **Car_cancellation**. It shows the prediction for cab booking cancellations (0 if there no cancellation. 1 for cancellation) corresponding to the user_IDs. 

What the company can do is: 

* Run the model at every one-hour interval
* Call the customer who is flagged by the model
* Confirm with the customer if the booking will be canceled or not
* Send cab only after the confirmation from the customer