## The Problem

![alt text](https://kaggle2.blob.core.windows.net/competitions/kaggle/4378/media/portugal_map4.png "Logo Title Text 1")

# Can we predict how long will a taxi ride last? 

#### To improve the efficiency of electronic taxi dispatching systems it is important to be able to predict how long a driver will have his taxi occupied. If a dispatcher knew approximately when a taxi driver would be ending their current ride, they would be better able to identify which driver to assign to each pickup request. 


## The Tools

We need data, compute, and algorithms!

#### (1) Data https://www.kaggle.com/c/nyc-taxi-trip-duration/data 

##### Our Features
- id - a unique identifier for each trip
- vendor_id - a code indicating the provider associated with the trip record
- pickup_datetime - date and time when the meter was engaged
- dropoff_datetime - date and time when the meter was disengaged
- passenger_count - the number of passengers in the vehicle (driver entered value)
- pickup_longitude - the longitude where the meter was engaged
- pickup_latitude - the latitude where the meter was engaged
- dropoff_longitude - the longitude where the meter was disengaged
- dropoff_latitude - the latitude where the meter was disengaged
- store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory - before sending to the vendor because the vehicle did not have a connection to the server 
- trip_duration - duration of the trip in seconds


#### (2) Compute - Google Colab (GPU + Python Environment in the cloud)

#### (3) Algorithms - Pandas for data preprocessing, XGBoost for learning, matplotlib for visualization


## The Steps

- Split into training and testing data
- Visualize the data (What do the features look like? how long is a trip? How much overlap between the training and testing data)
- Use XGBoost to train a model on the data
- Save the trained model 

## XGBoost



### First, lets talk Ensembles

![alt text](http://images.slideplayer.com/37/10747814/slides/slide_2.jpg "Logo Title Text 1")

- An ensemble is just a collection of predictors which come together (e.g. mean of all predictions) to give a final prediction. 
- The reason we use ensembles is that many different predictors trying to predict same target variable will perform a better job than any single predictor alone. 
- Ensembling techniques are further classified into Bagging and Boosting.

### Wait, whats bagging?

- Bagging is a simple ensembling technique in which we build many independent predictors/models/learners and combine them using some model averaging techniques. (e.g. weighted average, majority vote or normal average)
- We typically take random sub-sample/bootstrap of data for each model, so that all the models are little different from each other. 
- Each observation is chosen with replacement to be used as input for each of the model. So, each model will have different observations based on the bootstrap process. 
- Because this technique takes many uncorrelated learners to make a final model, it reduces error by reducing variance. Example of bagging ensemble is Random Forest models.


![alt text](https://cdn-images-1.medium.com/max/592/1*i0o8mjFfCn-uD79-F1Cqkw.png "Logo Title Text 1")


![alt text](https://www.researchgate.net/profile/Nazar_Zaki/publication/269359645/figure/fig2/AS:295037326905349@1447353788598/Random-Forest-algorithm.png "Logo Title Text 1")


### Ok, then whats boosting?

- Boosting is an ensemble technique in which the predictors are not made independently, but sequentially.
- subsequent predictors learn from the mistakes of the previous predictors. 
- Therefore, the observations have an unequal probability of appearing in subsequent models and ones with the highest error appear most. 
- So the observations are not chosen based on the bootstrap process, but based on the error - 
- The predictors can be chosen from a range of models like decision trees, regressors, classifiers etc. 
- Because new predictors are learning from mistakes committed by previous predictors, it takes less time/iterations to reach close to actual predictions. 
- But we have to choose the stopping criteria carefully or it could lead to overfitting on training data. 

#### Gradient Boosting is an type of boosting algorithm!

![alt text](https://cdn-images-1.medium.com/max/1600/1*8T4HEjzHto_V8PrEFLkd9A.png "Logo Title Text 1")

![alt text](https://cdn-images-1.medium.com/max/1600/1*PaXJ8HCYE9r2MgiZ32TQ2A.png "Logo Title Text 1")


The objective of any supervised learning algorithm is to define a loss function and minimize it. Let’s see how maths work out for Gradient Boosting algorithm. Say we have mean squared error (MSE) as loss defined as:

![alt text](https://cdn-images-1.medium.com/max/1600/1*fHenn7NVqcWvw25D3-zRiQ.png "Logo Title Text 1")

We want our predictions, such that our loss function (MSE) is minimum. By using gradient descent and updating our predictions based on a learning rate, we can find the values where MSE is minimum.

![alt text](https://cdn-images-1.medium.com/max/1600/1*LLbC4TstqzXQ3hzA8wCmeg.png "Logo Title Text 1")

So, we are basically updating the predictions such that the sum of our residuals is close to 0 (or minimum) and predicted values are sufficiently close to actual values.


#### The intuition behind gradient boosting algorithm is to repetitively leverage the patterns in residuals and strengthen a model with weak predictions and make it better. Once we reach a stage that residuals do not have any pattern that could be modeled, we can stop modeling residuals (otherwise it might lead to overfitting). Algorithmically, we are minimizing our loss function, such that test loss reach its minima.


![alt text](https://qph.fs.quoracdn.net/main-qimg-dc06543fbfbcd10c58659a42cac16dc9 "Logo Title Text 1")