# Modeling Used In Kaggle
  
Time to bring everything together and build some models! In this last chapter, you will build a base model before tuning some hyperparameters and improving your results with ensembles. You will then get some final tips and tricks to help you compete more efficiently.

## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
[Scikit-Learn Documentation](https://scikit-learn.org/stable/)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>sklearn.model_selection.train_test_split</td>
    <td>Split dataset into train and test sets for machine learning.</td>
  </tr>
  <tr>
    <td>2</td>
    <td>sklearn.metrics.mean_squared_error</td>
    <td>Calculate the mean squared error between true and predicted values.</td>
  </tr>
  <tr>
    <td>3</td>
    <td>math.sqrt</td>
    <td>Compute the square root of a number.</td>
  </tr>
  <tr>
    <td>4</td>
    <td>sklearn.ensemble.RandomForestRegressor</td>
    <td>Create a Random Forest Regressor model.</td>
  </tr>
  <tr>
    <td>5</td>
    <td>sklearn.model_selection.KFold</td>
    <td>Split dataset into K consecutive folds for cross-validation.</td>
  </tr>
  <tr>
    <td>6</td>
    <td>sklearn.ensemble.GradientBoostingRegressor</td>
    <td>Create a Gradient Boosting Regressor model.</td>
  </tr>
  <tr>
    <td>7</td>
    <td>itertools.product</td>
    <td>Generate cartesian product of input iterables.</td>
  </tr>
  <tr>
    <td>8</td>
    <td>sklearn.linear_model.LinearRegression</td>
    <td>Create a Linear Regression model.</td>
  </tr>
</table>
  
---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
Name: scikit-learn  
Version: 1.3.0  
Summary: A set of python modules for machine learning and data mining  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [2]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Setting a standard style
plt.style.use('ggplot')

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

## Baseline model
  
After going through the initial steps in the competition and feature engineering, it's time to train some machine learning models.
  
**Modeling stage**
  
Recall the modeling stage we've introduced in the previous chapter. We've already covered some data preprocessing techniques, like missing data imputation and categorical encoding, as well as creating new features. In this chapter, we will talk about model creation and some additional tricks to apply.
  
<center><img src='../_images/baseline-model-kaggle.png' alt='img' width='740'></center>
  
To start this loop, we should establish the baseline model. It's usually a very simple model that allows us to check the whole pipeline we've written, review the local validation process, and generate the first submissions for the test data.
  
<center><img src='../_images/baseline-model-kaggle1.png' alt='img' width='740'></center>
  
**New York city taxi validation**
  
Let’s again work with New York City Taxi competition data. We need to predict the fare amount for a taxi ride in New York City. The competition metric is root mean squared error. For the sake of simplicity, we will use the 30% holdout sample as a local validation set. So, we create a simple holdout split using the `train_test_split()` function from scikit-learn.
  
<center><img src='../_images/baseline-model-kaggle2.png' alt='img' width='740'></center>
  
**Baseline model I**
  
The simplest model is to assign the average fare value to all the test observations. For this purpose, we take the mean of the 'fare_amount' column over the whole train set and just assign this number to all the observations in the test set. Then, we select the id and fare_amount columns and write the predictions to the submission file. Such an approach gives about 10 dollars RMSE in both Local Validation and Public Leaderboard. Also, it achieves the 1449th position on the Leaderboard out of 1500 participants.
  
<center><img src='../_images/baseline-model-kaggle3.png' alt='img' width='740'></center>
  
**Baseline model II**
  
We could make the model a bit more complex by taking the mean grouped by the number of passengers. The idea is the same: assign the average value of fare amount to the whole group. Firstly, create a group object based on train data. And then make predictions on the test set using `pandas`' `map()` method and projecting each passengers number to the corresponding average fare amount. Then, again, write predictions to the file. Such model achieves slightly better results with a 30 places improvement on the Public Leaderboard.
  
<center><img src='../_images/baseline-model-kaggle4.png' alt='img' width='740'></center>
  
**Baseline model III**
  
Finally, we could apply an out-of-the-box sklearn Gradient Boosting model on all the numeric features available. We use these features only to discard preprocessing and simplify the baseline model. Features include latitudes and longitudes together with the number of passengers. We then fit the `GradientBoostingRegressor` on the train data and make predictions on the test data.
  
<center><img src='../_images/baseline-model-kaggle5.png' alt='img' width='740'></center>
  
We write predictions to the file and submit to Kaggle. And here are the results. Wow! It is a huge jump: we advanced 300 positions on the Public Leaderboard dropping the errors to about 5-6 dollars.
  
<center><img src='../_images/baseline-model-kaggle6.png' alt='img' width='740'></center>
  
**Intermediate results**
  
Now we have three simple submissions with local and Public Leaderboard scores. Let's take a look at them. One can easily see that local score correlates with the Public (the correlation is not perfect, but the better local score, the better it is on the Public Leaderboard). It is a good sign and we can proceed with our naive validation strategy.
  
<center><img src='../_images/baseline-model-kaggle7.png' alt='img' width='740'></center>
  
**Correlation with Public Leaderboard**
  
Generally, the ideal situation is to observe such correlation between local validation and Public Leaderboard scores. The values should not be absolutely the same, but if the local score is improving, then we want to see improvements on the Leaderboard. Let's compare the results of two different validation strategies. Results of the first validation strategy are presented in the table on the left. We see some improvements in the validation error with different models, but no improvements on the Public Leaderboard. It's a sign that something could be wrong with our models or validation scheme. Now, look at another validation strategy on the right. With the improvement in the validation error, the Public Leaderboard error is also improving. So, this strategy is more reliable.
  
<center><img src='../_images/baseline-model-kaggle8.png' alt='img' width='740'></center>
  
**Let's practice!**
  
All right, it's time to create a couple of baseline models of your own!

### Replicate validation score
  
You've seen both validation and Public Leaderboard scores in the video. However, the code examples are available only for the test data. To get the validation scores you have to repeat the same process on the holdout set.
  
Throughout this chapter, you will work with New York City Taxi competition data. The problem is to predict the fare amount for a taxi ride in New York City. The competition metric is the root mean squared error.
  
The first goal is to evaluate the Baseline model on the validation data. You will replicate the simplest Baseline based on the mean of `"fare_amount"`. Recall that as a validation strategy we used a 30% holdout split with `validation_train` as train and `validation_test` as holdout DataFrames. Both of them are available in your workspace.
  
---
  
1. Calculate the mean of `"fare_amount"` over the whole `validation_train` DataFrame.
2. Assign this naive prediction value to all the holdout predictions. Store them in the `"pred"` column.

In [3]:
from sklearn.model_selection import train_test_split

train = pd.read_csv('../_datasets/taxi_train_chapter_4.csv')
test = pd.read_csv('../_datasets/taxi_test_chapter_4.csv')

validation_train, validation_test = train_test_split(train, test_size=0.3)

In [4]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Calculate the mean fare_amount on the validation_train data
naive_prediction = np.mean(validation_train['fare_amount'])

# Assign naive prediction to all the holdout observations
validation_test = validation_test.copy()
validation_test['pred'] = naive_prediction

# Measure the local RMSE
rmse = sqrt(mean_squared_error(validation_test['fare_amount'], validation_test['pred']))
print('Validation RMSE for Baseline I model: {:.3f}'.format(rmse))

Validation RMSE for Baseline I model: 9.692


It's exactly the same number you've seen in the slides, well done! So, to avoid overfitting you should fully replicate your models using the validation data. Go forward to create a couple of other baselines!

### Baseline based on the date
  
We've already built 3 different baseline models. To get more practice, let's build a couple more. The first model is based on the grouping variables. It's clear that the ride fare could depend on the part of the day. For example, prices could be higher during the rush hours.
  
Your goal is to build a baseline model that will assign the average `"fare_amount"` for the corresponding hour. For now, you will create the model for the whole `train` data and make predictions for the `test` dataset.
  
The `train` and `test` DataFrames are available in your workspace. Moreover, the `"pickup_datetime"` column in both DataFrames is already converted to a `datetime` object for you.
  
---
  
1. Get the hour from the `"pickup_datetime"` column for the `train` and `test` DataFrames.
2. Calculate the mean `"fare_amount"` for each hour on the `train` data.
3. Make `test` predictions using `pandas`' `map()` method and the grouping obtained.
4. Write predictions to the file.

In [5]:
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'])
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'])

# Get pickup hour from the pickup_datetime column
train['hour'] = train['pickup_datetime'].dt.hour
test['hour'] = test['pickup_datetime'].dt.hour

# Calculate average fare_amount grouped by pickup hour
hour_groups = train.groupby('hour')['fare_amount'].mean()

# Make predictions on the test set
test['fare_amount'] = test.hour.map(hour_groups)

# Write predictions
test[['id', 'fare_amount']].to_csv('../_output/hour_mean_sub.csv', index=False)

In [6]:
!head ../_output/hour_mean_sub.csv

id,fare_amount
0,11.199879638916752
1,11.199879638916752
2,11.241585365853659
3,10.96488908606921
4,10.96488908606921
5,10.96488908606921
6,11.09468875502008
7,11.09468875502008
8,11.09468875502008


Great! Such baseline achieves 1409th place on the Public Leaderboard which is slightly better than grouping by the number of passengers. Also, remember to replicate all the results for the validation set as it was done in the previous exercise.

### Baseline based on the gradient boosting
  
Let's build a final baseline based on the Random Forest. You've seen a huge score improvement moving from the grouping baseline to the Gradient Boosting in the video. Now, you will use `sklearn`'s Random Forest to further improve this score.
  
The goal of this exercise is to take numeric features and train a Random Forest model without any tuning. After that, you could make test predictions and validate the result on the Public Leaderboard. Note that you've already got an `"hour"` feature which could also be used as an input to the model.
  
---
  
1. Add the `"hour"` feature to the list of numeric features.
2. Fit the `RandomForestRegressor` on the train data with numeric features and `"fare_amount"` as a target.
3. Use the trained Random Forest model to make predictions on the test data.

In [7]:
from sklearn.ensemble import RandomForestRegressor

# Select only numeric features
features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
            'dropoff_latitude', 'passenger_count', 'hour']

# Train a Random Forest model
rf = RandomForestRegressor()
rf.fit(train[features], train.fare_amount)

# Make predictions on the test data
test['fare_amount'] = rf.predict(test[features])

# Write predictions
test[['id', 'fare_amount']].to_csv('../_output/rf_sub.csv', index=False)

In [8]:
!head ../_output/rf_sub.csv

id,fare_amount
0,9.418
1,8.581000000000003
2,5.3309999999999995
3,8.769
4,13.541000000000004
5,8.285000000000004
6,5.632999999999998
7,54.8275
8,11.686999999999998


Congratulations! This final baseline achieves the 1051st place on the Public Leaderboard which is slightly better than the Gradient Boosting from the video. So, now you know how to build fast and simple baseline models to validate your initial pipeline.

## Hyperparameter tuning
  
We've covered the process of creating baseline models. Now, let's talk about the subsequent steps and hyperparameter optimization.
  
**Iterations**
  
Here is a table with the results of our three baseline models for the Taxi Fare Prediction competition. Once we have a baseline and observe some correlation with the Leaderboard, we start creating new features. For example, we could add the hour of the ride to our model. It improves the results and we advance 40 positions on the Leaderboard. Adding distance feature improves our rank by another 60 positions. And this process is endless. We keep creating new features trying to improve our local validation and Public Leaderboard scores. However, it's impossible to check every new feature or small change on the Leaderboard, because the number of submissions to Kaggle is limited, usually to 5 attempts per day.
  
<center><img src='../_images/hyper-parameter-tuning-kaggle.png' alt='img' width='740'></center>
  
That is why, generally we make submissions after a couple of changes just to track that our local validation score moves in the same direction with the Public Leaderboard score. For example, making submissions after the gradient boosting baseline and only after two new features are created.
  
<center><img src='../_images/hyper-parameter-tuning-kaggle1.png' alt='img' width='740'></center>
  
**Hyperparameter optimization**
  
Feature engineering is the major resource of improving our score in classic Machine Learning competitions (with tabular or time series data). Once we are out of ideas for feature engineering, we move on to hyperparameter optimization. It means that we try to find a set of model parameters that further improves the validation score. On the other hand, in Deep Learning competitions with text or images data, there is no need for feature engineering. Neural nets are generating features on their own, while we need to specify the architecture and a list of hyperparameters. So, generally speaking, in Deep Learning competitions we only have to optimize the hyperparameters.
  
<center><img src='../_images/hyper-parameter-tuning-kaggle2.png' alt='img' width='740'></center>
  
**Ridge regression**
  
The simplest hyperparameter example could be found in the Ridge regression. In a basic least squares linear regression we need to minimize the residual sum of squares, where y is true values, while y hat is model predictions.
  
$\Large\text{Loss} = \sum_{i=1}^N (y_i - \hat{y_i})^2 \rightarrow \text{min}$
  
Ridge regression introduces a regularization term to the loss function. It is the sum of squares of model coefficients w. So, we are adding a penalty on the coefficients size. The volume of this penalty is controlled by a hyperparameter alpha. It's not estimated by the model. Instead, we have to specify it manually outside the model, that's why it's called a "hyperparameter". And hyperparameter tuning helps us to find the best alpha value automatically.
  
$\Large \text{Loss} = \sum_{i=1}^N (y_i - \hat{y_i})^2 + \alpha \sum_{j=1}^K w_j^2 \rightarrow \text{min}$
  
- alpha is a hyperparameter
  
**Hyperparameter optimization strategies**
  
There is a number of different strategies to find the best set of hyperparameters. 
  
The most popular include: 
  
- Grid Search: We're selecting a discrete grid of possible hyperparameter values and loop through each possible combination. 
  
- Random Search: In this case, we just specify the search range for the parameters, for example, 'from' and 'to' values. For each iteration, we just sample the observations from this distribution.
    
- Bayesian optimization: We also need to specify the search space for the hyperparameters. The difference with random or grid search is that Bayesian optimization uses past evaluation results to choose the next hyperparameter values to evaluate.
  
<center><img src='../_images/hyper-parameter-tuning-kaggle3.png' alt='img' width='740'></center>
  
**Grid search**
  
In this course we'll cover only the grid search. For instance, let's tune, the alpha value for the Ridge regression using the grid search. Firstly, we create a grid of possible alpha values. Then, we create a results dictionary to store the scores. For each value in the grid, we create a ridge regression model with a specific alpha value. Calculate the validation score with this particular alpha. And store the results in the dictionary. Once we've looped through all the grid values, we can select the alpha value that achieves the best validation score. It is our optimal hyperparameter value.
  
<center><img src='../_images/hyper-parameter-tuning-kaggle4.png' alt='img' width='740'></center>
  
**Let's practice!**
  
All right, now it's your turn to tune hyperparameters!

### Grid search
  
Recall that we've created a baseline Gradient Boosting model in the previous lesson. Your goal now is to find the best `max_depth=` hyperparameter value for this Gradient Boosting model. This hyperparameter limits the number of nodes in each individual tree. You will be using K-fold cross-validation to measure the local performance of the model for each hyperparameter value.
  
You're given a function `get_cv_score()`, which takes the train dataset and dictionary of the model parameters as arguments and returns the overall validation RMSE score over 3-fold cross-validation.
  
---
  
1. Specify the grid for possible `max_depth=` values with 3, 6, 9, 12 and 15.
2. Pass each hyperparameter candidate in the grid to the model `params` dictionary.

In [9]:
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor

def get_cv_score(train, params):
    # Create KFold object
    kf = KFold(n_splits=3, shuffle=True, random_state=123)

    rmse_scores = []
    
    # Loop through each split
    for train_index, test_index in kf.split(train):
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    
        # Train a Gradient Boosting model
        gb = GradientBoostingRegressor(random_state=123, **params).fit(cv_train[features], cv_train.fare_amount)
    
        # Make predictions on the test data
        pred = gb.predict(cv_test[features])
    
        fold_score = np.sqrt(mean_squared_error(cv_test['fare_amount'], pred))
        rmse_scores.append(fold_score)
    
    return np.round(np.mean(rmse_scores) + np.std(rmse_scores), 5)


In [10]:
# Possible max depth values
max_depth_grid = [3, 6, 9, 12, 15]
results = {}

# For each value in the grid
for max_depth_candidate in max_depth_grid:
    # Specify parameters for the model
    params = {'max_depth': max_depth_candidate}
    
    # Calculate validation score for a particular hyperparameter
    validation_score = get_cv_score(train, params)
    
    # Save the results for each max depth value
    results[max_depth_candidate] = validation_score
print(results)

{3: 5.67296, 6: 5.36925, 9: 5.35641, 12: 5.49932, 15: 5.70246}


Nice! We have a validation score for each value in the grid. It's clear that the optimal max depth value is located somewhere between 3 and 6. The next step could be to use a smaller grid, for example [3, 4, 5, 6] and repeat the same process. Moving from larger to smaller grids allows us to find the most optimal values. Keep going to try optimizing 2 hyperparameters simultaneously!

### 2D grid search
  
The drawback of tuning each hyperparameter independently is a potential dependency between different hyperparameters. The better approach is to try all the possible hyperparameter combinations. However, in such cases, the grid search space is rapidly expanding. For example, if we have 2 parameters with 10 possible values, it will yield 100 experiment runs.
  
Your goal is to find the best hyperparameter couple of `max_depth=` and `subsample` for the Gradient Boosting model. `subsample` is a fraction of observations to be used for fitting the individual trees.
  
You're given a function `get_cv_score()`, which takes the train dataset and dictionary of the model parameters as arguments and returns the overall validation RMSE score over 3-fold cross-validation.
  
---
  
1. Specify the grids for possible `max_depth=` and `subsample` values. For `max_depth=`: 3, 5 and 7. For `subsample`: 0.8, 0.9 and 1.0.
2. Apply the `product()` function from the `itertools` package to the hyperparameter grids. It returns all possible combinations for these two grids.
3. Pass each hyperparameters candidate couple to the model `params` dictionary.

In [11]:
import itertools

# Hyperparameter grids
max_depth_grid = [3, 5, 7]
subsample_grid = [0.8, 0.9, 1.0]
results = {}

# For each couple in the grid
for max_depth_candidate, subsample_candidate in itertools.product(max_depth_grid, subsample_grid):
    params = {'max_depth': max_depth_candidate,
              'subsample': subsample_candidate}
    validation_score = get_cv_score(train, params)
    # Save the results fro each couple
    results[(max_depth_candidate, subsample_candidate)] = validation_score
print(results)

{(3, 0.8): 5.65813, (3, 0.9): 5.65228, (3, 1.0): 5.67296, (5, 0.8): 5.34947, (5, 0.9): 5.44506, (5, 1.0): 5.3132, (7, 0.8): 5.38994, (7, 0.9): 5.40631, (7, 1.0): 5.3591}


Great! You can see that tuning multiple hyperparameters simultaneously achieves better results. In the previous exercise, tuning only the `max_depth=` parameter gave the best RMSE of $6.50. With `max_depth=` equal to 7 and `subsample` equal to 0.8, the best RMSE is now $6.16. However, do not spend too much time on the hyperparameter tuning at the beginning of the competition! Another approach that almost always improves your solution is model ensembling. Go on for it!

## Model ensembling
  
So far, we've been talking only about individual models. Now it's time to combine multiple models together.
  
**Model ensembling**
  
Kaggle top solutions are usually not a single model, but a combination of a large number of various models. Different ways to combine models together is called model ensembling. For example, here is an ensemble design for a winning solution in the Homesite Quote Conversion challenge. We can see hundreds of models with multi-level stacking and blending. Let's learn about these 'blending' and 'stacking' terms.
  
<center><img src='../_images/model-ensembling-kaggle.png' alt='img' width='740'></center>
  
**Model blending**
  
The idea of ensemble learning is to build a prediction model by combining the strengths of a collection of simpler base models. The so-called blending approach is to just find an average of our multiple models predictions. Say we're solving a regression problem with a continuous target variable. And we have trained two models: A and B. So, for each test observation, there are model A and model B predictions available.
  
<center><img src='../_images/model-ensembling-kaggle1.png' alt='img' width='740'></center>
  
To combine models together we can just find the predictions mean, taking the sum and divide it by two. As we see, it allows us to tweak predictions, and to take into account both model A and model B opinions. That's it, such a simple ensembling method in the majority of cases will yield some improvement to our single models.
  
<center><img src='../_images/model-ensembling-kaggle2.png' alt='img' width='740'></center>
  
Arithmetic mean works for both regression and classification problems. However, for the classification, it's better to use a geometric mean of the class probabilities predicted.
  
<center><img src='../_images/model-ensembling-kaggle3.png' alt='img' width='740'></center>
  
**Model stacking**
  
The more advanced ensembling approach is called model stacking. The idea is to train multiple single models, take their predictions and use these predictions as features in the 2nd level model. So, we need to perform the following steps: Split train data into two parts. Part 1 and Part 2. Train multiple single models on the first part. Make predictions on the second part of the train data, and on the test data. Now, we have models predictions for both Part 2 of the train data and for the test data. It means that we could create a new model using these predictions as features. This model is called the 2nd level model or meta-model. Its predictions on the test data give us the stacking output.
  
1. Split train data into two parts
2. Train multiple models on Part 1
3. Make predictions on Part 2
4. Make predictions on the test data
5. Train a new model on Part 2 using predictions as features
6. Make predictions on the test data using the 2nd level model
  
**Stacking example**
  
Let's consider all these steps on the example. Suppose we are given a binary classification problem with a bunch of numerical features: feature_1, feature_2 and so on to feature_N. For the train data, target variable is known. And we need to make predictions on the test data with the unknown target variable.
  
<center><img src='../_images/model-ensembling-kaggle5.png' alt='img' width='740'></center>
  
First of all, we split train data into two separate parts: Part 1 and Part 2.
  
<center><img src='../_images/model-ensembling-kaggle6.png' alt='img' width='740'></center>
  
Then we train multiple single models only on the first part of the train data. For example, we've trained three different models denoting them as A, B and C.
  
<center><img src='../_images/model-ensembling-kaggle7.png' alt='img' width='740'></center>
  
Having these 3 models we make the predictions on part 2 of the train data. The columns with the predictions are denoted as A_pred, B_pred and C_pred. Then make the predictions on the test data as well.
  
<center><img src='../_images/model-ensembling-kaggle8.png' alt='img' width='740'></center>
  
So, now we have models predictions for both Part 2 of the train data and for the test data.
  
It's now possible to create a second level model using these predictions as features. It's trained on the Part 2 train data and is used to make predictions on the test data.
  
<center><img src='../_images/model-ensembling-kaggle9.png' alt='img' width='740'></center>
  
As a result, we obtain stacking predictions for the test data. Thus, we combined individual model predictions into a single number using a 2nd level model.
  
<center><img src='../_images/model-ensembling-kaggle10.png' alt='img' width='740'></center>
  
**Let's practice!**
  
OK, having learned the theory, move on to build your own model ensembles!

### Model blending
  
You will start creating model ensembles with a blending technique.
  
Your goal is to train 2 different models on the New York City Taxi competition data. Make predictions on the test data and then blend them using a simple arithmetic mean.
  
The `train` and `test` DataFrames are already available in your workspace. `features` is a list of columns to be used for training and it is also available in your workspace. The target variable name is "fare_amount".
  
---
  
1. Train a Gradient Boosting model on the `train` data using `features` list, and the "fare_amount" column as a target variable.
2. Train a Random Forest model in the same manner.
3. Make predictions on the `test` data using both Gradient Boosting and Random Forest models.
4. Find the average of both models predictions.

In [13]:
train = pd.read_csv('../_datasets/taxi_train_distance.csv')
test = pd.read_csv('../_datasets/taxi_test_distance.csv')

features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 
            'passenger_count', 'distance_km', 'hour']

In [14]:
# Train a Gradient Boosting model
gb = GradientBoostingRegressor().fit(train[features], train.fare_amount)

# Train a Random Forest model
rf = RandomForestRegressor().fit(train[features], train.fare_amount)

# Make predictions on the test data
test['gb_pred'] = gb.predict(test[features])
test['rf_pred'] = rf.predict(test[features])

# Find mean of model predictions
test['blend'] = (test['gb_pred'] + test['rf_pred']) / 2
test[['gb_pred', 'rf_pred', 'blend']].head(3)

Unnamed: 0,gb_pred,rf_pred,blend
0,9.661374,9.732,9.696687
1,9.304288,8.011,8.657644
2,5.79514,4.615,5.20507


That was not too hard! Blending allows you to get additional score improvements almost for free just by averaging multiple models predictions. Now, let's explore model stacking!

### Model stacking I
  
Now it's time for stacking. To implement the stacking approach, you will follow the 6 steps we've discussed in the previous video:
  
1. Split train data into two parts
2. Train multiple models on Part 1
3. Make predictions on Part 2
4. Make predictions on the test data
5. Train a new model on Part 2 using predictions as features
6. Make predictions on the test data using the 2nd level model
  
`train` and `test` DataFrames are already available in your workspace. features is a list of columns to be used for training on the Part 1 data and it is also available in your workspace. Target variable name is "fare_amount".
  
---
  
1. Split the `train` DataFrame into two equal parts: `part_1` and `part_2`. Use the `train_test_split()` function with `test_size=` equal to 0.5.
2. Train Gradient Boosting and Random Forest models on the `part_1` data.
3. Make Gradient Boosting and Random Forest predictions on the `part_2` data.
4. Make Gradient Boosting and Random Forest predictions on the `test` data.

In [15]:
# Split train data into two parts
part_1, part_2 = train_test_split(train, test_size=0.5, random_state=123)

# Train a Gradient Boosting model on Part 1
gb = GradientBoostingRegressor().fit(part_1[features], part_1.fare_amount)

# Train a Random Forest model on Part 1
rf = RandomForestRegressor().fit(part_1[features], part_1.fare_amount)

# Make predictions on the Part 2 data
part_2 = part_2.copy()
part_2['gb_pred'] = gb.predict(part_2[features])
part_2['rf_pred'] = rf.predict(part_2[features])

# Make predictions on the test data
test = test.copy()
test['gb_pred'] = gb.predict(test[features])
test['rf_pred'] = rf.predict(test[features])

Great! You've covered 4 out of 6 steps to create a stacking ensemble. The only steps left is to create a new model on Part 2 data using predictions as features and apply it to the test data. Go on to implement it!

### Model stacking II
  
OK, what you've done so far in the stacking implementation:
  
1. Split train data into two parts
2. Train multiple models on Part 1
3. Make predictions on Part 2
4. Make predictions on the test data
  
Now, your goal is to create a second level model using predictions from steps 3 and 4 as features. So, this model is trained on Part 2 data and then you can make stacking predictions on the test data.
  
`part_2` and `test` DataFrames are already available in your workspace. Gradient Boosting and Random Forest predictions are stored in these DataFrames under the names "gb_pred" and "rf_pred", respectively.
  
---
  
1. Train a Linear Regression model on the Part 2 data using Gradient Boosting and Random Forest models predictions as features.
2. Make predictions on the `test` data using Gradient Boosting and Random Forest models predictions as features.

In [16]:
from sklearn.linear_model import LinearRegression

# Create linear regression model without the intercept
lr = LinearRegression(fit_intercept=False)

# Train 2nd level model on the Part 2 data
lr.fit(part_2[['gb_pred', 'rf_pred']], part_2.fare_amount)

# Make stacking predictions on the test data
test['stacking'] = lr.predict(test[['gb_pred', 'rf_pred']])

# Look at the model coefficients
print(lr.coef_)

[0.0977768  0.90005824]


Congratulations, now your toolbox contains ensembling techniques! Usually, the 2nd level model is some simple model like Linear or Logistic Regressions. Also, note that you were not using intercept in the Linear Regression just to combine pure model predictions. Looking at the coefficients, it's clear that 2nd level model has more trust to the Gradient Boosting: 0.7 versus 0.3 for the Random Forest model. Now, move forward to the last lesson in order to learn some final tips and tricks!

## Final tips
  
All right, we're almost done. In this lesson, we'll just cover some tips that haven't been mentioned throughout the course.
  
**Save information**
  
The first tip is saving all the information we can. To begin, save folds distribution to files. Our goal is to track the validation score during the competition. And of course, this validation score should always be calculated on the same folds. Another data that we'd like to save is model runs. It will allow us to reproduce our experiments or go back if needed. One of the possibilities could be to create a separate git commit for each model run or submission. It is also a good idea to save model predictions as well. If we start saving validation and test predictions from the very beginning of the competition, it will allow us to simply build model ensembles near the end. Because we store predictions for the models blending as well as features for the models stacking. Finally, we should keep a log of models' results to track the performance progress. It could be done as comments to the git commits or as notes in a separate document.
  
1. Save folds to the disk
2. Save model runs
3. Save model predictions to the disk
4. Save performance results
   
**Kaggle forum and kernels**
  
Now let's speak about the Kaggle forum and kernels. It's one of the strongest sources of knowledge on Kaggle.
  
Each competition has an open forum where all the participants can start topics sharing their thoughts and ideas, asking questions and so on.
  
Kaggle kernels is another source of knowledge. It represents scripts and notebooks that participants are sharing during the competition. So, we have an opportunity not only to discuss the competition, but also to look at the code. Moreover, kernels represent a computational environment where we have access to an interactive session running in a Docker container with pre-installed packages, the ability to mount data sources, use GPU resources, and more.
  
**Forum and kernels usage**
  
Forum and kernels could bring us lots of benefits during the different competition stages. Suppose we decided to join some of the current Kaggle competitions. First of all, it is useful to find similar past competitions on Kaggle. Usually, top teams are sharing their approach on the forum once the competition has finished. It allows us to read through the best performing solutions and get to know what could work for the similar problem types. During the rest of the competition, we should precisely follow all the topics in the forum and the most popular kernels. It allows us to be up-to-date during the competition and learn lots of new ideas and approaches. Finally, even after the end of the competition, it's time to learn from the top participants. Usually, winners share their solutions a couple of days after the competition finish. It's very valuable information that we should utilize to determine what we could have done better during the competition.
  
<center><img src='../_images/final-tips-kaggle.png' alt='img' width='740'></center>
  
**Select final submissions**
  
The last few words are devoted to the final submissions. Kaggle competitions have different durations, but generally, it's about 2 or 3 months. As we already know, every day we have a limited number of submissions to the Leaderboard. So, if we have a 2-month competition with 5 submissions per day, we could make up to 300 submissions to the Public Leaderboard.
  
<center><img src='../_images/final-tips-kaggle1.png' alt='img' width='740'></center>
  
However, for the final evaluation on the Private Leaderboard, we have to choose only 2 submissions. We mark them in the list of all submissions made.
  
<center><img src='../_images/final-tips-kaggle2.png' alt='img' width='740'></center>
  
And only these are used for the final standings. Our result is the best score out of these two final submissions.
  
<center><img src='../_images/final-tips-kaggle3.png' alt='img' width='740'></center>
  
**Final submissions**
  
The suggested strategy that works pretty well is to select one submission that is the best on the local validation, and another submission that is the best on the Public Leaderboard.
  
**Let's practice!**
  
Let's now review these final tips before saying good-bye!

### Testing Kaggle forum ideas
  
Unfortunately, not all the Forum posts and Kernels are necessarily useful for your model. So instead of blindly incorporating ideas into your pipeline, you should test them first.
  
You're given a function `get_cv_score()`, which takes a `train` dataset as an argument and returns the overall validation root mean squared error over 3-fold cross-validation. The `train` DataFrame is already available in your workspace.
  
You should try different suggestions from the Kaggle Forum and check whether they improve your validation score.
  
---
  
1. Suggestion 1: the `passenger_count` feature is useless. Let's see! Drop this feature and compare the scores.
2. This first suggestion worked. Suggestion 2: Sum of `pickup_latitude` and `distance_km` is a good feature. Let's try it!

In [17]:
def get_cv_score(train):
    features = ['pickup_longitude', 'pickup_latitude',
            'dropoff_longitude', 'dropoff_latitude',
            'passenger_count', 'distance_km', 'hour', 'weird_feature']
    
    features = [x for x in features if x in train.columns]
    
    # Create KFold object
    kf = KFold(n_splits=3, shuffle=True, random_state=123)

    rmse_scores = []
    
    # Loop through each split
    for train_index, test_index in kf.split(train):
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    
        # Train a Gradient Boosting model
        gb = GradientBoostingRegressor(random_state=123).fit(cv_train[features], cv_train.fare_amount)
    
        # Make predictions on the test data
        pred = gb.predict(cv_test[features])
    
        fold_score = np.sqrt(mean_squared_error(cv_test['fare_amount'], pred))
        rmse_scores.append(fold_score)
    
    return np.round(np.mean(rmse_scores) + np.std(rmse_scores), 5)

In [18]:
# Drop passenger_count column
new_train_1 = train.drop('passenger_count', axis=1)

# Compare validation scores
initial_score = get_cv_score(train)
new_score = get_cv_score(new_train_1)

print('Initial score is {} and the new score is {}'.format(initial_score, new_score))

Initial score is 6.49932 and the new score is 6.42315


In [19]:
# Create copy of the initial train DataFrame
new_train_2 = train.copy()

# Find sum of pickup latitude and ride distance
new_train_2['weird_feature'] = new_train_2['pickup_latitude'] + new_train_2['distance_km']

# Compare validation scores
initial_score = get_cv_score(train)
new_score = get_cv_score(new_train_2)

print('Initial score is {} and the new score is {}'.format(initial_score, new_score))

Initial score is 6.49932 and the new score is 6.50495


Be aware that not all the ideas shared publicly could work for you! In this particular case, dropping the "passenger_count" feature helped, while finding the sum of pickup latitude and ride distance did not. The last action you perform in any Kaggle competition is selecting final submissions. Go on to practice it!

### Select final submissions
  
The last action in every competition is selecting final submissions. Your goal is to select 2 final submissions based on the local validation and Public Leaderboard scores. Suppose that the competition metric is RMSE (the lower the metric the better). Keep up with a selection strategy we've discussed in the slides:
  
1. Local validation: 1.25; Leaderboard: 1.35.
2. Local validation: 1.32; Leaderboard: 1.39.
3. Local validation: 1.10; Leaderboard: 1.29.
4. Local validation: 1.17; Leaderboard: 1.25.
5. Local validation: 1.21; Leaderboard: 1.32.
   
---
  
Possible Answers
  
- [ ] 1 and 2.
- [ ] 2 and 3.
- [x] 3 and 4.
- [ ] 4 and 5.
- [ ] 1 and 5.
  
Correct! Submission 3 is the best on local validation and submission 4 is the best on Public Leaderboard. So, it's the best choice for the final submissions!

## Final thoughts
  
OK, we're at the finish line! Congratulations on going through all the lessons!
  
**What we've learned**
  
Let's quickly recap what we have learned in this course. First of all, we've learned what Kaggle actually is and the Machine Learning competitions process. Then, we've covered all the steps we need to perform in any competition: starting with a problem definition and some initial data exploration, develop a reliable validation strategy, test hypothesis, generate new features, and finally, build model ensembles. These topics have been covered on relatively accessible examples. Now it's your turn to start competing on Kaggle and expand your knowledge and experience.
  
- What is Kaggle
- Understand the problem
- Make EDA
- Develop local validation
- Generate new features
- Build model ensembles
  
**Kaggle vs Data Science**
  
The final note I'd like to emphasize is that Kaggle competitions are not equal to Data Science. They cover only some subareas of the real Data Science job.
  
**Kaggle vs Data Science**
  
First of all Data Science is not only about building the Machine Learning models. Tasks could also include some Data Analytics stuff in order to provide insights, make analytical reports and help decision makers. Kaggle does not help here.
  
**Kaggle vs Data Science**
  
But even Machine Learning projects in Data Science job include some additional steps that are not covered by Kaggle. Let's list the usual stages performed to develop the model. We need to start with the business people communication to get the model requirements. Then we collect data needed for the model. Either it is located in .csv files, in databases or Hadoop cluster. Based on the problem, we have to choose the metric to be optimized. And make a fair train/test split for the model evaluation without a leakage. In the competitions, Kaggle does all these steps above for us. What we do on Kaggle is creating the best performing model itself. Of course, it is a long process and includes all the stuff you've seen in this course. Finally, in the real project, we need to incorporate model created into the production environment. Here Kaggle also does not help.
  
1. Talk to Business, Define the problem
2. Collect the data
3. Select the metric
4. Make train and test split
5. Create the model
6. Move model to the production
   
**Insert title here...**
  
However, Kaggle does give practical skills and teaches tricks that are not covered in any online courses or books. One could get hands-on experience for any problem type within the community of the best Data Scientists in the world.
  
**Start competing on Kaggle!**
  
Congratulations again on finishing this course, and thank you for taking it! I hope you've enjoyed this course, as much as I've enjoyed teaching it. Now go straight to [kaggle.com](kaggle.com) and take part in Machine Learning competitions! Good luck!