# Kaggle

#### Understanding the problem
* **Data type:** tabular data, time series, images, text, etc. structured vs. unstructured... a mix
* **Problem type:** classification, regression, ranking, etc.
* **Evaluation metric:** ROC AUC, F1-score, MAE, MSE, etc.

#### Metric definition
* Generally, the majority of the metrics can be found in the `sklearn.metrics` library
* However, **there are some special competition metrics that are not available in scikit-learn**
    * In such cases, we have to create metrics manually
    
### Kaggle Solution Workflow

<img src='data/solution_workflow.png' width="600" height="300" align="center"/>

In [None]:
import pandas as pd
import numpy as np

In [None]:
def rmsle(y_true, y_pred):
    diffs = np.log(y_true + 1) - np.log(y_pred + 1)
    squares = np.power(diffs, 2)
    err = np.sqrt(np.mean(squares))
    return err

* Before building any models, we should perform some preliminary steps to understand the data and the problem we're facing. 

#### Goals of EDA
* Size of the data
* Properties of the target variable
    * high class imblance in classification problem?
    * skewed distribution in regression problem?
* Properties of the features
* Generate ideas for feature engineering

### K-fold cross-validation

In [None]:
from sklearn.model_selection import KFold

In [None]:
# Create a KFold object
kf = KFold(n_splits=5, shuffle= True, random_state=123)

* Now we need to train `K` models for each cross-validation split. 
* To obtain all the splits we call the `split()` method of the KFold object with the `train` data as an argument.
* It returns a list of training and testing observations for each split
* The observations are given as numeric indices on the train data.
* These indices could be used inside the loop to select training and testing folds for the corresponding cross-validation split
* For pandas DataFrame, it could be done using the `iloc` operator, for example.

In [None]:
# Loop through each cross-validation split
for train_index, test_index in kf.split(train):
    # Get training and testing data for the corresponding split
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]

#### Stratified K-fold
* As demonstrated in the image, each fold has the same class distribution as the initial dataset.
* It is useful when we have a classification problem with high class imbalance in the target variable or our data size is very small.

<img src='data/stratified_kfold.png' width="600" height="300" align="center"/>

In [None]:
# Import StratifiedKFold
from sklearn.model_selection import StratifiedKFold

# Create a StratifiedKFold object
str_kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state=123)

# Loop through each cross-validation split
for train_index, test_index in str_kf.split(train, train['target']):
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]

```
# Import KFold
from sklearn.model_selection import KFold

# Create a KFold object
kf = KFold(n_splits=3, shuffle=True, random_state=123)

# Loop through each split
fold = 0
for train_index, test_index in kf.split(train):
    # Obtain training and testing folds
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    print('Fold: {}'.format(fold))
    print('CV train shape: {}'.format(cv_train.shape))
    print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
    fold += 1
    
# Import StratifiedKFold
from sklearn.model_selection import StratifiedKFold

# Create a StratifiedKFold object
str_kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=123)

# Loop through each split
fold = 0
for train_index, test_index in str_kf.split(train, train['interest_level']):
    # Obtain training and testing folds
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    print('Fold: {}'.format(fold))
    print('CV train shape: {}'.format(cv_train.shape))
    print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
    fold += 1
```

#### Validation usage
* **Leakage** causes a model to seem accurate until we start making predictions in a real-world environment
* **Types of data leakage:**
    * Leak in **features:** using data that will not be available in the real (production) setting
        * Example: predicting sales in US dollars, while having exactly the same sales in UK pounds as a feature.
    * Leak in **validation strategy:** validation strategy differs from the real-world situation
        * Using kfold for time series data
        * Instead, time series kfold should be done as demonstrated in the image below:
        
<img src='data/time_series_kfold.png' width="600" height="300" align="center"/>        


* The underlying idea of time series k-fold crossvalidation is to provide multiple splits in such a manner that we train only on past data while always predicting the future
* **Time k-fold crossvalidation** is also available in `sklearn.model_selection`

In [None]:
# Import TimeSeriesSplit
from sklearn.model_selection import TimeSeriesSplit

# Create a TimeSeriesSplit object
time_kfold = TimeSeriesSplit(n_splits=5)

* **Note** that **before applying it to the data, we need to sort the train DataFrame by date.**
    * Then, as usual, iterate through each crossvalidation split

In [None]:
# Sort train by date
train = train.sort_values('date')

# Loop through each cross-validation split
for train_index, test_index in time_kfold.split(train):
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]

### Validation pipeline
* Firstly, create an empty list where we will store the model's results
* Split train data into folds 
* For each crossvalidation split, we perform the following steps:
    * Train a model using all except for a single fold
    * Make predictions on this single unseen fold
    * Calculate the competition metric and append it to the list of folds metrics
* As a result, we have a list of K numbers representing model quality for each fold 

```
# List for the results
fold_metrics = []
for train_index, test_index in CV_STRATEGY.split(train):
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    # Train a model
    model.fit(cv_train)
    # Make predictions
    predictions = model.predict(cv_test)
    # Calculate the metric
    metric = evaluate(cv_test, predictions)
    fold_metrics.append(metric)
```
* Now, we could train two different models and for each model get a list of K numbers

<img src='data/mod_comp.png' width="300" height="150" align="center"/>

* For example, above we have models A and B, each with mean squared errors in four folds
* Our goal is to select the model with better quality 
* The next step is to tranform K fold scores into a single overall validation score
* The simplest way to obtain a single number is to find the mean over all fold scores
    * However, the mean is not always a good choice, as it does *not* take into account **score deviation** from one fold to another
    * For example, we could get a very good score for a single fold, while the performance on the rest K-1 folds is poor. 

```
import numpy as np

# Simple mean over the folds
mean_score = np.mean(fold_metrics)
```
* **A more reliable overall validation score:** uses the worst-case scenario considering validation score one standard deviation away from the mean
* **Note** that we **add** standard deviation if the competition metric is being *minimized* and **subtract** standard deviation if the metric is being *maximized*.

```
# Overall validation score
overall_score_minimizing = np.mean(fold_metrics) + np.std(fold_metrics)

# Or
overall_score_maximizing = np.mean(fold_metrics) - np.std(fold_metrics)
```

#### Exercises: Time K-Fold

```
# Create TimeSeriesSplit object
time_kfold = TimeSeriesSplit(n_splits=3)

# Sort train data by date
train = train.sort_values(by='date')

# Iterate through each split
fold = 0
for train_index, test_index in time_kfold.split(train):
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    
    print('Fold :', fold)
    print('Train date range: from {} to {}'.format(cv_train.date.min(), cv_train.date.max()))
    print('Test date range: from {} to {}\n'.format(cv_test.date.min(), cv_test.date.max()))
    fold += 1
```

#### Exercises: Overall Validation Score

```
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Sort train data by date
train = train.sort_values('date')

# Initialize 3-fold time cross-validation
kf = TimeSeriesSplit(n_splits=3)

# Get MSE scores for each cross-validation split
mse_scores = get_fold_mse(train, kf)

print('Mean validation MSE: {:.5f}'.format(np.mean(mse_scores)))
print('MSE by fold: {}'.format(mse_scores))
print('Overall validation MSE: {:.5f}'.format(np.mean(mse_scores) + np.std(mse_scores)))
```

# $\star$ Chapter 3: Feature Engineering
You will now get exposure to different types of features. You will modify existing features and create new ones. Also, you will treat the missing data accordingly.

### Feature engineering

<img src='data/mod_process.png' width="600" height="300" align="center"/>

* **Important rule:** tweak only a single a thing at a time, because changing multiple things does not allow us to detect what actually works and what doesn't
* **Feature engineering** helps our ML models to get additional information and consequently to better predict the target variable
* The ideas for new features can come from prior experience working with similar data.
* Also, having looked at the data, we could potentially generate ideas for new valuable features
* One more source is domain knowledge of the problem we're solving

#### Feature types
* Numerical
* Categorical
* Datetime
* Coordinates
* Text
* Images

#### Creating features 
* There are some situations when we need to generate features for train and tests independently and for each validation split in the k-fold cross-validation
* However, in the majority of cases features are created for train and test sets simultaneously
    * For this purpose, we concatenate train and test DataFrames from Kaggle into a single DF using pandas
    
```
# Concatenate the train and test data
data = pd.concat([train, test])

# Generate new features for the full DataFrame

# Get the original train and test split back
train = data[data.id.isin(train.id)]
test = data[data.is.isin(test.id)]
```

#### Arithmetical features

```
# Arithmetical features
two_sigma['price_per_bedroom'] = two_sigma.price / two_sigma.bedrooms
two_sigma['rooms_number'] = two_sigma.bedrooms + two_sigma.bathrooms
```

#### Datetime features 

```
# Convert date to the datetime object
dem['date'] = pd.to_datetime(dem['date'])

# Year features
dem['year'] = dem['date'].dt.year

# Month features
dem['month'] = dem['date'].dt.month

# Week features
dem['week'] = dem['date'].dt.weekofyear

# Day features
dem['dayofyear'] = dem['date'].dt.dayofyear
dem['dayofmonth'] = dem['date'].dt.day
dem['dayofweek'] = dem['date'].dt.dayofweek
```

#### Exercises: Arithmetical features

```
# Look at the initial RMSE
print('RMSE before feature engineering:', get_kfold_rmse(train))

# Find the total area of the house
train['TotalArea'] = train['TotalBsmtSF'] + train['FirstFlrSF'] + train['SecondFlrSF']

# Look at the updated RMSE
print('RMSE with total area:', get_kfold_rmse(train))

# Find the area of the garden
train['GardenArea'] = train['LotArea'] - train['FirstFlrSF']
print('RMSE with garden area:', get_kfold_rmse(train))

# Find total number of bathrooms
train['TotalBath'] = train.FullBath + train.HalfBath
print('RMSE with number of bathrooms:', get_kfold_rmse(train))

# Concatenate train and test together
taxi = pd.concat([train, test])

# Convert pickup date to datetime object
taxi['pickup_datetime'] = pd.to_datetime(taxi['pickup_datetime'])

# Create a day of week feature
taxi['dayofweek'] = taxi['pickup_datetime'].dt.dayofweek

# Create an hour feature
taxi['hour'] = taxi['pickup_datetime'].dt.hour

# Split back into train and test
new_train = taxi[taxi['id'].isin(train['id'])]
new_test = taxi[taxi['id'].isin(test['id'])]
```

### Categorical features
* The majority of ML models do not handle string values and categorical features automatically

#### Label Encoding

```
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
le = LabelEncoder()

# Encode a categorical feature
df['cat_encoded'] = le.fit_transform(df['cat'])
```
* The problem with Label Encoding is that we implicitly assume that there is a ranking dependency between the categories
* Such an approach can be harmful to linear models, although it still works for tree-based models.

#### One-Hot encoding

```
# Create One-Hot encoded features
ohe = pd.get_dummies(df['cat'], prefix='ohe_cat')

# Drop the initial feature
df.drop('cat', axis = 1, inplace = True)

# Concatenate OHE features to the dataframe
df = pd.concat([df, ohe], axis =1)
```

#### Binary Feature
* One special case of categorical features is binary features 
    * For example: Yes/No, On/Off
* For such features, we **always apply label encoding**

```
le = LabelEncoder()
binary_feature['binary_encoded'] = le.fit_transform(binary_feature['binary_feat'])
```

### Other encoding approaches
* Backward Difference Coding
* BaseN
* Binary
* CatBoost Encoder
* Hashing
* Helmert Coding
* James-Stein Encoder
* M-estimate
* One Hot
* Ordinal
* Polynomial Coding
* Sum Coding
* Target Encoder $\Leftarrow$ **most widely used at Kaggle**
* Weight of Evidence

#### Exercises: Label Encoding

```
# Concatenate train and test together
houses = pd.concat([train, test])

# Label encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# Create new features
houses['RoofStyle_enc'] = le.fit_transform(houses['RoofStyle'])
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])

# Look at new features
print(houses[['RoofStyle', 'RoofStyle_enc', 'CentralAir', 'CentralAir_enc']].head())
```
#### Exercises: One-Hot Encoding

```
# Concatenate train and test together
houses = pd.concat([train, test])

# Look at feature distributions
print(houses['RoofStyle'].value_counts(), '\n')
print(houses['CentralAir'].value_counts())

# Concatenate train and test together
houses = pd.concat([train, test])

# Label encode binary 'CentralAir' feature
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])

# Concatenate train and test together
houses = pd.concat([train, test])

# Label encode binary 'CentralAir' feature
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])

# Create One-Hot encoded features
ohe = pd.get_dummies(houses['RoofStyle'], prefix='RoofStyle')

# Concatenate OHE features to houses
houses = pd.concat([houses, ohe], axis=1)

# Look at OHE features
print(houses[[col for col in houses.columns if 'RoofStyle' in col]].head(3))
```

### Target Encoding

#### High cardinality categorical features 
* These are categorical features that have a large number of category values (at least 10+ different category values).
    * Example: zipcode
* **Target encoding:**
    * As a label encoder, it creates only a single column, but it also introduces the correlation between the categories and the target variable
    * There are various options for the encoding function
    
#### Mean target encoding
* To apply mean target encoding, we need to follow the following steps:
    * Calculate mean on the train, apply to the test
    * Split train into K folds. Calculate mean on (K-1) folds, apply to the K-th fold
    * Add mean target encoded feature to the model 
    
<img src='data/mte1.png' width="300" height="150" align="center"/>    
    
* **In this case, for Category A, the mean target code is 0.66**
    * 2/3 As are value 1; 1/3 As are value 0
* **Category B's mean target code is 0.25**
    * 1/4 Bs are value 1; 3/4 Bs are value 0
* **Next:** Apply mean target codes to test data; as a result, we've obtained a new feature
  
<img src='data/mean_target_encoding.png' width="300" height="150" align="center"/>

#### Train encoding using out-of-fold
* Now we need to calculate this mean target encoded feature for the train data
* We will be using out-of-fold statistics
* **First** we'll split the data into 2 folds:

<img src='data/mte2.png' width="300" height="150" align="center"/>

* Take fold number 1:
    * We take the target mean out of this fold, so using only fold # 2 observations
    
<img src='data/mte4.png' width="300" height="150" align="center"/>

* Now we calculate out-of-fold target means for the second fold using only the first fold observtions:

<img src='data/mte5.png' width="300" height="150" align="center"/>

#### Practical guides
* Some tips that are always applied together with mean target encoding:
    * **Smoothing**
    * **New categories**
    
#### Exercises: Mean Target Encoding

In [1]:
def test_mean_target_encoding(train, test, target, categorical, alpha=5):
    # Calculate global mean on the train data
    global_mean = train[target].mean()
    
    # Group by the categorical feature and calculate its properties
    train_groups = train.groupby(categorical)
    category_sum = train_groups[target].sum()
    category_size = train_groups.size()
    
    # Calculate smoothed mean target statistics
    train_statistics = (category_sum + global_mean * alpha) / (category_size + alpha)
    
    # Apply statistics to the test data and fill new categories
    test_feature = test[categorical].map(train_statistics).fillna(global_mean)
    return test_feature.values

In [2]:
def train_mean_target_encoding(train, target, categorical, alpha=5):
    # Create 5-fold cross-validation
    kf = KFold(n_splits=5, random_state=123, shuffle=True)
    train_feature = pd.Series(index=train.index)
    
    # For each folds split
    for train_index, test_index in kf.split(train):
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
      
        # Calculate out-of-fold statistics and apply to cv_test
        cv_test_feature = test_mean_target_encoding(cv_train, cv_test, target, categorical, alpha)
        
        # Save new feature for this particular fold
        train_feature.iloc[test_index] = cv_test_feature       
    return train_feature.values

In [3]:
def mean_target_encoding(train, test, target, categorical, alpha=5):
  
    # Get the train feature
    train_feature = train_mean_target_encoding(train, target, categorical, alpha)
  
    # Get the test feature
    test_feature = test_mean_target_encoding(train, test, target, categorical, alpha)
    
    # Return new features to add to the model
    return train_feature, test_feature

#### Exercises: K-fold cross-validation

```
# Create 5-fold cross-validation
kf = KFold(n_splits=5, random_state=123, shuffle=True)

# For each folds split
for train_index, test_index in kf.split(bryant_shots):
    cv_train, cv_test = bryant_shots.iloc[train_index], bryant_shots.iloc[test_index]

    # Create mean target encoded feature
    cv_train['game_id_enc'], cv_test['game_id_enc'] = mean_target_encoding(train=cv_train,
                                                                           test=cv_test,
                                                                           target='shot_made_flag',
                                                                           categorical='game_id',
                                                                           alpha=5)
    # Look at the encoding
    print(cv_train[['game_id', 'shot_made_flag', 'game_id_enc']].sample(n=1))
    
# Create mean target encoded feature
train['RoofStyle_enc'], test['RoofStyle_enc'] = mean_target_encoding(train=train,
                                                                     test=test,
                                                                     target='SalePrice',
                                                                     categorical='RoofStyle',
                                                                     alpha=10)

# Look at the encoding
print(test[['RoofStyle', 'RoofStyle_enc']].drop_duplicates())
```

### Missing data
* Some machine learning algorithms like XGBoost or LightGBM can treat missing data without any preprocessing 
* However, it is always a good idea to implement your own missing value imputation in order to improve the model. 

<img src='data/misdata1.png' width="300" height="150" align="center"/>

#### Numerical data
* Mean/median imputation
* Constant value imputation
    * **To emphasize that a value was missing, sometimes a special constant value is used.**
    * Not a good choice for linear models, but works perfectly for tree-based models
    
#### Categorical data
* Most frequent category imputation
* New category imputation
    * Create a new category for the missing values 
    
#### Find missing data
* `df.isnull()`
    * returns the dataframe with boleans as cell values 
    * If missing, `True`
* `df.isnull().sum()`


#### Numerical missing data

```
# Import SimpleImputer
from sklearn.impute import SimpleImputer

# Different types of imputers
mean_imputer = SimpleImputer(strategy='mean')
constant_imputer = SimpleImputer(strategy='constant', fill_value=-999)

# Imputation
df[['num']] = mean_imputer.fit_transform(df['num']])
```
* **Note that even if we want to impute a single column, we have to use double brackets** (though you can also pass a list of columns).

#### Categorical missing data

```
# Import SimpleImputer
from sklearn.impute import SimpleImputer

# Different types of imputers
frequent_imputer = SimpleImputer(strategy='most_frequent')
constant_imputer = SimpleImputer(strategy='constant', fill_value='MISS')

# Imputation 
df[['cat']] = constant_imputer.fit_transform(df[['cat']])
```

# $\star$ Chapter 4: Modeling
Time to bring everything together and build some models! In this last chapter, you will build a base model before tuning some hyperparameters and improving your results with ensembles. You will then get some final tips and tricks to help you compete more efficiently.

### Baseline model

<img src='data/baseline_model.png' width="500" height="250" align="center"/>

* To start this loop, we should establish the baseline model.
* It's usually a very simple model that allows us to check the whole pipeline we've written, review the local validation process, and generate the first submissions for the test data

### Baseline model I: mean

```
import numpy as np

#Assign the mean fare amount to all the test observations
taxi_test['fare_amount'] = np.mean(taxi_train.fare_amount)

# Write the predictions to the file
taxi_test[['id', 'fare_amount']].to_csv('mean_sub.csv', index=False)
```

### Baseline model II: mean grouped by the number of passengers

```
# Calculate the mean fare amount by group
naive_prediction_groups = taxi_train.groupby('passenger_count').fare_amount.mean()

# Make predictions on the test set
taxi_test['fare_amount'] = taxi_test.passenger_count.map(naive_prediction_groups)

# Write predictions to the file
taxi_test[['id', 'fare_amount']].to_csv('mean_group_sub.csv', index=False)
```
* The idea is the same: assign the average value of fare amount to the whole group

### Baseline III: out-of-the-box Gradient Boosting model on all numeric features

```
# Select only numeric features
features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']

from sklearn.ensemble import GradientBoostingRegressor

# Train a Gradient Boosting model
gb = GradientBoostingRegressor()
gb.fit(taxi_train[features], taxi_train.fare_amount)

# Make predictions on the test data
taxi_test['fare_amount'] = gb.predict(taxi_test[features])

# Write predictions to the file
taxi_test[['id', 'fare_amount']].to_csv('gb_sub.csv', index=False)
```

### Correlation with Public Leaderboard
* Generally, the ideal situation is to observe such correlation between local validation and Public Leaderboard scores.
* The values should not be absolutely the same, but if the local score is improving, then we want to see improvements on the Leaderboard.
    * If not, it is a sign that something could either be wrong with our models or validation scheme.
 
```
import numpy as np
from sklearn.metrics import mean_squared_error
from math import sqrt

# Calculate the mean fare_amount on the validation_train data
naive_prediction = np.mean(validation_train['fare_amount'])

# Assign naive prediction to all the holdout observations
validation_test['pred'] = naive_prediction

# Measure the local RMSE
rmse = sqrt(mean_squared_error(validation_test['fare_amount'], validation_test['pred']))
print('Validation RMSE for Baseline I model: {:.3f}'.format(rmse))
```

```
# Get pickup hour from the pickup_datetime column
train['hour'] = train['pickup_datetime'].dt.hour
test['hour'] = test['pickup_datetime'].dt.hour

# Calculate average fare_amount grouped by pickup hour 
hour_groups = train.groupby('hour')['fare_amount'].mean()

# Make predictions on the test set
test['fare_amount'] = test.hour.map(hour_groups)

# Write predictions
test[['id','fare_amount']].to_csv('hour_mean_sub.csv', index=False)

from sklearn.ensemble import RandomForestRegressor

# Select only numeric features
features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
            'dropoff_latitude', 'passenger_count', "hour"]

# Train a Random Forest model
rf = RandomForestRegressor()
rf.fit(train[features], train.fare_amount)

# Make predictions on the test data
test['fare_amount'] = rf.predict(test[features])

# Write predictions
test[['id','fare_amount']].to_csv('rf_sub.csv', index=False)
```

### Hyperparameter tuning
* Generally we make Kaggle Leaderboard submissions only after a couple of changes, just to track that our local validation score moves in the same direction as the Public Leaderboard score (usually limt of 5 submissions per day).

<img src='data/ml_v_dl.png' width="600" height="300" align="center"/>

#### Grid search

```
# Possible alpha values
alpha_grid = [0.01, 0.1, 1, 10]
from sklearn.linear_model import Ridge
result ={}

# For value in the grid 
for candidate_alpha in alpha_grid:
    # Create a model with a specific alpha value
    ridge_regression = Ridge(alpha=candidate_alpha)
    # Find the validation score for this model
    # Save the results for each alpha value
    results[candidate_alpha] = validation_score
```

```
# Possible max depth values
max_depth_grid = [3, 6, 9, 12, 15]
results = {}

# For each value in the grid
for max_depth_candidate in max_depth_grid:
    # Specify parameters for the model
    params = {'max_depth': max_depth_candidate}

    # Calculate validation score for a particular hyperparameter
    validation_score = get_cv_score(train, params)

    # Save the results for each max depth value
    results[max_depth_candidate] = validation_score   
print(results)
```

```
import itertools

# Hyperparameter grids
max_depth_grid = [3, 5, 7]
subsample_grid = [0.8, 0.9, 1.0]
results = {}

# For each couple in the grid
for max_depth_candidate, subsample_candidate in itertools.product(max_depth_grid, subsample_grid):
    params = {'max_depth': max_depth_candidate,
              'subsample': subsample_candidate}
    validation_score = get_cv_score(train, params)
    # Save the results for each couple
    results[(max_depth_candidate, subsample_candidate)] = validation_score   
print(results)
```

### Model Ensembling

<img src='data/model_ensembling.png' width="600" height="300" align="center"/>

### Model blending
* The idea of ensemble learning is to build a prediction model by combining the strength of a collection of simpler base models
* The so-called blending approach is to just find an average of our multiple models' predictions

#### Arithmetic mean
* **Arithmetic mean** works for both regression and classification problems.
    * However, for classification, it is better to use the geometric mean of the class probabilities predicted

<img src='data/argeo_means.png' width="400" height="200" align="center"/>

#### Geometric mean
* For classification, it's better to use a geometric mean of the class probabilities predicted

### Model stacking
* Stacking is a more advanced ensembling approach
* The idea is to take multiple single models, take their predictions and use these predictions as features in the 2nd level model.
    * 1) Split train data into two parts: Part 1 & Part 2
    * 2) Train multiple single models on the first part (Part 1)
    * 3) Make predictions on Part 2
    * 4) Make predictions on the test data
        * (Now we have model predictions for both Part 2 of the train data and for the test data)
    * 5) Train a new model on Part 2 using the predictions as features
        * This model is called the **2nd level model** or **meta-model**
    * 6) Make prediction on the test data using the 2nd level model 

#### Exercises: Model blending

```
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

# Train a Gradient Boosting model
gb = GradientBoostingRegressor().fit(train[features], train.fare_amount)

# Train a Random Forest model
rf = RandomForestRegressor().fit(train[features], train.fare_amount)

# Make predictions on the test data
test['gb_pred'] = gb.predict(test[features])
test['rf_pred'] = rf.predict(test[features])

# Find mean of model predictions
test['blend'] = (test['gb_pred'] + test['rf_pred']) / 2
print(test[['gb_pred', 'rf_pred', 'blend']].head(3))
```

#### Exercises: Model Stacking I

```
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

# Split train data into two parts
part_1, part_2 = train_test_split(train, test_size=0.5, random_state=123)

# Train a Gradient Boosting model on Part 1
gb = GradientBoostingRegressor().fit(part_1[features], part_1.fare_amount)

# Train a Random Forest model on Part 1
rf = RandomForestRegressor().fit(part_1[features], part_1.fare_amount)

# Make predictions on the Part 2 data
part_2['gb_pred'] = gb.predict(part_2[features])
part_2['rf_pred'] = rf.predict(part_2[features])

# Make predictions on the test data
test['gb_pred'] = gb.predict(test[features])
test['rf_pred'] = rf.predict(test[features])
```

#### Exercises: Model Stacking II

```
from sklearn.linear_model import LinearRegression

# Create linear regression model without the intercept
lr = LinearRegression(fit_intercept=False)

# Train 2nd level model on the Part 2 data
lr.fit(part_2[['gb_pred', 'rf_pred']], part_2.fare_amount)

# Make stacking predictions on the test data
test['stacking'] = lr.predict(test[['gb_pred', 'rf_pred']])

# Look at the model coefficients
print(lr.coef_)
```

### Final Tips
#### Save all the information
* **Save folds distributions to files**
* **Save model runs**
* **Save model predictions to the disk**
* **Save performance results**

#### Kaggle forum and kernels
* **Kaggle forum:**
    * Competition discussed by the participants
    * Open forum
* **Kaggle kernels:**
    * Scripts and notebooks shared by the participants
    * Cloud computational environment
    

* So, we have an opportunity not only to discuss the competition, but also to look at the code. 

<img src='data/kaggle_forums_kernels.png' width="600" height="300" align="center"/>

#### Final submissions (usually 2)
* 1) Best submission on the local validation
* 2) Best submission on the Public Leaderboard

```
# Drop passenger_count column
new_train_1 = train.drop('passenger_count', axis=1)

# Compare validation scores
initial_score = get_cv_score(train)
new_score = get_cv_score(new_train_1)

print('Initial score is {} and the new score is {}'.format(initial_score, new_score))

# Create copy of the initial train DataFrame
new_train_2 = train.copy()

# Find sum of pickup latitude and ride distance
new_train_2['weird_feature'] = new_train_2['pickup_latitude'] + new_train_2['distance_km']

# Compare validation scores
initial_score = get_cv_score(train)
new_score = get_cv_score(new_train_2)

print('Initial score is {} and the new score is {}'.format(initial_score, new_score))