In [1]:
import pandas as pd

# Missing Values

Most machine learning libraries (including scikit-learn) give an error if you try to build a model using data with missing values.  
The strategies to deal with that are:

### 1) Simple: Drop columns with missing values.  
Not a very good approach, unless most values are missing

### 2) Imputation: Fills the missing values with some number.  
For example, we can replace a missing value with the mean value along that column. 

### 3) Extension to Imputation: Filling the missing value, but keeping a record of the entries that were replaced.  
This is achieved by creating a new column with booleans, which are `True` for the location of missing values. 


Let's test those three strategies on the same Melbourne data:

In [2]:
melbourne_file_path = "C:\\Users\\fdoli\\github\\Kaggle\\DataMelbourne\\melb_data.csv"

In [3]:
melbourne_data = pd.read_csv(melbourne_file_path)

In [4]:
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# select target
y = melbourne_data.Price

# we drop the target column
melb_data_predicts = melbourne_data.drop(['Price'], axis=1)
# we keep only numeric values
X = melb_data_predicts.select_dtypes(exclude=['object'])

In [6]:
X.head()

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
0,2,2.5,3067.0,2.0,1.0,1.0,202.0,,,-37.7996,144.9984,4019.0
1,2,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,-37.8079,144.9934,4019.0
2,3,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,-37.8093,144.9944,4019.0
3,3,2.5,3067.0,3.0,2.0,1.0,94.0,,,-37.7969,144.9969,4019.0
4,4,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,-37.8072,144.9941,4019.0


In [7]:
# We split the data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

### Define Function to Measure Quality of Each Approach

In [8]:
from sklearn.ensemble import RandomForestRegressor

# function for comparing diff. approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Score from approach **1)**

In [9]:
# get names with columns with missing values
cols_with_miss = [col for col in X_train.columns if X_train[col].isnull().any()]

# now that we have them, let's drop them from the datasets
reduced_X_train = X_train.drop(cols_with_miss, axis=1)
reduced_X_valid = X_valid.drop(cols_with_miss, axis=1)

# let's check the model!
print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop columns with missing values):
183550.22137772635


In [17]:
cols_with_miss

['Car', 'BuildingArea', 'YearBuilt']

Score from approach **2)**

In [14]:
from sklearn.impute import SimpleImputer

# imputator definition
my_imp = SimpleImputer()
# imputation with "fit_transform"
imputed_X_train = pd.DataFrame(my_imp.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imp.transform(X_valid))

# The imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

# let's check the model!
print('MAE from approach 2 (Imputation)')
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE from approach 2 (Imputation)
178166.46269899711


**2nd approach** performed better that **1st one**

Score from approach **3)**

In [15]:
# we make a copy to avoid changing the original data when imputing
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# create the new columns to indicate what will be imputed
for col in cols_with_miss:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
    
# imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# As before, imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print('MAE from Approach 3 (Extended Imputation)')
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

MAE from Approach 3 (Extended Imputation)
178927.503183954


**Approach 3** performed slightly worse than **Approach 2**.

### Discussion: Why Imputation was better than dropping?

It has to do with the relevant information we drop along the dropped (3) columns.

In [18]:
# shape of training data
print(X_train.shape)

# Nr of missing values in each column of the training data
missing_val_count_by_columns = X_train.isnull().sum()
print(missing_val_count_by_columns[missing_val_count_by_columns > 0])

(10864, 12)
Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64


As we can see, it's at max 50% of the whole column values.

When Imputations does not improve the results, but still the missing data is too litte, one can change the `strategy` parameter within `SimpleImputer`. For some type of feature, `median` might work better than `mean` 

## Categorical Variable

This variables take only a limited number of values. We need to preprocess this data, as it will give an error if we try to plug them in a machine learning model.  
We can study three approaches used to prepare the categorical data:

### 1) Drop Categorical Data  
It only helps if columns do not contain useful information. 

### 2) Label Encoding  
It assigns a different integer to each *unique* categorical value. It assumes an **order** of the categories ('never' < 'rarely' < 'most days' < 'every day'). This type of categorical variable is called **ordinal variables**. 

### 3) One-Hot Encoding  
It creates a new column indicating the presence/absence of each possible value in the original data, i.e. from 1 column containing 3 categorical variables, you get 3 columns with binary values.  
It does not assume ordering of the categories. These are called **nominal variables**.   
It is not recommended for a *large* (>15) number of nominal values

Let's work with an example:

In [58]:
# we have the target "y" and the data "melbourne_data"
X = melbourne_data.drop(['Price'], axis=1)

# divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, 
                                                                test_size=0.2, random_state=0)

# Drop column with missing values (Simple approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()]
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# Select categorical columns with rel. low cardinaloty (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 
                       and X_train_full[cname].dtype=='object']

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype 
                  in ['int64', 'float64']]

# Keep selected columns only 
my_cols = categorical_cols + numerical_cols
X_train0 = X_train_full[my_cols].copy()
X_valid0 = X_valid_full[my_cols].copy()

In [31]:
X_train0.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


Now we look for a list of the categorical variables in the training data

In [32]:
s = (X_train0.dtypes == 'object')
object_cols = list(s[s].index)

print('Categorical variables: {}'.format(object_cols))

Categorical variables: ['Type', 'Method', 'Regionname']


Define a function to measure the quality of each approach using MAE

In [43]:
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### Score from **Approach 1** (Dropping categorical variables)  
We drop the `object` columns using the `select_dtypes` method

In [45]:
drop_X_train = X_train0.select_dtypes(exclude=['object'])
drop_X_valid = X_valid0.select_dtypes(exclude=['object'])

print('MAE from Approach 1 (dropping cat. variables) = {}'.format(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid)))

MAE from Approach 1 (dropping cat. variables) = 175703.48185157913


### Score from Approach 2 (Label encoding)  
Using `LabelEncoder` from scikit-learn. 

In [46]:
from sklearn.preprocessing import LabelEncoder

# make a copy to avoid changing original data
label_X_train = X_train0.copy()
label_X_valid = X_valid0.copy()

# Apply label encoder to each categorical column
label_encoder = LabelEncoder()
for col in object_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train0[col])
    label_X_valid[col] = label_encoder.transform(X_valid0[col])

score = score_dataset(label_X_train, label_X_valid, y_train, y_valid)
print('MAE from Approach 2 (Label Encoding) = {}'.format(score))

MAE from Approach 2 (Label Encoding) = 165936.40548390493


Here we havent organized the labels we give to the categorical data. We expect a performance improvement when this is done.

### Score from Approach 3 (One-Hot Encoding)  
Using `OneHotEncoder` class from scikit-learn. Useful parameter for customizing:
- `handle_unknown='ignore'` to avoid error due to differences between training and validation data
- `sparse=False` to ensure the encoded columns are returned as *numpy array* (instead of *sparse matrix*)

To use the encoder, we supply only the categorical columns we want to encode.

In [48]:
from sklearn.preprocessing import OneHotEncoder

# Apply it to each categorical column
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train0[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid0[object_cols]))

# The encoding removed the index, put it back
OH_cols_train.index = X_train0.index
OH_cols_valid.index = X_valid0.index

# Remove the original categorical columns, as they will be replaced by one-hot encoded ones
num_X_train = X_train0.drop(object_cols, axis=1)
num_X_valid = X_valid0.drop(object_cols, axis=1)

# Add the new "one-hot encoded" columns
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

score = score_dataset(OH_X_train, OH_X_valid, y_train, y_valid)
print('MAE from Approach 3 (One-Hot Encoding) = {}'.format(score))

MAE from Approach 3 (One-Hot Encoding) = 166089.4893009678


**Approach 1** is the worst. We can not conclude anything meaningful from **2** and **3**, because their values are very close.

In general, **Approach 3** (One-Hot Encoding) performs best and **Approach 1** (Dropping the cat. columns) worst.  

In [49]:
object_cols

['Type', 'Method', 'Regionname']

In [52]:
len(object_cols)

3

## Pipelines  

A pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step. Their benefits are:  
1. **Cleaner Code:**
2. **Fewer Bugs:**
3. **Easier to Productionaze:**
4. **More Options for Model Validation:**

Let's see an Example:  
We start with the same data *X_train, X_valid, y_train, y_valid*.  

In [72]:
# read the data again
data = pd.read_csv(melbourne_file_path)

y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, 
                                                                random_state=0)

# Select categorical cols with rel. low vardinality
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                   X_train_full[cname].dtype=='object']

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

We look at the data and we find some missing values for both categorical and numerical data.  
With a pipeline, it's easy to deal with both.

In [73]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,1.0,193.0,,,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,1.0,555.0,,,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,-37.7623,144.8272,4217.0


We will construct the full pipeline in three steps
### Step 1: Define Preprocessing Steps  

The `ColumnTransformer` class bundles different preprocessing steps. This code:
- imputes missing values in **numerical** data, and
- imputes missing values and applies a one-hot encoding to **categorical** data

In [74]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

### Step 2: Define the Model  
Next, we define a random forest model with the familiar `RandomForestRegressor` class

In [75]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

### Step 3: Create and Evaluate the Pipeline  

Finally, we use the `Pipeline` class to define a pipeline that bundles the preprocessing and modeling steps.  
Important:
- With the pipeline, we preprocess the training data and fit the model in a single line of code. This makes the whole process cleaner and faster. 
- With the pipeline, we supply the unprocessed features in `X_valid` to the `predict()` command, and the pipeline automatically preprocesses the features before generating the predictions. 

In [76]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(
    steps=[('preprocessor', preprocessor),
           ('model', model)
          ])

# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE (with pipeline) = {}'.format(score))

MAE (with pipeline) = 160679.18917034855


### Conclusion

Very clean code using pipelines. They are useful for workflows and sophisticated data preprocessing.

## Cross Validation

*for better measure of model performance*

For a fixed amount a data, there is a tradeoff between training data and validation data. The higher those data sets, the better for both processes. A large validation data, contains less randomness in the quality measure of the model. Increasing validation data would mean a smaller training data, and thus a worse modelling. 

### What is the procedure then?

Dividing the whole dataset in several parts and rotating what part will be the validation data, while repeating the training-validation process. This way, 100% of the data is used as validation at some point, involving every row in the measure of quality of the model. 

### When to use it?

Considering it is computationally expensive, there are 2 cases:
- For small datasets, extra computational burden isn't a big deal, thus cross-validation should be applied.
- For larger datasets, a single validation subset might be enough. 

But, when a dataset is small or large? couple of minutes to run -> ~small -> use cross-validation

But it might be that each run of the cross-validation gives the same result, in that case a single run is probably sufficient. 

### Example:

Using the same data as in the pipeline section.  
We use an Imputer to fill in missing values and a Random Forest model to make predictions.  
Using pipelines for cross-validation is easier that without them.

In [83]:
import pandas as pd

# Read the data
data = pd.read_csv(melbourne_file_path)

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

In [84]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(
    steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators=50, random_state=0))
    ])

We obtain the **cross-validation scores** from the `cross_val_score` function from scikit-learn.  
The `cv` parameter sets the number of folds (**experiments**).

In [87]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y, 
                              cv=5, 
                              scoring='neg_mean_absolute_error')

print('MAE scores:\n {}'.format(scores))

MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


The value `'neg_mean_absolute_error'` is one of the measures of model quality the parameters `scoring` can take. The list of options can be found in the documentations of scikit-learn. 

*(scikit-learn has a convention where all metrics are defined so a high number is better, that's why we used negative MAE)*

To have a unique number, we take the mean of all the **experiments**.

In [88]:
print('Average MAE score (across experiments): {}'.format(scores.mean()))

Average MAE score (across experiments): 277707.3795913405


### Conclusion

Better measure of model quality. Cleaner code.  
Note: We don't need to keep track of **separate** training and validation sets! This is a good improvement for small datasets. 

## XGBoost

*gradient boosting* method that performs very well

It is an *ensemble method*, just as the random forest method. This improves the performance of a *single model*.

**Gradient Boosting** goes through cycles to iteratively add models into an ensemble. 

The method starts with a single *naïve* model, that might be wildly innacurate (subsequent additions to the ensemble will address those error). After that, we start the cycle:  
- 1st, we make predictions with the current ensemble.
- These predictions are used to calculate a *loss function* (like *mean square error, for instance).
- We use the loss function to fit a new model and we add it to the ensemble. This fitting determines model parameters so that this added model will reduce the loss. (*gradient descent* is used in the loss function to determine the parameters in this new model).
- We add the new model to the ensemble.
- Repeat


### Example

In [122]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv(melbourne_file_path)

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

We'll work with the **XGBoost** library, which stands for **extreme gradient boosting**. (Scikit-learn has a version of *gradient boosting*, but *XGBoost* has some technical advantages.)

We need to use the scikit-learn API for XGBoost (`xgboost.XGBRegressor`).

In [129]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

We make predictions and evaluate the model

In [130]:
from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print('MAE: {}'.format(mean_absolute_error(predictions, y_valid)))

MAE: 263981.7765463918


### Parameter Tuning

This model is highly sensitive on parameter adjustments. The first ones to keep in mind are:  
**`n_estimators`**  
it specifies how many times to go through the modeling *cycle*. It is essentially the number of models that we include in the ensemble. 
- Too small value -> underfitting -> innacurate predictions
- Too large value -> overfitting -> accurate on training, innacurate predictions on test data (which is the important part)  

Typical values range: `100-1000`, though this depends *a lot* on the `learning_rate` parameter.  
Let's set the number of models in the ensemble:

In [131]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=500,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

---
**`early_stopping_rounds`**  
it offers a way to automatically find the idea value for `n_estimators`. It causes the model to stop iterating when the validation score stops improving, independently of the value of `n_estimators`. The usual practice is to set a high value for the number of models and use `early_stopping_rounds` to find the optimal time to stop iterating. 

It's better to set a number at around `5` instead of 1, to be sure the model is deteriorating its validation scores.

When using `early_stopping_rounds`, we need to put aside some data for calculating the validation scores - by using the `eval_set` parameter.

Let's modify the previous code:

In [132]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train, 
            early_stopping_rounds=5, 
            eval_set=[(X_valid, y_valid)],
            verbose=False)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=500,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

---
**`learning_rate`**  
It's a *small* scaling factor multiplied to the predictions from each component model. This is done before adding them to the ensemble.  
This means each tree we add to the ensemble helps us less. We can thus set a higher value for `n_estimators` without *overfitting*! If we use `early_stopping_rounds`, the optimal number of trees will be determined automatically.  
In general, the best is to have a small learning rate and a large number of estimators. This is computationally expensive, though.  
The default value is `learning_rate=0.1`.  

Let's modify the code again:

In [133]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.05, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

---
**`n_jobs`**  
It's related to parallel computing. It's useful for larger datasets where runtime is a consideration. It's common to set `n_jobs` equal to the number of cores on the machine.  
The model is not better, so only use it on *larger* datasets. 

Modifying the model:

In [134]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=5, n_jobs=4)
my_model.fit(X_train, y_train, 
            early_stopping_rounds=5, 
            eval_set=[(X_valid, y_valid)], 
            verbose=False)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=5, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
             n_jobs=4, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

### Conclusion

**XGBoost** is the leading software library for working with standard tabular data.  
Highly sensitive on *parameter tuning*.

## Data Leakage  
*How to prevent it?*

**Leakage** is when the training data contains information about the target, but that info is not available when the model is used for prediction. This way the model performs well in training (and even validation), but poorly in *production*.

This causes the model to look good, but adjustments make it very innacurate to predict.  
There are two main types of leakage: **target leakage** and **train-test contamination**.

### Target leakage  
occurs when the predictors include data that will not be available when making predictions, causing *correlations* between features.  
An example of this may be the relation between having *pneumonia* and taking *antibiotics medicine*. There is a relation between people who took it and people who got pneumonia. 

Everything boils down to the *time* those features are updated. If the targer feature is `got_pneumonia`, usually `took_antibiotic_medicine` is updated after the former was updated. This is a problem. 

To prevent data leakage, those variables updated after the target value is realized should be excluded. 

### Train-Test Contamination  
occurs when one is not carefull to distinguish training data from validation data.

Performing preprocessing steps before calling `train_test_split()`to split between train and test data, may be a cause from this contaminations. 