<a href="https://colab.research.google.com/github/zerotodeeplearning/ztdl-masterclasses/blob/master/notebooks/Gradient_Boosting_with_XGBoost_and_LightGBM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learn with us: www.zerotodeeplearning.com

Copyright © 2021: Zero to Deep Learning ® Catalit LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gradient Boosting with XGBoost and LightGBM

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/zerotodeeplearning/ztdl-masterclasses/master/data/australian_credit.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
y = df.pop('class')
y.value_counts()

In [None]:
numerical_features = list(df.select_dtypes(include='number').columns)
numerical_features

In [None]:
categorical_features = list(df.select_dtypes(exclude='number').columns)
categorical_features

## Baselines

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
  df[numerical_features], y, test_size=0.2, random_state=0)

In [None]:
def train_eval(model):
  model.fit(X_train, y_train)

  train_score = model.score(X_train, y_train)
  test_score = model.score(X_test, y_test)
  return train_score, test_score

In [None]:
models = [DummyClassifier(strategy='most_frequent'),
          LogisticRegression(solver='liblinear'),
          DecisionTreeClassifier()]

res = []

for model in models:
  mname = model.__class__.__name__
  tr, te = train_eval(model)
  res.append([mname, tr, te])

df_results = pd.DataFrame(res, columns=['model_name',
                                        'train_accuracy',
                                        'test_accuracy'])

df_results.sort_values('test_accuracy', ascending=False)

## Exercise 1: Scikit-Learn

Extend the above measurements with the following models from Scikit Learn:

- Random Forest
- Extra Trees
- AdaBoost


## Exercise 2: XGBoost with 1-hot encoded variables

Let's use XGBoost to classify our data.

- Import `XGBClassifier` from `xgboost`
- create a new dataset called `df_one_hot` where all categorical variables are one-hot encoded
- perform a train/test split again
- re-train all the models previously trained on the new dataset
- include `XGBClassifier` in the list of models
- compare their scores
- BONUS: use `GridSearchCV` to optimize the hyperparameters of `XGBClassifier`

## Exercise 3: LightGBM

Let's use LightGBM to classify our data.

- import `LGBMClassifier` from `lightgbm`
- train your best model on the one-hot encoded features
- compare the results

- BONUS:
- create a new dataset called `df_cat_enc` where all categorical variables are encoded with the `OrdinalEncoder` from `sklearn.preprocessing`, while the numerical features are preserved
- perform a new train/test split
- train a lgbm model on this data. You will need to use the following code:
```python
ds_train = lgb.Dataset(X_train, label=y_train)
model3 = lgb.train(params, ds_train, 
                   categorical_feature = categorical_features)
```
refer to the [documentation](https://lightgbm.readthedocs.io/en/latest/Python-Intro.html) if you're unsure about how to proceed for this step.
- compare their scores
