# Gradient Boosting Decision Trees with LightGBM and XGBoost

In [1]:
import warnings 
warnings.filterwarnings ("ignore")

import numpy as np
import pandas as pd
import sklearn
import lightgbm as lgb
import xgboost as xgb
import json

from utils import (Timer, load_airline, convert_related_cols_categorical_to_numeric, 
                  convert_cols_categorical_to_numeric, binarize_prediction,
                  classification_metrics_binary, classification_metrics_binary_prob)

print(f"Numpy version:{np.__version__}")
print(f"Pandas version:{pd.__version__}")
print(f"Sklearn version:{sklearn.__version__}")
print(f"LightGBM version:{lgb.__version__}")
print(f"XGBoost version:{xgb.__version__}")

Numpy version:1.22.3
Pandas version:1.4.1
Sklearn version:1.0.2
LightGBM version:3.3.2
XGBoost version:1.6.0rc1



## XGBoost vs LightGBM

XGBoost started in 2014, and it has become popular due to its use in many winning Kaggle competition entries. Originally XGBoost was based on a level-wise growth algorithm, but then they added an option for leaf-wise growth that implements split approximation using histograms. We refer to this version as XGBoost hist. 

LightGBM is a more recent arrival, started in March 2016 and open-sourced in August 2016. It is based on a leaf-wise algorithm and histogram approximation, and has attracted a lot of attention due to its speed. 

Apart from multithreaded CPU implementations, GPU acceleration is now available on both XGBoost and LightGBM too.

## Airline dataset

The Airline dataset contains flight arrival and departure details for all the commercial flights within the USA, from October 1987 to April 2008. Its size is around 116 million records and 5.76 GB of memory. It has 13 features plus the target. The target attribute is Arrival Delay, it is a positive or negative value measured in minutes.

To download the dataset:

```bash
cd data
wget http://kt.ijs.si/elena_ikonomovska/datasets/airline/airline_14col.data.bz2
bzip2 -dk airline_14col.data.bz2
```

In this notebook, we are going to set a classification problem where the goal is to **classify wheather a flight has arrived delayed or not.**

In [2]:
N_ROWS = 10000000

In [3]:
%%time
df_plane = load_airline(nrows=N_ROWS)
print(df_plane.shape)

(10000000, 14)
CPU times: user 9.2 s, sys: 4.25 s, total: 13.4 s
Wall time: 13.4 s


In [4]:
df_plane.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,SFO,ORD,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,LAX,SFO,337,0,5
2,1987,10,1,4,5,35,HP,351,167,ICT,LAS,987,0,17
3,1987,10,1,4,5,40,DL,251,35,MCO,PBI,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,LAS,ORD,1515,0,17


The first step is to convert the categorical features to numeric features.

In [5]:
%%time
df_plane_numeric = convert_related_cols_categorical_to_numeric(df_plane, col_list=['Origin','Dest'])
del df_plane

CPU times: user 8.84 s, sys: 1.18 s, total: 10 s
Wall time: 9.99 s


In [6]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,0,33,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,1,0,337,0,5
2,1987,10,1,4,5,35,HP,351,167,2,4,987,0,17
3,1987,10,1,4,5,40,DL,251,35,3,41,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,4,33,1515,0,17


In [7]:
%%time
df_plane_numeric = convert_cols_categorical_to_numeric(df_plane_numeric, col_list='UniqueCarrier')

CPU times: user 5.12 s, sys: 674 ms, total: 5.8 s
Wall time: 5.78 s


To simplify the pipeline, we are going to set a classification problem where the goal is to classify wheather a flight has arrived delayed or not. For that we need to binarize the variable `ArrDelay`.

If you want to extend this experiment, you can set a regression problem and try to identify the number of minutes of delay a fight has. Both XGBoost and LightGBM have regression classes.

In [8]:
df_plane_numeric = df_plane_numeric.apply(lambda x: x.astype('int16'))

In [9]:
%%time
df_plane_numeric['ArrDelayBinary'] = 1*(df_plane_numeric['ArrDelay'] > 0)

CPU times: user 57.3 ms, sys: 140 ms, total: 197 ms
Wall time: 57 ms


In [10]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay,ArrDelayBinary
0,1987,10,1,4,1,556,0,190,247,0,33,1846,0,27,1
1,1987,10,1,4,5,114,1,57,74,1,0,337,0,5,1
2,1987,10,1,4,5,35,2,351,167,2,4,987,0,17,1
3,1987,10,1,4,5,40,3,251,35,3,41,142,0,-2,0
4,1987,10,1,4,8,517,4,500,208,4,33,1515,0,17,1


Once the features are prepared, let's split the dataset into train and test set. We won't use validation for this example (however, you can try to add it).

In [11]:
def split_train_val_test_df(df, val_size=0.2, test_size=0.2):
    train, validate, test = np.split(
        df.sample(frac=1),
        [int((1 - val_size - test_size) * len(df)), int((1 - test_size) * len(df))],
    )
    return train, validate, test

def generate_feables(df):
    X = df[df.columns.difference(['ArrDelay', 'ArrDelayBinary'])]
    y = df['ArrDelayBinary']
    return X,y

In [12]:
%%time
train, validate, test = split_train_val_test_df(df_plane_numeric, val_size=0, test_size=0.2)
print(train.shape)
print(validate.shape)
print(test.shape)

(8000000, 15)
(0, 15)
(2000000, 15)
CPU times: user 4.04 s, sys: 480 ms, total: 4.52 s
Wall time: 4.49 s


In [13]:
%%time
X_train, y_train = generate_feables(train)
X_val, y_val = generate_feables(validate)
X_test, y_test = generate_feables(test)

CPU times: user 119 ms, sys: 42.5 ms, total: 161 ms
Wall time: 155 ms


In [14]:
del train, validate, test

## Training

Now we are going to create two pipelines, one of XGBoost and one for LightGBM. The technology behind both libraries is different, so it is difficult to compare them in the exact same model setting. XGBoost grows the trees depth-wise and controls model complexity with `max_depth`. Instead, LightGBM uses a leaf-wise algorithm and controls the model complexity by `num_leaves`. As a tradeoff, we use XGBoost with `max_depth=8`, which will have max number leaves of 255, and compare it with LightGBM with `num_leaves=255`.

In [15]:
results_dict = dict()
num_rounds = 200

Let's start with the XGBoost classifier.

In [16]:
xgb_clf_pipeline = xgb.XGBRegressor(max_depth=8,
                                    n_estimators=num_rounds,
                                    min_child_weight=30,
                                    learning_rate=0.1,
                                    scale_pos_weight=2,
                                    gamma=0.1,
                                    reg_lambda=1,
                                    subsample=1,
                                    n_jobs=-1,
                                    random_state=77)

In [17]:
with Timer() as t:
    xgb_clf_pipeline.fit(X_train, y_train)

In [18]:
results_dict['xgb']={ 'train_time': t.interval }

Training XGBoost model with leave-wise growth

In [19]:
xgb_hist_clf_pipeline = xgb.XGBRegressor(max_depth=0,
                                        max_leaves=255,
                                        n_estimators=num_rounds,
                                        min_child_weight=30,
                                        learning_rate=0.1,
                                        scale_pos_weight=2,
                                        gamma=0.1,
                                        reg_lambda=1,
                                        subsample=1,
                                        grow_policy='lossguide',
                                        tree_method='hist',
                                        n_jobs=-1,
                                        random_state=77)

In [20]:
with Timer() as t:
    xgb_hist_clf_pipeline.fit(X_train, y_train)

In [21]:
results_dict['xgb_hist']={ 'train_time': t.interval }

Training LightGBM model.

In [22]:
lgbm_clf_pipeline = lgb.LGBMRegressor(num_leaves=255,
                                      n_estimators=num_rounds,
                                      min_child_weight=30,
                                      learning_rate=0.1,
                                      scale_pos_weight=2,
                                      min_split_gain=0.1,
                                      reg_lambda=1,
                                      subsample=1,
                                      n_jobs=-1,
                                      seed=77)

In [23]:
with Timer() as t:
    lgbm_clf_pipeline.fit(X_train, y_train)

In [24]:
results_dict['lgbm']={ 'train_time': t.interval }

## Evaluation

Now let's evaluate the model in the test set.

In [25]:
with Timer() as t:
    y_prob_xgb = np.clip(xgb_clf_pipeline.predict(X_test), 0.0001, 0.9999)

In [26]:
results_dict['xgb']['test_time'] = t.interval

In [27]:
with Timer() as t:
    y_prob_xgb_hist = np.clip(xgb_hist_clf_pipeline.predict(X_test), 0.0001, 0.9999)

In [28]:
results_dict['xgb_hist']['test_time'] = t.interval

In [29]:
with Timer() as t:
    y_prob_lgbm = np.clip(lgbm_clf_pipeline.predict(X_test), 0.0001, 0.9999)

In [30]:
results_dict['lgbm']['test_time'] = t.interval

## Metrics

We are going to obtain some metrics to evaluate the performance of each of the models.

In [31]:
y_pred_xgb = binarize_prediction(y_prob_xgb)
y_pred_xgb_hist = binarize_prediction(y_prob_xgb_hist)
y_pred_lgbm = binarize_prediction(y_prob_lgbm)

In [32]:
report_xgb = classification_metrics_binary(y_test, y_pred_xgb)
report2_xgb = classification_metrics_binary_prob(y_test, y_prob_xgb)
report_xgb.update(report2_xgb)

In [33]:
results_dict['xgb']['performance'] = report_xgb

In [34]:
report_xgb_hist = classification_metrics_binary(y_test, y_pred_xgb_hist)
report2_xgb_hist = classification_metrics_binary_prob(y_test, y_prob_xgb_hist)
report_xgb_hist.update(report2_xgb_hist)

In [35]:
results_dict['xgb_hist']['performance'] = report_xgb_hist

In [36]:
report_lgbm = classification_metrics_binary(y_test, y_pred_lgbm)
report2_lgbm = classification_metrics_binary_prob(y_test, y_prob_lgbm)
report_lgbm.update(report2_lgbm)

In [37]:
results_dict['lgbm']['performance'] = report_lgbm

## Results

In [38]:
print(json.dumps(results_dict, indent=4, sort_keys=True))

{
    "lgbm": {
        "performance": {
            "AUC": 0.8870619223357775,
            "Accuracy": 0.8036615,
            "F1": 0.8198376470987667,
            "Precision": 0.832062582943354,
            "Recall": 0.807966735335025
        },
        "test_time": 24.33524683199994,
        "train_time": 305.5830000979995
    },
    "xgb": {
        "performance": {
            "AUC": 0.8711924262702456,
            "Accuracy": 0.736158,
            "F1": 0.7948506216871666,
            "Precision": 0.6971206959102453,
            "Recall": 0.9244500351782152
        },
        "test_time": 5.3016615609994915,
        "train_time": 348.40394902200023
    },
    "xgb_hist": {
        "performance": {
            "AUC": 0.8846706773183659,
            "Accuracy": 0.7576015,
            "F1": 0.8072960825770015,
            "Precision": 0.7202188993611889,
            "Recall": 0.91832504670835
        },
        "test_time": 9.886299425000288,
        "train_time": 340.21929162499964

In [39]:
del xgb_clf_pipeline, xgb_hist_clf_pipeline, lgbm_clf_pipeline, X_train, X_test, X_val

## Summary

The experiments have been conducted on an Intel i7 @ 1.30GHz with 32Gb of RAM.

| Airline subsample size | Lib | Training time (s) | Test time (s) | AUC | F1 |
|:-----------------------|:----|:-----------------:|:-------------:|:---:|:--:|
| 10,000     | xgb      |    0.9712 |   0.0174 | 0.8396 | 0.8121 |
| 10,000     | xgb_hist |    0.8773 |   0.0090 | 0.8312 | 0.8131 |
| 10,000     | lgb      |    0.1645 |   0.0161 | 0.8379 | 0.8113 |
| 100,000    | xgb      |    5.4288 |   0.0227 | 0.9023 | 0.8330 |
| 100,000    | xgb_hist |    1.7466 |   0.0283 | 0.9074 | 0.8405 |
| 100,000    | lgb      |    0.8431 |   0.0478 | 0.9095 | 0.8553 |
| 1,000,000  | xgb      |  113.1993 |   0.3321 | 0.9104 | 0.8407 |
| 1,000,000  | xgb_hist |   38.8424 |   0.5896 | 0.9227 | 0.8554 |
| 1,000,000  | lgb      |   17.8804 |   1.4159 | 0.9255 | 0.8746 |
| 10,000,000 | xgb      |  348.4039 |   5.3016 | 0.8711 | 0.7948 |
| 10,000,000 | xgb_hist |  340.2192 |   9.8862 | 0.8846 | 0.8072 |
| 10,000,000 | lgb      |  305.5830 |  24.3352 | 0.8870 | 0.8198 |

## References
* Lessons Learned From Benchmarking Fast Machine Learning Algorithms
 https://docs.microsoft.com/en-us/archive/blogs/machinelearning/lessons-learned-benchmarking-fast-machine-learning-algorithms
* <div class="csl-entry">Chen, T., &#38; Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. <i>KDD</i>. https://github.com/dmlc/xgboost</div> 
* <div class="csl-entry">Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., &#38; Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. <i>NIPS</i>. https://github.com/Microsoft/LightGBM.</div>
* Gradient Boosted Decision Trees-Explained https://towardsdatascience.com/gradient-boosted-decision-trees-explained-9259bd8205af
* Can one do better than XGBoost? PyData 2017 https://www.youtube.com/watch?v=5CWwwtEM2TA