# LightGBM (Light Gradient Boosting Machine)

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
- Faster training speed and higher efficiency.
- Lower memory usage.
- Better accuracy.
- Support of parallel and GPU learning.
- Capable of handling large-scale data.

## Importing libraries

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing classifier and regressor
from lightgbm import LGBMClassifier, LGBMRegressor

from sklearn import metrics 
from sklearn.metrics import *
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

## LightGBM regression

### Importing data

In [2]:
from sklearn.datasets import load_diabetes

# Load the diabetes data
X, y = load_diabetes(return_X_y = True)

print(X[:10], end = '\n')
print('---')
print(y[:10])

[[ 0.03807591  0.05068012  0.06169621  0.02187239 -0.0442235  -0.03482076
  -0.04340085 -0.00259226  0.01990749 -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 -0.02632753 -0.00844872 -0.01916334
   0.07441156 -0.03949338 -0.06833155 -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 -0.00567042 -0.04559945 -0.03419447
  -0.03235593 -0.00259226  0.00286131 -0.02593034]
 [-0.08906294 -0.04464164 -0.01159501 -0.03665608  0.01219057  0.02499059
  -0.03603757  0.03430886  0.02268774 -0.00936191]
 [ 0.00538306 -0.04464164 -0.03638469  0.02187239  0.00393485  0.01559614
   0.00814208 -0.00259226 -0.03198764 -0.04664087]
 [-0.09269548 -0.04464164 -0.04069594 -0.01944183 -0.06899065 -0.07928784
   0.04127682 -0.0763945  -0.04117617 -0.09634616]
 [-0.04547248  0.05068012 -0.04716281 -0.01599898 -0.04009564 -0.02480001
   0.00077881 -0.03949338 -0.06291688 -0.03835666]
 [ 0.06350368  0.05068012 -0.00189471  0.06662945  0.09061988  0.10891438
   0.02286863  0.01770335 -0.03581619  0.00306441]


### Creating training and testing sets

In [3]:
# Divide into train and test sets
train_x, test_x, train_y, test_y = train_test_split(X, y, random_state = 23)

print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

(331, 10) (331,)
(111, 10) (111,)


### Build the lightgbm regressor model

In [4]:
# Create an instance of the LightGBM Regressor with the RMSE metric. 
model = LGBMRegressor(metric = 'rmse') 
  
# Train the model using the training data. 
model.fit(train_x, train_y)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000208 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 561
[LightGBM] [Info] Number of data points in the train set: 331, number of used features: 10
[LightGBM] [Info] Start training from score 147.996979


In [6]:
# Calculating scores
print('Training Score:', round(model.score(train_x, train_y), 3))
print('Testing Score:', round(model.score(test_x, test_y), 3))

Training Score: 0.922
Testing Score: 0.341


In [7]:
from sklearn.metrics import r2_score, mean_squared_error

# Calculate MAE, R2 Score and RMSE
y_train_pred = model.predict(train_x)
y_pred = model.predict(test_x)

print('Training R2-Score:', round(r2_score(train_y, y_train_pred), 3))
print('Testing R2-Score:', round(r2_score(test_y, y_pred), 3))
print('Root Mean Square Error:', round(np.sqrt(mean_squared_error(test_y, y_pred)), 3))

Training R2-Score: 0.922
Testing R2-Score: 0.341
Root Mean Square Error: 61.015


## LightGBM classification

### Importing data

In [8]:
# Reading the data
data = pd.read_csv('datasets/data_cleaned.csv')

# Check the data
data.head()

Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,38.0,71.2833,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,1,26.0,7.925,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,1,35.0,53.1,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
4,0,35.0,8.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1


### Separating independent and dependent variables

In [9]:
# Independent variables
x = data.drop(['Survived'], axis=1)

#dependent variable
y = data['Survived']

print(x.shape, y.shape)

(891, 24) (891,)


### Creating the training and testing sets

In [10]:
# Divide into train and test sets
train_x, test_x, train_y, test_y = train_test_split(x, y, random_state = 101, stratify = y)

print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

(668, 24) (668,)
(223, 24) (223,)


### Build the lightgbm classification model

In [11]:
# Creating a lightgbm classifier instance
clf = LGBMClassifier()

# Train the model
clf.fit(train_x,train_y)

[LightGBM] [Info] Number of positive: 256, number of negative: 412
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000323 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 207
[LightGBM] [Info] Number of data points in the train set: 668, number of used features: 16
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383234 -> initscore=-0.475846
[LightGBM] [Info] Start training from score -0.475846


In [13]:
# Calculating scores
print('Training Score:', round(clf.score(train_x, train_y), 3))
print('Testing Score:', round(clf.score(test_x, test_y), 3))

Training Score: 0.949
Testing Score: 0.789
