# Assignment - Boosting with Light GBM

<b>(1)</b> Using regression and classification models using the Light GBM algorithm by reusing the data set you use in the second project regression project and determine the most appropriate parameter values for this model. Compare with the models in your projects.

## 1.1. Classification with Light GBM

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn.tree as tree
import scipy
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

df_heart_attack = pd.read_csv('risk_of_heart_attack2.csv')
df_heart_attack.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
2,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
3,56.0,1.0,2.0,120.0,236.0,0.0,0.0,178.0,0.0,0.8,1.0,0.0,3.0,0
4,57.0,0.0,4.0,120.0,354.0,0.0,0.0,163.0,1.0,0.6,1.0,0.0,3.0,0


In [2]:
import lightgbm as lgb

X = df_heart_attack.drop('num', axis=1)
y = df_heart_attack['num']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11,stratify=y)

d_train = lgb.Dataset(data=X_train, label=y_train)

params = {'boosting_type' : 'gbdt',
          'objective' : 'binary',
          'metric' : 'binary_logloss',
          'sub_feature' : 0.5,
          'num_leaves' :  10,
          'min_data' : 50,
          'max_depth' : 10,
          'verbose': -1,
          'force_row_wise':True,
         }

lgb_model = lgb.train(params, d_train, num_boost_round = 100)

In [3]:
y_pred = lgb_model.predict(X_test)
y_pred = [0 if y_pred < 0.5 else 1 for y_pred in y_pred]

In [4]:
from sklearn.metrics import accuracy_score
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.944


## 1.2. Regression with Light GBM

In [5]:
df_house_prices = pd.read_csv('house_prices2.csv')
df_house_prices.head()

Unnamed: 0,SalePrice,BsmtQual,PoolQC,YearBuilt,OverallQual,GrLivArea,GarageCars
0,208500,3,0,2003,7,1710,2
1,181500,3,0,1976,6,1262,2
2,223500,3,0,2001,7,1786,2
3,140000,2,0,1915,7,1717,3
4,250000,3,0,2000,8,2198,3


In [6]:
X = df_house_prices.drop("SalePrice", axis=1)
y = df_house_prices["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

d_train = lgb.Dataset(data=X_train, label=y_train)

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmsle',
    'max_depth': 3, 
    'learning_rate': 0.2,
    'force_row_wise':True,
    'verbose': -1}

lgb_reg_model = lgb.train(params, d_train, num_boost_round = 100)
y_pred = lgb_reg_model.predict(X_test)
y_pred_train = lgb_reg_model.predict(X_train)

In [7]:
from sklearn.metrics import mean_squared_error as mse

rmse = mse(y_test, y_pred)**(1/2)
print("RMSE: {:.3f}".format(rmse))

RMSE: 35337.503


<b>(2)</b> Which parameters can be used for faster algorithms in Ligth GBM?

* bagging_fraction : Is used to perform bagging for faster results
* feature_fraction : Set fraction of the features to be used at each iteration
* max_bin : Smaller value of max_bin can save much time as it buckets the feature values in discrete bins which is computationally inexpensive.
* Use parallel learning.

<b>(3)</b> Which parameters can be used for better accuracy in Ligth GBM?

* Use bigger training data.
* num_leaves : Setting it to high value produces deeper trees with increased accuracy but lead to overfitting. Hence its higher value is not preferred.
* max_bin : Setting it to high values has similar effect as caused by increasing value of num_leaves and also slower our training procedure.