Here we are using dataset that contains the information about individuals from various countries. Our target is to predict whether a person makes <=50k or >50k annually on basis of the other information available. Dataset consists of 32561 observations and 14 features describing individuals.

Here is the link to the dataset: http://archive.ics.uci.edu/ml/datasets/Adult.

Go through the dataset to have a proper intuition about predictor variables and so that you could understand the code below properly.

## Important Parameters of light GBM
- task : default value = train ; options = train , prediction ; Specifies the task we wish to perform which is either train or prediction.
- application: default=regression, type=enum, options= options :
    - regression : perform regression task
    - binary : Binary classification
    - multiclass: Multiclass Classification
    - lambdarank : lambdarank application
- data: type=string; training data , LightGBM will train from this data
- num_iterations: number of boosting iterations to be performed ; default=100; type=int
- num_leaves : number of leaves in one tree ; default = 31 ; type =int
- device : default= cpu ; options = gpu,cpu. Device on which we want to train our model. Choose GPU for faster training.
- max_depth: Specify the max depth to which tree will grow. This parameter is used to deal with overfitting.
- min_data_in_leaf: Min number of data in one leaf.
- feature_fraction: default=1 ; specifies the fraction of features to be taken for each iteration
- bagging_fraction: default=1 ; specifies the fraction of data to be used for each iteration and is generally used to speed up the training and avoid overfitting.
- min_gain_to_split: default=.1 ; min gain to perform splitting
- max_bin : max number of bins to bucket the feature values.
- min_data_in_bin : min number of data in one bin
- num_threads: default=OpenMP_default, type=int ;Number of threads for Light GBM.
- label : type=string ; specify the label column
- categorical_feature : type=string ; specify the categorical features we want to use for training our model
- num_class: default=1 ; type=int ; used only for multi-class classification

In [7]:
#importing standard libraries 
import numpy as np 
import pandas as pd 
from pandas import Series, DataFrame 

#import lightgbm and xgboost 
import lightgbm as lgb 
import xgboost as xgb 

#loading our training dataset 'adult.csv' with name 'data' using pandas 
data=pd.read_csv('../data/raw/adult/adult.data',header=None) 

#Assigning names to the columns 
data.columns=['age','workclass','fnlwgt','education','education-num','marital_Status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','Income'] 

#glimpse of the dataset 
data.head() 


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital_Status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [12]:

# Label Encoding our target variable 
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
l=LabelEncoder() 
l.fit(data.Income) 

print(l.classes_)
data.Income=Series(l.transform(data.Income))  #label encoding our target variable 
data.Income.value_counts() 
OneHotEncoder?

[0 1]


In [13]:

#One Hot Encoding of the Categorical features 
one_hot_workclass=pd.get_dummies(data.workclass) 
one_hot_education=pd.get_dummies(data.education) 
one_hot_marital_Status=pd.get_dummies(data.marital_Status) 
one_hot_occupation=pd.get_dummies(data.occupation)
one_hot_relationship=pd.get_dummies(data.relationship) 
one_hot_race=pd.get_dummies(data.race) 
one_hot_sex=pd.get_dummies(data.sex) 
one_hot_native_country=pd.get_dummies(data.native_country) 

#removing categorical features 
data.drop(['workclass','education','marital_Status','occupation','relationship','race','sex','native_country'],axis=1,inplace=True) 

 

#Merging one hot encoded features with our dataset 'data' 
data=pd.concat([data,one_hot_workclass,one_hot_education,one_hot_marital_Status,one_hot_occupation,one_hot_relationship,one_hot_race,one_hot_sex,one_hot_native_country],axis=1) 

#removing dulpicate columns 
_, i = np.unique(data.columns, return_index=True) 
data=data.iloc[:, i] 

#Here our target variable is 'Income' with values as 1 or 0.  
#Separating our data into features dataset x and our target dataset y 
x=data.drop('Income',axis=1) 
y=data.Income 



#Imputing missing values in our target variable 
y.fillna(y.mode()[0],inplace=True) 

#Now splitting our dataset into test and train 
from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)

# Applying xgboost
 

In [16]:
#The data is stored in a DMatrix object 
#label is used to define our outcome variable
dtrain=xgb.DMatrix(x_train,label=y_train)
dtest=xgb.DMatrix(x_test)

In [17]:
#setting parameters for xgboost
parameters={'max_depth':7, 'eta':1, 'silent':1,'objective':'binary:logistic','eval_metric':'auc','learning_rate':.05}

In [22]:
#training our model 
num_round=50
from datetime import datetime 

%time xg=xgb.train(parameters,dtrain,num_round) 


1.73 s ± 57.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
#now predicting our model on test set 
ypred=xg.predict(dtest) 
ypred

array([ 0.95747328,  0.30730107,  0.18299842, ...,  0.95786685,
        0.46510884,  0.05661203], dtype=float32)

In [25]:
#Converting probabilities into 1 or 0  
for i in range(0,9769): 
    if ypred[i]>=.5:       # setting threshold to .5 
       ypred[i]=1 
    else: 
       ypred[i]=0  

In [26]:
#calculating accuracy of our model 
from sklearn.metrics import accuracy_score 
accuracy_xgb = accuracy_score(y_test,ypred) 
accuracy_xgb


0.86446923943085274

# Light GBM

In [29]:
train_data=lgb.Dataset(x_train,label=y_train)

In [30]:
#setting parameters for lightgbm
param = {'num_leaves':150, 'objective':'binary','max_depth':7,'learning_rate':.05,'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']

###### Here we have set max_depth in xgb and LightGBM to 7 to have a fair comparison between the two.

In [38]:
#training our model using light gbm
num_round=50

%timeit lgbm=lgb.train(param,train_data,num_round)


208 ms ± 37.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [37]:
#predicting on test set
ypred2=lgbm.predict(x_test)
ypred2[0:5]  # showing first 5 predictions

array([ 0.95909121,  0.31404598,  0.18614527,  0.13907408,  0.37017331])

In [39]:
#converting probabilities into 0 or 1
for i in range(0,9769):
    if ypred2[i]>=.5:       # setting threshold to .5
       ypred2[i]=1
    else:  
       ypred2[i]=0

In [42]:
#calculating accuracy
accuracy_lgbm = accuracy_score(ypred2,y_test)
print('lgbm: %f' % accuracy_lgbm)
y_test.value_counts()

lgbm: 0.863957


0    7384
1    2385
Name: Income, dtype: int64

In [43]:
from sklearn.metrics import roc_auc_score

In [44]:
#calculating roc_auc_score for xgboost
auc_xgb =  roc_auc_score(y_test,ypred)
auc_xgb

0.77721670289435363

In [49]:
#calculating roc_auc_score for light gbm. 
auc_lgbm = roc_auc_score(y_test,ypred2)
comparison_dict = {'accuracy score':(accuracy_lgbm,accuracy_xgb),'auc score':(auc_lgbm,auc_xgb)}

In [50]:
#Creating a dataframe ‘comparison_df’ for comparing the performance of Lightgbm and xgb. 
comparison_df = DataFrame(comparison_dict) 
comparison_df.index= ['LightGBM','xgboost'] 
comparison_df

Unnamed: 0,accuracy score,auc score
LightGBM,0.863957,0.776027
xgboost,0.864469,0.777217
