## Content

- LightGBM
    - GOSS (Gradient based one side sampling)
    - Exclusive Feature Bundling
    - Code walkthrough

- Cascading

- Stacking

- Comparison
    - RF vs GBDT
    - Cascading vs Stacking

# **LightGBM**

It was built at microsoft, primarily for a faster GBDT$.$

It is typically faster than Xgboost because of the code optimization


There are two main strategies for optimization:

#### 1.  GOSS - Gradient based one side sampling

When we are building the $m^{th}$ model the points we have is ($ x_i,res_{i,m} $),


so here instead of considering all points 

- we drop the points in which the $res_{i,m}$ is small 
- i.e. smart sampling ( probability of getting large residual value is higher)

So, when we are building $m^{th}$ model, we'll have fewer rows.

Here the key is to reduce the number of data points due to which the model becomes faster
  



 <img src='https://drive.google.com/uc?id=1NGKKYLSyz8MzFKQtzxnh9PTUbGQoM9JP' >

#### 2.  Exclusive Feature Bundling (EFB)

Let us assume we have a categorical feature with 3 categories 
- if we do **one hot encoding** (worst thing to do), 
    - for each row, only one of them will always be set i.e 1. 

<br>

#### What does Exclusive feature bundling do ? (intuition not detailed)

- It looks at all the dimensions 
- tries finding feature pairs s.t they are exclusive

<br>

#### What does exclusive mean? 
Say we have feature $f_1$, $f_2$.

When we say $f_1, f_2$ are exclusive, we mean
- if value of $f_1$ occurs, $f_2$ value doesn't
- or if $f_1$ is high, then $f_2$ is low

<br>

It tries to find these exclusive features (using graph based algo) and
- group these features 
   
So, here the key objective of EFB is to reduce the number of features and hence reduce dimensionality 

 <img src='https://drive.google.com/uc?id=1G2Ilv5_nhQ0rrX0KUp60v9UoqlDApWl4' >


### Code walkthrough

In [None]:
import lightgbm as lgb
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
#Refer: https://lightgbm.readthedocs.io/en/latest/Parameters.html
import datetime as dt
gridParams = {
    'learning_rate': [0.1, 0.5, 0.8],
    'boosting_type' : ['gbdt'],
    'objective' : ['multiclass'],
    'max_depth' : [5,6,7,8],
    'colsample_bytree' : [0.5,0.7],
    'subsample' : [0.5,0.7],
    'metric':['multi_error'],
    'random_state' : [501]
    }

clf = lgb.LGBMClassifier(num_classes=20)
grid = RandomizedSearchCV(clf,gridParams,verbose=3,cv=3,n_jobs = -1,n_iter=10,)

start = dt.datetime.now()
grid.fit(X_train,Y_train)
end = dt.datetime.now()


In [None]:
grid.best_params_

In [None]:
best_lgbm = lgb.LGBMClassifier(boosting_typ = 'gbdt',
                              objective = 'multiclass',
                              num_class=20, 
                              colsample_bytree=0.7, 
                              subsample=0.7, 
                              max_depth=8, 
                              learning_rate=0.5, 
                              random_state = 501)
best_lgbm.fit(X_train, Y_train)

In [None]:
print(f"Time taken for training : {end - start}\nTraining accuracy:{best_lgbm.score(X_train, Y_train)}\nTest Accuracy: {best_lgbm.score(X_test, Y_test)}")

Notice the time it took for 30 fits in XGBoost(5 mins) vs LightGBM (2 mins)

# **Cascading**

Lets, assume we are to detect a fraudulent transaction or not 
 
Let the dataset be $D_1$ which will be imbalanced, and 

- $y=1$ for fraudulent transaction
- $y =0$ for non fraudulent transaction

For a query point $x_q$, 
- we will pass this point through the first model $M_1$
- Model $M_1$ will return the probability of the query point being a fraud


Based on probability, we'll split it in 2 parts:
- if the probability of $y$ being 1 is extremely low, say $< 0.001$ then 
    - we consider that as not fraudulent, let this data be $D_1'$.

#### What happens to rest of the data? 

The rest of the points ($D_1-D_1'$) i.e. data with prob. > 0.001 which we are not sure about 
- will be passed through the next model $M_2$ 
- Model $M_2$ will be more stricter i.e. it'll penalize more.

Again model $M_2$ will split into 2 parts
- non fraud (say, $P(y =1 | x_q) < 0.001$)
- fraud transac. (p > 0.001)

We can again add another model after $M_2$ which will work on same principles





<img src='https://drive.google.com/uc?id=1TwfqSCDjjS3MsXaBadxNBIvBfF9LQ0JC' >

#### Did you notice the structure of model? 
We are cascading one model after another.

In the first model we are just removing all the genuine customers
- in second model, we are trying to find the may be fraudalent points from 2nd data set, 

we contimue doing this **cascading**

Every model is trained on different datasets ($D_n - D_n^1$ )

If even after all these models, we are not sure there will be a human at last to verify the same.

<img src='https://drive.google.com/uc?id=16cjT66tLCnFRuGzGPVmw4KUOgz0k85Ot' >



# **Stacking**

Lets assume we have a data set of $n$ poits.

#### What do we do in stacking?
We train $m$ **individual** models (base learners) on this data set, 
- these models can be different types of models like Decision tree,  GBDT, Random forest etc.

Do note that, these m model can be  optimial models.
- i.e. perfectly fitted model with minimum CV error.
 
Let these base learners be $c_1,c_2,....c_n$
* Now, given a datapoint, each of these model will give a prediction ($p_1,p_2,...p_n$) 

Unlike RF, we we train m model and aggregate the prediction using mean/median (regression) or majority vote (classification),

**In Stacking, We build a meta classifer on the predictios of the base learners**

#### What model will we use as Meta classifier? 
The Meta-classifier can be any model
* And this Meta-classifier gives the final output of the data 

#### What is happening in Stacking intuitively?
Here, we are taking the outputs of the perfectly build models and stacking them together to train a Meta-classifier to get the final output

BAM!!

 

<img src='https://drive.google.com/uc?id=14Ishp1beh9iHH1TrOICRomMKRMpgRrLQ' >

#### How is this implemented ? 

This is implemented using the **StackingClassifier** library from **miextend.classifier** module
* In the code we import all the libraries first
* Then **create the base classfiers** and pass these as **inputs to stacking classfier** and use **set any model as Meta-Classifier.**
* Now, **we train the model** and it's done

Double BAM!!



<img src='https://drive.google.com/uc?id=17wgLrV5LKmISQw7kzLrW2rzuDiolFDM5' >

#### Idea looks interesting, Why don't we them ?
Ans: coz Deep learning came into existence.

## Comparison



### RF vs GBDT

We use GBDT more often than RF

#### Question: Why GBDT is used more often than RF?

1. Because we can choose any differentiable loss function but we cannot do this for random forest 
2. Though training time varies but it's only done once so it doesnt matter much, but Run-time is important as queries are given everytime
3. GBDT  has cheaper Run-time because 
    - the base learners are shallow and 
    - Random forest has deeper trees and 
    - the number of trees to train in GBDT are less when compared to Random Forest  







<img src='https://drive.google.com/uc?id=1kBqeWQ-y71bi6ndWQPUn58IGo4FmiTjT' >

# **Cascading vs Stacking**


**Cascading** is used when the risk or cost of mistakes is high, and the data is highly imbalanced.
 * Like fraud transaction detection in amazon

#### What about the explainability of the model?
We make sure thst every model is explainable, so that we can explain the output using these models 
  * We will see few algorithms, like **LIME and SHAP** which can explain any black box algorithm after few lectures in Deep Learning.



**Stacking** is mostly seen in kaggle competitions, not so much in real world.



<img src='https://drive.google.com/uc?id=1kzRryuOAFc5_HYtWcHxSGDwml-dHtGQk' >
