## Explainable Boosting Machine 
### Author: Francesca Naretto
### Dataset: Adult dataset ( arleady cleaned)
#### Download EBM at https://github.com/interpretml/interpret
EBM is a Generalized Additive Model (GAM) with automatic interaction detection, based on trees and cyclic gradient boosting.
EBM learns a function f_j for each feature j. The function is learnt by exploiting a boosting procedure in a round-robin fashion, hence a feature at a time. EBM can automatically detect pairwise interaction terms. The forumula is:

$g(E[y]) = β_0 + \sum f_i(x_i) + \sum f_{i,j}(x_i, x_j)$

Where g is the link function, f is the function computed for each feature.

There are 2 explanations available: global and local. 


For global explanations, we can visualize an overall behaviour and the behaviour for each feature, that is extracted from the function f_j computed for each feature. 


For local explanations and predictions, each function f_j acts as a lookup table per feature and returns a term contribution. For predicting, the terms contributions are added up and passed to the final link function. 

In [2]:
!pip install interpret

Collecting interpret
  Downloading interpret-0.2.7-py3-none-any.whl (1.4 kB)
Collecting interpret-core[dash,debug,decisiontree,ebm,lime,linear,notebook,plotly,required,sensitivity,shap,skoperules,treeinterpreter]>=0.2.7
  Downloading interpret_core-0.2.7-py3-none-any.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 5.4 MB/s 
[?25hCollecting skope-rules>=1.0.1
  Downloading skope_rules-1.0.1-py3-none-any.whl (14 kB)
Collecting treeinterpreter>=0.2.2
  Downloading treeinterpreter-0.2.3-py2.py3-none-any.whl (6.0 kB)
Collecting dash-cytoscape>=0.1.1
  Downloading dash_cytoscape-0.3.0-py3-none-any.whl (3.6 MB)
[K     |████████████████████████████████| 3.6 MB 41.1 MB/s 
[?25hCollecting dash-table>=4.1.0
  Downloading dash_table-5.0.0.tar.gz (3.4 kB)
Collecting gevent>=1.3.6
  Downloading gevent-21.12.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 53.8 MB/s 
[?25hCollecting dash>=1.0.0
  Downloading dash

In [1]:
from interpret.glassbox import ExplainableBoostingClassifier
import pickle
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [4]:
#we load the adult dataset
#this is a clean version, in which the education feature is removed
#categorical variables are mapped in numerical ones
dataset = pd.read_csv('/content/adult_clean.csv')

In [5]:
dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,0,77516,13,7,14,5,1,0,2174,0,40,41,0
1,50,1,83311,13,1,1,1,1,0,0,0,13,41,0
2,38,2,215646,9,2,2,5,1,0,0,0,40,41,0
3,53,2,234721,7,1,2,1,2,0,0,0,40,41,0
4,28,2,338409,13,1,3,2,2,1,0,0,40,1,0


In [6]:
test_size = 0.3
random_state = 42
labels = dataset.pop('class')
features = list(dataset.columns)
X_train, X_test, Y_train, Y_test = train_test_split(dataset, labels,
                                                        test_size=test_size,
                                                        random_state=random_state,
                                                        stratify=labels)

### Creation of the EBM model

In [7]:
import time 
start = time.time()
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, Y_train)
end = time.time()
print('Time for the creation of EBM model ', end - start)

Time for the creation of EBM model  55.20486283302307


In [18]:
X_test.values[0].reshape(1,-1)

array([[    29,      2, 169683,     11,      1,      3,      1,      1,
             0,      0,      0,     40,     41]])

In [19]:
ebm.predict(X_test.values[0].reshape(1,-1))

array([0])

In [20]:
ebm.predict_proba(X_test.values[0].reshape(1,-1))

array([[0.63985526, 0.36014474]])

In [21]:
ebm.predict_and_contrib(X_test.values[0].reshape(1,-1), output='probabilities')

(array([[0.63985526, 0.36014474]]),
 array([[-0.18511864,  0.03513952, -0.00553996,  0.10400734,  0.63873291,
          0.40526434,  0.39629074,  0.04245792,  0.30480956, -0.23528792,
         -0.05455911,  0.01356569,  0.02432705,  0.04442561,  0.07356974,
          0.01105724,  0.02561942,  0.01527057,  0.07117249, -0.0077097 ,
         -0.01237292, -0.0163955 , -0.00256915]]))

### It offers interactive plots, in which we can see the summary of the overall beahviour of the model, the overall beahviour of single features or the interaction among some features

In [8]:
from interpret import show

ebm_global = ebm.explain_global(name='EBM Adult Global')
show(ebm_global)

  detected_envs


### Local explanation of the first 10 records of the test set

In [9]:
X_test.iloc[:10]

Unnamed: 0,age,workclass,fnlwgt,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
21746,29,2,169683,11,1,3,1,1,0,0,0,40,41
28575,37,2,238433,2,1,6,1,1,0,0,0,40,1
23613,41,2,204415,13,1,5,1,1,0,0,0,48,41
15533,18,2,362302,8,7,4,3,1,0,0,0,15,41
1257,20,2,444554,6,7,2,3,1,0,0,0,40,41
2721,38,2,183279,10,1,1,1,1,0,0,0,44,41
28528,21,2,145119,10,7,4,3,3,0,0,0,20,41
3080,29,2,129856,11,7,14,3,1,1,0,0,40,41
25974,30,2,161599,9,1,2,1,1,0,0,0,40,41
16268,58,4,489085,13,1,3,1,2,0,0,0,40,41


In [10]:
ebm_local = ebm.explain_local(X_test.iloc[:10], Y_test.iloc[:10])
show(ebm_local)

### There are available several plots, such as ROC curve

In [11]:
from interpret.perf import ROC

ebm_perf = ROC(ebm.predict_proba).explain_perf(X_test, Y_test, name='EBM Adult')

show(ebm_perf)
