## 0. Imports

In [1]:
import pandas as pd
import numpy as np

import interpret_extension
from interpret_extension import show
from interpret_extension.glassbox import GaussianNB
from interpret_extension.glassbox import CategoricalNB as CategoricalNaiveBayesClassifier

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer

  from tqdm.autonotebook import tqdm


## 1. Loading IRIS Dataset

Let's load the well-known IRIS Dataset.

In [2]:
iris = pd.read_csv('data/iris.csv')
iris.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

However, let's convert it into a binary problem and separate between **X** and **y**:

In [3]:
iris['species'] = np.where(iris['species'] == 'Iris-setosa', 1, 0)

X = iris.drop('species', axis=1)
y = iris['species']

So, now, Iris-versicolor and Iris-virginica are the same class (**negative class**) and Iris-setosa is the **positive class**.

Finally, let's split it:

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 2. Naive Bayes Models

Let's use both Gaussian NB and Categorical NB to solve this classification problem.

### 2.1 Gaussian Naive Bayes

In [5]:
X_train.sample(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
105,7.6,3.0,6.6,2.1
52,6.9,3.1,4.9,1.5
93,5.0,2.3,3.3,1.0


In [6]:
gaussian_nb = GaussianNB()
gaussian_nb.fit(X_train, y_train)

<interpret_extension.glassbox._naivebayes.GaussianNB at 0x1e9ac3df2b0>

In [7]:
pred = gaussian_nb.predict(X_test)
pred

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 1], dtype=int64)

In [8]:
print(gaussian_nb.score(X_test, y_test))

1.0


Predictions are OK, but how were they created? Let's see how the model can be interpreted.

These are the main params of the model. They give us information about the distribution of both classes, and are key for interpreting the model.

In [9]:
print(gaussian_nb._model().theta_)
print(gaussian_nb._model().var_)

[[6.21875 2.86625 4.865   1.6525 ]
 [4.99    3.44    1.4525  0.2425 ]]
[[0.44427344 0.10923594 0.663775   0.17599375]
 [0.1239     0.1549     0.03299375 0.01144375]]


Firstly, let's see global explanations:

In [10]:
gaussian_nb_global = gaussian_nb.explain_global()
show(gaussian_nb_global)

We can see some tendencies in each of the four variables. 

For example, looking at **sepal_width**, we see an almost linear function (it's a quadratic function in reality, but in this little range it seems linear), from we can see that when the **sepal_width** is higher, it's more probably to belong to the positive class (**Iris-setosa**) 

Looking at the local explanations we can obtain other conclusions:

In [11]:
gaussian_nb_local = gaussian_nb.explain_local(X_test, y_test)
show(gaussian_nb_local)

This way we can observe which variables are the most influential into making an individual prediction.

### 2.2 Categorical Naive Bayes

In order to use Categorical Naive Bayes, we need to discretize the continuous features. We can use the KBinsDiscretizer from scikit-learn to discretize the features.

In [12]:
kbd = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform', subsample=200)
X_train_discrete = pd.DataFrame(kbd.fit_transform(X_train).astype(int), columns=X_train.columns)
X_test_discrete = pd.DataFrame(kbd.transform(X_test).astype(int), columns=X_test.columns)

In [13]:
X_train_discrete.sample(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
8,1,3,0,0
91,0,0,0,0
62,2,1,3,3


Let's fit the model:

In [14]:
categorical_nb = CategoricalNaiveBayesClassifier()
categorical_nb.fit(X_train_discrete, y_train)

<interpret_extension.glassbox._naivebayes.CategoricalNB at 0x1e9ae2821d0>

In [15]:
pred = categorical_nb.predict(X_test_discrete)
pred

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 1], dtype=int64)

In [16]:
print(categorical_nb.score(X_test_discrete, y_test))

1.0


These are the main params used to explain the model:

In [17]:
categorical_nb.model.feature_log_prob_

[array([[-3.34403897, -1.73460106, -0.91629073, -1.3516088 , -2.04475598],
        [-1.03407377, -0.62860866, -2.7080502 , -3.80666249, -3.80666249]]),
 array([[-2.14006616, -0.8873032 , -0.85913232, -3.34403897, -4.44265126],
        [-3.11351531, -3.11351531, -0.76214005, -1.09861229, -2.19722458]]),
 array([[-4.44265126, -3.74950408, -1.18455472, -0.83173334, -1.49821228],
        [-0.09309042, -3.80666249, -3.80666249, -3.80666249, -3.80666249]]),
 array([[-4.44265126, -2.36320971, -0.91629073, -1.22377543, -1.60943791],
        [-0.11778304, -3.11351531, -3.80666249, -3.80666249, -3.80666249]])]

Let's see the global explanations:

In [18]:
categorical_nb_global = categorical_nb.explain_global()
show(categorical_nb_global)

In this case we don't have continuous functions as this model assumes categorical features. We can see the score of each bin of each variable, allowing us to interpret how it affects the model.

In [19]:
categorical_nb_local = categorical_nb.explain_local(X_test_discrete, y_test)
show(categorical_nb_local)

As before, the explanation of individual predictions.

If you compare Categorical NB explanations with Gaussian NB explanations, the length and orientation of the bars are very similar. Never the same, as we have lost information with the discretization, but pretty similar.