<a href="https://colab.research.google.com/github/czanalytics/czanalytics/blob/main/explainable_ai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explainable AI

We discuss generic eXplainable AI tools.
[XAI](https://en.wikipedia.org/wiki/Explainable_artificial_intelligence) helps users of machine lerning systems by improving their understanding of how the models reason.

[LIME and SHAP build build trust](https://www.datacamp.com/tutorial/explainable-ai-understanding-and-trusting-machine-learning-models) by highlighting ransparency and fairness in AI-driven decisions.

## Boston dataset

The Boston housing prices dataset has an ethical problem. 
We use LIME explainer to better understand models created using this data.

In [None]:
!pip install lime &> /dev/null

In [None]:
import lime

In [None]:
# The scikit-learn strongly discourage the use of this dataset 
# unless the purpose of the code is to study and educate about ethical issues
# from sklearn.datasets import load_boston

In [None]:
# we proceed with unethical dataset
import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)

data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

In [None]:
#Variables in order:
# CRIM     per capita crime rate by town
# ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
# INDUS    proportion of non-retail business acres per town
# CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
# NOX      nitric oxides concentration (parts per 10 million)
# RM       average number of rooms per dwelling
# AGE      proportion of owner-occupied units built prior to 1940
# DIS      weighted distances to five Boston employment centres
# RAD      index of accessibility to radial highways
# TAX      full-value property-tax rate per $10,000
# PTRATIO  pupil-teacher ratio by town
# B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
# LSTAT    % lower status of the population
# MEDV     Median value of owner-occupied homes in $1000's

In [None]:
feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 
                 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

In [None]:
rf = sklearn.ensemble.RandomForestRegressor(n_estimators=1000)

In [None]:
import sklearn.ensemble
import sklearn.model_selection
import numpy as np

train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(data, 
                                                                                  target, 
                                                                                  train_size=0.80, 
                                                                                  test_size=0.20)

In [None]:
rf.fit(train, labels_train)

In [None]:
print('Random Forest MSError', np.mean((rf.predict(test) - labels_test) ** 2))

In [None]:
print('MSError when predicting the mean', np.mean((labels_train.mean() - labels_test) ** 2))

In [None]:
categorical_features = np.argwhere(np.array([len(set(data[:,x])) for x in range(data.shape[1])]) <= 10).flatten()

In [None]:
import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(train, feature_names=feature_names, 
                                                   class_names=['price'], 
                                                   categorical_features=categorical_features, 
                                                   verbose=True, mode='regression')


In [None]:
i = 25
exp = explainer.explain_instance(test[i], rf.predict, num_features=5)

In [None]:
exp.show_in_notebook(show_table=True)

In [None]:
exp.as_list()