# Explaining Machine Learning Decisions Using AIX360
## Using "Protodash" Algorithm
---

## Introduction

Black box machine learning models that cannot be understood by people, such as deep neural networks and large ensembles, are achieving impressive accuracy on various tasks. However, as machine learning is increasingly used to inform high stakes decisions, explainability and interpretability of the models is becoming essential.

AI Explainability 360 is an open source toolkit developed by IBM Research, that can help explain why a machine learning model came to a decision. this toolkit can help explain the decisions to data scientists, Application users and users whom these decisions impact directly.This toolkit includes algorithms that span the different dimensions of ways of explaining along with proxy explainability metrics.

For more information see links below:

- AIX360 Demo: https://aix360.mybluemix.net
- AIX360 GitHub: https://github.com/IBM/AIX360/
- AIX360 API Docs: https://aix360.readthedocs.io/en/latest/

## Dataset

The dataset used for this lab is the HeartDisease dataset.It is a freely available data set on the UCI Machine Learning Repository portal. The **Heart Disease Data Set** is hosted [here](http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data).

## Objective

Different user roles present different requirements for explanations. In this medical scenario, there are 3 types of users: 

1. Data scientists: who are interested in very technical explainations of why a model behaves the way it does.
2. Doctors: who are interested in knowing what characteristics are similar between current patients and the ones diagnosed with heart disease to better understand why a patient is predicted to have heart disease.
3. Patients: who are interested to know what did they do to get heart disease and what they could have done to prevent it.

For this reason, AI Explainability 360 offers a collection of algorithms that provide diverse ways of explaining decisions generated by machine learning models. 

In this notebook you will utilize AIX360 to explain the decisions made to the second group, the doctors. You will use the "Protodash" algorithm for this purpose.

Upon completing this lab you will learn:

- How to load dataset using a download link
- Create, train and evaluate a XGBoost model
- Use Protodash Algorithm to extract similar examples and compare them with the current patient's case

## 1. Setup

In order to download the data from UCI Machine Learning Repository, use the `wget` library. Use the following command to install the `wget` library: `!pip install wget` 

In [None]:
!pip install wget 


Now, the code in the cell below downloads the data set and saves it in the local filesystem. The name of downloaded file containing the data will be displayed in the output of this cell.

In [None]:
import wget
import pandas as pd

link_to_data = 'http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'

# make sure no duplicates
!rm processed.cleveland*.data

ClevelandDataSet = wget.download(link_to_data)

print(ClevelandDataSet)

The downloaded data set contains the following attributes pertaining to heart disease.

### Data set Details:
1. age - age in years
2. sex - sex(1 =  male; 0 = female)
3. cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
5. chol - serum cholestoral in mg/dl
6. fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg - resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest
11. slope - the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
12. ca - number of major vessels (0-3) colored by flourosopy
13. thal - 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num - number of major blood vessels > 50% blocked (angiographic disease status)  

## 2. Load and explore data

In this section you will load the data as a Pandas data frame and perform a basic exploration.

Load the data in the .csv file, **processed.cleveland.data**, into a Pandas data frame by running the code below. Note that the dataset does not contain header information so that is provided in the col_names variable. The first 5 lines will be displayed by using the .head() method. 


In [None]:
col_names = ['age','sex','cp','restbp','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']

heart_data_df = pd.read_csv(ClevelandDataSet, sep=',', header=None, names=col_names, na_filter= True, na_values= {'ca': '?', 'thal': '?'})
heart_data_df.head()

In [None]:
(samples, attributes) = heart_data_df.shape
print("No. of Sample data =", samples )
print("No. of Attributes  =", attributes)

We will now create a derived attribute that will serve as our target. The goal of the model is to predict whether a patient has a heart problem. The data set as currently constructed does not directly have this information. However, this information can be derived from the `num` attribute. The `num` column and its values pertain to the number of major vessels with more than 50% narrowing (values- 0,1,2,3 or 4) for the corresponding sample data. 

Therefore, the target column `diagnosed` can derived in the following way: 
- 'diagnosed' is '0' when 'num' = 0 , indicating normal heart functioning 
- 'diagnosed' is '1' when 'num' > 0 , indicating a heart problem.



In [None]:
heart_data_df['diagnosed'] = heart_data_df['num'].map(lambda d: 1 if d > 0 else 0)

In [None]:
heart_data_df.describe()

<a id="create"></a>
## 3. Create an XGBoost model

In recent years, ensemble learning models took the lead and became popular among machine learning practitioners.

Ensemble learning model employs multiple machine learning algorithms to overcome the potential weaknesses of a single model. For example, if you are going to pick a destination for your next vacation, you probably ask your family and friends, read reviews and blog posts. Based on all the information you have gathered, you make your final decision.

This phenomenon is referred as the Wisdom of Crowds (WOC) in social sciences and it states that averaging the answers (prediction or probability) of a group will often result better than the answer of one of its members. The idea is that the collective knowledge of diverse and independent individuals will exceed the knowledge of any one of those individuals, helping to eliminate the noise.

XGBoost is an open source library for ensemble based algorithms. It can be used for classification, regression and ranking type of problems. XGBoost supports multiple languages, such as C++, Python, R, and Java. 

The Python library of XGBoost supports the following API interfaces to train and predict a model, also referred to as a `Booster`: 
- XGBoost's native APIs pertaining to the `xgboost` package, such as `xgboost.train()` or `xgboost.Booster`
- Scikit-Learn based Wrapper APIs: `xgboost.sklearn.XGBClassifier` and `xgboost.sklearn.XGBRegressor`

In this section you will learn how to train and test an XGBoost model using the scikit-learn based Wrapper APIs.  

First, you must import the required libraries.

In [None]:
import xgboost
from xgboost.sklearn import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split 

from xgboost import plot_importance
from matplotlib import pyplot
import pprint
%matplotlib inline

### 3.1: Prepare Data
In this section, clean and transform the data in the Pandas data frame into the data that can be given as input for training the model. 

#### 3.1.1: Cleanse the data
First, check if there are any null data in our dataset and remove the corresponding rows.

In [None]:
print("List of features with their corresponding count of null values : ")
print("---------------------------------------------------------------- ")
print(heart_data_df.isnull().sum())

From the output of the above cell, there are 6 occurrences where there are null values. The rows containing these null values can be removed so that the data set does not have any incomplete data. The cell below contains the command to remove the rows that contain these null values.

In [None]:
heart_data_df = heart_data_df.dropna(how='any',axis=0)

#### 3.1.2: Prepare the target data and feature columns
The next step is to select the attributes in the current data set that can be used for training the model. Here, all the attributes other than `num` attribute are chosen as the features.


In [None]:
feature_cols = ['age','sex','cp','restbp','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
features_df = heart_data_df[feature_cols]

#### 3.1.3: Split the data set for training and testing
As the target and feature columns has been defined, you can now split the data set into two sets that will be used for training the model and for testing the trained model. 

In [None]:
heart_train, heart_test, target_train, target_test = train_test_split(features_df, heart_data_df.loc[:,'diagnosed'], test_size=0.33, random_state=0)


### 3.2 Create the XGBoost Model
In the cell below, we create our pipeline which contains the XGBoost classifier:

In [None]:
pipeline = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])

After we have set up our pipeline with the XGBoost classifier, we can train it by invoking the fit method.

In [None]:
model = pipeline.fit(heart_train,target_train)

We can now make predictions on test data and evaluate the model.

In [None]:
y_pred = model.predict(heart_test.values)
accuracy = accuracy_score(target_test, y_pred)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

## 4. Using AIX360

In this section, you will install the aix360 library. This may take a few minutes.

In [None]:
!pip install aix360 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML
from aix360.algorithms.protodash import ProtodashExplainer

## 4.1 Doctor : ProtoDash Explainer - using simillar examples

We now show how to generate explanations in the form of selecting prototypical or similar user profiles to a patient in question that athe docrtor may be interested in. This may help the doctor understand the patiemnt's diagnosis by the AI system in the context of other similar patients. Note that the selected prototypical patients are profiles that are part of the training set that has been used to train an AI model that predicts whether or not a patients has heart disease. In fact, the method used in this notebook can work even if we are given not just one but a set of patient profiles for which we want to find similar profiles from a training dataset. Additionally, the method computes weights for each prototype showcasing its similarity to the patient(s) in question.

The prototypical explanations in AIX360 are obtained using the Protodash algorithm developed in the following work: ProtoDash: [Fast Interpretable Prototype Selection](https://arxiv.org/abs/1707.01212)


### 4.1.1. Obtain similar samples as explanations for a patient predicted as "Has Heart Disease" 

The following cell will normalize the data and choose a particular patient, whos profile is displayed below.

In [None]:
p_train = model.predict(heart_train) # Use trained model to predict train points
p_train = p_train.reshape((p_train.shape[0],1))

z_train = np.hstack((heart_train, p_train)) # Store instances that were predicted as Has Heart Disease
z_train_hd = z_train[z_train[:,-1]>=1, :]

In [None]:
heart_test.head(10)

### 4.1.2. Let us now consider patient number 10 who has diagnosed with heart disease. 

Please note, the patient number 10 may not come up in the table in the cell above because the data is randomized. We select patient number 10 because we know that in the data they are diagnosed with heart disease.

In [None]:
idx = 10
class_names = ['Negative', 'Positive']
heart_test_np = heart_test.to_numpy()

X = heart_test_np[idx].reshape((1,) + heart_test_np[idx].shape)
print("Chosen Sample:", idx)
print("Prediction made by the model:", class_names[np.argmax(model.predict_proba(X))])
print("Prediction probabilities:", model.predict_proba(X))
print("")

# attach the prediction made by the model to X
X = np.hstack((X, model.predict(X).reshape((1,1))))

dfx = pd.DataFrame.from_records(X.astype('double')) # Create dataframe with original feature values
dfx.head()

dfx[15] = class_names[X[0, -1].astype(int)]
dfx.columns = heart_data_df.columns
dfx.transpose()

### 4.1.3. Find similar applpatients predicted as "Has Heart Disease" using the protodash explainer.

In [None]:
explainer = ProtodashExplainer()
(W, S, setValues) = explainer.explain(X, z_train_hd, m=5) # Return weights W, Prototypes S and objective function values

### 4.1.4. Display similar patient user profiles and the extent to which they are similar to the chosen patient as indicated by the last row in the table below labelled as "Weight".

In [None]:
dfs = pd.DataFrame.from_records(z_train_hd[S, 0:-1].astype('double'))
df = pd.read_csv(ClevelandDataSet, sep=',', header=None, names=col_names, na_filter= True, na_values= {'ca': '?', 'thal': '?'})

RP=[]
for i in range(S.shape[0]):
    RP.append(class_names[z_train_hd[S[i], -1].astype(int)]) # Append class names
dfs[13] = RP
dfs.columns = df.columns  
dfs["Weight"] = np.around(W, 5)/np.sum(np.around(W, 5)) # Calculate normalized importance weights
dfs.transpose()

### 4.1.5. Compute how similar a feature of a prototypical user is to the chosen patient

The more similar the feature of prototypical user is to the patient, the closer its weight is to 1. We can see below that several features for prototypes are quite similar to the chosen applicant. 

In [None]:
z = z_train_hd[S, 0:-1] # Store chosen prototypes
eps = 1e-10 # Small constant defined to eliminate divide-by-zero errors
fwt = np.zeros(z.shape)
for i in range (z.shape[0]):
    for j in range(z.shape[1]):
        fwt[i, j] = np.exp(-1 * abs(X[0, j] - z[i,j])/(np.std(z[:, j])+eps)) # Compute feature similarity in [0,1]
                
# move wts to a dataframe to display
dfw = pd.DataFrame.from_records(np.around(fwt.astype('double'), 2))
dfw.columns = df.columns[:-1]
dfw.transpose()