# Explaining Machine Learning Decisions Using AIX360
## Using "Protodash" Algorithm
---

## Introduction

Black box machine learning models that cannot be understood by people, such as deep neural networks and large ensembles, are achieving impressive accuracy on various tasks. However, as machine learning is increasingly used to inform high stakes decisions, explainability and interpretability of the models is becoming essential. There are many ways to explain: data vs. model, directly interpretable vs. post hoc explanation, local vs. global, static vs. interactive; the appropriate choice depends on the persona of the consumer of the explanation.

The AI Explainability 360 Python package includes algorithms that span the different dimensions of ways of explaining along with proxy explainability metrics. The AI Explainability 360 interactive demo provides a gentle introduction to the concepts and capabilities by walking through an example use case from the perspective of different consumer personas. The tutorials and other notebooks offer a deeper, data scientist-oriented introduction. The complete API is also available. 

For reference information see links below:

- AIX360 Demo: https://aix360.mybluemix.net
- AIX360 GitHub: https://github.com/IBM/AIX360/
- AIX360 API Docs: https://aix360.readthedocs.io/en/latest/

## Tutorial Objective

Different user roles present different requirements for explanations. In this medical scenario, there are 3 types of users: 

1. Data scientists: who are interested in very technical explainations of why a model behaves the way it does.
2. Doctors: who are interested in knowing what characteristics are similar between current patients and the ones diagnosed with heart disease to better understand why a patient is predicted to have heart disease.
3. Patients: who are interested to know what did they do to get heart disease and what they could have done to prevent it.

For this reason, AI Explainability 360 offers a collection of algorithms that provide diverse ways of explaining decisions generated by machine learning models. 

In this notebook you will utilize AIX360 to explain the decisions made to the second group, the doctors. You will use the "Protodash" algorithm for this purpose.

Upon completing this lab you will learn how to:

- Load dataset using a download link
- Create, train and evaluate a XGBoost model
- Use Protodash Algorithm to extract similar examples and compare them with the current patient's case

## Environment
This tutorial uses a Jupyter Notebook, an open-source web applicaiton that allows you to create and share documents that contain instructions as well as live code.

The Jupyter Notebooks we are using today is based on a Watson Studio environment, a set of open source packages that provide us with a standardized data analysis tools. At multiple points during the demo, we will important additional tools we need for specific steps:

E.g. `import pandas as pd` -> to import the "pandas" tool for data manipulation.

E.g. `!pip install wget` -> to install a tool "wget" to download data from external webpages.

Watson Studio also contains a set of functionality that allows a user to pre-define a set of environments down to the package version level as well as define the hardware configurations available to certain users, allowing teams to easily standardize toolsets and resources. If needed, we can also connect our notebooks to GPUs, Apache Spark, and external clusters for higher performance.


## 1. Setup

In order to download the data from UCI Machine Learning Repository, use the `wget` library. Use the following command to install the `wget` library: `!pip install wget` 

In [None]:
!pip install wget 


Now, the code in the cell below downloads the data set and saves it in the local filesystem. The name of the downloadeded file containing the data will be displayed in the output of this cell.

In [None]:
import wget
import pandas as pd

link_to_data = 'http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'

# make sure no duplicates
!rm processed.cleveland*.data

ClevelandDataSet = wget.download(link_to_data)

if (ClevelandDataSet is not None):
    print("ClevelandDataSet loaded successfully.")

## Dataset

The heart disease dataset, pulled from the UC Irvine Machine Learning Repository, contains anonymized information on factors contributing to heart disease. The names and social security numbers from the raw data were removed and replaced with dummy values, allowing for unique identifiers without personally identifiable information. While the full database contains data from multiple locations, the lab today will focus on the Cleveland database, concentrating on either the presence or absence of heart disease.

It is a freely available data set on the UCI Machine Learning Repository portal. The Heart Disease Data Set is hosted [here](http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data).

The overall heart disease databaes including data from other locations is [here](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/)

### Column Details:
1. age - age in years
2. sex - sex(1 =  male; 0 = female)
3. cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
5. chol - serum cholestoral in mg/dl
6. fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg - resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest
11. slope - the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
12. ca - number of major vessels (0-3) colored by flourosopy
13. thal - 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num - number of major blood vessels > 50% blocked (angiographic disease status)  

Where 1,2,3,4 are considered "presence" of heart disease and 0 is "absence."

## 2. Explore data

In this section you will load the data as a Pandas data frame and perform a basic exploration.

Load the data in the .csv file, **processed.cleveland.data**, into a Pandas data frame by running the code below. Note that the dataset does not contain header information so that is provided in the col_names variable. The first 5 lines will be displayed by using the .head() method. 


In [None]:
col_names = ['age','sex','cp','restbp','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']

heart_data_df = pd.read_csv(ClevelandDataSet, sep=',', header=None, names=col_names, na_filter= True, na_values= {'ca': '?', 'thal': '?'})
heart_data_df.head()

In [None]:
(samples, attributes) = heart_data_df.shape
print("No. of Sample data =", samples )
print("No. of Attributes  =", attributes)

We will now create a derived attribute that will serve as our target. The goal of the model is to predict whether a patient has a heart problem. The data set as currently constructed does not directly have this information. However, this information can be derived from the `num` attribute. The `num` column and its values pertain to the number of major vessels with more than 50% narrowing (values- 0,1,2,3 or 4) for the corresponding sample data. 

Therefore, the target column `diagnosed` can derived in the following way: 
- 'diagnosed' is '0' when 'num' = 0 , indicating normal heart functioning 
- 'diagnosed' is '1' when 'num' > 0 , indicating a heart problem.



In [None]:
heart_data_df['diagnosed'] = heart_data_df['num'].map(lambda d: 1 if d > 0 else 0)

In [None]:
heart_data_df.describe()

<a id="create"></a>
## 3. Create an XGBoost model

In recent years, ensemble learning models took the lead and became popular among machine learning practitioners.

Ensemble learning models employs multiple machine learning algorithms to overcome the potential weaknesses of a single model. For example, if you are going to pick a destination for your next vacation, you probably ask your family and friends, read reviews and blog posts. Based on all the information you have gathered, you make your final decision.

This phenomenon is referred as the Wisdom of Crowds (WOC) in social sciences and it states that averaging the answers (prediction or probability) of a group will often result better than the answer of one of its members. The idea is that the collective knowledge of diverse and independent individuals will exceed the knowledge of any one of those individuals, helping to eliminate the noise.

XGBoost is an open source library for ensemble based algorithms. It can be used for classification, regression and ranking type of problems. XGBoost supports multiple languages, such as C++, Python, R, and Java. 

The Python library of XGBoost supports the following API interfaces to train and predict a model, also referred to as a `Booster`: 
- XGBoost's native APIs pertaining to the `xgboost` package, such as `xgboost.train()` or `xgboost.Booster`
- Scikit-Learn based Wrapper APIs: `xgboost.sklearn.XGBClassifier` and `xgboost.sklearn.XGBRegressor`

In this section you will learn how to train and test an XGBoost model using the scikit-learn based Wrapper APIs.  

To start off, we will add to the default environment by importing additional packages that will provide us with the specific tools we need for building models using XGBoost and displaying results using matplotlib.

XGBoost: https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/

matplotlib: https://matplotlib.org/

In [None]:
import xgboost
from xgboost.sklearn import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split 

from xgboost import plot_importance
from matplotlib import pyplot
import pprint
%matplotlib inline

### 3.1: Prepare Data
In order for our machine learning model to yield accurate results, we need to clean our data. Additionally, the data must be structurd in a specific way to be used as an input in building our machine learning models.

We will use Pandas, a software library purpose-built for data manipulation and analysis.

#### 3.1.1: Cleanse the data
Null (empty) values would adversely affect the performance and accuracey of our machine learning model.
Below, we check if there are any null rows in our dataset and remove each null rows.

`heart_data_df.isnull().sum()` - Count the number of null values for each feature (input variable)

In [None]:
print("List of features with their corresponding count of null values : ")
print("---------------------------------------------------------------- ")
print(heart_data_df.isnull().sum())

From the output of the above cell, there are 6 occurrences where there are null values. The rows containing these null values can be removed so that the data set does not have any incomplete data. The cell below contains the command to remove the rows that contain these null values.

`heart_data_df = heart_data_df.dropna(how='any',axis=0)` - For our dataset (heart_data_df) drop any 'NA' (null value).

In [None]:
heart_data_df = heart_data_df.dropna(how='any',axis=0)

#### 3.1.2: Prepare the target data and feature columns
A large part of the model building process is choosing which features (inputs) we want to use in prediction (e.g. Does high cholesterol cause heart disease? Or would age be a stronger predictor?). Choosing irrelevant features can decrease the accuracy of our models.

For brevity, we have already identified the most relevant features in our data: all our attributes other than the `num` attribute should used.

In the code below, we create `feature_cols` as a list of the features we will use. Meaning, we will predict on age, sex, cp, etc...in the list below.

In [None]:
feature_cols = ['age','sex','cp','restbp','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
features_df = heart_data_df[feature_cols]

#### 3.1.3: Split the data set for training and testing
As the target and feature columns have been defined, you can now split the data set into two sets that will be used for training the model and for testing the trained model. 

A training set is used to build our machine learning model. The test data set is used to assess the performance of our model. The statement below used 67% of our data to train ur model and 33% of our data to test the model.

In [None]:
heart_train, heart_test, target_train, target_test = train_test_split(features_df, heart_data_df.loc[:,'diagnosed'], test_size=0.33, random_state=0)


### 3.2 Create the XGBoost Model
In the cell below, we create our pipeline which contains the XGBoost classifier. We will use this pipeline to train our model.

In [None]:
pipeline = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])

After we have set up our pipeline with the XGBoost classifier, we can train it by invoking the fit method.

In [None]:
model = pipeline.fit(heart_train,target_train)

We can now evaluate our model using the testing dataset we created earlier. If we find that our model has low accuracy, we can make adjustments to our models form or input parameters and retrain the model.

In [None]:
y_pred = model.predict(heart_test.values)
accuracy = accuracy_score(target_test, y_pred)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

## 4. Using AIX360


In this section, you will install the aix360 library into our environment, allowing us to use the prebuilt algorithms to explain how our model makes decisions. 

### This step may take a few minutes.

In [None]:
#!pip install aix360 
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='bb85b849-190f-40a1-85b7-cc59bef96005', project_access_token='p-b0ce539bc2fb99e43c428c7c2ea1261802f48fdf')
pc = project.project_context

In [None]:
# Initialize the project variable
from project_lib import Project

# Fetch the file, for example the tar.gz or whatever installable distribution you created
with open("aix360-0.2.0.tar.gz","wb") as f:
    f.write(project.get_file("aix360-0.2.0.tar.gz").read())

# Install and import the library
!pip install "aix360-0.2.0.tar.gz"

More environment setup. 
The line `from aix360.algorithms.protodash import ProtodashExplainer` imports ProtoDashExplainer, the algorithm we will use for this lab, from the aix360 library we imported in the previous cell.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML
from aix360.algorithms.protodash import ProtodashExplainer

## 4.1 Doctor : ProtoDash Explainer - using similar examples

We now show how to generate explanations in the form of selecting prototypical or similar user profiles to a patient in question that the doctor may be interested in. This may help the doctor understand the diagnosis our machine learning model in context.

In other words, we identify example profiles of patients that our model diagnosed with heart disease and their specific attributes (e.g. age 65, chol 248). We can then understand why another patient was diagnosed by our model with heart disease by comparing his/her attributes (age, chol, etc) with that example 'prototypical' patient and measuring the level of similarity.

Note that the selected prototypical patients are profiles that are part of the training set that has been used to train an AI model that predicts whether or not a patient has heart disease. In fact, the method used in this notebook can work even if we are given not just one but a set of patient profiles for which we want to find similar profiles from a training dataset. Additionally, the method computes weights for each prototype showcasing its similarity to the patient(s) in question.

The prototypical explanations in AIX360 are obtained using the Protodash algorithm developed in the following work: ProtoDash: [Fast Interpretable Prototype Selection](https://arxiv.org/abs/1707.01212)


### 4.1.1. Obtain similar samples as explanations for a patient predicted as "Has Heart Disease" 

The following cell will choose a particular patient, whose profile is displayed below.

In [None]:
p_train = model.predict(heart_train) # Use trained model to predict train points
p_train = p_train.reshape((p_train.shape[0],1))

z_train = np.hstack((heart_train, p_train)) 
z_train_hd = z_train[z_train[:,-1]>=1, :]  # Store instances that were predicted as Has Heart Disease

### 4.1.2. Let us now consider patient number 10 who has been diagnosed with heart disease. 

We select patient number 10 because we know that in the data they are diagnosed with heart disease.

We then display:
- the patient (sample) we chose.
- The prediction made: 'positive' (has heart disease).
- Our prediction probabilities (['probability does not have heart disease','probability has heart disease']

And the associated patient information (e.g. age=65).

In [None]:
idx = 10
class_names = ['Negative', 'Positive']
heart_test_np = heart_test.to_numpy()

X = heart_test_np[idx].reshape((1,) + heart_test_np[idx].shape)
print("Chosen Sample:", idx)
print("Prediction made by the model:", class_names[np.argmax(model.predict_proba(X))])
print("Prediction probabilities:", model.predict_proba(X))
print("")

# attach the prediction made by the model to X
X = np.hstack((X, model.predict(X).reshape((1,1))))

dfx = pd.DataFrame.from_records(X.astype('double')) # Create dataframe with original feature values
dfx.head()

dfx[15] = class_names[X[0, -1].astype(int)]
dfx.columns = heart_data_df.columns
dfx.transpose()

### 4.1.3. Find similar patients predicted as "Has Heart Disease" using the protodash explainer.

Similar to how we created our machine learning pipeline, we now create a Protodash explainer and then find similar 'example' patients.

In [None]:
explainer = ProtodashExplainer()
(W, S, setValues) = explainer.explain(X, z_train_hd, m=5) # Return weights W, Prototypes S and objective function values

### 4.1.4. Display similar patient user profiles and the extent to which they are similar to the chosen patient.
Indicated by the last row in the table below labelled as "Weight", weights closer to 1 are more similar to the chosen patient.

In [None]:
dfs = pd.DataFrame.from_records(z_train_hd[S, 0:-1].astype('double'))
df = pd.read_csv(ClevelandDataSet, sep=',', header=None, names=col_names, na_filter= True, na_values= {'ca': '?', 'thal': '?'})

RP=[]
for i in range(S.shape[0]):
    RP.append(class_names[z_train_hd[S[i], -1].astype(int)]) # Append class names
dfs[13] = RP
dfs.columns = df.columns  
dfs["Weight"] = np.around(W, 5)/np.sum(np.around(W, 5)) # Calculate normalized importance weights
dfs.transpose()

### 4.1.5. Compute how similar a feature of a prototypical user is to the chosen patient

In the code below we switch the raw values to a measure of similarity (e.g. age for column 0 switched from 58 to 0.18).

The more similar the feature of prototypical user is to the patient, the closer its weight is to 1. 
We can see below that several features for prototypes are quite similar to the chosen applicant. We can use these values to infer why a machine learning model diagnosed an individual patient with heart disease based on the level of similarity in each row.


In [None]:
z = z_train_hd[S, 0:-1] # Store chosen prototypes
eps = 1e-10 # Small constant defined to eliminate divide-by-zero errors
fwt = np.zeros(z.shape)
for i in range (z.shape[0]):
    for j in range(z.shape[1]):
        fwt[i, j] = np.exp(-1 * abs(X[0, j] - z[i,j])/(np.std(z[:, j])+eps)) # Compute feature similarity in [0,1]
                
# move wts to a dataframe to display
dfw = pd.DataFrame.from_records(np.around(fwt.astype('double'), 2))
dfw.columns = df.columns[:-1]
dfw.transpose()