<table style="border: none" align="left">
   <tr style="border: none">
      <th style="text-align: left;border: none"><font face="verdana" size="5" color="black"><b>Train and deploy a heart disease prediction model using XGBoost and IBM Watson Machine Learning APIs</b></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://github.com/pmservice/drug-selection/raw/master/images/heart_banner.png" width="600" alt="Icon"> </th>
   </tr>
</table>

This notebook demonstrates how to train a model using the XGBoost library to classify whether a person has heart disease or not. In addition to training, the notebook also explains how to persist a trained model to IBM Watson Machine Learning repository, and deploy the model as a REST service.  

In order to train and test the heart disease prediction model, you will be using an open source data set published in the University of California, Irvine (UCI) Machine Learning Repository. 

This notebook uses Python 3.6 runtime, XGBoost 0.82 and Scikit-Learn 0.20. 



## Notebook Tips

For those unfamiliar with using a notebook, please read this section. Otherwise, proceed to the next section.<br><br> 
A notebook consists of a series of cells. Cells consist of code or documentation (called markdown).<br>
Cells that are prefixed with an **In [ ]** consist of code. One's prefixed with an **Out** indicate displayed output which appear after executing the code. <br>
To execute a code cell you need to move the cursor into the cell, and then click on the Run icon in the toolbar at the top, <br>
or press the Shift-Enter keys. Once the cell has been executed a number will appear (e.g. **In [1]**)

## Learning goals

The learning goals of this notebook are:

-  Load a CSV file into Pandas DataFrame.
-  Prepare data for training and evaluation.
-  Create, train and evaluate a XGBoost model.
-  Visualize a decision trees used by the model.
-  Visualize the importance of features that were used to train the model.
-  Use cross validation to select optimal model hyperparameters based on a parameter grid
-  Persist a model in Watson Machine Learning repository using Python client library.
-  Deploy a model for online scoring using the Watson Machine Learning's REST APIs



## Table of contents
This notebook contains the following sections:

1.	[Setup](#setup)
2.	[Load and explore data](#load)
3.	[Create XGBoost model](#create)
4.	[Persist model](#persistence)
5.	[Deploy to the Cloud](#deploy)
6.	[Summary and next steps](#summary)

<a id="setup"></a>
## 1. Setup

Before you execute the sample code in this notebook, you must download the **Heart Disease Data Set** data in the Notebook's local filesystem

### Download Heart Disease Data Set  to Notebook's local filesystem
The Heart Disease Data Set is a freely available data set on the UCI Machine Learning Repository portal. The **Heart Disease Data Set** is hosted [here](http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data).



In order to download the data from UCI Machine Learning Repository, use the `wget` library. Use the following command to install the `wget` library: `!pip install wget` 

In [None]:
!pip install wget 

Now, the code in the cell below downloads the data set and saves it in the local filesystem. The name of downloaded file containing the data will be displayed in the output of this cell.

In [None]:
import wget

link_to_data = 'http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'

# make sure no duplicates
!rm processed.cleveland*.data

ClevelandDataSet = wget.download(link_to_data)

print(ClevelandDataSet)

The .csv file, **processed.cleveland.data**, that contains the heart disease data set is now availble on your local gpfs filesystem. 

The downloaded data set contains the following attributes pertaining to heart disease.

### Data set Details:
1. age - age in years
2. sex - sex(1 =  male; 0 = female)
3. cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
5. chol - serum cholestoral in mg/dl
6. fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg - resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest
11. slope - the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
12. ca - number of major vessels (0-3) colored by flourosopy
13. thal - 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num - number of major blood vessels > 50% blocked (angiographic disease status)  

<a id="load"></a>
## 2. Load and explore data

In this section you will load the data as a Pandas data frame and perform a basic exploration.


Load the data in the .csv file, **processed.cleveland.data**, into a Pandas data frame by running the code below. Note that the dataset does not contain header information so that is provided in the col_names variable. The first 5 lines will be displayed by using the .head() method. 

In [None]:
import pandas as pd

In [None]:
col_names = ['age','sex','cp','restbp','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']

heart_data_df = pd.read_csv(ClevelandDataSet, sep=',', header=None, names=col_names, na_filter= True, na_values= {'ca': '?', 'thal': '?'})
heart_data_df.head()

Let us see how many attributes and samples we have in this data set by using the .shape attribute. 

In [None]:
(samples, attributes) = heart_data_df.shape
print("No. of Sample data =", samples )
print("No. of Attributes  =", attributes)

We have 303 rows of sample data with 14 columns of data per sample. 

We will now create a derived attribute that will serve as our target. The goal of the model is to predict whether a patient has a heart problem. The data set as currently constructed does not directly have this information. However, this information can be derived from the `num` attribute. The `num` column and its values pertain to the number of major vessels with more than 50% narrowing (values- 0,1,2,3 or 4) for the corresponding sample data. 

Therefore, the target column `diagnosed` can derived in the following way: 
- 'diagnosed' is '0' when 'num' = 0 , indicating normal heart functioning 
- 'diagnosed' is '1' when 'num' > 0 , indicating a heart problem.


In [None]:
heart_data_df['diagnosed'] = heart_data_df['num'].map(lambda d: 1 if d > 0 else 0)

Use pandas describe method to get dataset statistics

In [None]:
heart_data_df.describe()

We can see from the describe statistics (count row) that the "ca" field, and the "thai" field have missing values. This will be handled below. Now we will use IBM's pixiedust library to visualize the data. IBM has open sourced pixiedust. Pixiedust front-ends a number of visualization libraries and makes it easier to create visualizations with an interactive interface. Instead of coding, you can select different charts using the chart icon (second icon below) and then click on Options to select the variables to visualize. Please experiment with pixiedust to get a better understanding of the data characteristics and relationships. 

In [None]:
import pixiedust
display(heart_data_df)

<a id="create"></a>
## 3. Create an XGBoost model

In recent years, ensemble learning models took the lead and became popular among machine learning practitioners.

Ensemble learning model employs multiple machine learning algorithms to overcome the potential weaknesses of a single model. For example, if you are going to pick a destination for your next vacation, you probably ask your family and friends, read reviews and blog posts. Based on all the information you have gathered, you make your final decision.

This phenomenon is referred as the Wisdom of Crowds (WOC) in social sciences and it states that averaging the answers (prediction or probability) of a group will often result better than the answer of one of its members. The idea is that the collective knowledge of diverse and independent individuals will exceed the knowledge of any one of those individuals, helping to eliminate the noise.

XGBoost is an open source library for ensemble based algorithms. It can be used for classification, regression and ranking type of problems. XGBoost supports multiple languages, such as C++, Python, R, and Java. 

The Python library of XGBoost supports the following API interfaces to train and predict a model, also referred to as a `Booster`: 
- XGBoost's native APIs pertaining to the `xgboost` package, such as `xgboost.train()` or `xgboost.Booster`
- Scikit-Learn based Wrapper APIs: `xgboost.sklearn.XGBClassifier` and `xgboost.sklearn.XGBRegressor`

In this section you will learn how to train and test an XGBoost model using the scikit-learn based Wrapper APIs.  

First, you must import the required libraries.

In [None]:
import xgboost
from xgboost.sklearn import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split 

from xgboost import plot_importance
from matplotlib import pyplot
import pprint
%matplotlib inline

### 3.1: Prepare data

In this section, clean and transform the data in the Pandas data frame into the data that can be given as input for training the model. 


#### 3.1.1: Cleanse the data
First, check if there are any null data in our dataset and remove the corresponding rows.

In [None]:
print("List of features with their corresponding count of null values : ")
print("---------------------------------------------------------------- ")
print(heart_data_df.isnull().sum())

From the output of the above cell, there are 6 occurrences where there are null values. The rows containing these null values can be removed so that the data set does not have any incomplete data. The cell below contains the command to remove the rows that contain these null values.

In [None]:
heart_data_df = heart_data_df.dropna(how='any',axis=0)

#### 3.1.2: Prepare the target data and feature columns

The next step is to select the attributes in the current data set that can be used for training the model. Here, all the attributes other than `num` attribute are chosen as the features.

In [None]:
feature_cols = ['age','sex','cp','restbp','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
features_df = heart_data_df[feature_cols]

#### 3.1.3: Split the data set for training and testing

As the target and feature columns has been defined, you can now split the data set into two sets that will be used for training the model and for testing the trained model. 

In [None]:
heart_train, heart_test, target_train, target_test = train_test_split(features_df, heart_data_df.loc[:,'diagnosed'], test_size=0.33, random_state=0)


### 3.2 Create the XGBoost Model

In the cell below, we create our pipeline which contains the XGBoost classifier:

In [None]:
pipeline = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])

We can see the default parameters of the stages in our pipeline:

In [None]:
pipeline

After we have set up our pipeline with the XGBoost classifier, we can train it by invoking the fit method.

In [None]:
pipeline.fit(heart_train,target_train)

We can now make predictions on test data and evaluate the model.

In [None]:
y_pred = pipeline.predict(heart_test.values)

In [None]:
accuracy = accuracy_score(target_test, y_pred)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

We plot the feature importance based on fitted trees which allows us to see the features that were useful to construct boosted trees.

In [None]:
xgboost.plot_importance(pipeline.steps[1][1])

f0=age, f3=restbp, f9=oldpeak <br>

We can tune our model now to achieve better accuracy by using grid search and cross validation.

XGBoost has an extensive catalog of hyperparameters which provides great flexibility to shape the algorithm’s desired behavior. Let’s have a look at the most important ones:
- learning_rate (default=0.1): Boosting learning rate (xgb’s “eta”)
- n_estimators (default=100): Number of boosted trees to fit
- max_depth (default=3): Maximum tree depth for base learners
- objective (default='binary:logistic'): Specify the learning task and the corresponding learning objective or a custom objective function to be used

In below cell, we create our XGBoost pipeline and set up the parameter space.

In [None]:
pipeline_gs = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
parameters = {'classifier__learning_rate': [0.01, 0.03], 'classifier__n_estimators': [50, 200]}

We can search for the best parameters over the specified parameters with GridSearchCV. You can use estimator.get_params().keys() to see the available hyperparameters for search.

In [None]:
clf = GridSearchCV(pipeline_gs, parameters, return_train_score=True)

In [None]:
clf.fit(heart_train.values, target_train.values)

In [None]:
print("Best score: %s" % (clf.best_score_))
print("Best parameter set: %s" % (clf.best_params_))

We can see the accuracy of the best parameter combination on test set.

In [None]:
y_pred = clf.predict(heart_test.values)

accuracy_opt = accuracy_score(target_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy_opt * 100.0))

In [None]:
print("Improvement: %.2f%%" % ((accuracy_opt-accuracy)/accuracy*100))

<a id="persistence"></a>
## 4. Persist the model

In this section store the XGBoost model in the Watson Machine Learning repository by using Watson Machine Learning repository service Python client libraries.

In [None]:
# Install the WML client API

!pip install watson-machine-learning-client-V4

In [None]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

### Authenticate to Watson Machine Learning service on the IBM Cloud. 



### Action: PASTE CREDENTIALS FROM YOUR INSTANCE OF WATSON MACHINE LEARNING INTO THE FOLLOWING CELL. 



In [None]:
# Instantiate a client using credentials
wml_credentials = {
  "apikey": "",
  "instance_id": "",
  "url": "https://us-south.ml.cloud.ibm.com"
}

client = WatsonMachineLearningAPIClient(wml_credentials)



### 4.1: Save the model in the Machine Learning Repository

In this subsection you will learn how to save a model artifact to your Watson Machine Learning instance by using the Watson Machine Learing repository Python client package.



In [None]:
# All available runtimes

client.runtimes.list(pre_defined=True)

In [None]:
heart_metadata = {
    client.repository.ModelMetaNames.NAME: "Heart Disease",
    client.repository.ModelMetaNames.DESCRIPTION: "Model for Heart Disease",
    client.repository.ModelMetaNames.TYPE: "scikit-learn_0.20",
    client.repository.ModelMetaNames.RUNTIME_UID: "scikit-learn_0.20-py3.6"    
}

model_details = client.repository.store_model(model=clf, meta_props=heart_metadata)

model_uid = client.repository.get_model_uid(model_details)

print( model_uid )

### 4.2 Load the Booster from the saved model

In [None]:
model=client.repository.load(model_uid)

In [None]:
y_lpredict = model.predict(heart_test)
print(y_lpredict)

<a id="deploy"></a>
## 5. Deploy to the Cloud

In [None]:
meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: "Heart Disease Deployment",
    client.deployments.ConfigurationMetaNames.DESCRIPTION: "Heart Disease Deployment",
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

deployment_details = client.deployments.create(model_uid, meta_props=meta_props)

deployment_uid = client.deployments.get_uid(deployment_details)

print( deployment_uid )


<a id="summary"></a>
## 6. Summary and next steps     

You successfully completed this notebook! You learned how to use XGBoost machine learning as well as Watson Machine Learning for model creation and deployment. Check out our [Online Documentation](https://console.ng.bluemix.net/docs/services/PredictiveModeling/pm_service_api_spark.html) for more samples, tutorials, documentation, how-tos, and blog posts. 

### Original Author

**Krishnamurthy Arthanarisamy**, is a senior technical lead in IBM Watson Machine Learning team. Krishna works on developing cloud services that caters to different stages of machine learning and deep learning modeling life cycle. 

### Additional Authors - have modified the Notebook for use in the Proof of Technology
**Bernie Beekman ** is an Executive I/T Architect in IBM's Federal Analytical Solution Center <br>
**Joel Patterson ** is on the Federal Big Data Engineering team. 

Copyright © 2017, 2018 IBM. This notebook and its source code are released under the terms of the MIT License.