# Heat Attack Dataset
This dataset was downloaded from https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset.<br>
The notebook was created taking as baseline the tutorials from https://microsoftlearning.github.io/mslearn-dp100/.
## Connect to a workspace

In [1]:
from azureml.core import Workspace, Dataset
ws = Workspace.from_config()
print(ws.name, "loaded")

wsp-ai-915-002 loaded


Check available compute resources.

In [2]:
print("Compute Resources:")
for compute_name in ws.compute_targets:
    compute = ws.compute_targets[compute_name]
    print("\t", compute.name, ':', compute.type)

Compute Resources:
	 CLAI915002 : AmlCompute
	 CI20211026 : ComputeInstance


## Specify a compute cluster

Uses an existing cluster, or creates a new one if there is no pre-xisting cluster.

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "CLAI915002"

try:
    # Get existing compute target
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found cluster!")
except ComputeTargetException:
    # Create one if it does not exist
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        training_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)


Found cluster!


## Load and register dataset
**Data Description**<br>

*age*: Age of the person<br>
*sex*: Gender of the person<br>
*cp*: chest pain type<br>
*trtbps*: resting blood pressure (mm Hg)<br>
*chol*: cholesterol (mg/dl)<br>
*fbs*: fasting blood sugar > 120 mg/dl<br>
*restecg*: resting electrocardiographic results<br>
*thalachh*: maximum heart rate achieved<br>
*exng*: exercise induced angina (1 = yes, 0 = no)<br>
*oldpeak*: previous peak<br>
*slp*: slope<br>
*caa*: number of major vessels (0-3)<br>
*thall*: Thal rate <br>
*output*: had heart attack (target)



In [4]:
# Load default datastore
default_ds = ws.get_default_datastore()

# Upload datasets to the datastore
default_ds.upload_files(['./data/heart.csv'],
                        target_path='heart-data/',
                        overwrite=True,
                        show_progress=True)

Uploading an estimated of 1 files
Uploading ./data/heart.csv
Uploaded ./data/heart.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_95b4490356e241279422f512e20d581c

In [5]:
# Create tabular dataset with heart data
heart_tab = Dataset.Tabular.from_delimited_files(path=(default_ds, 'heart-data/heart.csv'))
heart_tab.to_pandas_dataframe()


Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [6]:
# Register heart dataset
heart_tab = heart_tab.register(workspace=ws,
                            name='heart',
                            description='heart attack data',
                            tags={'format':'CSV'},
                            create_new_version=True)


Check existing datasets and versions.

In [7]:
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, 'version', dataset.version)

Datasets:
	 o2 version 1
	 heart version 2


## Check data

Making sure there are no missinng values, and taking a look at the descriptive statictics for the dataset.

In [8]:
# Check for Null values
heart_tab.to_pandas_dataframe().isnull().sum()

age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
output      0
dtype: int64

In [9]:
# Look inside
heart_tab.to_pandas_dataframe().describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


## Train a model from script

Create an experiment folder.

In [10]:
import os

# Create a folder for the experiment files
experiment_folder = 'heart_training_hyperdrive'
os.makedirs(experiment_folder, exist_ok=True)

Create an environment file.

In [11]:
%%writefile $experiment_folder/hyperdrive_env.yml
name: hyperdrive_env
dependencies:
- python=3.6.2
- scikit-learn
- ipykernel
- matplotlib
- pandas
- numpy
- pip
- pip:
  - azureml-defaults

Overwriting heart_training_hyperdrive/hyperdrive_env.yml


Creating experiment script, using a gradient boosting classifier. Observe the hyperparameters to be tunned are learning rate and number of estimators.

In [12]:
%%writefile $experiment_folder/heart_training.py
# Import libraries
import os
import argparse
from azureml.core import Run, Dataset
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

# Get script argument input dataset and hyperparameters
parser = argparse.ArgumentParser()
parser.add_argument("--ds", type=str, dest='ds_id')
parser.add_argument("--learning_rate", type=float, dest='learning_rate', default=0.1)
parser.add_argument("--n_estimators", type=int, dest='n_estimators', default=100)
args = parser.parse_args()

# Get experiment run context
run = Run.get_context()

# Log hyperparameter values
run.log('learning_rate', np.float(args.learning_rate))
run.log('n_estimators', np.float(args.n_estimators))

# Get training dataset
print("Loading Data...")
ws = run.experiment.workspace
heart = run.input_datasets['heart_dataset'].to_pandas_dataframe()

# Separate features and labels
y = heart['output'].values
X = heart.drop(['output'], axis=1).values

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Train a random forest model
print('Training a Gradient Boosting Classifier model with default hyperparameters.')
model = GradientBoostingClassifier(learning_rate=args.learning_rate,
                                    n_estimators=args.n_estimators).fit(X_train, y_train)

# Calculate accuracy
y_hat = model.predict(X_test)
model_accuracy = np.average(y_hat == y_test)
print('Accuracy: ', model_accuracy)
run.log('Accuracy', np.float(model_accuracy))

os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/heart_model_hyperdrive.pkl')

run.complete()

Overwriting heart_training_hyperdrive/heart_training.py


Running the experiment script. The hyperparameter tuning configuration includes the following:
- Two hyperparameters to be tunned: learning rate and number of estimators;
- Random samplig is used to select hyperparameter values;
- An early termination policy based on running averages of the primary metric;
- The goal is to maximize accuracy.

In [None]:
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.widgets import RunDetails
from azureml.train.hyperdrive import RandomParameterSampling, HyperDriveConfig, PrimaryMetricGoal, uniform, choice, MedianStoppingPolicy

# Create python environment for the experiment (from a .yml file)
env = Environment.from_conda_specification("hyperdrive_env", experiment_folder + "/hyperdrive_env.yml")

# Get training dataset
heart_ds = ws.datasets.get("heart")

# Get a script config
script_config = ScriptRunConfig(source_directory=experiment_folder,
                                script='heart_training.py',
                                arguments=['--ds', heart_ds.as_named_input('heart_dataset')],
                                environment=env,
                                compute_target=cluster_name)

params = RandomParameterSampling({
                                    "learning_rate": uniform(0.10, 0.15),
                                    "n_estimators": choice(60, 70, 80)
                                   })

early_termination_policy = MedianStoppingPolicy(evaluation_interval=1, delay_evaluation=5)

# Configure hyperdrive settings
hyperdrive = HyperDriveConfig(run_config=script_config,
                            hyperparameter_sampling=params,
                            policy=early_termination_policy,
                            primary_metric_name='Accuracy',
                            primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                            max_total_runs=20,
                            max_concurrent_runs=2
                            )

# Submit the experiment
experiment_name = 'train-heart-hyperdrive'
experiment = Experiment(workspace=ws, name=experiment_name)
run = experiment.submit(config=hyperdrive)
RunDetails(run).show()
run.wait_for_completion()

Check the best performing run.

In [27]:
for child in run.get_children():
    print(child.get_metrics())

{'learning_rate': 0.14647713520351516, 'n_estimators': 60.0, 'Accuracy': 0.8157894736842105}
{'learning_rate': 0.11618236886733278, 'n_estimators': 60.0, 'Accuracy': 0.8289473684210527}
{'learning_rate': 0.10759394589582, 'n_estimators': 70.0, 'Accuracy': 0.8026315789473685}
{'learning_rate': 0.12361103871615381, 'n_estimators': 60.0, 'Accuracy': 0.8026315789473685}
{'learning_rate': 0.12055256990308783, 'n_estimators': 70.0, 'Accuracy': 0.8157894736842105}
{'learning_rate': 0.12445324622760633, 'n_estimators': 80.0, 'Accuracy': 0.8157894736842105}
{'learning_rate': 0.14707108694134272, 'n_estimators': 60.0, 'Accuracy': 0.8157894736842105}
{'learning_rate': 0.13072844007802958, 'n_estimators': 80.0, 'Accuracy': 0.8157894736842105}
{'learning_rate': 0.14245353096359964, 'n_estimators': 70.0, 'Accuracy': 0.7894736842105263}
{'learning_rate': 0.12785117783486208, 'n_estimators': 80.0, 'Accuracy': 0.8026315789473685}
{'learning_rate': 0.1450678457816757, 'n_estimators': 60.0, 'Accuracy': 0

In [29]:
# Get the best run
best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
script_arguments = best_run.get_details()['runDefinition']['arguments']

print('  -Accuracy:', best_run_metrics['Accuracy'])
print('  -Arguments:', script_arguments)

  -Accuracy: 0.8421052631578947
  -Arguments: ['--ds', 'DatasetConsumptionConfig:heart_dataset', '--learning_rate', '0.1450678457816757', '--n_estimators', '60']
