# Exercise: Heart Disease / PyTorch

Version: 4.0, Summer Semester 2025

Follow the steps to load the heart disease database and to train & score a Neural Network classifier with it.
The database is a slightly modified version from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/heart+Disease).

## Your Name

Enter your name in the block below:


In [None]:
# @title Enter your name here {"run":"auto","vertical-output":true}
student_name = '' # @param {type:"string"}
import uuid
import hashlib
import os
from datetime import datetime

notebook_version = 4.0

def getData():
    mid = hashlib.sha256(str(uuid.getnode()).encode()).hexdigest()[:10]
    execution_time = datetime.now().isoformat()
    return mid, execution_time

mid, execution_time = getData()

# Store metadata in the notebook itself
print(f"Your name: {student_name} ({notebook_version} - {mid} - {execution_time})")


The blocks with assert perform automated tests so that you know if your solution is correct, without giving away how to code it. Simply execute these blocks. If you don't get any output from them, you know that your code is correct.

In [None]:
assert student_name != '', "Please enter your name in the block above"

## 1: Read about the Dataset

Check out the dataset description of [UCI](https://archive.ics.uci.edu/ml/datasets/heart+Disease). Our dataset is the Cleveland set and uses the smaller variant with 14 attributes. Take a look at column 3 ("chest pain" / "cp"). The values are either 1, 2, 3 or 4 in the dataset.

Change the code below so that the Python dictionary contains the names of the four different types of chest pain that the dataset distinguishes.

*Hint: you'll find the answer in the "Attribute information" section of the linked web page.*

In [None]:
# Replace xxx with the textual names of the different types of chest pain
pain_types = {1: "xxx",
             2: "xxx",
             3: "xxx",
             4: "xxx"}
# When you have finished replacing the values, delete the exception in the line
# below and run the next block with the asserts to see if you got it right.
raise NotImplementedError


In [None]:
import hashlib
assert hashlib.blake2b(pain_types[1].casefold().encode()).hexdigest() == '10de5546a056feb6522445a2e94492fa20419030d70694bd97fc67016906936dd8a8791e286becb0bb5f8087986bd2621f2f06d639523f653f955a2d45e78d6f'
assert hashlib.blake2b(pain_types[2].casefold().encode()).hexdigest() == '884a2e513fec27229c985dc4b01d4935d42e90e6c71d1f5d519b38e7369ae23be525d803dadf9e102ddb91beb6feef4a4d71d8a9636c1a4106f87c1aee1c2640'
assert hashlib.blake2b(pain_types[3].casefold().encode()).hexdigest() == 'cb8436129905c921a1930a01e2ae2ebe74d5562826802d6ea27f183c6703a3a58d9e8cc94f5eb53e48928abcefdd88e8ac50e153c99f0870005e327fcd4d1c89'
assert hashlib.blake2b(pain_types[4].casefold().encode()).hexdigest() == '56f3734df8320938c6c2e66fec6ea97f04e6f5f7e5cc27c57da98fc99ba277442240d7c0a9d4aa12816199274de33843c591d6d61037cd5a12f347f50850bd5b'

## 2: Imports

Use the following block for all the imports you need for your notebook.
This will include `pandas` (as `pd`), `numpy` (as `np`), `seaborn` (as `sns`) and all the parts of `sklearn` you will need. Also include all necessary imports for PyTorch (PyTorch itself, `nn`, `optim`, `TensorDataset` and `DataLoader`).
Configure `matplotlib` to draw graphics inline to ensure the figures are visible inside the notebook. If during this notebook you discover that you need additional imports, just come back to this cell, add the import here and execute the cell again. It's often easier to have all imports in one place.

In the rest of the notebook, simply replace `raise NotImplementedError` with your own code and then execute the cell.

In [None]:
# Place all neccessary imports here
raise NotImplementedError


In [None]:
# Predefined imports - just execute this line
# This is needed for some of the automated tests to help you check if your
# code is correct
import unittest
test_case = unittest.TestCase()

## 3: Load the CSV

Load the `heart_disease-fhstp-nomissing.csv` file into a Pandas dataframe variable called `df_heart` and print its head to check if importing worked. The URL for the file is: `https://raw.githubusercontent.com/andijakl/MachineLearning/refs/heads/main/lab%20-%20pytorch%20-%20heart%20disease/heart_disease-fhstp-nomissing.csv`

You can read CSV files with `read_csv` from the pandas library you imported as `pd`. The dataset we are using has been modified from the original to remove lines where data was missing. Therefore, it has 297 lines instead of the original 303.

In [None]:
file_url = 'https://raw.githubusercontent.com/andijakl/MachineLearning/refs/heads/main/lab%20-%20pytorch%20-%20heart%20disease/heart_disease-fhstp-nomissing.csv'
# Load the dataset into a variable called df_heart
# Specify that ? should be recognized as na value
# Print the head of the dataframe to see if importing worked correctly.
raise NotImplementedError


In [None]:
# Tests if data has been loaded
assert len(df_heart) == 297, "Imported data should have 297 samples"

## 4: Describe

Use the `describe` function of the Pandas DataFrame to find out the count, mean, standard deviation & more from the dataset for each column.

In [None]:
# Call method to describe the dataset
raise NotImplementedError


Based on the printed information you see about the dataset, answer a few questions and assign the numbers to the corresponding variables. Round your answers (up/down) to the next integer.

In [None]:
# What is the mean age?
mean_age = 0;
# What is the maximum heart rate?
max_hr = 0;
# What is the standard deviation of the cholesterol level?
std_cholesterol = 0;
# Simply assign the values to the variables above and when finished,
# remove the line throwing the not implemented error.
raise NotImplementedError


In [None]:
assert mean_age > 0
assert max_hr > 0
assert std_cholesterol > 0

## 5: Age Histogram

Print the histogram of the "age" column. Use `10` bins. This gives you a good understanding of the patients that were tested for heart diseases.

In [None]:
# Call method to print the age histogram with 10 bins
raise NotImplementedError


According to the age bins above (middle number), which age is most frequent in the dataset? Look at the histogram and assign what you think should be approximately in the middle of the highest bin to the following variable:

In [None]:
# Change the variable to the approximate middle of the highest bin in the histogram above.
# This answers: from what age are most people in the dataset? Assign that age to most_age.
most_age = -1
# Again, assign your answer to the variable above and then delete the exception
raise NotImplementedError


In [None]:
assert most_age != -1

## 6: Violin Plot

Use the Violin Plot of Seaborn (add the import statement at the beginning of the file!) to print the target class (`"diameter narrowing"` -> *diagnosis of heart disease (angiographic disease status), Value 0: < 50% diameter narrowing*) on the `x` axis and `"max HR"` on the `y` axis.

In [None]:
# Plot a violin plot with the parameters stated above
raise NotImplementedError


Based on the violin plot, answer: is the max HR higher if the target class is 0 or 1?

In [None]:
# Change the variable max_hr_higher to
# 0 if the maximum heart rate is higher if there is no heart disease;
# 1 if the maximum heart rate is higher if there is heart disease
max_hr_higher = -1
# Assign the value to the variable above and delete the exception
raise NotImplementedError


In [None]:
assert max_hr_higher != -1

## 7: Prepare Data for Classification

Our CSV contains all the data plus the target class values. Remove (`pop`) the target class into an extra variable (`y`).
Then split the dataframe into the 4 variables for training & test data. Use the random state `10`. You don't need to specify the size of the training and testing set; according to the [train_test_split documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), the default size of the test data will be 0.25, the training data 0.75.

In [None]:
# Pop the target class from df_heart, convert it to a numpy array using to_numpy()
# And store the resulting target array in a variable y
raise NotImplementedError


In [None]:
assert y.shape == (297,)
assert y[1] == 1
assert df_heart.shape == (297,13)

In [None]:
# Split the dataframe and the target classes using the random state 10
# into the variables X_train, X_test, y_train and y_test
raise NotImplementedError


In [None]:
assert X_train.shape == (222,13)
assert y_train.shape == (222,)
assert X_test.iloc[0]['age'] == 60.0
assert X_train.iloc[5]['age'] == 46.0
assert X_test.shape == (75, 13)
assert y_test.shape == (75,)
assert y_test[1] == 1

## 8: Scaling

Many classifiers like Neural Networks work best with scaled features. Therefore, we need to scale these to make the features more comparable.

Create the `StandardScaler`. Then use `fit_transform()` to analyze the training data and to scale it in one step. Afterwards, use `transform()` to scale the test data (this applies the same scaling that was learned from the training data).
Make sure to save the scaled test + train data into new variables, e.g., `X_train_scaled` and `X_test_scaled`.

In [None]:
# Create the standard scaler and store it in a variable called scaler
raise NotImplementedError


In [None]:
# Correctly scale the train and test data into X_train_scaled and X_test_scaled
raise NotImplementedError


In [None]:
assert X_train_scaled.shape == (222,13)
assert X_test_scaled.shape == (75,13)
assert np.isclose(X_train_scaled[0][0], 1.0153, rtol=0.1), "Should be close to 1.1053"
assert np.isclose(X_test_scaled[0][0], 0.5599, rtol=0.1), "Should be close to 0.5599"
assert np.isclose(X_test_scaled[1][1], 0.7143, rtol=0.1), "Should be close to 0.7143"

## 9: PyTorch Setup

In this block, we will convert the data to tensors and into data loaders so that they can be used by PyTorch.

First, convert the four data variables (`X_train_scaled` etc.) to tensors, using similar names but indicating that these have the tensor form (`X_train_tensor` etc.).

Also specify the `dtype` as `torch.float32`, to ensure classification using the standard neural network settings will work afterwards.

For the `y` target classes, you also need to add the following at the end of the conversion: `.view(-1,1)`. What does this do?

- Without the conversion, the tensors would have the shape of (number_of_samples,), meaning 1-dimensional vectors
- With the conversion, the tensors are reshaped to (number_of_samples, 1). This makes them 2-dimensional tensors with a single column.

Why is this needed? Most neural network architectures and standard functions expect the target tensor to have two dimensions. Also, the model output will be in the same shape; to compare the predictions, both arrays need to have the same shape.

In [None]:
# Convert to PyTorch tensors. Make sure you use the scaled values!
raise NotImplementedError


In [None]:
test_case.assertEqual(X_train_tensor.shape, (222, 13))
test_case.assertEqual(X_test_tensor.shape, (75, 13))
test_case.assertEqual(y_train_tensor.shape, (222, 1))
test_case.assertEqual(y_test_tensor.shape, (75, 1))

test_case.assertEqual(X_train_tensor.dtype, torch.float32)
test_case.assertEqual(X_test_tensor.dtype, torch.float32)
test_case.assertEqual(y_train_tensor.dtype, torch.float32)
test_case.assertEqual(y_test_tensor.dtype, torch.float32)

test_case.assertTrue(torch.isclose(X_train_tensor[0][0], torch.tensor(1.0153, dtype=torch.float32), rtol=0.1))

Next, create two `TensorDataset` variables (`train_dataset` and `test_dataset`) that combine the `X` and `y` for each dataset. This makes it easier for PyTorch to handle the data.

Afterwards, construct a `DataLoader` for each dataset. Name these variables: `train_loader` and `test_loader`. They will take care of loading the next batch into memory for training. As such, set a `batch_size` of `32`. It's good if you let the `DataLoader` shuffle the training dataset between each epoch, to avoid any effects the order of data might have on the training. For consistency during evaluation, set `shuffle` to `False` for the test data.

In [None]:
# Create TensorDataset and DataLoader
raise NotImplementedError


In [None]:
test_case.assertEqual(len(train_dataset), 222)
test_case.assertEqual(len(test_dataset), 75)

# Check the first batch to ensure data loaders are working as intended.
for x, y in train_loader:
    test_case.assertEqual(x.shape[1], 13)  # Check the number of features
    test_case.assertEqual(y.shape[1], 1)  # Check the number of target variables
    break

for x, y in test_loader:
    test_case.assertEqual(x.shape[1], 13)  # Check the number of features
    test_case.assertEqual(y.shape[1], 1)  # Check the number of target variables
    break

## 10: Define Neural Network

Now, define the structure of the neural network using `nn.Sequential`. The resulting variable that stores the network should be called `model`.

Layers:

1. **Linear:** this fully connected layer needs two attributes: the features of our input data (which is 13, according to the 13 different health indicators). To make the code more dynamic, you can also use `X_train_scaled.shape[1]` to retrieve the number. The second attribute specifies the number of nodes (neurons) in the first hidden layer. Set it to 16.
2. **ReLU:** Add a ReLU (Rectified Linear Unit) activation function. This is crucial for learning complex patterns in data, as it introduces non-linearity into the network.
3. **Linear:** the second hidden layer. Take the 16 outputs from before as inputs, and provide 8 outputs for the following layer.
4. **ReLU:** another ReLU activation function is applied after the second hidden layer.
5. **Linear** output layer: it takes 8 outputs from the previous layer as inputs and produces a single output value. This will represent the model's prediction.
6. **Sigmoid:** it squashes the output value between 0 and 1, representing the probability of the input belonging to one of the two classes (heart disease or no heart disease).



In [None]:
# Define the neural network model using Sequential
raise NotImplementedError


In [None]:
test_case.assertEqual(len(list(model.children())), 6)  # Check for 6 layers
test_case.assertIsInstance(model[0], nn.Linear) # Check first layer is Linear
test_case.assertIsInstance(model[1], nn.ReLU) # Check second layer is ReLU
test_case.assertIsInstance(model[2], nn.Linear) # Check third layer is Linear
test_case.assertIsInstance(model[3], nn.ReLU) # Check fourth layer is ReLU
test_case.assertIsInstance(model[4], nn.Linear) # Check fifth layer is Linear
test_case.assertIsInstance(model[5], nn.Sigmoid) # Check sixth layer is Sigmoid

test_case.assertEqual(model[0].in_features, 13) #check input size of first layer
test_case.assertEqual(model[0].out_features, 16) #check output size of first layer
test_case.assertEqual(model[2].in_features, 16) #check input size of second layer
test_case.assertEqual(model[2].out_features, 8) #check output size of second layer
test_case.assertEqual(model[4].in_features, 8) #check input size of third layer
test_case.assertEqual(model[4].out_features, 1) #check output size of third layer

Now print the summary of your created model. Import summary from torchsummary (pre-installed on Google CoLab).

Then, use its summary function and send it the model. The second parameter is a tuple that specifies the input size that the model expects.
- The first parameter is the batch size. You defined this above when creating the `DataLoader`s.
- The second is the number of features. You can dynamically get the number like you did previously when defining the input shape of the first neural network layer.

In [None]:
raise NotImplementedError


The last preparation steps are the loss function and the optimizer:

* For the loss, use `BCELoss`, which is the binary cross entropy for classification. We are performing a binary classification.
* For the optimizer, use `Adam`. You can define the learning rate yourself.

In [None]:
# Define loss function and optimizer
raise NotImplementedError


## 11: Train the Neural Network

First, let's define two variables.

* The first is `num_epochs`. Store the number of training epochs you would like to use, e.g., 50. It's good practice to have this in a variable to make changes in configuration easier in a central place, without directly messing around in code of the for-loop.

* Also define a variable `train_loss_history` as an empty array. We'll append values during training.

In [None]:
# Training loop
raise NotImplementedError


In [None]:
assert num_epochs > 0
assert len(train_loss_history) == 0

Set the model to **train mode**. We don't evaluate within the loop, so it's enough to only switch to train mode once before the loop. This is important because some layers, like dropout, behave differently during training and evaluation.

In [None]:
raise NotImplementedError


In [None]:
assert model.training

Now it's time for the large training loop. Use the standard procedure for PyTorch:


1. Create a `for`-loop over the range of `num_epochs`.
2. Set a `total_loss` variable to `0`, so that the loss of each epoch can be summed while going through batches.
3. Create an inner `for`-loop over the `train_loader`. It returns two variables that you have as loop variables. Call them: `batch_X`, `batch_y`.
  1. Within the inner loop, first **reset the optimizer** using `zero_grad()`.
  2. Next, send your batch data to the **model** and store its **predictions** in a `y_pred` variable.
  3. Use the predicted and the true labels to calculate the **loss function** you defined above. Store its results in a variable called `loss`.
  4. Call `backward()` on `loss` to compute the gradients of the loss function with respect to all trainable parameters.
  5. **Update the parameters** of the model using `optimizer.step()` using our Adam optimization function you defined before.
  **6. For statistics:** add the `loss.item()` to the total_loss variable so that we can get the sum of all losses of the whole epoch. *Note:* this calculates the average loss per batch, which is sufficient as we want to monitor the trend and not the exact loss per item.

4. Outside of the inner loop: **Compute the average loss**. Divide the `total_loss` by the number of items we used for training (--> `len(train_loader)`).
5. Append the average loss you computed to the `train_loss_history` array.

6. Every 5 epochs, **print** the current epoch number as well as the current training loss.

In [None]:
raise NotImplementedError


In [None]:
# Assert statements to check training loop correctness
assert len(train_loss_history) == num_epochs, "Training loss history should have an entry for each epoch."

# Check for decreasing loss trend (not strictly decreasing, allowing for minor fluctuations)
for i in range(1, len(train_loss_history) - 5):  # Check for a trend over several epochs
    assert train_loss_history[i] <= train_loss_history[i-1] * 1.5, f"Loss should be non-increasing. Epoch {i}: {train_loss_history[i]}, Epoch {i-1}: {train_loss_history[i - 1]}"

# Check final loss value
assert train_loss_history[-1] < 0.5, f"Final training loss is unexpectedly high: {train_loss_history[-1]}"

## 12: Plot the training history

Execute the following block of code to see the visualization of the training loss. It should decrease during training over the epochs.

In [None]:
# Plot Training Loss and Test Accuracy
plt.figure(figsize=(10, 4))

plt.plot(train_loss_history, label="Training Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training Loss Over Epochs")
plt.legend()

mid, execution_time = getData()
plt.text(.01, .01, f'{notebook_version}, {mid}, {execution_time}', ha='left', va='bottom', transform=plt.gca().transAxes)

plt.show()

## 13: Evaluation

As the last step, we also want to know how well our model performs. We'll need to evaluate the trained model with the separate test data, which was not used for training. Therefore, the model has never seen it before, making the data a more reliable indicator on the model performance.

First, we need to set up a few things. Create two variables, `correct` and `total`, and set both to 0. We'll use these for counting how many predictions we got correct, compared to the ground truth.

Also, set the `model` to *evaluation mode*. Among other things, this disables dropout features if defined in the structure.

In [None]:
raise NotImplementedError


In [None]:
assert model.training == False
assert correct == 0
assert total == 0

The evaluation is again a loop. This time, we iterate over the `test_loader`. We count the number of correctly classified items.

First, we use the `no_grad` context manager of PyTorch, which disables gradient calculation in its block. This reduces memory consumption for computations. When the code execution leaves the block, it will automatically re-enable gradient calculation.

Within this block, we iterate over the `test_loader` with a `for`-loop, similar to the training loop. As loop variables, we get both the data for the batch (`batch_X`), as well as the labels (`batch_y`).

Within the loop:
1. Let the `model` **predict** based on the batch data, and store the results in `y_pred`.
2. **Convert the probabilities to binary labels.** How? The last layer of our neural network is a sigmoid. This means that it returns float values between [0..1]. If we don't weight one class more, we take the middle (`0.5`) as separation point. Therefore, we can just compare if `y_pred >= 0.5` to get the class output (`0` or `1`). To be on the safe side, convert the output with float() to prevent binary labels (true / false), as our y is just containing 0/1, but also formatted as float. Store this in `y_pred_class`.
3. **Count how many items are classified correctly.** To do this, compare if `y_pred_class == batch_y`. This would return a vector of 0 or 1, depending on whether the classes are equal (`1`) or not (`0`). To calculate the accuracy, we need to know how many are correct, so use `sum()` to count how many correct predictions we have in that array. Finally, we're now leaving the tensor-world and returning to plain Python numbers, so convert the tensor to a number with `item()`. Add this result to the `correct` variable.
4. The final step is to **count how many items** we have processed in total so far, which will later be required for the percentage. Add `batch_y.size(0)` to the `total` variable.

In [None]:
# Evaluate accuracy on test set
raise NotImplementedError


In [None]:
# Assertions for the evaluation part
assert total == 75, "Total number of test samples should be 75."
accuracy = correct / total
assert accuracy >= 0.7, "Accuracy should be above 70%." # Adjust threshold as needed

Now we have all the numbers, and we just need to calculate the accuracy: divide correct by total and print the result for our test accuracy. With the default architecture, you should reach around 90% arruracy. The exact number will differ every time you run the example, as the solution the neural network found depends on the random initial value of the weights and biases initally assigned when constructing the network. Another big influence would be the split between the training and test set, but we made sure this is consistent by supplying the inital value for the random number generator.

In [None]:
raise NotImplementedError


In [None]:
assert test_accuracy > 0.80
assert test_accuracy < 0.95
execution_time = datetime.now().isoformat()
print(f"Execution time: {execution_time}")

## 14: Final Checks

To make sure that your whole Jupyter Notebook executes without issues, choose *Runtime -> Restart Session and Run All ...*. Make sure it actually restarts executing all cells – if not, select the command again. Make sure all the lines you wrote execute and all automated tests pass.

Jupyter automatically saves the cell outputs into the notebook itself, so make sure the cell outputs (including the plotted graphs) are visible. To submit your notebook with all the outputs included, follow these steps:
1. Rename the file to include your name, e.g.: "lab_heart_disease_25ss_**jakl**.ipynb"
2. Make sure your file is saved (press Ctrl+S)
3. Export the notebook to your computer: File --> Download --> Download .ipynb
4. Go to eCampus and upload your notebook file to the assignment.