# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project Notebook: Structured Data Classification

**DISCLAIMER:** THIS NOTEBOOK IS PROVIDED ONLY AS A REFERENCE SOLUTION NOTEBOOK FOR THE MINI-PROJECT. THERE MAY BE OTHER POSSIBLE APPROACHES/METHODS TO ACHIEVE THE SAME RESULTS.

## Problem Statement

To predict whether a patient has a heart disease.

## Learning Objectives

At the end of the experiment, you will be able to

* understand the Cleveland Clinic Foundation for Heart Disease dataset
* pre-process this dataset
* build a neural network architecture/model using Keras sequential or functional api
* perform model training
* perform inference on an unseen data
* build a Gradio interface for this application

## Introduction

This example demonstrates how to do structured data classification, starting from a raw
CSV file. Our data includes both numerical and categorical features. We will do preprocessing to normalize the numerical features and vectorize the categorical
ones.

### Dataset

[Our dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) is provided by the
Cleveland Clinic Foundation for Heart Disease.
It's a CSV file with 303 rows. Each row contains information about a patient (a
**sample**), and each column describes an attribute of the patient (a **feature**). We
use the features to predict whether a patient has a heart disease (**binary
classification**).

Here's the description of each feature:

Column| Description| Feature Type
------------|--------------------|----------------------
Age | Age in years | Numerical
Sex | (1 = male; 0 = female) | Categorical
CP | Chest pain type (0, 1, 2, 3, 4) | Categorical
Trestbpd | Resting blood pressure (in mm Hg on admission) | Numerical
Chol | Serum cholesterol in mg/dl | Numerical
FBS | fasting blood sugar in 120 mg/dl (1 = true; 0 = false) | Categorical
RestECG | Resting electrocardiogram results (0, 1, 2) | Categorical
Thalach | Maximum heart rate achieved | Numerical
Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical
Oldpeak | ST depression induced by exercise relative to rest | Numerical
Slope | Slope of the peak exercise ST segment | Numerical
CA | Number of major vessels (0-3) colored by fluoroscopy | Both numerical & categorical
Thal | 3 = normal; 6 = fixed defect; 7 = reversible defect | Categorical
Target | Diagnosis of heart disease (1 = true; 0 = false) | Target

In [None]:
#@title Download the data
!wget -qq https://cdn.iisc.talentsprint.com/AIandMLOps/Datasets/heart.csv
print("Data Downloaded Successfuly!!")
!ls | grep '.csv'

Data Downloaded Successfuly!!
heart.csv


## Grading = 10 Points

### Import Required Packages

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers

## Load the data and pre-process it [3 Marks]

### Load data into a Pandas dataframe

Hint:: pd.read_csv

In [None]:
file_url = "/content/heart.csv"
df = pd.read_csv(file_url)

Check the shape of the dataset:

In [None]:
df.shape

(303, 14)

Check the preview of a few samples:

Hint:: head()

In [None]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


Draw some inference from the data. What does the target column indicate?

The last column, "target", indicates whether the patient has a heart disease (1) or not
(0).

### Missing values

In [None]:
# Check if any missing values is present
df.isna().sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trestbps,0
chol,0
fbs,0
restecg,0
thalach,0
exang,0
oldpeak,0


### Show the unique values present in each categorical columns

- Remove the rows which has '1' and '2' as values in `thal` column

In [None]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [None]:
# Print the unique values present in each categorical columns

categorical_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca', 'thal']

print(f"{'Col_name':<10} Unique values")
print("="*40)
for col in categorical_cols:
    print(f"{col:<10} {df[col].unique()}")

Col_name   Unique values
sex        [1 0]
cp         [1 4 3 2 0]
fbs        [1 0]
restecg    [2 0 1]
exang      [0 1]
ca         [0 3 2 1]
thal       ['fixed' 'normal' 'reversible' '1' '2']


In [None]:
# Print the unique values present in each categorical columns along with their counts

for col in categorical_cols:
    print(f"{col:<10} {df[col].value_counts()}")
    print("="*40)

sex        sex
1    205
0     98
Name: count, dtype: int64
cp         cp
4    142
3     84
2     49
1     24
0      4
Name: count, dtype: int64
fbs        fbs
0    258
1     45
Name: count, dtype: int64
restecg    restecg
0    149
2    146
1      8
Name: count, dtype: int64
exang      exang
0    204
1     99
Name: count, dtype: int64
ca         ca
0    176
1     67
2     40
3     20
Name: count, dtype: int64
thal       thal
normal        168
reversible    115
fixed          18
1               1
2               1
Name: count, dtype: int64


- Remove the rows which has '1' and '2' as values in `thal` column

In [None]:
df[df['thal'] == '1']

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
247,59,1,0,164,176,1,0,90,0,1.0,1,2,1,0


In [None]:
df[df['thal'] == '2']

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
252,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


In [None]:
idx = [df[df['thal'] == '1'].index[0], df[df['thal'] == '2'].index[0]]
idx

[247, 252]

In [None]:
df = df.drop(idx, axis=0)
df.shape

(301, 14)

In [None]:
df['thal'].value_counts()

Unnamed: 0_level_0,count
thal,Unnamed: 1_level_1
normal,168
reversible,115
fixed,18


In [None]:
# Recheck the unique values present in each categorical columns

print(f"{'Col_name':<10} Unique values")
print("="*40)
for col in categorical_cols:
    print(f"{col:<10} {df[col].unique()}")

Col_name   Unique values
sex        [1 0]
cp         [1 4 3 2 0]
fbs        [1 0]
restecg    [2 0 1]
exang      [0 1]
ca         [0 3 2 1]
thal       ['fixed' 'normal' 'reversible']


### Convert the categorical values present in `thal` column to numerical labels

Hint: Create a dictionary mapping

In [None]:
thal_mapping = {'fixed': 0, 'normal': 1, 'reversible': 2}
df['thal'] = df['thal'].map(thal_mapping)

In [None]:
print(f"{'Col_name':<10} Unique values")
print("="*40)
for col in categorical_cols:
    print(f"{col:<10} {df[col].unique()}")

Col_name   Unique values
sex        [1 0]
cp         [1 4 3 2 0]
fbs        [1 0]
restecg    [2 0 1]
exang      [0 1]
ca         [0 3 2 1]
thal       [0 1 2]


### Split the dataset into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=df['target'])

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((240, 13), (61, 13), (240,), (61,))

### Scale the numerical features

In [None]:
numerical_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope']

df[numerical_cols].head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,slope
0,63,145,233,150,2.3,3
1,67,160,286,108,1.5,2
2,67,120,229,129,2.6,2
3,37,130,250,187,3.5,3
4,41,130,204,172,1.4,1


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

In [None]:
X_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
101,1.362636,0,3,-0.963595,5.868071,0,2,0.498048,0,0.429344,0.691529,0,2
55,-0.282397,1,2,-0.679558,1.414375,0,0,1.014755,0,-0.744452,-0.935599,0,1
107,0.594954,1,4,0.456589,0.818064,0,2,0.928637,0,0.093973,0.691529,2,2
183,0.156278,1,4,-0.111484,0.631717,1,2,-1.95631,1,0.429344,2.318657,0,2
17,-0.06306,1,4,0.456589,-0.188211,0,0,0.498048,0,0.093973,-0.935599,0,1


## Building the model [3 Marks]

* Use tf.keras.layers.Input() for input layer
* Add dense layers
* Add dropout layers
* Add a classification layer at the end


In [None]:
# Create model
model = keras.Sequential()
model.add(layers.Input((X_train.shape[1],)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

In [None]:
# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Perform training
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)

Epoch 1/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 115ms/step - accuracy: 0.5778 - loss: 0.6853 - val_accuracy: 0.7500 - val_loss: 0.5517
Epoch 2/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 42ms/step - accuracy: 0.7378 - loss: 0.5353 - val_accuracy: 0.7083 - val_loss: 0.5023
Epoch 3/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.7542 - loss: 0.4861 - val_accuracy: 0.8125 - val_loss: 0.4649
Epoch 4/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.8074 - loss: 0.4425 - val_accuracy: 0.8333 - val_loss: 0.4287
Epoch 5/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8044 - loss: 0.4279 - val_accuracy: 0.8958 - val_loss: 0.4075
Epoch 6/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - accuracy: 0.8267 - loss: 0.3795 - val_accuracy: 0.8958 - val_loss: 0.3912
Epoch 7/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7f8b5b658820>

In [None]:
# Performance on test set
model.evaluate(X_test, y_test)

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.7954 - loss: 0.5682 


[0.590534508228302, 0.7868852615356445]

## Inference on new data [1 Mark]

To get a prediction for a new sample, you can simply call `model.predict()`.

In [None]:
# Inference on new data

sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",
}

data = pd.DataFrame(sample, index=[0])

data['thal'] = data['thal'].map(thal_mapping)

data[numerical_cols] = scaler.transform(data[numerical_cols])

data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.594954,1,1,0.740626,-0.300019,1,2,0.067459,0,1.016241,2.318657,0,0


In [None]:
pred = model.predict(data, verbose=0)
label = 'Does not have disease' if np.argmax(pred)==0 else 'Has disease'
label

'Does not have disease'

## Gradio Implementation [3 Marks]

Create a Gradio interface for this `Heart Disease Prediction` application. For the feature values given by the user as input, perform predcition using the trained model, and return the result back to user.

Make use of gradio elements such as Textbox, Radio buttons, etc.

In [None]:
%%capture
!pip -q install gradio

In [None]:
import gradio
import gradio as gr

In [None]:
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.594954,1,1,0.740626,-0.300019,1,2,0.067459,0,1.016241,2.318657,0,0


In [None]:
# UI - Input components
in_age = gradio.Textbox(lines=1, placeholder=None, value="34", label='Age of the patient in yrs')
in_sex = gradio.Radio(["Female", "Male"], type="index", label='Gender')
in_cp = gradio.Radio([0, 1, 2, 3, 4], type="value", label='Chest pain type')
in_trestbps = gradio.Textbox(lines=1, placeholder=None, value="120", label='Resting blood pressure (in mm Hg)')
in_chol = gradio.Textbox(lines=1, placeholder=None, value="236", label='Serum cholestoral in mg/dl')
in_fbs = gradio.Radio([0, 1], type="value", label='Fasting blood sugar > 120 mg/dl')
in_restecg = gradio.Radio([0, 1, 2], type="value", label='Resting electrocardiographic results')
in_thalach = gradio.Textbox(lines=1, placeholder=None, value="150", label='Maximum heart rate achieved')
in_exang = gradio.Radio([0, 1], type="value", label='Exercise induced angina')
in_oldpeak = gradio.Textbox(lines=1, placeholder=None, value="2.3", label='ST depression induced by exercise relative to rest')
in_slope = gradio.Radio([0, 1, 2], type="value", label='The slope of the peak exercise ST segment')
in_ca = gradio.Radio([0, 1, 2, 3], type="value", label='Number of major vessels (0-3) colored by flourosopy')
in_thal = gradio.Radio(["fixed", "normal", "reversible"], type="value", label='Thalium Stress Test result')

# UI - Output component
out_label = gradio.Textbox(type="text", label='Prediction', elem_id="out_textbox")

In [None]:
# Label prediction function

def get_output_label(in_age, in_sex, in_cp, in_trestbps, in_chol, in_fbs, in_restecg, in_thalach, in_exang, in_oldpeak, in_slope, in_ca, in_thal):
    input_df = pd.DataFrame({"age": [in_age],
                             "sex": [in_sex],
                             "cp": [in_cp],
                             "trestbps": [in_trestbps],
                             "chol": [in_chol],
                             "fbs": [in_fbs],
                             "restecg": [in_restecg],
                             "thalach": [in_thalach],
                             "exang": [in_exang],
                             "oldpeak": [in_oldpeak],
                             "slope": [in_slope],
                             "ca": [in_ca],
                             "thal": [in_thal]
    })

    input_df['thal'] = input_df['thal'].map(thal_mapping)
    input_df[numerical_cols] = scaler.transform(input_df[numerical_cols])
    pred = model.predict(input_df, verbose=0)
    label = 'Does not have disease' if np.argmax(pred)==0 else 'Has disease'
    return label


In [None]:
# Create Gradio interface object

iface = gradio.Interface(fn = get_output_label,
                         inputs = [in_age, in_sex, in_cp, in_trestbps, in_chol, in_fbs, in_restecg, in_thalach, in_exang, in_oldpeak, in_slope, in_ca, in_thal],
                         outputs = [out_label],
                         title = "Heart Disease Prediction",
                         description = "To predict whether a patient has a heart disease.",
                         flagging_mode = "never"
                         )

iface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://117e93f749042a3eee.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


