# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project Notebook: Structured Data Classification

## Problem Statement

To predict whether a patient has a heart disease.

## Learning Objectives

At the end of the experiment, you will be able to

* understand the Cleveland Clinic Foundation for Heart Disease dataset
* pre-process this dataset
* build a neural network architecture/model using Keras sequential or functional api
* perform model training
* perform inference on an unseen data
* build a Gradio interface for this application

## Introduction

This example demonstrates how to do structured data classification, starting from a raw
CSV file. Our data includes both numerical and categorical features. We will do preprocessing to normalize the numerical features and vectorize the categorical
ones.

### Dataset

[Our dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) is provided by the
Cleveland Clinic Foundation for Heart Disease.
It's a CSV file with 303 rows. Each row contains information about a patient (a
**sample**), and each column describes an attribute of the patient (a **feature**). We
use the features to predict whether a patient has a heart disease (**binary
classification**).

Here's the description of each feature:

Column| Description| Feature Type
------------|--------------------|----------------------
Age | Age in years | Numerical
Sex | (1 = male; 0 = female) | Categorical
CP | Chest pain type (0, 1, 2, 3, 4) | Categorical
Trestbpd | Resting blood pressure (in mm Hg on admission) | Numerical
Chol | Serum cholesterol in mg/dl | Numerical
FBS | fasting blood sugar in 120 mg/dl (1 = true; 0 = false) | Categorical
RestECG | Resting electrocardiogram results (0, 1, 2) | Categorical
Thalach | Maximum heart rate achieved | Numerical
Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical
Oldpeak | ST depression induced by exercise relative to rest | Numerical
Slope | Slope of the peak exercise ST segment | Numerical
CA | Number of major vessels (0-3) colored by fluoroscopy | Both numerical & categorical
Thal | 3 = normal; 6 = fixed defect; 7 = reversible defect | Categorical
Target | Diagnosis of heart disease (1 = true; 0 = false) | Target

In [None]:
#@title Download the data
!wget -qq https://cdn.iisc.talentsprint.com/AIandMLOps/Datasets/heart.csv
print("Data Downloaded Successfuly!!")
!ls | grep '.csv'

## Grading = 10 Points

### Import Required Packages

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers

## Load the data and pre-process it [3 Marks]

### Load data into a Pandas dataframe

Hint:: pd.read_csv

In [None]:
file_url = "/content/heart.csv"
## YOUR CODE HERE

Check the shape of the dataset:

In [None]:
## YOUR CODE HERE

Check the preview of a few samples:

Hint:: head()

In [None]:
## YOUR CODE HERE

Draw some inference from the data. What does the target column indicate?

The last column, "target", indicates whether the patient has a heart disease (1) or not
(0).

### Missing values

In [None]:
# Check if any missing values is present
## YOUR CODE HERE

### Show the unique values present in each categorical columns

- Remove the rows which has '1' and '2' as values in `thal` column

In [None]:
# Show all the columns in dataframe
## YOUR CODE HERE

In [None]:
# Print the unique values present in each categorical columns

categorical_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca', 'thal']

## YOUR CODE HERE


In [None]:
# Print the unique values present in each categorical columns along with their counts

## YOUR CODE HERE


- Remove the rows which has '1' and '2' as values in `thal` column

In [None]:
# Find indices of the rows which has '1', '2' as values in `thal` column

idx = ## YOUR CODE HERE
idx

In [None]:
# Drop the above indexed rows

## YOUR CODE HERE

In [None]:
# Recheck the unique values present in each categorical columns

## YOUR CODE HERE

### Convert the categorical values present in `thal` column to numerical labels

Hint: Create a dictionary mapping

In [None]:
## YOUR CODE HERE

### Split the dataset into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

## YOUR CODE HERE (perform stratified sampling/splitting)

### Scale the numerical features

In [None]:
numerical_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope']

In [None]:
from sklearn.preprocessing import StandardScaler

## YOUR CODE HERE

## Building the model [3 Marks]

* Use tf.keras.layers.Input() for input layer
* Add dense layers
* Add dropout layers
* Add a classification layer at the end


In [None]:
# Create model

## YOUR CODE HERE

model.summary()

In [None]:
# Compile model with 'adam' optimizer, appropriate loss and metric

## YOUR CODE HERE

In [None]:
# Perform training
epochs=50
batch_size=32
validation_split=0.2

model.fit(## YOUR CODE HERE)

In [None]:
# Performance on test set

model.evaluate(## YOUR CODE HERE)

## Inference on new data [1 Mark]

To get a prediction for a new sample, you can simply call `model.predict()`.

In [None]:
# Inference on new data

sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",
}


In [None]:
## YOUR CODE HERE

## Gradio Implementation [3 Marks]

Create a Gradio interface for this `Heart Disease Prediction` application. For the feature values given by the user as input, perform predcition using the trained model, and return the result back to user.

Make use of gradio elements such as Textbox, Radio buttons, etc.

In [None]:
%%capture
!pip -q install gradio

In [None]:
import gradio
import gradio as gr

In [None]:
# UI - Input components
## YOUR CODE HERE ...


# UI - Output component
## YOUR CODE HERE ...


In [None]:
# Label prediction function

## YOUR CODE HERE


In [None]:
# Create Gradio interface object and launch it with (share=True)

## YOUR CODE HERE
