# A Short Guide to Documenting Code
Chaitanya Madduri, June 5, 2022


> <i> This code is about the ML Model deployment using teh DASH </i>


There are four elements to documenting an Ipython notebook 

1. _Type Annotation_ of functions
2. _Docstrings_
3. _Commenting_
4. _Markdown_ cells

Let's take each in order.

### Detail Context
- Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

- People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

### Attribute Information

- Age: age of the patient [years]
- Sex: sex of the patient [M: Male, F: Female]
- ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: resting blood pressure [mm Hg]
- Cholesterol: serum cholesterol [mm/dl]
- FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
- Oldpeak: oldpeak = ST [Numeric value measured in depression]
- ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: output class [1: heart disease, 0: Normal]

### Source
- This dataset was created by combining different datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

    - Cleveland: 303 observations
    - Hungarian: 294 observations
    - Switzerland: 123 observations
    - Long Beach VA: 200 observations
    - Stalog (Heart) Data Set: 270 observations
    - Total: 1190 observations
    - Duplicated: 272 observations

- Final dataset: 918 observations

- Every dataset used can be found under the Index of heart disease datasets from UCI Machine Learning Repository on the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/

# 1. Tools and Packages

In [29]:
import numpy as np
import pandas as pd
import pickle
import os


import time



from sklearn.preprocessing import StandardScaler
from sklearn import svm

from sklearn import metrics

from sklearn.metrics import fbeta_score
from sklearn.metrics import make_scorer

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

#picking models for prediction.
from sklearn.svm import SVC





from matplotlib import cm
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [30]:
# TO print multiple outputs in single line
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

# 2. Data Preparation

The next stage we will clean the dataset and prepare all the functions needed to later execute, train, test our model.

### 2.1 Cleaning Data

In [51]:
data = pd.read_csv('heart.csv')
data

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1


In [54]:
categorical_columns = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
for col in categorical_columns:
    print(' COlumn Name : {}'.format(col))
    data[col].unique()

 COlumn Name : Sex


array(['M', 'F'], dtype=object)

 COlumn Name : ChestPainType


array(['ATA', 'NAP', 'ASY', 'TA'], dtype=object)

 COlumn Name : RestingECG


array(['Normal', 'ST', 'LVH'], dtype=object)

 COlumn Name : ExerciseAngina


array(['N', 'Y'], dtype=object)

 COlumn Name : ST_Slope


array(['Up', 'Flat', 'Down'], dtype=object)

In [47]:
data.head(0).to_dict()

{'Age': {},
 'Sex': {},
 'ChestPainType': {},
 'RestingBP': {},
 'Cholesterol': {},
 'FastingBS': {},
 'RestingECG': {},
 'MaxHR': {},
 'ExerciseAngina': {},
 'Oldpeak': {},
 'ST_Slope': {}}

In [49]:
pd.DataFrame({'Age': {},
 'Sex': {},
 'ChestPainType': {},
 'RestingBP': {},
 'Cholesterol': {},
 'FastingBS': {},
 'RestingECG': {},
 'MaxHR': {},
 'ExerciseAngina': {},
 'Oldpeak': {},
 'ST_Slope': {}})

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope


In [46]:
data.columns

Index(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS',
       'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope'],
      dtype='object')

In [32]:
data.drop(['HeartDisease'], inplace=True, axis=1)

### 2.2 Categorical Values 

In [33]:
num_ix = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak',
       'HeartDisease']

### 2.3 Extracting the Categorical Columns and the Numerical columns 

### 2.4 Label Encoding for categorical columns 

##### We also tried one hot coding, which added 10 more columns to the dataset. After training the data with one hot encoding, for test set, accuracy is 0.69, precision is 0.64, recall is 0.76 and F1 is 0.69. So we get better results with label encoding, see below.

In [34]:
categorical_columns= ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
with open('label_dict.pkl', 'rb') as f:  # Python 3: open(..., 'wb')
    label_dict = pickle.load(f) 
    
encoded_data = data
for i in categorical_columns:
    encoded_data[i] = label_dict[i].transform(data[i])
encoded_data[categorical_columns].head()

Unnamed: 0,Sex,ChestPainType,RestingECG,ExerciseAngina,ST_Slope
0,1,1,1,0,2
1,0,2,1,0,1
2,1,1,2,0,2
3,0,0,1,1,1
4,1,2,1,0,2


In [50]:
for col in categorical_columns:
    data[col].value_counts()

1    725
0    193
Name: Sex, dtype: int64

0    496
2    203
1    173
3     46
Name: ChestPainType, dtype: int64

1    552
0    188
2    178
Name: RestingECG, dtype: int64

0    547
1    371
Name: ExerciseAngina, dtype: int64

1    460
2    395
0     63
Name: ST_Slope, dtype: int64

### 2.5 Scaling the data.
Applying the minmax scaler to maintain all the columns in the same magnitude 

In [35]:
with open('min_max_scaler.pkl', 'rb') as f:  # Python 3: open(..., 'wb')
    min_max_scaler = pickle.load(f)

numerical_columns = ['Age',
 'RestingBP',
 'Cholesterol',
 'FastingBS',
 'MaxHR',
 'Oldpeak']
scale_encoded_data = encoded_data[numerical_columns]

scale_encoded_data = min_max_scaler.transform(scale_encoded_data)


In [36]:
encoded_data[numerical_columns] = scale_encoded_data

In [37]:
encoded_data

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,0.244898,1,1,0.70,0.479270,0.0,1,0.788732,0,0.295455,2
1,0.428571,0,2,0.80,0.298507,0.0,1,0.676056,0,0.409091,1
2,0.183673,1,1,0.65,0.469320,0.0,2,0.267606,0,0.295455,2
3,0.408163,0,0,0.69,0.354892,0.0,1,0.338028,1,0.465909,1
4,0.530612,1,2,0.75,0.323383,0.0,1,0.436620,0,0.295455,2
...,...,...,...,...,...,...,...,...,...,...,...
913,0.346939,1,3,0.55,0.437811,0.0,1,0.507042,0,0.431818,1
914,0.816327,1,0,0.72,0.320066,1.0,1,0.570423,0,0.681818,1
915,0.591837,1,0,0.65,0.217247,0.0,1,0.387324,1,0.431818,1
916,0.591837,0,1,0.65,0.391376,0.0,0,0.802817,0,0.295455,1


### 2.6 Prepare data for training

### 2.7  Creation of Key Functions

In [38]:
# make f2 a valid validation
def f2_scorer(y_true, y_pred):
    y_true, y_pred, = np.array(y_true), np.array(y_pred)
    return fbeta_score(y_true, y_pred, beta = 2, pos_label = 0, average = 'binary')

f2 = make_scorer(f2_scorer, greater_is_better=True)

# here for question b: make f05 a valid validation
def f05_scorer(y_true, y_pred):
    y_true, y_pred, = np.array(y_true), np.array(y_pred)
    return fbeta_score(y_true, y_pred, beta = 0.5, pos_label = 0, average = 'binary')

f05 = make_scorer(f05_scorer, greater_is_better = True)

## Building the model with Best parameters 

- As per the Detailed analysis we have understoof that the below parameters are best for the model buyilding. We have take the best parameters from tehSVM grid search and the manual analysis

In [39]:

with open('dash_svm.pkl', 'rb') as f:  # Python 3: open(..., 'wb')
    svm_bestf2 = pickle.load(f)

#### Confusion matrix

In [40]:
svm_bestf2.predict(encoded_data)

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,

### End of the Notebook