# ML Zoomcamp - Midterm Project

## Dataset : UCI Heart Disease Data

For the ml-zoomcamp midterm project I've chosen a subset of the Heart Disease Data Set from UCI Machine Learning data repository. It contains 14 patient attributes and I'll use them to predict whether a patient has heart disease (target values 1,2,3,4) or not (target value 0).

The dataset is included in the project directory, or can be downloaded from kaggle:

[https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data/download?datasetVersionNumber=6](https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data/download?datasetVersionNumber=6)

The feature names are:

0. **id**
1. **age**
2. **sex**
3. **dataset**: the Cleveland database is the only one used
4. **cp**: chest pain type
    - typical angina
    - atypical angina
    - non-anginal pain
    - asymptomatic
5. **trestbps**: resting blood pressure (in mm Hg on admission to the hospital)
6. **chol**: serum cholestoral in mg/dl
7. **fbs**: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
8. **restecg**: resting electrocardiographic results
    - normal
    - lv hypertrophy	
    - st-t abnormality
9. **thalach**: maximum heart rate achieved
10. **exang**: exercise induced angina
11. **oldpeak**: ST depression induced by exercise relative to rest
12. **slope**: the slope of the peak exercise ST segment
    - upsloping
    - flat
    - downsloping
13. **ca**: number of major vessels (0-3) colored by flourosopy
14. **thal**:
    - normal
    - fixed defect
    - reversable defect
15. **num**: the predicted target
    - normal = 0
    - heart disease = 1,2,3,4

### Data preparation and exploratory data analysis

In [36]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

After downloading the dataset, I've proceeded to import it with pandas
to inspect the features and possible missing values.

In [80]:
df = pd.read_csv("../heart_disease_uci.csv")
df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


In [81]:
df.shape

(920, 16)

In [82]:
df.dtypes

id            int64
age           int64
sex          object
dataset      object
cp           object
trestbps    float64
chol        float64
fbs          object
restecg      object
thalch      float64
exang        object
oldpeak     float64
slope        object
ca          float64
thal         object
num           int64
dtype: object

Column names are all lower case and contain no spaces.

Checking for missing values:

In [83]:
df.isnull().sum()

id            0
age           0
sex           0
dataset       0
cp            0
trestbps     59
chol         30
fbs          90
restecg       2
thalch       55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
num           0
dtype: int64

There are missing values across quite a few columns, so I'll fill in NaN values with the most frequent value
in the columns with less than 30% of missing values.

I don't think it makes sense to keep features where values are missing in more than 50% of the rows,
so I'll remove them from the dataframe.

In [84]:
columns = ["trestbps", "chol", "fbs", "restecg", "thalch", "exang", "oldpeak", "slope"]

for col in columns:
    df[col] = df[col].fillna(df[col].mode)

del df["ca"]
del df["thal"]

In [85]:
df.isnull().sum()

id          0
age         0
sex         0
dataset     0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalch      0
exang       0
oldpeak     0
slope       0
num         0
dtype: int64

My objective is to use the current dataset to train a model for binary classification,
that is, to predict whether a patient has a heart disease or not.
There are 5 unique values in the target: 0 indicates that the patient has no heart disease
and 1,2,3,4 indicate the presence of heart disease to various degrees of severity.
In order to convert the target into a binary column, the values that indicate the presence of disease are all turned into 1.

In [86]:
df.num.value_counts()

0    411
1    265
2    109
3    107
4     28
Name: num, dtype: int64

In [87]:
df.loc[df.num != 0] = 1

And I change the target column name to something more descriptive.

In [101]:
df["disease"] = df.num
df.drop("num", axis=1, inplace=True)
df.disease.value_counts()

1    509
0    411
Name: disease, dtype: int64

The dataset is balanced!

## Setting up the validation framework

In [103]:
from sklearn.model_selection import train_test_split

In [104]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [None]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [105]:
y_train = df.disease.values
y_val = df.disease.values
y_test = df.disease.values

In [108]:
del df_train['disease']
del df_val['disease']
del df_test['disease']