### Project Proposal - Heart Disease dataset

#### Introduction

Cardiovascular disease (also known as heart disease) is a type of disease that affects the heart or blood vessels. Patients suffering from this disease have compromised circulatory systems due to several factors including age, sex, cholestrol levels etc. 

In this project, our goal is to identify which chest pain type is more likely to indicate heart disease. For this classification, we have chosen to use the Heart Disease dataset from the UCI Machine Learning Repository (UCI). The dataset contains many variables which affect the probability of a chest pain type being indicative of heart disease. For our data analysis, we will be focusing on the following variables:

- Age - age in years
- Sex - (1=male, 0=female)
- Chest pain type (cp) - 1 = typical angina, 2 = atypical angina, 3 = non-angina pain, 4 = asymptomatic angina
- Fasting blood sugar (fbs) - fasting blood sugar in mg/dL 
- Resting ECG - resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- Maximum heart rate achieved - measured in beats per min (BPM)
- Exercise induced angina - 1 = yes; 0 = no
- Resting heart rate (thalrest) - resting heart rate in beats per min (BPM)

In [3]:
# Packages needed for classification on our dataset
import random

import altair as alt
import pandas as pd
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics.pairwise import euclidean_distances

### Data analysis

In [29]:
# Loading the dataset using the URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

heart_disease = pd.read_csv(url, header=None)

heart_disease.columns = ("Age",
           "Sex",
           "Chest_Pain_Type",
           "Resting_Blood_Pressure",
           "Serum_Cholesterol",
           "Fasting_Blood_Sugar",
           "Resting_ECG",
           "Max_Heart_Rate",
           "Exercise_Induced_Angina",
           "ST_Depression_Exercise",
           "Peak_Exercise_ST_Segment",
           "Num_Major_Vessels_Flouro",
           "Thalassemia",
           "Diagnosis") # assigning human-readable column headings

heart_disease


Unnamed: 0,Age,Sex,Chest_Pain_Type,Resting_Blood_Pressure,Serum_Cholesterol,Fasting_Blood_Sugar,Resting_ECG,Max_Heart_Rate,Exercise_Induced_Angina,ST_Depression_Exercise,Peak_Exercise_ST_Segment,Num_Major_Vessels_Flouro,Thalassemia,Diagnosis
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


In [30]:
heart_disease.dropna() # Removing Na's


# Age is measured in years therefore will be an integer
heart_disease.Age = heart_disease.Age.astype('int64') 

# Since sex is one of male or female, we convert the data type to bool
heart_disease.Sex = heart_disease.Sex.astype(bool) 

# A patient either has heart disease or not, hence diagnosis will also be of bool type
heart_disease.Diagnosis = heart_disease.Diagnosis.astype(bool) 

# Exercise-induced angina is one of 0 or 1 hence will be bool
heart_disease.Exercise_Induced_Angina = heart_disease.Exercise_Induced_Angina.astype(bool) 

# Since chest pain type can only be one of 1, 2, 3, 4 hence it will be of type int64
heart_disease.Chest_Pain_Type = heart_disease.Chest_Pain_Type.astype('int64') 

heart_disease.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       303 non-null    int64  
 1   Sex                       303 non-null    bool   
 2   Chest_Pain_Type           303 non-null    int64  
 3   Resting_Blood_Pressure    303 non-null    float64
 4   Serum_Cholesterol         303 non-null    float64
 5   Fasting_Blood_Sugar       303 non-null    float64
 6   Resting_ECG               303 non-null    float64
 7   Max_Heart_Rate            303 non-null    float64
 8   Exercise_Induced_Angina   303 non-null    bool   
 9   ST_Depression_Exercise    303 non-null    float64
 10  Peak_Exercise_ST_Segment  303 non-null    float64
 11  Num_Major_Vessels_Flouro  303 non-null    object 
 12  Thalassemia               303 non-null    object 
 13  Diagnosis                 303 non-null    bool   
dtypes: bool(3)

In [31]:
test

NameError: name 'test' is not defined