## Exploratory Data Analysis - 1 (EDA)

In [77]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

Note: Data is from the UCI Machine Learning Repository:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

In [78]:
# data: https://archive.ics.uci.edu/ml/datasets/heart+disease
heart = pd.read_csv('processed.cleveland.data.csv')

### Data Description
- **age:** age in years
- **sex:** 1=male, 0=female
- **cp:** chest pain type
 - **Value 1:** typical angina
 - **Value 2:** atypical angina
 - **Value 3:** non-anginal pain
 - **Value 4:** asymptomatic
- **trestbps:** resting blood pressure (in mm Hg on admission to the hospital)
- **chol:** serum cholestoral in mg/dl
- **fbs:** (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- **restecg:** resting electrocardiographic results
 - **Value 0:** normal
 - **Value 1:** having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 
 - **Value 2:** showing probable or definite left ventricular hypertrophy by Estes' criteria
- **thalach:** maximum heart rate achieved in an exercise test
- **exang:** exercise induced angina (1 = yes; 0 = no)
- **oldpeak:** ST depression induced by exercise relative to rest
- **slope:** the slope of the peak exercise ST segment
 - **Value 1:** upsloping
 - **Value 2:** flat
 - **Value 3:** downsloping
- **ca:** number of major vessels (0-3) colored by flourosopy
- **thal:** 
 - **Value 3:** normal
 - **Value 6:** fixed defect
 - **Value 7:** reversable defect
- **heart_disease:** diagnosis of heart disease (angiographic disease status)
 - **Value 0:** < 50% diameter narrowing
 - **Value 1:** > 50% diameter narrowing
 
"\[This field\] refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0)."


In [79]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [80]:
heart.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
age,303.0,,,,54.438944,9.038662,29.0,48.0,56.0,61.0,77.0
sex,303.0,,,,0.679868,0.467299,0.0,0.0,1.0,1.0,1.0
cp,303.0,,,,3.158416,0.960126,1.0,3.0,3.0,4.0,4.0
trestbps,303.0,,,,131.689769,17.599748,94.0,120.0,130.0,140.0,200.0
chol,303.0,,,,246.693069,51.776918,126.0,211.0,241.0,275.0,564.0
fbs,303.0,,,,0.148515,0.356198,0.0,0.0,0.0,0.0,1.0
restecg,303.0,,,,0.990099,0.994971,0.0,0.0,1.0,2.0,2.0
thalach,303.0,,,,149.607261,22.875003,71.0,133.5,153.0,166.0,202.0
exang,303.0,,,,0.326733,0.469794,0.0,0.0,0.0,1.0,1.0
oldpeak,303.0,,,,1.039604,1.161075,0.0,0.0,0.8,1.6,6.2


In [81]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    float64
 2   cp             303 non-null    float64
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            303 non-null    float64
 6   restecg        303 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          303 non-null    float64
 9   oldpeak        303 non-null    float64
 10  slope          303 non-null    float64
 11  ca             303 non-null    object 
 12  thal           303 non-null    object 
 13  heart_disease  303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


### Inspect Object Type Columns

In [82]:
heart["ca"].unique()

array(['0.0', '3.0', '2.0', '1.0', '?'], dtype=object)

In [83]:
heart["thal"].unique()

array(['6.0', '3.0', '7.0', '?'], dtype=object)

In [84]:
heart.replace("?", np.nan, inplace=True)

In [85]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    float64
 2   cp             303 non-null    float64
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            303 non-null    float64
 6   restecg        303 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          303 non-null    float64
 9   oldpeak        303 non-null    float64
 10  slope          303 non-null    float64
 11  ca             299 non-null    object 
 12  thal           301 non-null    object 
 13  heart_disease  303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


In [86]:
heart["ca"] = heart["ca"].astype("float")

In [87]:
heart["thal"] = heart["thal"].astype("float")

In [88]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            303 non-null    float64
 2   cp             303 non-null    float64
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            303 non-null    float64
 6   restecg        303 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          303 non-null    float64
 9   oldpeak        303 non-null    float64
 10  slope          303 non-null    float64
 11  ca             299 non-null    float64
 12  thal           301 non-null    float64
 13  heart_disease  303 non-null    int64  
dtypes: float64(13), int64(1)
memory usage: 33.3 KB


### Inspect Null Values

In [124]:
heart[heart.isnull().any(axis=1)]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,thalach,exang,oldpeak,slope,ca,thal,heart_disease,restecg_normal
87,53.0,female,non-anginal pain,128.0,216.0,0.0,115.0,0.0,0.0,upsloping,0.0,,absence,0
166,52.0,male,non-anginal pain,138.0,223.0,0.0,169.0,0.0,0.0,upsloping,,normal,absence,1
192,43.0,male,asymptomatic,132.0,247.0,1.0,143.0,1.0,0.1,flat,,reversable defect,presence,0
266,52.0,male,asymptomatic,128.0,204.0,1.0,156.0,1.0,1.0,flat,0.0,,presence,1
287,58.0,male,atypical angina,125.0,220.0,0.0,144.0,0.0,0.4,flat,,reversable defect,absence,1
302,38.0,male,non-anginal pain,138.0,175.0,0.0,173.0,0.0,0.0,upsloping,,normal,absence,1


In [122]:
#Looking at this output, we note that there is no overlap between the rows with missing ca data and missing thal data. 
#This suggests that these patients are missing ca and thal information for different reasons. 
#We don’t see any immediate clues as to why the data is missing in the first place, 
    #but we can inspect this further once we start digging into individual features.

### sex Column Replace
- **sex:** 1=male, 0=female

In [125]:
heart["sex"].replace({1.0: "male", 0.0:"female"}, inplace=True)

In [126]:
heart["sex"]

0        male
1        male
2        male
3        male
4      female
        ...  
298      male
299      male
300      male
301    female
302      male
Name: sex, Length: 303, dtype: object

### cp Column Replace
- **cp:** chest pain type
 - **Value 1:** typical angina
 - **Value 2:** atypical angina
 - **Value 3:** non-anginal pain
 - **Value 4:** asymptomatic

In [89]:
heart["cp"].replace([1.0, 2.0, 3.0, 4.0], ["typical angina", "atypical angina", "non-anginal pain", "asymptomatic"],\
                    inplace=True)

In [90]:
heart["cp"]

0        typical angina
1          asymptomatic
2          asymptomatic
3      non-anginal pain
4       atypical angina
             ...       
298      typical angina
299        asymptomatic
300        asymptomatic
301     atypical angina
302    non-anginal pain
Name: cp, Length: 303, dtype: object

### slope Column Replace
- **slope:** the slope of the peak exercise ST segment
 - **Value 1:** upsloping
 - **Value 2:** flat
 - **Value 3:** downsloping

In [91]:
heart["slope"]

0      3.0
1      2.0
2      2.0
3      3.0
4      1.0
      ... 
298    2.0
299    2.0
300    2.0
301    2.0
302    1.0
Name: slope, Length: 303, dtype: float64

In [92]:
heart["slope"].replace({1.0: "upsloping", 2.0: "flat", 3.0: "downsloping"}, inplace=True)

In [93]:
heart["slope"] = pd.Categorical(heart["slope"], ["upsloping", "flat", "downsloping"], ordered=True)

In [94]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   age            303 non-null    float64 
 1   sex            303 non-null    float64 
 2   cp             303 non-null    object  
 3   trestbps       303 non-null    float64 
 4   chol           303 non-null    float64 
 5   fbs            303 non-null    float64 
 6   restecg        303 non-null    float64 
 7   thalach        303 non-null    float64 
 8   exang          303 non-null    float64 
 9   oldpeak        303 non-null    float64 
 10  slope          303 non-null    category
 11  ca             299 non-null    float64 
 12  thal           301 non-null    float64 
 13  heart_disease  303 non-null    int64   
dtypes: category(1), float64(11), int64(1), object(1)
memory usage: 31.3+ KB


In [95]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
0,63.0,1.0,typical angina,145.0,233.0,1.0,2.0,150.0,0.0,2.3,downsloping,0.0,6.0,0
1,67.0,1.0,asymptomatic,160.0,286.0,0.0,2.0,108.0,1.0,1.5,flat,3.0,3.0,2
2,67.0,1.0,asymptomatic,120.0,229.0,0.0,2.0,129.0,1.0,2.6,flat,2.0,7.0,1
3,37.0,1.0,non-anginal pain,130.0,250.0,0.0,0.0,187.0,0.0,3.5,downsloping,0.0,3.0,0
4,41.0,0.0,atypical angina,130.0,204.0,0.0,2.0,172.0,0.0,1.4,upsloping,0.0,3.0,0


In [96]:
heart["slope"].cat.codes

0      2
1      1
2      1
3      2
4      0
      ..
298    1
299    1
300    1
301    1
302    0
Length: 303, dtype: int8

### restecg Column Replace
- **restecg:** resting electrocardiographic results
 - **Value 0:** normal
 - **Value 1:** having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 
 - **Value 2:** showing probable or definite left ventricular hypertrophy by Estes' criteria

In [99]:
heart["restecg"]

0      2.0
1      2.0
2      2.0
3      0.0
4      2.0
      ... 
298    0.0
299    0.0
300    0.0
301    2.0
302    0.0
Name: restecg, Length: 303, dtype: float64

In [100]:
heart["restecg"].replace({0.0: "normal", 1.0: "ST-T wave abnormality", 2.0: "left centricular hypertrophy"}, inplace=True)

In [101]:
heart["restecg"]

0      left centricular hypertrophy
1      left centricular hypertrophy
2      left centricular hypertrophy
3                            normal
4      left centricular hypertrophy
                   ...             
298                          normal
299                          normal
300                          normal
301    left centricular hypertrophy
302                          normal
Name: restecg, Length: 303, dtype: object

### restecg Column Get Dummies 

In [102]:
heart = pd.get_dummies(data=heart, columns=["restecg"], drop_first=True)

In [107]:
heart.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,thalach,exang,oldpeak,slope,ca,thal,heart_disease,restecg_normal
0,63.0,male,typical angina,145.0,233.0,1.0,150.0,0.0,2.3,downsloping,0.0,6.0,0,0
1,67.0,male,asymptomatic,160.0,286.0,0.0,108.0,1.0,1.5,flat,3.0,3.0,2,0
2,67.0,male,asymptomatic,120.0,229.0,0.0,129.0,1.0,2.6,flat,2.0,7.0,1,0


In [105]:
heart.drop(columns="restecg_left centricular hypertrophy", inplace=True)

In [106]:
heart.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,thalach,exang,oldpeak,slope,ca,thal,heart_disease,restecg_normal
0,63.0,male,typical angina,145.0,233.0,1.0,150.0,0.0,2.3,downsloping,0.0,6.0,0,0
1,67.0,male,asymptomatic,160.0,286.0,0.0,108.0,1.0,1.5,flat,3.0,3.0,2,0
2,67.0,male,asymptomatic,120.0,229.0,0.0,129.0,1.0,2.6,flat,2.0,7.0,1,0


### thal Column Replace
- **thal:** 
 - **Value 3:** normal
 - **Value 6:** fixed defect
 - **Value 7:** reversable defect

In [111]:
heart["thal"]

0      6.0
1      3.0
2      7.0
3      3.0
4      3.0
      ... 
298    7.0
299    7.0
300    7.0
301    3.0
302    3.0
Name: thal, Length: 303, dtype: float64

In [113]:
heart["thal"].replace({3.0: "normal", 6.0: "fixed defect", 7.0: "reversable defect"}, inplace=True)

In [114]:
heart["thal"].unique()

array(['fixed defect', 'normal', 'reversable defect', nan], dtype=object)

### heart_disease Column
heart_disease: diagnosis of heart disease (angiographic disease status)
- Value 0: < 50% diameter narrowing
- Value 1: > 50% diameter narrowing

"\[This field\] refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0)."

In [115]:
heart["heart_disease"]

0      0
1      2
2      1
3      0
4      0
      ..
298    1
299    2
300    3
301    1
302    0
Name: heart_disease, Length: 303, dtype: int64

In [120]:
heart["heart_disease"] = np.where(heart["heart_disease"] == 0, "absence", "presence")

In [121]:
heart["heart_disease"]

0       absence
1      presence
2      presence
3       absence
4       absence
         ...   
298    presence
299    presence
300    presence
301    presence
302     absence
Name: heart_disease, Length: 303, dtype: object