# UCI Heart Disease Dataset

In this notebook, we read in the data from the `.data` file and store it in the form of a `csv` file. We also learn more about the parameters in this dataset along the way.

We have used the `Cleveland Dataset` here in this notebook, and the same will be followed in the further notebooks of this lab.

In [1]:
# Importing the required libraries

import pandas as pd

## 1. Reading the Raw data

In [2]:
# Read the CSV file

df = pd.read_csv('./raw_data/processed.cleveland.data', sep=',', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


The Columns names are not present here, so we will refer to the documentation for these info.
The column details are :
* `age` : Age of the patient in years
* `gender` : The gender of the patient -- 
        -- 1 : male
        -- 0 : female
* `cp` : Chest Pain Type -- 
        -- Value 1: typical angina
        -- Value 2: atypical angina
        -- Value 3: non-anginal pain
        -- Value 4: asymptomatic
* `trestbps` : resting blood pressure (in `mm Hg` on admission to the hospital)
* `chol` : serum cholestoral in `mg/dl`
* `fbs` : fasting blood sugar (> 120 mg/dl) --
        -- 1 : true
        -- 0 : false
* `restecg` : resting electrocardiographic results --
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy
* `thalach` : maximum heart rate achieved
* `exang` : exercise induced angina --
        -- 1 = yes
        -- 0 = no
* `oldpeak` : ST depression induced by exercise relative to rest
* `slope` : the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping
* `ca` : number of major vessels (`0-3`) colored by flourosopy
         Coronary arteries are the blood vessels of coronary circulation, which transports oxygenated blood to the actual heart muscle.
* `thal` : Exercise Thallium heart scan (Exercise thallium scintigraphy) --
        -- 3 = normal
        -- 6 = fixed defect
        -- 7 = reversable defect
* `num` : diagnosis of heart disease (angiographic disease status)
        -- Value 0: < 50% diameter narrowing
        -- Value 1: > 50% diameter narrowing

In [3]:
# Name the columns into the dataset

df.columns = ['age', 'gender', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']

In [5]:
df.head()

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   gender    303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        303 non-null    object 
 12  thal      303 non-null    object 
 13  num       303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


## 2. Pre-Processing

As we can notice that the fields : `ca` and `thal` are taken as objects, it means that there are some null values. Let us take steps to precess this.

### 2.1 Null Values in `ca` 
The number of major vessels colored by flourosopy.

In [12]:
# To get all the unique values of the `cp` column

df['ca'].unique()

array(['0.0', '3.0', '2.0', '1.0', '?'], dtype=object)

Notice that there is a question mark `?` indicating missing values.

In [9]:
# The rows where we have '?' as the data

df[df['ca'] == '?']

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
166,52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,?,3.0,0
192,43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,?,7.0,1
287,58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0


As there are only a few records, we will remove these.

In [14]:
# Get the indices of these instances

indices = df[df['ca'] == '?'].index

In [15]:
# Remove these rows from the dataset

df.drop(indices, axis=0, inplace=True)

In [40]:
# Now change the column ca into float

df = df.astype({'ca' : 'int64'})

In [41]:
# Now we can notice that the 'ca' field is also of float datatype
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 297 entries, 0 to 301
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       297 non-null    float64
 1   gender    297 non-null    float64
 2   cp        297 non-null    float64
 3   trestbps  297 non-null    float64
 4   chol      297 non-null    float64
 5   fbs       297 non-null    float64
 6   restecg   297 non-null    float64
 7   thalach   297 non-null    float64
 8   exang     297 non-null    float64
 9   oldpeak   297 non-null    float64
 10  slope     297 non-null    float64
 11  ca        297 non-null    int64  
 12  thal      297 non-null    float64
 13  num       297 non-null    int64  
dtypes: float64(12), int64(2)
memory usage: 34.8 KB


### 2.2 Null Values in `thal`

In [28]:
# To get all the unique values of the `thal` column

df['thal'].unique()

array(['6.0', '3.0', '7.0', '?'], dtype=object)

Notice that there is a question mark `?` indicating missing values.

In [29]:
# The rows where we have '?' as the data

df[df['thal'] == '?']

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
87,53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,?,0
266,52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,?,2


As there are only a few records, we will remove these.

In [30]:
# Get the indices of these instances

indices = df[df['thal'] == '?'].index

In [31]:
# Remove these rows from the dataset

df.drop(indices, axis=0, inplace=True)

In [42]:
# Now change the column ca into float

df = df.astype({'thal' : 'int64'})

In [43]:
# Now we can notice that the 'ca' field is also of integer datatype
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 297 entries, 0 to 301
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       297 non-null    float64
 1   gender    297 non-null    float64
 2   cp        297 non-null    float64
 3   trestbps  297 non-null    float64
 4   chol      297 non-null    float64
 5   fbs       297 non-null    float64
 6   restecg   297 non-null    float64
 7   thalach   297 non-null    float64
 8   exang     297 non-null    float64
 9   oldpeak   297 non-null    float64
 10  slope     297 non-null    float64
 11  ca        297 non-null    int64  
 12  thal      297 non-null    int64  
 13  num       297 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 34.8 KB


### 2.3 Convert the remaining columns into appropriate data types

In [50]:
df = df.astype({
    'age' : 'int64',
    'gender' : 'int64',
    'cp' : 'int64',
    'trestbps' : 'int64',
    'chol' : 'int64',
    'fbs' : 'int64',
    'restecg' : 'int64',
    'thalach' : 'int64',
    'exang' : 'int64',
    'oldpeak' : 'float64',
    'slope' : 'int64',
    'ca' : 'int64',
    'thal' : 'int64',
    'num' : 'int64',
})

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 297 entries, 0 to 301
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       297 non-null    int64  
 1   gender    297 non-null    int64  
 2   cp        297 non-null    int64  
 3   trestbps  297 non-null    int64  
 4   chol      297 non-null    int64  
 5   fbs       297 non-null    int64  
 6   restecg   297 non-null    int64  
 7   thalach   297 non-null    int64  
 8   exang     297 non-null    int64  
 9   oldpeak   297 non-null    float64
 10  slope     297 non-null    int64  
 11  ca        297 non-null    int64  
 12  thal      297 non-null    int64  
 13  num       297 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 34.8 KB


## 3. Saving the Dataset

In [52]:
# We save this modified dataset into a CSV file for faster access

df.to_csv('cleveland_data.csv', index=False)