##### **Import libraries**
We first import the necessary libraries needed for analysis:
-  `numpy`:  for array munipulation
-  `pandas`: for arranging data in a dataframe format, with rows and columns munipulation
-  `pyplot` from `matplotlib`: for data visualization
-   `sklearn` : for ML model implementation, data processing and inference 

In [64]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings 
warnings.filterwarnings("ignore")

#### **Import dataset**
This stage consist of importing the dataset and visualizing it. It gives an overview of what the data  looks like

In [65]:
# import dataset
file_path = ('B:\_GITHUB\Machine-Learning-project-series\logistic regression\heart disease prediction\dataset\heart_disease.csv')
data = pd.read_csv(file_path)
data.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
5,58,0,0,100,248,0,0,122,0,1.0,1,0,2,1
6,58,1,0,114,318,0,2,140,0,4.4,0,3,1,0
7,55,1,0,160,289,0,0,145,1,0.8,1,1,3,0
8,46,1,0,120,249,0,0,144,0,0.8,2,0,3,0
9,54,1,0,122,286,0,0,116,1,3.2,1,2,2,0


#### **data Info**
It's a clean, easy to understand set of data. However, the meaning of some of the column headers are not obvious. Here's what they mean,

- **age**: The person's age in years
- **sex**: The person's sex (1 = male, 0 = female)
- **cp**: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
- **trestbps**: The person's resting blood pressure (mm Hg on admission to the hospital)
- **chol**: The person's cholesterol measurement in mg/dl
- **fbs**: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
- **restecg**: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
- **thalach**: The person's maximum heart rate achieved
- **exang**: Exercise induced angina (1 = yes; 0 = no)
- **oldpeak**: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)
- **slope**: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
- **ca**: The number of major vessels (0-3)
- **thal**: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
- **target**: Heart disease (0 = no, 1 = yes)

In [33]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


In [63]:
# Changing data to default
# renaming the sex to male and female
#data.loc[data['sex'] == 1, 'sex'] = 'male'
#data['sex'].mask(data['sex'] == 1, 'male', inplace=True)
data['sex'] = np.where(data['sex'] == 1, 'male','female')

# changing chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
data.loc[data['cp'] == 1, 'cp'] = 'typical angina'
data.loc[data['cp'] == 2, 'cp'] = 'atypical angina'
data.loc[data['cp'] == 3, 'cp'] = 'non-anginal pain'
data.loc[data['cp'] == 4, 'cp'] = 'asymptomtic'
data.loc[data['cp'] == 0, 'cp'] = 'symptomtic'

# The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
data['fbs'] = np.where(data['fbs'] == 1, 'true', 'false')

# exang: Exercise induced angina (1 = yes; 0 = no)
data['exang'] = np.where(data['exang'] == 1, 'true', 'false')

# slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
data.loc[data['slope'] == 0, 'slope'] = 'nosloping'
data.loc[data['slope'] == 1, 'slope'] = 'upsloping'
data.loc[data['slope'] == 2, 'slope'] = 'flat'
data.loc[data['slope'] == 3, 'slope'] = 'downsloping'

# thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
data.loc[data['thal'] == 3, 'thal'] = 'normal'
data.loc[data['thal'] == 6, 'thal'] = 'fixed defect'
data.loc[data['thal'] == 7, 'thal'] = 'reversable defect'
data.loc[data['thal'] == 2, 'thal'] = 'defect'
data.loc[data['thal'] == 1, 'thal'] = 'abnormal'

# target: Heart disease (0 = no, 1 = yes)
data['target'] = np.where(data['target'] == 1, 'cancer', 'no cancer')

data.head(50)

In [76]:
pd.set_option("display.max_rows",11)
pd.set_option("display.float_format",'{:.3f}'.format)
data.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434,0.696,0.942,131.612,246.0,0.149,0.53,149.114,0.337,1.072,1.385,0.754,2.324,0.513
std,9.072,0.46,1.03,17.517,51.593,0.357,0.528,23.006,0.473,1.175,0.618,1.031,0.621,0.5
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [77]:
data.mean()

age         54.434
sex          0.696
cp           0.942
trestbps   131.612
chol       246.000
             ...  
oldpeak      1.072
slope        1.385
ca           0.754
thal         2.324
target       0.513
Length: 14, dtype: float64