## Data Wrangling for Capstone Two

### Data Loading

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [2]:
# Check the current working Directory
os.listdir(os.getcwd())

['.ipynb_checkpoints',
 'DataWrangling.ipynb',
 'heart_data.csv',
 'meteorites.ipynb',
 'meteorites.py',
 'tmp']

In [3]:
loaded_data = pd.read_csv('heart_data.csv',index_col=False)
loaded_data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### Data Organisation
Create some subfolders for data Organization

<font color='teal'> **Create a subfolder called `data`.**</font>

In [4]:
#os.mkdir('data',mode=0o777)

<font color='teal'> **Create a subfolder called `figures`.**</font>

In [5]:
#os.mkdir('figures',mode=0o777)

<font color='teal'> **Create a subfolder called `models`.**</font>

In [6]:
os.mkdir('models',mode=0o777)

### Data Definition

In [7]:
loaded_data.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [9]:
loaded_data.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [10]:
loaded_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


### Description of Data Columns

* age: age in years
* sex: sex (1 = male; 0 = female)
* cp: chest pain type
 * Value 1: typical angina
 * Value 2: atypical angina
 * Value 3: non-anginal pain
 * Value 4: asymptomatic
* trestbps: resting blood pressure (in mm Hg on admission to the
hospital)
* chol: serum cholestoral in mg/dl
* fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg: resting electrocardiographic results
 * Value 0: normal
 * Value 1: having ST-T wave abnormality (T wave inversions and/or ST
elevation or depression of > 0.05 mV)
 * Value 2: showing probable or definite left ventricular hypertrophy
by Estes' criteria
* thalach: maximum heart rate achieved
* exang: exercise induced angina (1 = yes; 0 = no)
* oldpeak = ST depression induced by exercise relative to rest
* slope: the slope of the peak exercise ST segment
 * Value 1: upsloping
 * Value 2: flat
 * Value 3: downsloping\
* ca: number of major vessels (0-3) colored by flourosopy
* thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
* target: diagnosis of heart disease (angiographic disease status)
 * Value 0: < 50% diameter narrowing -- Value 1: > 50% diameter narrowing
(in any major vessel)

In [13]:
# Check for Unique values in each and every columns
for i in loaded_data.columns:
    print('No of unique values in column ',i,' is ',np.count_nonzero(loaded_data[i].unique()))

No of unique values in column  age  is  41
No of unique values in column  sex  is  1
No of unique values in column  cp  is  3
No of unique values in column  trestbps  is  49
No of unique values in column  chol  is  152
No of unique values in column  fbs  is  1
No of unique values in column  restecg  is  2
No of unique values in column  thalach  is  91
No of unique values in column  exang  is  1
No of unique values in column  oldpeak  is  39
No of unique values in column  slope  is  2
No of unique values in column  ca  is  4
No of unique values in column  thal  is  3
No of unique values in column  target  is  1


#### Check range of values in each and every column

In [14]:
loaded_data.agg([min,max]).T

Unnamed: 0,min,max
age,29.0,77.0
sex,0.0,1.0
cp,0.0,3.0
trestbps,94.0,200.0
chol,126.0,564.0
fbs,0.0,1.0
restecg,0.0,2.0
thalach,71.0,202.0
exang,0.0,1.0
oldpeak,0.0,6.2


In [15]:
loaded_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,303.0,54.366337,9.082101,29.0,47.5,55.0,61.0,77.0
sex,303.0,0.683168,0.466011,0.0,0.0,1.0,1.0,1.0
cp,303.0,0.966997,1.032052,0.0,0.0,1.0,2.0,3.0
trestbps,303.0,131.623762,17.538143,94.0,120.0,130.0,140.0,200.0
chol,303.0,246.264026,51.830751,126.0,211.0,240.0,274.5,564.0
fbs,303.0,0.148515,0.356198,0.0,0.0,0.0,0.0,1.0
restecg,303.0,0.528053,0.52586,0.0,0.0,1.0,1.0,2.0
thalach,303.0,149.646865,22.905161,71.0,133.5,153.0,166.0,202.0
exang,303.0,0.326733,0.469794,0.0,0.0,0.0,1.0,1.0
oldpeak,303.0,1.039604,1.161075,0.0,0.0,0.8,1.6,6.2


### Data Cleaning

As we can see our data set has no null values, there is no need for any imputation of null values

We will however check if we have anny duplicated data

In [16]:
duplicateRowsDF = loaded_data[loaded_data.duplicated()]
duplicateRowsDF

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
164,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1


As we can see above we have one duplicated column. But the in our dataset we can't rule it out as a duplicated data because there might be persons with similar age and heart paramteres. So will best keep this row

Now we will save the data into our data folder to keep our project organised

In [17]:
loaded_data.to_csv('data/step2_output.csv',index=False)