# Assignment Heart failure Data processing


Unfortunately data from real life cases is often not nicely structured. We need to manipulate the unstructured and/or messy data into a structured or clean form. We need to drop rows and collumns because they are not needed for the analysis or because we cannot use them in case of too many missing values. Maybe we need to relabel collumns or reformat characters into numerical values. Maybe we need to combine data from several sources. Cleaning and manipulating data into a structured form is called **data processing**. Data processing starts with data in its raw form and converts it into a more readable format (tables, graphs etc.), giving it the form and context necessary to be interpreted by computers and utilized by users. 


In this notebook we use python and the libraries `NumPy` and `Pandas`. These libraries are high performance libraries especially suitable for data manipulations and data computations. 

# Data processing Example: Heart failure casus

Cardiovascular diseases kill approximately 17 million people globally every year, and they mainly exhibit as myocardial infarctions and heart failures. Heart failure occurs when the heart cannot pump enough blood to meet the needs of the body. Available electronic medical records of patients quantify symptoms, body features, and clinical laboratory test values, which can be used to perform biostatistics analysis aimed at highlighting patterns and correlations otherwise undetectable by medical doctors. Machine learning, can predict patients’ survival from their data and can individuate the most important features among those included in their medical records[1]. As a datascientist you are required to inspect if the data can be used for modelling and to select the most important features for predicting the patient's survival. Data for the analysis is available in `heart_failure_clinical_records_dataset.csv`. The data description is to be found in the table `data_description.csv`

[1] https://doi.org/10.1186/s12911-020-1023-5

In [9]:
import pandas as pd
import numpy as np

# Assignment

In this notebook we are going to preprocess the data. Not all steps are coded. Sometimes you see #your code here or 'your code here'. You need to code the instructions in that case


## Step 1: Inspect the data

The first step is inspecting the data and getting an idea about the meaning of the variables, format and units. 

In [10]:
# load and display the meta data, the data that describes the data
md = pd.read_csv('data/data_description.csv', sep=';')
md

Unnamed: 0,Feature,Explanation,Measurement
0,Age,Age of patient,years
1,Anaemia,Decrease of red blood cells or hemoglobin,Boolean
2,High blood pressure,If a patient has hypertension,Boolean
3,Creatinine phosphokinase,Level of the CPK enzyme in the blood,mcg/L
4,Diabetes,If the patient has diabetes,Boolean
5,Ejection fraction,Percentage of blood leaving the heart at each ...,Percentage
6,Sex,Woman or Man,Binary
7,Platelets,Platelets in the blood,kiloplatelets/mL
8,Serum creatinine,Level of creatinine in the blood,mg/dL
9,Serum sodium,Level of sodium in the blood,mEq/L


The death event will be used to predict survival rate and will be the class variable. The variable `death event` is a boolean. If the `death event` is 1 (True) then the patient died. If the `death event` = 0 (False) then the patient survived

In [11]:
# load and display data 
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
print(f'this dataset contains {len(df)} rows')
df.head(5)

this dataset contains 299 rows


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1.0,0.0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1.0,,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1.0,1.0,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1.0,0.0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0.0,0.0,8,1


In [12]:
list(df.columns)

['age',
 'anaemia',
 'creatinine_phosphokinase',
 'diabetes',
 'ejection_fraction',
 'high_blood_pressure',
 'platelets',
 'serum_creatinine',
 'serum_sodium',
 'sex',
 'smoking',
 'time',
 'DEATH_EVENT']

Mind you, the column names of the meta data are slightly different than the one from the clinical records. Also the order is different. We must take that into account if we want to make use of the meta data to select a subset of the clinical records. 

### Missing data

Looking at the dataframe values we also see NaN in the column smoking. This means that the data contains missing data. Let us inspect the missing data 

In [13]:
# first inspect missing data
# YOUR CODE HERE

The columns sex and smoking do have missing values. When columns have a lot of missing data we can think of dropping the column from the dataframe. In this case we can either fill the column with a guessed value or we can drop the row.

In [14]:
df = df.dropna(axis = 0) # drop NaN rows
print(f'this dataset contains {len(df)} rows')

this dataset contains 297 rows


In [15]:
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1.0,0.0,4,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1.0,1.0,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1.0,0.0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0.0,0.0,8,1
5,90.0,1,47,0,40,1,204000.0,2.1,132,1.0,1.0,8,1


Furthermore we can see that all the binary data and boolean Yes/No data is displayed by either a zero or a one. It might be unclear what this means when plotting the data.

In [16]:
df['sex'].value_counts() 

1.0    193
0.0    104
Name: sex, dtype: int64

In the meta data we see the description "Woman" or "man", so we might want to change that.

In [17]:
#change the code in Man Woman 
# YOUR CODE HERE

### Inspect the datatypes

We changed the sex column to category, but what datatypes are the other columns? 

In [1]:
#INSPECT THE DATA
#YOUR CODE HERE

We know that some of the integers should be booleans (logical). Let's change that

In [11]:
df["anaemia"] = df["anaemia"].astype('bool')
df["high_blood_pressure"] = df["high_blood_pressure"].astype('bool')
df["diabetes"] = df["diabetes"].astype('bool')
df["smoking"] = df["smoking"].astype('bool')
df["DEATH_EVENT"] = df["DEATH_EVENT"].astype('bool')
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,False,582,False,20,True,265000.0,1.9,130,Man,False,4,True
1,55.0,False,7861,False,38,False,263358.03,1.1,136,Man,True,6,True
2,65.0,False,146,False,20,False,162000.0,1.3,129,Man,True,7,True
3,50.0,True,111,False,20,False,210000.0,1.9,137,Man,False,7,True
4,65.0,True,160,True,20,False,327000.0,2.7,116,Woman,False,8,True


In [12]:
df['anaemia'].value_counts() 

False    170
True     129
Name: anaemia, dtype: int64

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   age                       299 non-null    float64 
 1   anaemia                   299 non-null    bool    
 2   creatinine_phosphokinase  299 non-null    int64   
 3   diabetes                  299 non-null    bool    
 4   ejection_fraction         299 non-null    int64   
 5   high_blood_pressure       299 non-null    bool    
 6   platelets                 299 non-null    float64 
 7   serum_creatinine          299 non-null    float64 
 8   serum_sodium              299 non-null    int64   
 9   sex                       298 non-null    category
 10  smoking                   299 non-null    bool    
 11  time                      299 non-null    int64   
 12  DEATH_EVENT               299 non-null    bool    
dtypes: bool(5), category(1), float64(3), int64(4)
memo

## Step 2: Explore data

It is useful to understand the range of the data. A function that displays the descriptives of the numerical data is `describe`

In [3]:
#Your code here

What we can see is that the data ranges differ per feature. If we want to use the data for prediction we need to normalize the data later on. We can do that with numpy. From the describe table we can also see that most of the data is not symetric distributed. Let us inspect the distributions by plotting. 

## Step 3: Plotting the data

Plotting of the data helps to to answer questions like are attributes independent from eachother? We will use a profile report and a heatmap to investigate

In [19]:
# Generate a quick report from our dataset 
from pandas_profiling import ProfileReport  
# your code here

### Plot distributions of numeric values

Another way is using bokeh to do the plotting

In [20]:
#plot numeric values distributions
df_num = df.select_dtypes(include=['float64', 'int64'])

In [None]:
from bokeh.layouts import gridplot
from bokeh.plotting import figure, output_file, show

#use a function to generalize the plotting creation
def make_plot(title, hist, edges):
    p = figure(title=title, tools='', background_fill_color="#fafafa")
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
           fill_color="navy", line_color="white", alpha=0.5)
    p.y_range.start = 0
    p.xaxis.axis_label = 'value'
    p.yaxis.axis_label = 'count'
    p.grid.grid_line_color="white"
    return p

# Distribution
g = []
for i in range(len(df_num.columns)):
    hist, edges = np.histogram(df_num[df_num.columns[i]], bins=40)
    p = make_plot(f" {df_num.columns[i]}", hist, edges)
    g.append(p)


#output_file('histogram.html', title="distribution plots")
show(gridplot(g, ncols=4, plot_width=250, plot_height=250, toolbar_location=None))

### Plot number of deaths related to time

First we re-organize the data by creating a new table with number of deaths per time unit. Use the `groupby` method to do such

In [None]:
grouped = 'YOUR CODE HERE'
print(grouped.head(10))

Then we use the new table to plot in the barplot

In [20]:
p = figure(title="death events in time", plot_width=950, plot_height=300, toolbar_location=None)
p.vbar(x='time', top='DEATH_EVENT', width=1, source=grouped, color='black')
p.xaxis.axis_label = 'number of days'
p.yaxis.axis_label = 'number of deaths'
show(p)


## Heatmap

To investigate if the attributes are independent from eachother we first remove the class variable. Then we create a correlation matrix. We reshape this into a ColumnDataSource object to be used for the heatmap plot.

In [25]:
df = df.drop(['DEATH_EVENT'],axis = 1)
c = df.corr().abs()
y_range = (list(reversed(c.columns)))
x_range = (list(c.index))
c

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
age,1.0,0.091263,0.07694,0.098717,0.065669,0.089771,0.059381,0.160963,0.06052,0.06211,0.015088,0.234584
anaemia,0.091263,1.0,0.188232,0.019868,0.026216,0.039075,0.037178,0.049662,0.056498,0.088063,0.107192,0.142004
creatinine_phosphokinase,0.07694,0.188232,1.0,0.012969,0.049068,0.057838,0.027495,0.010307,0.071998,0.068424,0.022052,0.034963
diabetes,0.098717,0.019868,0.012969,1.0,0.010599,0.011973,0.100229,0.04992,0.077877,0.151418,0.147158,0.035472
ejection_fraction,0.065669,0.026216,0.049068,0.010599,1.0,0.028122,0.081583,0.013714,0.19784,0.142912,0.06434,0.049433
high_blood_pressure,0.089771,0.039075,0.057838,0.011973,0.028122,1.0,0.045824,0.004436,0.028635,0.106781,0.059321,0.206137
platelets,0.059381,0.037178,0.027495,0.100229,0.081583,0.045824,1.0,0.038465,0.041739,0.134619,0.024224,0.001485
serum_creatinine,0.160963,0.049662,0.010307,0.04992,0.013714,0.004436,0.038465,1.0,0.187516,0.009944,0.026996,0.149687
serum_sodium,0.06052,0.056498,0.071998,0.077877,0.19784,0.028635,0.041739,0.187516,1.0,0.044494,0.003816,0.071188
sex,0.06211,0.088063,0.068424,0.151418,0.142912,0.106781,0.134619,0.009944,0.044494,1.0,0.446947,0.018672


In [22]:
#reshape
dfc = pd.DataFrame(c.stack(), columns=['r']).reset_index()
dfc.head()

Unnamed: 0,level_0,level_1,r
0,age,age,1.0
1,age,anaemia,0.088006
2,age,creatinine_phosphokinase,0.081584
3,age,diabetes,0.101012
4,age,ejection_fraction,0.060098


In [28]:
# Create a heatmap using seaborn heatmap
# your code here


## Step 4: Clean data

Based on the inspecting and exploration of the data it is decided to drop the column time. The feature time will not be used for prediction. All the other variables will be used for further analysis. For computation convenience the int64 data is used instead of booleans and categories. Furthermore the data needs to be transformed and normalized. 

In [25]:
import numpy as np

df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
# your code here to drop NaN rows
# your code here to drop the time column
df = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)] # remove outliers

print(f'this dataset contains {len(df)} rows and {len(df.columns)} columns')
df.head()

this dataset contains 280 rows and 12 columns


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1.0,0.0,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1.0,1.0,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1.0,0.0,1
5,90.0,1,47,0,40,1,204000.0,2.1,132,1.0,1.0,1
6,75.0,1,246,0,15,0,127000.0,1.2,137,1.0,0.0,1


## Step 5: Split into features matrix and class vector. Normalize features

In [6]:
# put the label in the y vector and the other columns and rows in the X matrix. 
y = 'your code here'
X = 'your code here'


In [27]:
# normaliseer data
# your code here

We now have a cleaned normalized feature matrix and a class variable vector. We succesfully prepared the dataset for machine learning algorithms in order to predict the heart failure death event.  