<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# **Heart Failure Prediction**

# Lab 3. Data Analysis with Python

# Abstract
In this lab, you will learn what is meant by exploratory analysis of medical data, and you will learn how to perform data calculations to calculate basic descriptive statistical information about patients with heart failure, such as mean, median, mode, and quartile values, and how to use this information to understand the data distribution better. You will learn how to group medical data for better visualization, use the Pearson correlation method to compare two continuous numerical variables and use the chi-square test to find relationships between two categorical variables.

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Explore features or charecteristics to predict mortality of patients


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#import_data">Import Data from Module</a></li>
    <li><a href="#pattern_visualization">Analyzing Individual Feature Patterns using Visualization</a></li>
    <li><a href="#discriptive_statistics">Descriptive Statistical Analysis</a></li>
    <li><a href="#basic_grouping">Basics of Grouping</a></li>
    <li><a href="#correlation_causation">Correlation and Causation</a></li>
    <li><a href="#anova">ANOVA</a></li>
</ol>

</div>

<hr>


<h3>What are the main characteristics that have the most impact on the mortality rate?</h3>


<a id="import_data"></a><h2>1. Import Data from Module 2</h2>


Import libraries:


In [ ]:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
#! mamba install scipy=1.7.1-y
# ! mamba install seaborn=0.9.0-y
!  pip install dython

In [ ]:
import pandas as pd
import numpy as np
from dython.nominal import associations
from scipy import stats
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
If error appeared please restart the kernel or run this block again

</div>

We will use DataSet from previous lab. Load the data and store it in dataframe `df`:


In [ ]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08ZWEN/clean_df.csv'

In [ ]:
df = pd.read_csv(path)
df.head()

Let’s set number of digits in float type.

In [ ]:
pd.options.display.float_format = '{:.2f}'.format

<a id="pattern_visualization"></a><h2>2. Analyzing Individual Feature Patterns Using Visualization</h2>


To install Seaborn we use pip, the Python package manager.


<h4>How to choose the right visualization method?</h4>
<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>


In [ ]:
# list the data types for each column
print(df.dtypes)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3>Question  #1:</h3>

<b>What is the data type of the column "BP"? </b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df['BP'].dtypes
```

</details>


For example, we can calculate the correlation between variables  of type "int64" or "float64" using the method "corr":


In [ ]:
corr = df.corr()
corr

 Let's take a look on correlation heatmap of our data:

In [ ]:
plt.figure(figsize = (12,10))
sns.heatmap(corr, linewidths=.5)

Now we need to delete one of two columns which aren't target where correlation is strong (corr > 0.8). Because, in general, it is recommended to avoid having correlated features in your DataSet. Indeed, a group of highly correlated features will not bring additional information (or just very few), but will increase the complexity of the algorithm, thus increasing the risk of errors.

In [ ]:
df = df.drop(columns=['Age', 'CK-MB', 'PCV', 'MCV', 'Neutrophil', 'Gender-female', 'Reaction', 'Locality-rural'])

To see the relationship between mortality and variables of string or category type we can use <code>dython.nominal.associations</code>. This method computes the correlation or strength of association between features in a dataset, taking into account both categorical and continuous features. It utilizes different metrics depending on the types of features being compared:

- For continuous-continuous cases, it calculates Pearson's R, which measures the linear correlation.
- For categorical-continuous cases, it uses the Correlation Ratio metric.
- For categorical-categorical cases, it employs either Cramer's V or Theil's U as the correlation metric.

In [ ]:
associations(df[['Mortality', 'Age Group', 'Marital Status', 'Category', 'Others', 'CO', 'Diagnosis', 'SK React', 'Max heart rate-binned']])

Here we also need to delete columns with high association and save our DataSet for future labs.

In [ ]:
df = df.drop(columns=['CO'])
df.to_csv('clean_df_new.csv', index=False)

The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3> Question  #2: </h3>

<p>Find the correlation between the following columns: BP, Resting BP, Eosinophil, and Monocyte.</p>
<p>Hint: if you would like to select those columns, use the following syntax: df[['BP', 'Resting BP', 'Eosinophil', 'Monocyte']]</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df[['BP', 'Resting BP', 'Eosinophil', 'Monocyte']].corr()
```

</details>


<h2>Continuous Numerical Variables:</h2> 

<p>Continuous numerical variables are variables that may contain any value within some range. They can be of type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.</p>

<p>In order to start understanding the (linear) relationship variables, we can use "regplot" which plots the scatterplot plus the fitted regression line for the data.</p>


We have chosen "Mortality" column to predict, but it only has two values: 1 and 0. Because of that, we need to choose continuous numerical variables to show examples of plots. Let's see several examples of different linear relationships:


<h3>Positive Linear Relationship</h4>


Let's find the scatterplot of "RBC" and "Hemoglobin".


In [ ]:
sns.regplot(x="RBC", y="Hemoglobin", data=df)
plt.ylim(0,)

<p>As the Hemoglobin goes up, the RBC (Number of erythrocytes) goes up: this indicates a positive direct correlation between these two variables</p>


We can examine the correlation between "RBC" and "Hemoglobin" and see that it's approximately 0.75.


In [ ]:
df[['RBC', 'Hemoglobin']].corr()

Let's find the scatterplot of "Thrombolysis" and "Streptokinase".


In [ ]:
sns.regplot(x="Thrombolysis", y="Streptokinase", data=df)

<p>As Thrombolysis goes up, the S goes down: this indicates an inverse/negative relationship between these two variables.</p>


We can examine the correlation between "Thrombolysis" and "Streptokinase" and see it's approximately -0.7.


In [ ]:
df[['Thrombolysis', 'Streptokinase']].corr()

<h3>Weak Linear Relationship</h3>


Let's see if "Serum cholesterol" is a predictor variable of "Lymphocyte".


In [ ]:
sns.regplot(x="Serum cholesterol", y="Lymphocyte", data=df)

<p>Cholesterol does not seem like a good predictor of the lymphocytes at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore, it's not a reliable variable.</p>


We can examine the correlation between 'Serum cholesterol' and 'Lymphocyte' and see it's approximately -0.1.


In [ ]:
df[['Serum cholesterol', 'Lymphocyte']].corr()

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  3 a): </h1>

<p>Find the correlation  between x="BP" and y="Resting BP".</p>
<p>Hint: if you would like to select those columns, use the following syntax: df[["BP","Resting BP"]].  </p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python

#The correlation is -0.13, the non-diagonal elements of the table.

df[["BP", "Resting BP"]].corr()

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1>Question  3 b):</h1>

<p>Given the correlation results between "BP" and "Resting BP", do you expect a linear relationship?</p>
<p>Verify your results using the function "regplot()".</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python

#There is a weak correlation between the variable 'BP' and 'Resting BP.' as such regression will not work well. We can see this using "regplot" to demonstrate this.

#Code: 
sns.regplot(x="BP", y="Resting BP", data=df)

```

</details>


<h3>Categorical Variables</h3>

<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots. As an example, to see different cases we will compare numerical variables not only with 'Mortality', because it doesn't have strong correlation with other columns.</p>


Let's look at the relationship between "Mortality" and "Hemoglobin".


In [ ]:
sns.boxplot(x="Mortality", y="Hemoglobin", data=df)

<p>We see that the distributions of hemoglobin between the different categories have a significant overlap, so hemoglobin would not be a good predictor of mortality. Let's examine "SK React" and "RBC":</p>


In [ ]:
sns.boxplot(x="RBC", y="SK React", data=df, orient='h')

<p>Here we see that the distribution of RBC (eritrocite count) between people who have different reaction to streptokinaze are distinct enough to take RBC as a potential good predictor of SK reaction.</p>


Let's examine "Smoking" and "Hemoglobin".


In [ ]:
sns.boxplot(x="Smoking", y="Hemoglobin", data=df)

<p>Here we see that the distribution of RBC (eritrocite count) between people who smoke and don't smoke slightly differs.


<a id="discriptive_statistics"></a><h2>3. Descriptive Statistical Analysis</h2>


<p>Let's first take a look at the variables by utilizing a description method.</p>

<p>The <b>describe</b> function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.</p>

This will show:

<ul>
    <li>the count of that variable</li>
    <li>the mean</li>
    <li>the standard deviation (std)</li> 
    <li>the minimum value</li>
    <li>the IQR (Interquartile Range: 25%, 50% and 75%)</li>
    <li>the maximum value</li>
<ul>


We can apply the method "describe" as follows:


In [ ]:
df.describe()

The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'object' as follows:


In [ ]:
df.describe(include=['object'])

<h3>Value Counts</h3>


<p>Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "Age Group". Don’t forget the method "value_counts" only works on pandas series, not pandas dataframes. As a result, we only include one bracket <code>df['Age Group']</code>, not two brackets <code>df[['Age Group']]</code>.</p>


In [ ]:
df['Age Group'].value_counts()

We can convert the series to a dataframe as follows:


In [ ]:
df['Age Group'].value_counts().to_frame()

Let's repeat the above steps but save the results to the dataframe "age_group_counts" and rename the column  'Age Group' to 'value_counts'.


In [ ]:
age_group_counts = df['Age Group'].value_counts().to_frame()
age_group_counts.rename(columns={'Age Group': 'value_counts'}, inplace=True)
age_group_counts

Now let's rename the index to 'Age Group':


In [ ]:
age_group_counts.index.name = 'Age Group'
age_group_counts

We can repeat the above process for the variable 'SK React'.


In [ ]:
sk_react_counts = df['SK React'].value_counts().to_frame()
sk_react_counts.rename(columns={'SK React': 'value_counts'}, inplace=True)
sk_react_counts.index.name = 'SK React'
sk_react_counts.head(10)

<p>After examining the value counts of the marital status, we see that it would not be a good predictor variable for the mortality. This is because we only have three single patients and 365 married so this result is skewed. Thus, we are not able to draw any conclusions about the marital status.</p>


<a id="basic_grouping"></a><h2>4. Basics of Grouping</h2>


<p>The "groupby" method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups.</p>

<p>For example, let's group by the variable "Age Group". We see that there are 5 different categories of age.</p>


In [ ]:
df['Age Group'].unique()

<p>If we want to know, on average, which type of age groups is most valuable, we can group "Age Group" and then average them.</p>

<p>We can select the columns 'Age Group', 'SK React' and 'Mortality', then assign it to the variable "df_group_one".</p>


In [ ]:
df_group_one = df[['Age Group','SK React','Mortality']]

We can then calculate the average mortality rate for each of the different categories of data.


In [ ]:
df_group_one = df_group_one.groupby(['Age Group'],as_index=False).mean()
df_group_one

<p>From our data, it seems patients of age between 41-50 have, on average, the highest mortality rate.</p>

<p>You can also group by multiple variables. For example, let's group by both 'Age Group' and 'SK React'. This groups the dataframe by the unique combination of 'Age Group' and 'SK React'. We can store the results in the variable 'grouped_test1'.</p>


In [ ]:
# grouping results
df_gptest = df[['Age Group','SK React','Mortality']]
grouped_test1 = df_gptest.groupby(['Age Group','SK React'],as_index=False).mean()
grouped_test1

<p>This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot" to create a pivot table from the groups.</p>

<p>In this case, we will leave the age group variable as the rows of the table, and pivot SK reaction to become the columns of the table:</p>


In [ ]:
grouped_pivot = grouped_test1.pivot(index='Age Group',columns='SK React')
grouped_pivot

<p>Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.</p>


In [ ]:
grouped_pivot = grouped_pivot.fillna('') #fill missing values with empty value
grouped_pivot

We have a lot of 0 values because of lack of data, but we will use it as an example.

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1>Question 4:</h1>

<p>Use the "groupby" function to find the average "Mortality" of each "SK React".</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
# grouping results
df_gptest2 = df[['SK React','Mortality']]
grouped_test_sk_react = df_gptest2.groupby(['SK React'],as_index= False).mean()
grouped_test_sk_react

```

</details>


<h4>Variables: Age group and SK reaction vs. Mortality</h4>


Let's use a heat map to visualize the relationship between Age group and SK reaction vs. Mortality.


In [ ]:
#use the grouped results
grouped_pivot = grouped_test1.pivot(index='Age Group',columns='SK React').fillna(0)
plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

<p>The heatmap plots the target variable (Mortality) proportional to colour with respect to the variables 'Age Group' and 'SK React' on the vertical and horizontal axis, respectively. This allows us to visualize how the mortality is related to 'Age Group' and 'SK React'.</p>

<p>The default labels convey no useful information to us. Let's change that:</p>


In [ ]:
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

<p>Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course.</p>

<p>The main question we want to answer in this module is, "What are the main characteristics which have the most impact on the mortality rate?".</p>

<p>To get a better measure of the important characteristics, we look at the assotiation of these variables with the mortality rate. In other words: how is the mortality rate dependent on this variable?</p>


<a id="correlation_causation"></a><h2>5. Correlation and Causation</h2>


<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.</p>


<p><b>Pearson Correlation</b></p>
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Perfect positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Perfect negative linear correlation.</li>
</ul>


<p>Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64'  variables.</p>


In [ ]:
df.corr()

Sometimes we would like to know the significant of the correlation estimate.


<b>P-value</b>

<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the

<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>


We can obtain this information using  "stats" module in the "scipy"  library.


<h3>Each column vs. Mortality</h3>

Let's calculate the  Pearson Correlation Coefficient and P-value of different columns of float and int type and 'Mortality'.


In [ ]:
columns = df.select_dtypes(include=['int64', 'float'])
for c in columns:
    pearson_coef, p_value = stats.pearsonr(df[c], df['Mortality'])
    print(c, ":\n  Pearson Correlation Coefficient is", "{:.2e}".format(pearson_coef), " with a P-value of P =", "{:.2e}".format(p_value))

<h4>Conclusion:</h4>
<p>Since the p-value is $<$ 0.001, the correlation between different columns and mortality is statistically significant, although the linear relationship isn't extremely strong. These columns are 'Diabetes', 'HTN', 'Serum cholesterol', 'Follow Up', 'Gender-male'</p>


<a id="anova"></a><h2>6. ANOVA</h2>


<h3>ANOVA: Analysis of Variance</h3>
<p>The Analysis of Variance (ANOVA) is a statistical technique employed to assess whether there are significant disparities among the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant our calculated score value is.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.</p>


<h3>Age Group</h3>


<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

<p>To see if different types of 'AgeHroup' impact  'Mortality', we group the data.</p>


In [ ]:
grouped_test2=df_gptest[['Age Group', 'Mortality']].groupby(['Age Group'])
grouped_test2.head(2)

In [ ]:
df_gptest

We can obtain the values of the method group using the method "get_group".


In [ ]:
grouped_test2.get_group('21-30')['Mortality']

We can use the function 'f_oneway' in the module 'stats' to obtain the <b>F-test score</b> and <b>P-value</b>.


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('21-30')['Mortality'], grouped_test2.get_group('31-40')['Mortality'], grouped_test2.get_group('41-50')['Mortality'], grouped_test2.get_group('51-60')['Mortality'], grouped_test2.get_group('61-70')['Mortality'])

print( "ANOVA results: F=", "{:.2e}".format(f_val), ", P =", "{:.2e}".format(p_val))

This is a great result with a large F-test score showing a strong correlation and a P-value of almost 0 implying almost certain statistical significance. But does this mean all five tested groups are all this highly correlated?

Let's examine them separately.


To compare pairs of groups we can use <code>itertools</code>

In [ ]:
values = ['21-30', '31-40', '41-50', '51-60', '61-70']
for a, b in itertools.combinations(values, 2):
    f_val, p_val = stats.f_oneway(grouped_test2.get_group(a)['Mortality'], grouped_test2.get_group(b)['Mortality'])
    print(a, "and", b, "ANOVA results: F=", "{:.2e}".format(f_val), ", P =", "{:.2e}".format(p_val))

Results represent that some of separately compared pairs of groups have different F and P value from other pairs. So only some of them are highly correlated.

<h3>Conclusion: Important Variables</h3>


<p>We now have a better idea of what our data looks like and which variables are important to take into account when predicting mortality from the heart failure. We have narrowed it down to the following variables:</p>

Continuous numerical variables:

<ul>
    <li>Diabetes</li>
    <li>HTN</li>
    <li>Serum cholesterol</li>
    <li>Follow Up</li>
    <li>Gender-male</li>
</ul>

Categorical variables:

<ul>
    <li>Age Group</li>
    <li>SK React</li>
    <li>Diagnosis</li>
    <li>Others</li>
</ul>

<p>As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.</p>


### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/bohdan_kuno">Bohdan Kuno</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/nataliya_boyko">Ass. Prof. Nataliya Boyko, PhD</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
|2023-03-11|01|Bohdan Kuno|Lab created|


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
