<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# **Heart Failure Prediction**

# Lab 4. Model Development

# Abstract
In this lab, you will focus on preliminary data processing techniques. You will learn how to modify data types, normalize and process categorical data, and apply various feature selection methods. Additionally, you will explore working with different classifiers and gain insights into visualizing decision trees. By the end of the lab, you will have gained the knowledge and skills required to predict the likelihood of a patient experiencing heart failure based on different models.

Estimated time needed: **30** minutes

## Objectives

1. preprocess (normalize and transform categorical data) and create DataSet
2. features selection
3. make classification of patients
4. visualize the decision tree of a classification model

<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#prep">Data preparation</a></li>
    <li><a href="#select">Features selection</a></li>
    <li><a href="#classif">Classification models</a>
        <ul>
        <li><a href="#decision">Decision tree</a>
        <li><a href="#extra">Extra Trees Classifier</a>
        <li><a href="#logistic">Logistic regression</a>
        </ul>
    </li>
    <li><a href="#visualization">Visualization of the decision tree</a></li>
</ol>

</div>

<hr>


## Materials and Methods

The data that we are going to use for this is a subset of an open source The Heart Failure Prediction Dataset. https://www.kaggle.com/datasets/asgharalikhan/mortality-rate-heart-patient-pakistan-hospital.

> This dataset is publicly available for research. 
The dataset consists of comprehensive records of heart patients, making it accessible for data scientists from various regions worldwide to work with. The data is collected from the Institute of Cardiology, a hospital located in Faisalabad, Pakistan.

In this lesson, we will try to give answers to a set of questions that may be relevant when analyzing heart failure data:

1. What are the most useful Python libraries for classification analysis?
2. How to transform category data?
3. How to create DataSet?
4. How to do features selection?
5. How to make, fit, and visualize a classification model?

In addition, we will make the conclusions from the obtained results of our classification analysis to predict the mortality rate of patients with heart disease.

[Scikit-learn](https://scikit-learn.org/stable/), previously known as scikits.learn and commonly referred to as sklearn, is an open-source machine learning library for Python. It provides a wide range of algorithms for tasks such as classification, regression, and clustering. These include popular methods like support vector machines, random forests, gradient boosting, k-means, and DBSCAN. Scikit-learn is designed to seamlessly integrate with other Python libraries for numerical and scientific computing, such as NumPy and SciPy.

In [ ]:
!conda install --yes scikit-learn==0.24.2
!conda install --yes python-graphviz

## Import Libraries

Import the libraries necessary to use in this lab. We can add some aliases to make the libraries easier to use in our code and set a default figure size for further plots. Ignore the warnings.


In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
plt.rcParams["figure.figsize"] = (8, 6)
# Data transformation
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
# Features Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, mutual_info_classif
# Classificators
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree
# warnings deactivate
import warnings
warnings.filterwarnings('ignore')
import graphviz

Further, specify the value of the `precision` parameter equal to 2 to display two decimal signs (instead of 6 as default).

In [ ]:
pd.options.display.float_format = '{:.2f}'.format

## Load the Dataset

We will use the same DataSet that we have saved in previous labs.

In [ ]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0MI1EN/clean_df_new.csv')
df.head(5)

In [ ]:
df.shape

As you can see DataSet consists of 47 columns. The target column is 'Mortality'. Also DataSet consists of 368 rows. In previous labs we investigated these columns.

<details>
<summary><b>Click to see attribute information</b></summary>

Input features (column names):

1. `Age Group` - patient age divided by groups (categorical)
2. `Marital Status` - married or single (categorical)
3. `Lifestyle` - does the patient have a healthy lifestyle (boolean)
4. `Sleep` - does the patient sleep enough?(boolean)
5. `Category` paid or free treatment (categorical)
6. `Depression` - does a patient feel depressed? (boolean)
7. `Hyperlipidemia` - an excess of lipids or fats in your blood (boolean)
8. `Smoking` - does the patient smoke? (boolean)
9. `Diabetes` - does the patient have diabetes? (binary) 
10. `HTN` - hypertension, also known as high blood pressure (boolean) 
11. `Allergies` - does the patient have allergies? (boolean)
12. `BP` - blood pressure (float, normalized) 
13. `Thrombolysis` - uses medications or a minimally invasive procedure to break up blood clots and prevent new clots from forming (binary) 
14. `BGR` - blood glucose level (int) 
15. `CPK` - creatine phosphokinase level (int)
16. `ESR` - erythrocyte sedimentation rate (int) 
17. `WBC` - white blood cells, also known as leukocytes (int) 
18. `RBC` - red blood cells, also known as erythrocytes (float) 
19. `Hemoglobin` - hemoglobin level (float) 
20. `MCH` - mean corpuscular hemoglobin or the average amount in each of red blood cells of hemoglobin (float)
21. `MCHC` - mean corpuscular hemoglobin concentration (float)
22. `PlateletCount` - count of platelets or thrombocytes (int)
23. `Lymphocyte` - share of lymphocytes in blood (float)
24. `Monocyte` -  share of monocytes in blood (float)
25. `Eosinophil` - count of eosinophils (int)
26. `Others` - other diseases, that weren't mentioned (categorical)
27. `Diagnosis` - what is the patient's diagnosis? (float)
28. `Hypersensitivity` - does the patient have hypersensitivity? (boolean)
29. `Chest pain type` - patient's chest pain stage (int)
30. `Resting BP` - resting blood pressure (float)
31. `Serum cholesterol` - amount of total cholesterol in their blood (float)
32. `FBS` - fasting blood sugar > 120 mg/dl (binary)
33. `Resting electrocardiographic` - resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy) (int)
34. `Max heart rate` - patient's maximum heart rate achieved (int)
35. `Angina` - does the patient have exercise induced angina (binary)
36. `ST depression` - ST depression induced by exercise relative to rest (float)
37. `Slope` - the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping) (int)
38. `Vessels num` - number of major vessels (0-3) colored by flourosopy (int)
39. `Thal` - 3 = normal; 6 = fixed defect; 7 = reversable defect (int)
40. `Num` -  diagnosis of heart disease (angiographic disease status) (int)
41. `Streptokinase` - used to dissolve blood clots that have formed in the blood vessels. Does the patient take it? (binary)
42. `SK React` - what is the reaction from streptokinase (categorical)
43. `Follow up` - number of patient's visiting time (int)
44. `Max heart rate-binned` - patient's maximum heart rate achieved - binned (from Lab2) (categorical)
45. `Gender-male` - is the patient male (from Lab2)? (binary)
46. `Locality-urban` - is the patient's locality urban (from Lab2)? (binary)

Output feature (desired target):

47. `Mortality` - did the patient die of heart failure? (binary)
</details>

Our goal is to create a model for predicting mortality caused by Heart Failure. To do this we must analyze and prepare data for such type of model.

## 1. Data preparation <a id="prep"></a>

### Data transformation

First of all we should investigate how pandas recognized types of features

In [ ]:
df.info()

As you can see all categorical features were recognized like object. We must change their type to "categorical". 

In [ ]:
col_cat = list(df.select_dtypes(include=['object']).columns)
col_cat

Let's look at the dataset size.

In [ ]:
df.loc[:, col_cat] = df[col_cat].astype('category')
df.info()

To see the unique values of the exact feature (column) we can use:

In [ ]:
df['Age Group'].unique()

As was signed earlier the dataset contains 368 objects (rows), for each of which 47 features are set (columns), including 1 target feature (y). 7 features are categorical. These data types of values cannot be used for classification. We must transform it to int or float.
To do this we can use **[LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)** and **[OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)**. These functions can encode categorical features as an integer array.

First of all we separate DataSet on input and output(target) DataSets

In [ ]:
X = df.drop(['Mortality'], axis=1)  #input columns
y = df['Mortality']   #target column

### Encoding and Normalization

Let's see how many unique values our target column has.

In [ ]:
y.value_counts()

We have 288 0 values and 80 1 values, which is acceptable for predicting our target column.

Then create a list of categorical fields and transform their values into int arrays:

In [ ]:
col_cat = list(X.select_dtypes(include=['category']).columns)
oe = OrdinalEncoder()
oe.fit(X[col_cat])
X_cat_enc = oe.transform(X[col_cat])

In [ ]:
X_cat_enc

Then we must transform arrays back into DataFrame:

In [ ]:
X_cat_enc = pd.DataFrame(X_cat_enc)
X_cat_enc.columns = col_cat

Numerical fields can have a different scale and can consist of negative values. These will lead to round mistakes and exceptions for some AI methods. To avoid it these features must be normalized.

Let's create a list of numerical fields and normalize it using by **[MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)**

In [ ]:
col_num = list(X.select_dtypes(include=['float', 'int', 'bool']).columns)
scaler = MinMaxScaler(feature_range=(0, 1))
X_num_enc = scaler.fit_transform(X[col_num])

In [ ]:
X_num_enc

Like in the previous case transform back obtained arrays into DataFrame

In [ ]:
X_num_enc = pd.DataFrame(X_num_enc)
X_num_enc.columns = col_num
X_num_enc

Then we should concatenate these DataFrames in one input DataFrame

In [ ]:
x_enc = pd.concat([X_cat_enc, X_num_enc], axis=1)
x_enc

Our target column is already normalized, so we don't need to encode it.<br>

## 2. Features selection <a id="select"></a>

As was signed before input fields consist of 46 features. Of course, some of them are more significant for classification.

There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable.

They are:

* Chi-Squared Statistic.
* Mutual Information Statistic.

Let’s take a closer look at each in turn.

To do this we can use **[SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)**

### Chi-Squared Statistic

Pearson's chi-squared statistical hypothesis test serves as an instance of a test used to determine independence between categorical variables.

You can learn more about this statistical test in the tutorial:

[A Gentle Introduction to the Chi-Squared Test for Machine Learning](https://machinelearningmastery.com/chi-squared-test-for-machine-learning/)
The results of this test can be used for feature selection, where those features that are independent of the target variable can be removed from the dataset.

The scikit-learn machine library provides an implementation of the chi-squared test in the **[chi2()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2)** function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

For example, we can define the SelectKBest class to use the chi2() function and select all (or most significant) features.

Apply SelectKBest class to extract the top 10 best features

In [ ]:
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(x_enc, y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

Concat two dataframes for better visualization 

In [ ]:
featureScores = pd.concat([dfcolumns, dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features

### Mutual Information Statistic

In the field of information theory, mutual information is the utilization of information gain, commonly employed in decision tree construction, for feature selection.

Mutual information quantifies the reduction in uncertainty of one variable when a known value of another variable is present. It calculates the dependency between two variables based on their shared information content.

[You can learn more about mutual information in the following tutorial.](https://machinelearningmastery.com/information-gain-and-mutual-information)

The scikit-learn machine learning library provides an implementation of mutual information for feature selection via the **[mutual_info_classif()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif)** function.

Like chi2(), it can be used in the SelectKBest feature selection strategy (and other strategies).

In [ ]:
bestfeatures = SelectKBest(score_func=mutual_info_classif, k=10)
fit = bestfeatures.fit(x_enc, y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featureScores = pd.concat([dfcolumns, dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features

As you can see these 2 functions select different significant features.

### Feature Importance

You can get the feature importance of each feature of your DataFrame by using the feature importance property of the exact classification model. Feature importance gives you a score for each feature of your data, the higher the score more important or relevant the feature is to your output variable. For example, 
feature importance is an inbuilt class that comes with **[Tree Based Classifiers](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)**, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

Let's create and fit the model:

In [ ]:
model = ExtraTreesClassifier()
model.fit(x_enc, y)

Use inbuilt class `feature_importances` of tree based classifiers

In [ ]:
print(model.feature_importances_)

Let's transform it into a `Series` and plot a graph of features' importance for better visualization

In [ ]:
feat_importances = pd.Series(model.feature_importances_, index=x_enc.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

You can see that for Extra Tree Classifier impotance of features are different than in previous cases. It means that there are not exact rules for features selection. And their impotance strictly depedence on model.

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    <h1>Question 1:</h1>
    <p>Plot graph of 5 least important features</p>
</div>

In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
feat_importances.nsmallest(5).plot(kind='barh')
plt.show()

```

</details>

### Correlation Matrix with Heatmap

Correlation describes the relationship between features in a dataset. It can be positive, indicating that an increase in one feature corresponds to an increase in another, or negative, suggesting that an increase in one feature leads to a decrease in another. By utilizing the seaborn library, we can create a heatmap that visually highlights the most closely related features to another variable. This heatmap enables us to easily identify the strength and direction of correlations within the dataset.

In [ ]:
corrmat = x_enc.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(35,35))
g=sns.heatmap(x_enc[top_corr_features].corr(),annot=True,cmap="RdYlGn")

We have already removed strictly correlated columns in previous lab, thus we can use all these features.

## 3. Classification models<a id="classif"></a>

### Decision tree <a id="decision"></a>

### Build model

As shown, the previous methods have high accuracy. However, the biggest drawback is the inability to visualize or justify the decision.

Decision trees are widely employed in supervised learning for several reasons. They can be utilized for both regression and classification tasks, eliminating the need for feature scaling. Additionally, decision trees offer relatively straightforward interpretability, as they can be visualized. This visualization not only aids in understanding the model but also facilitates communication regarding its functioning. Therefore, it is valuable to learn how to create visualizations based on your model.

A **[Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)** is a supervised machine learning algorithm that employs a binary tree graph, where each node has two children, to assign a target value to each data sample. The target values are located in the tree's leaves. The sample traverses through nodes, starting from the root node, to reach the leaf node. At each node, a decision is made to determine which descendant node the sample should proceed to. This decision is based on the features of the selected sample. Decision Tree learning involves identifying the optimal rules in each internal tree node based on a chosen metric.

This method allows also us to calculate the features' importance. Let's calculate them. Choice the best 10 of them. Refit the model and visualize the decision tree.

In [ ]:
model_dec_tree = DecisionTreeClassifier()
model_dec_tree.fit(x_enc, y)
yhat = model_dec_tree.predict(x_enc)
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Create a user function that will calculate the accuracy of the defined classification model:

In [ ]:
def model_ac(x, y, model):
    model.fit(x, y)
    yhat = model.predict(x)
    accuracy = accuracy_score(y, yhat)
    return accuracy

Now let's create a user function that will calculate the features' importance of the defined classification model. And let's create a variable, that contains features sorted by importance in descending order.

In [ ]:
def model_imp(x, y, model):
    feat_importances = pd.Series(model.feature_importances_, index=x.columns)
    return feat_importances.sort_values(ascending=False)
imp = model_imp(x_enc, y, model_dec_tree)
print(imp)

Plot graph of feature importances for better visualization

In [ ]:
imp.nlargest(10).plot(kind='barh')
plt.show()

Build a plot that shows the accuracy of the defined model dependence on the number of input features.

In [ ]:
col = []
ac = []
for c in imp.index:
    col.append(c)
    ac.append(model_ac(x_enc[col], y, model_dec_tree))
    print('Input fields: ', len(col), 'Accuracy: %.2f' % (ac[-1]*100))
ac = pd.Series(ac)
ac.plot()

We can see that 5 features are enough to make 100% accuracy. So let's create a list of these 5 features in order to use them for our next classification models.

In [ ]:
col = imp.nlargest(5).index
col

Let's refit the model on most important features

In [ ]:
X_most_imp = x_enc[col]
model_dec_tree.fit(X_most_imp, y)
yhat = model_dec_tree.predict(X_most_imp)
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.2f' % (accuracy*100))

### Extra Trees Classifier <a id="extra"></a>

**[Extra Trees Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)** is an ensemble machine learning algorithm that belongs to the family of decision tree-based classifiers. It is an extension of the Random Forest algorithm, but with some differences in the way it builds and combines decision trees. Extra Trees creates a large number of random decision trees and combines their predictions through voting to make the final classification. It introduces additional randomness in the tree-building process by selecting random features and thresholds at each node, which can lead to improved generalization and robustness against overfitting. Overall, Extra Trees Classifier is known for its simplicity, efficiency, and effectiveness in handling high-dimensional datasets.

Let's create and fit ExtraTreesClassifier on the most important features and calculate the accuracy of classification:

In [ ]:
model = ExtraTreesClassifier()
model.fit(X_most_imp, y)

Evaluate the model to obtain predictions

In [ ]:
yhat = model.predict(X_most_imp)
print(yhat)

Evaluate the accuracy: 

In [ ]:
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.2f' % (accuracy*100))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    <h1>Question 2:</h1>
    <p>Create the variable, that contains features sorted by importance in descending order for the Extra Tree model (using 5 most important features)</p>
</div>

In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
imp = model_imp(X_most_imp, y, model)
print(imp)

```

</details>

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    <h1>Question 3:</h1>
    <p>Build a plot that shows the accuracy of the Extra Tree model dependence on the numbers of input features (using 5 most important features)</p>
</div>

In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
col = []
ac = []
for c in imp.index:
    col.append(c)
    ac.append(model_ac(X_most_imp[col], y, model))
    print('Input fields: ', len(col), 'Accuracy: %.2f' % (ac[-1]*100))
ac = pd.Series(ac)
ac.plot()

```

</details>

### Logistic regression <a id="logistic"></a>

There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?

A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.

**[Logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)** serves as a suitable model for evaluating feature selection methods because it can potentially yield improved performance when irrelevant features are eliminated from the model. We will utilize this model in a manner identical to the previous one, following the same approach and methodology.

In [ ]:
model = LogisticRegression(solver='lbfgs')
model.fit(X_most_imp, y)
yhat = model.predict(X_most_imp)
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.2f' % (accuracy*100))

As we can see, the accuracy of the Logistic Regression model is lower (about 80%).

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
    <h1>Question 4:</h1>
    <p>Calculate the accuracy of the Logistic Regression model using all features.</p>
</div>

In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
model = LogisticRegression(solver='lbfgs')
model.fit(x_enc, y)
yhat = model.predict(x_enc)
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.2f' % (accuracy*100))

```

</details>

## 4. Visualization of the decision tree<a id="visualization"></a>

Let's visualize the decision tree.
There are some ways to do it. 

### _Text visualization_

In [ ]:
text_representation = tree.export_text(model_dec_tree)
print(text_representation)

You can save it into the file:

In [ ]:
with open("decistion_tree.log", "w") as fout:
    fout.write(text_representation)

### _Plot tree_

You can plot a tree using two different ways:

**[plot_tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)** (slow render - this can take some time): 


In [ ]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(model_dec_tree,
               feature_names = col, 
               filled = True)

This tree is aimed to help the doctor make a decision about whether our patient is supposed to die of heart failure. But it's problematic to make a decision because data is normalized here, so we need to fix that. Let's create a model, where the data isn't normalized. 

In [ ]:
model_real = DecisionTreeClassifier()
X_most_imp_real = df[X_most_imp.columns]

oe = OrdinalEncoder()
oe.fit(X_most_imp_real[['Age Group']])
X_most_imp_real['Age Group'] = oe.transform(X_most_imp_real[['Age Group']]) 

model_real.fit(X_most_imp_real, y)
yhat = model_real.predict(X_most_imp_real)
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.2f' % (accuracy*100))

As you can see, we also use `OrdinalEncoder()` to encode categorical columns ('Age Group' in this case) in order to provide correct code execution. We can't make a DecisionTree model using categorical values. So after that, we can fit the model and predict our target column.

Let's plot what we've got

In [ ]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(model_real,
               feature_names = col, 
               filled = True)

Save the picture

In [ ]:
fig.savefig('decision_tree.png')

Now we can see our decision tree is much more understandable. But how can we read column 'Age Group'? In order to deal with this problem, we need to compare numerical values with categorical values. We have 5 categories and 5 numerical values: <br>
'21-30' - 0, <br>
'31-40' - 1, <br>
'41-50' - 2, <br>
'51-60' - 3, <br>
'61-70' - 4 <br>
So 'Age Group <= 2.5' means that first three categories are correct ('21-30', '31-40', '41-50')<br>
<h3>How to make a decision</h3>
<p>If the expression is true, we move to the left, if it's false - to the right. For example, we have a patient and data about him:
    'Age Group' - '51-60', <br>
    'Serum cholesterol' - 8.0, <br>
    'Gender-male' - 0, <br>
    'Hemoglobin' - 8.0, <br>
    'Diabetes' - 1 <br>
Age Group <= 2.5' - false; Gender-male <= 0.5 - true; Serum cholesterol <= 7.913 - false; Hemoglobin <= 8.006 - true. The result is true, our patient will die of heart failure.</p>
<p>Also, as another example, we can take real data from our DataSet. Let's take a patient with index 0 and check whether the result will be the same.
    'Age Group' - '41-50', <br>
    'Serum cholesterol' - 10-58, <br>
    'Gender-male' - 0, <br>
    'Hemoglobin' - 7.2, <br>
    'Diabetes' - 1 <br>
Age Group <= 2.5' - true; Hemoglobin <= 9.992 - true; Diabetes <= 0.5 - false; Hemoglobin <= 8.192 - true; Hemoglobin <= 6.951 - false.  The result is false, this patient won't die of heart failure. Moreover, in the DataSet 'Mortality' of the patient with a 0 index is 0. So, the result is successful and we learned how to predict the mortality of patients using a decision tree. </p>

Or you can use the **[python-graphviz](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)** library. This is a faster function

In [ ]:
dot_data = tree.export_graphviz(model_real,
               feature_names = col,
                                filled=True)

After creation you can draw the graph

In [ ]:
graph = graphviz.Source(dot_data, format="png") 
graph

And render it into the file:

In [ ]:
graph.render("decision_tree_graphivz")

## Conclusions

In this lab, we learned to do preliminary data processing. In particular, change data types, and normalize and process categorical data. It was shown how to make feature selections by different methods. Shows how to work with different classifiers. It was also shown how to visualize a decision tree. As a result of the lab, it was shown how on the basis of a statistical database predict if the patient will die of heart failure.

The accuracy of the Decision Tree and Extra Tree classifiers was 100%. And the accuracy of Logistic Regression is about 80%.

### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/bohdan_kuno">Bohdan Kuno</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/nataliya_boyko">Ass. Prof. Nataliya Boyko, PhD</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
|2023-03-18|01|Bohdan Kuno|Lab created|


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
