<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode_vertical.png" width="300" alt="cognitiveclass.ai logo"  />
</center>

# **Breast Cancer Investigation**

# Lab 5. Model Evaluating and Refinement

Estimated time needed: **45** minutes

## Abstract

In this lab, explore healthcare data analysis using Python and Pandas, with a focus on breast cancer. Import libraries, load the dataset, and dive into data pre-preparation. Learn about pipeline classification, Logistic Regression, cross-validation, accuracy assessment, addressing over-sampling issues, and ensemble techniques. Join this project to enhance your data analysis and machine learning skills while unlocking the potential of medical data models for improved breast cancer prediction.

## Objectives

After completing this lab you will be able to:

* Download and prepare the dataset for analysis.
* Conduct basic data analysis and exploratory data visualization.
* Perform feature engineering and selection.
* Build and evaluate machine learning classification models.
* Create ensemble models by combining multiple classifiers.
* Calculate accuracy and analyze errors of the models.
* Implement a data analysis pipeline to streamline the process.
* Demonstrate practical applications of classifiers and ensembles in a specific domain.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#materials_and_methods">Materials and methods</a></li>
        <li><a href="#import_libraries">Import Libraries</a></li>
        <li><a href="#load_the_dataset">Load the DataSet</a></li>
        <li><a href="#data_pre_preparation">Data pre-preparation</a></li>
        <li><a href="#pipeline_classification">Pipeline Classification</a>
             <ul>
                <li><a href="#LogisticRegression">LogisticRegression</a></li>
                <li><a href="#cross_validation">Cross-validation</a></li>
                <li><a href="#accuracy">Accuracy</a></li>
            </ul>
        </li>
        <li><a href="#over_sampling_problem">Over-sampling problem</a></li>
        <li><a href="#ensemble_of_classifiers">Ensemble of classifiers</a></li>
        <li><a href="#conclusions">Conclusions</a></li>
    </ol>
</div>

## 1. Materials and methods <p id="materials_and_methods"></p>

In this lab, we will learn how to download and pre-prepare data, classify and combine classifiers into an ensemble.
This lab consists of the following steps:
* Download data - download and display data from a file
* Preliminary data preparation - preliminary analysis of data structure, change of data structure and tables
* Pipeline classification - classification and analysis by grouping stages
    * Logistic regression - classification and analysis of accuracy and errors using logistic regression
    * Over-sampling problem - solve the problem of uneven distribution of data
    * Ensemble of classifiers - study various classifiers and methods of combining them into an ensemble

The statistical data obtained from <a href="https://www.kaggle.com/datasets/gunesevitan/breast-cancer-metabric">https://www.kaggle.com/datasets/gunesevitan/breast-cancer-metabric</a> under <a href="https://opendatacommons.org/licenses/odbl/1-0/" target="_blank">Database: Open Database, Contents: © Original Authors</a> license.

## Prerequisites
* [Python](https://www.python.org) - middle level
* [Pandas](https://pandas.pydata.org) - middle level 
* [Matplotlib](https://matplotlib.org) - basic level
* [SeaBorn](https://seaborn.pydata.org) - basic level
* [Scikit-Learn](https://scikit-learn.org/stable/) - middle level 

## Objectives

After completing this lab, you will be able to:

* Download DataSet from * .csv files
* Conduct basic data analysis
* Calculate new and change column types
* Divide the DataSet into training and test
* Use different machine learning classification methods
* Combine classifiers into ensemble
* Calculate accuracy and analyze errors
* Combine all stages of data analysis with Pipeline

## 2. Import Libraries/Define Auxiliary Functions <p id="import_libraries"></p>

Libraries such as Scikit-Learn, imbalanced-learn should be installed.

In [None]:
conda install -c intel scikit-learn

In [None]:
conda install -c conda-forge imbalanced-learn

Some libraries should be imported before you can begin.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler                         
from sklearn.compose import make_column_transformer
from sklearn import set_config
from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
#Classifiers
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn import tree
from sklearn.metrics import recall_score
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

Let's disable warnings by **[warnings.filterwarnings()](https://docs.python.org/3/library/warnings.html)**

In [2]:
import warnings
warnings.filterwarnings('ignore')

Further specify the value of the precision parameter equal to 2 to display two decimal signs (instead of 6 as default) by and  **[pd.options.display](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html)**.

In [None]:
pd.options.display.float_format = '{:.2f}'.format

## 3. Download data from a .csv file <p id="load_the_dataset"></p>

The next step is to download the data file from the repository by **[read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)**.

We will use the same DataSet like in previous lab. Therefore next some steps will be the same.

In [None]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX036XEN/breast_cancer.csv')

Now let's look at our DataSet.

In [None]:
df

## 4. Data pre-preparation <p id="data_pre_preparation"></p>

Let's study DataSet. As you can see DataSet consist 2509 rows × 29 columns. As you can see DataSet consist information of different types. We should be sure that python recognized data types in the right way. To do this we shoul use **[pandas.info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html?highlight=info#pandas.DataFrame.info)**.

In [None]:
df.info()

<details>
<summary><b>Click to see attribute information</b></summary>
    
Input features (column names):

1. `Age at Diagnosis` - Age of the patient at diagnosis time (numeric)
2. `Type of Breast Surgery` - Breast cancer surgery type (categorical: `Breast Conserving`, `Mastectomy`)
3. `Cancer Type Detailed` - Detailed Breast cancer types (categorical: `Breast`, `Breast Angiosarcoma`, `Breast Invasive Ductal Carcinoma`, `Breast Invasive Lobular Carcinoma`, `Breast Invasive Mixed Mucinous Carcinoma`, `Breast Mixed Ductal and Lobular Carcinoma`, `Invasive Breast Carcinoma`, `Metaplastic Breast Cancer`)
4. `Cellularity` - Cancer cellularity post chemotherapy, which refers to the amount of tumor cells in the specimen and their arrangement into clusters (categorical: `High`, `Low`, `Moderate`)
5. `Chemotherapy` - Whether or not the patient had chemotherapy as a treatment (yes/no) (boolean)
6. `Pam50 + Claudin-low subtype` - Pam 50: is a tumor profiling test that helps show whether some estrogen receptor-positive (ER-positive), HER2-negative breast cancers are likely to metastasize (when breast cancer spreads to other organs). (categorical: `Basal`, `Her2`, `LumA`, `LumB`, `NC`, `Normal`, `claudin-low`)
7. `Cohort` - Cohort is a group of subjects who share a defining characteristic (numeric)
8. `ER status measured by IHC` - To assess if estrogen receptors are expressed on cancer cells by using immune-histochemistry (a dye used in pathology that targets specific antigen, if it is there, it will give a color, it is not there, the tissue on the slide will be colored)(categorical: `Positve`, `Negative`)
9. `ER Status` - Cancer cells are positive or negative for estrogen receptors (categorical: `Positve`, `Negative`)
10. `Neoplasm Histologic Grade` - Determined by pathology by looking the nature of the cells, do they look aggressive or not (It takes a value from 1 to 3) (numeric).
11. `HER2 status measured by SNP6` - To assess if the cancer positive for HER2 or not by using advance molecular techniques (Type of next generation sequencing) (categorical: `Gain`, `Loss`, `Neutral`, `Undef`)
12. `Tumor Other Histologic Subtype` - Type of the cancer based on microscopic examination of the cancer tissue (categorical: `Ductal/NST`, `Lobular`, `Medullary`, `Metaplastic`, `Mixed`, `Mucinous`, `Other`, `Tubular/ cribriform`)
13. `Hormone Therapy` - Whether or not the patient had hormonal as a treatment (yes/no) (boolean)
14. `Integrative Cluster` - Molecular subtype of the cancer based on some gene expression (categorical: `1`, `2`, `3`, `4ER+`, `4ER-`, `5`, `6`, `7`,  `8`, `9`, `10`)
15. `Primary Tumor Laterality` - Whether it is involving the right breast or the left breast (categorical: `Left`, `Right`)
16. `Lymph nodes examined positive` - To take samples of the lymph node during the surgery and see if there were involved by the cancer (numeric)
17. `Mutation Count` - Number of gene that has relevant mutations (numeric)
18. `Nottingham prognostic index` - It is used to determine prognosis following surgery for breast cancer. Its value is calculated using three pathological criteria: the size of the tumour; the number of involved lymph nodes; and the grade of the tumour. (numeric)
19. `Oncotree Code` - The OncoTree is an open-source ontology that was developed at Memorial Sloan Kettering Cancer Center (MSK) for standardizing cancer type diagnosis from a clinical perspective by assigning each diagnosis a unique OncoTree code (categorical: `BRCA`, `BREAST`, `IDC`, `ILC`, `IMMC`, `MBC`, `MDLC`, `PBS`)
20. `PR Status` - Cancer cells are positive or negative for progesterone receptors (categorical: `Positve`, `Negative`)
21. `Radio Therapy` - Whether or not the patient had radio as a treatment (yes/no) (boolean)
22. `3-Gene classifier subtype` - Three Gene classifier subtype (categorical: `ER+/HER2- High Prolif`, `ER+/HER2- Low Prolif`, `ER-/HER2-`, `HER2+`)
23. `Tumor Size` - Tumor size measured by imaging techniques (numeric)
24. `Tumor Stage` - Stage of the cancer based on the involvement of surrounding structures, lymph nodes and distant spread (numeric)
25. `Overall Survival (Years)` - Duration from the time of the intervention to death (numeric)
26. `Relapse Free Status (Years)` - Absence of any signs or symptoms of cancer recurrence or metastasis after a patient has completed treatment for breast cancer. (numeric)
27. `Nottingham prognostic index-binned` - (categorical)
28. `Inferred Menopausal State-Post` - Whether the patient is post menopausal or not (numeric)
29. `Relapse Free Status-Not Recurred` - Absence of any signs or symptoms of cancer recurrence or metastasis after a patient has completed treatment for breast cancer (numeric)


Output feature (desired target):

30. `Patient's Vital Status` - Patient's Vital Status (categorical: `Died of Disease`,`Died of Other Causes`, `Living`)
    
    </details>

Let's study information of DataSet columns. 

## 5. Pipeline Classification <p id="pipeline_classification"></p>

### LogisticRegression <p id="LogisticRegression"></p>

Before classification, the dataset must be divided into input and target factors.

In [None]:
x = df.drop(columns = ["Patient's Vital Status"])

In [None]:
y = df["Patient's Vital Status"]

In [None]:
x.info()

You can see the input data set consists from 28 columns.

As you can see, 13 columns are categorical, and all other 15 - numerical. To make classification, all numerical fields must be normalized and categorical fields must be digitized. This can be automated using the **[sklearn.preprocessing.OrdinalEncoder()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)** and **[sklearn. preprocessing.StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)**.

Since the machine learning process consists of several steps, each of which has the function `fit`,` predict` and etc, we can combine all these stages into one block using `Pipeline` (**[sklearn.pipeline.make_pipeline()](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)**), **[sklearn.compose.make_column_transformer()](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)** and visualize it with: **[sklearn.set_config()](https://scikit-learn.org/stable/modules/generated/sklearn.set_config.html)**.

In [None]:
col_cat = list(x.select_dtypes(include=['object']).columns)
col_num = list(x.select_dtypes(include=['float', 'int', 'bool']).columns)

In [None]:
trans = make_column_transformer((OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1),col_cat),
                                (StandardScaler(),col_num),
                                remainder = 'passthrough')
set_config(display = 'diagram')
trans

Next we must separate DataSets for train and test DataSets for calculate accuracy of models. To do this we can use **[sklearn.model_selection.train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**. Let's separate DataSets in 0.3 proportion train/test

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3, shuffle=False)

In [None]:
x_train.shape

In [None]:
x_test.shape

Nowe let's create a logistic regression model (**[sklearn.linear_model.LogisticRegression()](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)**) and add it to our `Pipeline`.

In [None]:
lr = LogisticRegression()
pipe_lr = make_pipeline(trans,lr)

Let's fit our model and calculate its accuracy.

In [None]:
pipe_lr.fit(x_train,y_train)

### Cross-validation <p id="cross_validation"></p>

Cross-validation is a technique in machine learning where the available dataset is split into multiple subsets or folds, and the model is trained and tested on different subsets in a rotation. The primary purpose of cross-validation is to estimate how well the model is expected to perform when it is deployed to make predictions on new, unseen data.

One common way to implement cross-validation is by using the cross_val_score helper function, which takes an estimator (the model to be trained and tested) and the dataset, and returns the scores from each fold. This allows for easy evaluation and comparison of different models based on their performance metrics.

In [None]:
Rcross = cross_val_score(pipe_lr, x, y, cv=4)
print([round(val, 2) for val in Rcross])
print("The mean of the folds are", round(Rcross.mean(), 2), "and the standard deviation is", round(Rcross.std(), 2))

Let's use `cross_val_predict` to generate cross-validated estimates for each input data point.

In [None]:
yhat = cross_val_predict(pipe_lr, x, y,cv=4)
yhat[0:5]

#### Accuracy <p id="accuracy"></p>

Let's calculate accuracy of this pipeline.

In [None]:
scores_train = pipe_lr.score(x_train, y_train)
scores_test = pipe_lr.score(x_test, y_test)
print('Training DataSet accuracy: {: .1%}'.format(scores_train), 'Test DataSet accuracy: {: .1%}'.format(scores_test))

Let's evaluate the correctness of the classification with: **[sklearn.metrics.plot_confusion_matrix()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html)** and convince of these conclusions.

In [None]:
plot_confusion_matrix(pipe_lr, x_test, y_test)
plt.show() 

As you can see from the table, our model predicts patient's vital status very well. At the same time, errors in the classification of patients that they will live are very big. The correct forecast is only 321 patients. In 184 cases when the patients actually will die of disease, the model shows that the patient will live. However, there are 0 cases where the model predicts that the patient will die of the disease, when in fact the patient will live.

The `Recall` metric is used to assess the accuracy of only patients who will live: **[sklearn.metrics.recall_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)**

In [None]:
scores_train = recall_score(y_train, pipe_lr.predict(x_train), average='macro')
scores_test = recall_score(y_test, pipe_lr.predict(x_test), average='macro')
print('Training DataSet accuracy: {: .1%}'.format(scores_train), 'Test DataSet accuracy: {: .1%}'.format(scores_test))

As can be seen from this metric, the accuracy is very low. This means that in order to increase this metric of accuracy, it is necessary to increase the training sample. Let's analyze it.

### 6. Over-sampling problem <p id="over_sampling_problem"></p>

Let's analyze the Patient's Vital Status (**[seaborn.countplot()](https://seaborn.pydata.org/generated/seaborn.countplot.html)**):

In [None]:
sns.countplot(x = y)

As you can see, the number of Living is much greater than the number of Died of Other Causes. To balance the data set, we can use a special function: **[imblearn.over_sampling.RandomOverSampler()](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html)**:

In [None]:
ROS = RandomOverSampler()
o_x, o_y = ROS.fit_resample(x,y)
sns.countplot(x = o_y)

Let's add this function to our `Pipeline`, fit the model and recalculate the accuracy.

In [None]:
pipe_s_lr = make_pipeline(trans, ROS, lr)
pipe_s_lr

In [None]:
pipe_s_lr.fit(x_train,y_train)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #1: </h1>

<b>Calculate the precision for `pipe_s_lr` using the `Recall` metric.</b>

</div>

In [None]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
scores_train = recall_score(y_train, pipe_s_lr.predict(x_train), average='macro')
scores_test = recall_score(y_test, pipe_s_lr.predict(x_test), average='macro')
print('Training DataSet accuracy: {: .1%}'.format(scores_train), 'Test DataSet accuracy: {: .1%}'.format(scores_test))
```

</details>

As you can see, the balance for our dataset is barely changed.

Let's analyze the model errors.

In [None]:
plot_confusion_matrix(pipe_s_lr, x_test, y_test)  
plt.show() 

As we can see, the number of false predictions about a patient who will die has almost not changed. However, the error is high when the model predicts the patient's vital status. The `Precision` is used to assess this accuracy.

To further increase the `Recall` metric, the model must be modified because the accuracy of logistic regression for unknown data is about the same as for known data. Therefore, it can no longer provide a better fit.

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #2: </h1>

<b>Сalculate the cross-validation score for the new pipeline.</b>

</div>

In [None]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
Rcross = cross_val_score(pipe_s_lr, x, y, cv=4)
print([round(val, 2) for val in Rcross])
print("The mean of the folds are", round(Rcross.mean(), 2), "and the standard deviation is", round(Rcross.std(), 2))
```

</details>

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #3: </h1>

<b>Use `cross_val_predict` to generate cross-validated estimates for each input data point using the new pipeline.</b>

</div>

In [None]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
yhat = cross_val_predict(pipe_s_lr, x, y,cv=4)
yhat[0:5]
```

</details>

### 7. Ensemble of classifiers <p id="ensemble_of_classifiers"></p>

Let's test other classifiers and compare the results.
We will test:
* [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#sklearn.linear_model.LogisticRegression)
* [Linear SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html?highlight=linear%20svm#sklearn.svm.LinearSVR)
* [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontreeclassifier#sklearn.tree.DecisionTreeClassifier)
* [Extra Tree](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
* [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforestclassifier#sklearn.ensemble.RandomForestClassifier)
* [Multi-layer Perceptron classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html?highlight=mlpclassifier#sklearn.neural_network.MLPClassifier)
* [Ada Boost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html?highlight=adaboostclassifier#sklearn.ensemble.AdaBoostClassifier)
* [Gradient Boosting for classification](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
* [Bagging classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)

Additionally, different classifiers may misclassify data in various circumstances. Therefore, model ensembles via Voting Classifier must be used in order to correct each other's errors.

A **[Voting Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)** is a machine learning model that gains experience by training on a collection of several models and forecasts an output (class) based on the class with the highest likelihood of being the output. To predict the output class based on the highest majority of votes, it merely averages the results of each classifier that was passed into the voting classifier. The concept is to build a single model that learns from these models and predicts output based on their aggregate majority of voting for each output class, rather than building separate dedicated models and determining the accuracy for each of them.

Two different voting methods are supported by Voting Classifier.

**Hard voting**: In hard voting, the projected output class is the one that had the greatest number of votes, i.e., the class that had the greatest likelihood of being predicted by each of the classifiers. In this case, the majority anticipated A as the output when three classifiers (A, A, and B) predicted the output class. Therefore, the final prediction will be A.

**Soft Voting**: In a soft vote, the forecast for the output class is based on the likelihood assigned to that class on average. Assume that given some input, the prediction probabilities for classes A and B are (0.20, 0.32, 0.40) and (0.30, 0.47, 0.53), respectively. As a result, class A's average is 0.4333, while class B's average is 0.3067. As a result, class A is the winner because it had the highest probability as averaged by all classifiers.

In [None]:
names = ["Logistic Regression", "Linear SVM",
         "Decision Tree", "Extra Tree", "Random Forest", "Neural Net", 
         "AdaBoost", "GradientBoostingClassifier", "BaggingClassifier", "VotingClassifier"]

classifiers = [
    LogisticRegression(),
    SVC(kernel="linear", C=0.025),
    DecisionTreeClassifier(max_depth=5),
    ExtraTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(n_estimators=100, random_state=0),
    GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0),
    BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0)]

est = [(str(est), est) for est in classifiers]

eclf = [VotingClassifier(
     estimators=est,
     voting='hard')]
classifiers += eclf
scores_train = []
scores_test = []
scores_train_s = []
scores_test_s = []

for name, classif in zip(names, classifiers):
    print(name,'fitting.....')
    clf = make_pipeline(trans, classif)
    clf.fit(x_train,y_train)
    score_train = recall_score(y_train, clf.predict(x_train), average='macro')
    score_test = recall_score(y_test, clf.predict(x_test), average='macro')
    scores_train.append(score_train)
    scores_test.append(score_test)
    
    clf_s = make_pipeline(trans, ROS, classif)
    clf_s.fit(x_train,y_train)
    score_train_s = recall_score(y_train, clf_s.predict(x_train), average='macro')
    score_test_s = recall_score(y_test, clf_s.predict(x_test), average='macro')
    scores_train_s.append(score_train_s)
    scores_test_s.append(score_test_s)

Let's compare the accuracy of classifiers for balanced and unbalanced data sets.

In [None]:
res = pd.DataFrame(index = names)
res['Train'] = np.array(scores_train)
res['Test'] = np.array(scores_test)
res['Train Over Sampler'] = np.array(scores_train_s)
res['Test Over Sampler'] = np.array(scores_test_s)

res.index.name = "Classifier accuracy"
pd.options.display.float_format = '{:,.2f}'.format
res

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(names, scores_test, label='Test')
ax.bar(names, scores_test_s, label='Test Over Sampler')
ax.legend(['Test', 'Test Over Sampler'])

ax.set_title('Classifier Test Accuracies')
ax.set_xlabel('Classifier')
ax.set_ylabel('Accuracy')

plt.xticks(rotation=90)

plt.show()

As you can see, the balanced data set leads to a sharp increase in accuracy in all classifiers. It can also be seen that the most accurate model was GradientBoostingClassifier. The ensemble of models showed better accuracy on the training data set and slightly worse on the test.

Let's display the last classifier:

In [None]:
clf_s

## 8. Conclusions <p id="conclusions"></p>

In this lab we studied how to normalize numerical and categorical data. It was shown how to build training and test data sets. Shows how to fit different classifiers, evaluate their accuracy and analyze errors.
We also studied how to join them together in an ensemble and create a model based on Pipeline.
We compared the accuracy of different classifiers and their ensemble and showed how they can be used in medicine.

The accuracy of the decision was about 70%.

### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/dmytro_shliakhovskyi">Dmytro Shliakhovskyi</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/nataliya_boyko">Ass. Prof. Nataliya Boyko, PhD</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
|    2023-03-25     | 01 | Dmytro Shliakhovkyi | Lab created |



<hr>

## <h3 align="center"> © IBM Corporation 2020. All rights reserved. <h3/>