<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# **Breast Cancer Investigation**

# Lab 6. Predicting the patient's status

## Abstract

In this lab, delve into healthcare data analysis using Python and Pandas, focusing on breast cancer prediction. Import essential libraries, load the dataset, and explore data pre-preparation and preparation techniques. Gain hands-on experience with Logistic Regression, ensemble of classifiers, and Decision Trees. Apply these models to predict patients' diagnoses accurately. Join this project to enhance your data analysis and machine learning skills, and unlock the potential of medical data models in improving breast cancer detection and treatment.

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Be confident about your data analysis skills

The statistical data obtained from <a href="https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric">https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric</a> under <a href="https://opendatacommons.org/licenses/dbcl/1-0/" target="_blank">Database: Open Database, Contents: Database Contents</a> license.

The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database is a Canada-UK Project which contains targeted sequencing data of 1,980 primary breast cancer samples. Clinical and genomic data was downloaded from cBioPortal.

The dataset was collected by Professor Carlos Caldas from Cambridge Research Institute and Professor Sam Aparicio from the British Columbia Cancer Centre in Canada and published on Nature Communications (Pereira et al., 2016). It was also featured in multiple papers including Nature and others.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
  <ol>
    <li><a href="#import-libraries">Import libraries</a></li>
    <li><a href="#importing-the-data">Importing the Data</a></li>
    <li><a href="#data-pre-preparation">Data pre-preparation</a></li>
    <li><a href="#data-preparation">Data preparation</a></li>
    <li><a href="#logistic-regression">Logistic Regression</a></li>
    <li><a href="#ensemble-of-classifiers">Ensemble of classifiers</a></li>
    <li><a href="#decision-trees">Decision Trees</a></li>
    <li><a href="#diagnoses-prediction">Predict patients diagnoses</a></li>
  </ol>
</div>


## 1. Import libraries <p id = "import-libraries"></p>

In [None]:
!pip install scikit-learn

In [None]:
!pip install imblearn

In [None]:
!pip install dython

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn import set_config
from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline
from sklearn.metrics import plot_confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn import tree
from sklearn.metrics import recall_score
from dython.nominal import associations
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
If error appeared, please restart kernel or run this block again.
</div>


Let's disable warnings by **[warnings.filterwarnings()](https://docs.python.org/3/library/warnings.html)**

In [2]:
import warnings
warnings.filterwarnings('ignore')

## 2. Importing the Data <p id = "importing-the-data"></p>

Load the csv:


In [None]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08UPEN/METABRIC_RNA_Mutation.csv"
df = pd.read_csv(filename)

We use the method  <code>head()</code>  to display the first 5 columns of the dataframe:

In [None]:
df.head()

<details>
<summary><b>Click to see attribute information</b></summary>

Input features (column names):

    1. `patient_id` - Patient ID
    2. `age_at_diagnosis` - Age of the patient at diagnosis time
    3. `type_of_breast_surgery` - Breast cancer surgery type
    4. `cancer_type` - Breast cancer types
    5. `cancer_type_detailed` - Detailed Breast cancer types
    6. `cellularity` - Cancer cellularity post-chemotherapy, which refers to the number of tumor cells in the specimen and their arrangement into clusters
    7. `chemotherapy` - Whether or not the patient had chemotherapy as a treatment (yes/no)
    8. `pam50_+_claudin-low_subtype` - Pam 50: is a tumor profiling test that helps show whether some estrogen receptor-positive (ER-positive), HER2-negative breast cancers are likely to metastasize (when breast cancer spreads to other organs).
    9. `cohort` - A cohort is a group of subjects who share a defining characteristic
    10. `er_status_measured_by_ihc` - To assess if estrogen receptors are expressed on cancer cells by using immune-histochemistry
    11. `er_status` - Cancer cells are positive or negative for estrogen receptors
    12. `neoplasm_histologic_grade` - Determined by pathology by looking at the nature of the cells, do they look aggressive or not
    13. `her2_status_measured_by_snp6` - To assess if cancer positive for HER2 or not by using advanced molecular techniques
    14. `her2_status` - Whether the cancer is positive or negative for HER2
    15. `tumor_other_histologic_subtype` - Type of cancer based on microscopic examination of the cancer tissue
    16. `hormone_therapy` - Whether or not the patient had hormonal as a treatment (yes/no)
    17. `inferred_menopausal_state` - Whether the patient is is post-menopausal or not (post/pre)
    18. `integrative_cluster` - Molecular subtype of cancer based on some gene expression
    19. `primary_tumor_laterality` - Whether it is involving the right breast or the left breast
    20. `lymph_nodes_examined_positive` - To take samples of the lymph node during the surgery and see if there were involved in the cancer
    21. `mutation_count` - Number of a gene that has relevant mutations
    22. `nottingham_prognostic_index` - It is used to determine the prognosis following surgery for breast cancer. Its value is calculated using three pathological criteria: the size of the tumor; the number of involved lymph nodes; and the grade of the tumor.
    23. `oncotree_code` - The OncoTree is an open-source ontology that was developed at Memorial Sloan Kettering Cancer Center (MSK) for standardizing cancer-type diagnosis from a clinical perspective by assigning each diagnosis a unique OncoTree code.
    24. `overall_survival_months` - Duration from the time of the intervention to death
    25. `overall_survival` - Target variable whether the patient is alive or dead.
    26. `pr_status` - Cancer cells are positive or negative for progesterone receptors
    27. `radio_therapy` - Whether or not the patient had radio as a treatment (yes/no)
    28. `3-gene_classifier_subtype` - Three Gene classifier subtype
    29. `tumor_size` - Tumor size measured by imaging techniques
    30. `tumor_stage` - Stage of cancer based on the involvement of surrounding structures, lymph nodes, and distant spread

Output feature (desired target):

    31. `death_from_cancer` - Whether the patient's death was due to cancer
    
</details>

## 3. Data pre-preparation <p id="data-pre-preparation"></p>

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 1 </h1>
<b>Delete unnecessary columns from 31 to the last one:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 2 </h1>
<b>Check for NaN and remove them using `dropna`:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 3 </h1>
<b>Build a correlation matrix for numeric columns and association heatmap for object columns:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 4 </h1>
<b>Remove columns that are strictly correlate each other:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 5 </h1>
<b>Check the data type of the columns and change their data type:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

## 4. Data preparation <p id="data-preparation"></p>

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 6 </h1>
<b>Create two dataframes for the feature column and the target column:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 7 </h1>
<b>Create transformer using `make_column_transformer`:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 8 </h1>
<b>Incorporate a train/test split with a ratio of 0.3 for our DataSet:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

## 5. Logistic Regression <p id="logistic-regression">

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 9 </h1>
<b>Create a logistic regression pipeline and fit it:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 10 </h1>
<b>Calculate the accuracy of the pipeline for test and train DataSets:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 11 </h1>
<b>Add cross-validation and predict the output:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 12 </h1>
<b>Plot the confusion matrix:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

## 6. Ensemble of classifiers <p id="ensemble-of-classifiers"></p>

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 13 </h1>
<b>Determine if the count of values in the target column is alike:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 14 </h1>
<b>Use `RandomOverSampler` to balance the number of values in the target column:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 15 </h1>
<b>Add this function to our `Pipeline` and fit the model:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 16 </h1>
<b>Calculate the accuracy for `pipe_s_lr` using the `Recall` metric:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 17 </h1>
<b>Plot the confusion matrix for `pipe_s_lr`:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 18 </h1>
<b>Add cross-validation for `pipe_s_lr` and predict the output:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 19 </h1>
<b>Create an ensemble of classifiers including `VotingClassifier` and calculate their accuracy:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 20 </h1>
<b>Display the accuracy of each classifier:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 21 </h1>
<b>Build a diagram of classifiers' accuracy:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

## 7. Decision Trees <p id="decision-trees"></p>

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 22 </h1>
<b>Create a Pipeline based on Decision Tree and calculate the accuracy:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 23 </h1>
<b>Build a text visualization of decision tree:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 24 </h1>
<b>Plot decision tree using `plot_tree`:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

## 8. Predict patients diagnoses <p id="diagnoses-prediction"></p>

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 25 </h1>
<b>Compose a function called `create_ensemble()` that generates an ensemble utilizing a fixed count of classifiers previously established:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 26 </h1>
<b>Compose a function called `create_ensemble()` that can predict patient vital status, with input parameters consisting of a classifier and a DataFrame:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 27 </h1>
<b>Create list of classifiers:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 28 </h1>
<b>Create a new ensemble using `create_ensemble`:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question 29 </h1>
<b>Make a predict using your new data and `predict_patient_status`:</b>
</div>

In [None]:
# Write your code below and press Shift+Enter to execute

## Authors <p id="authors"></p>

### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/dmytro_shliakhovskyi">Dmytro Shliakhovskyi</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/nataliya_boyko">Ass. Prof. Nataliya Boyko, PhD</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
|    2023-04-01     | 01 | Dmytro Shliakhovkyi | Lab created |



<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>