<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Predicting the patient's status

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Be confident about your data analysis skills

The statistical data obtained from <a href="https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric">https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric</a> under <a href="https://opendatacommons.org/licenses/dbcl/1-0/" target="_blank">Database: Open Database, Contents: Database Contents</a> license.

The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database is a Canada-UK Project which contains targeted sequencing data of 1,980 primary breast cancer samples. Clinical and genomic data was downloaded from cBioPortal.

The dataset was collected by Professor Carlos Caldas from Cambridge Research Institute and Professor Sam Aparicio from the British Columbia Cancer Centre in Canada and published on Nature Communications (Pereira et al., 2016). It was also featured in multiple papers including Nature and others.

<h4>You will need the following libraries</h4>

In [1]:
# !pip install scikit-learn

In [2]:
# !pip install imblearn

In [3]:
# !pip install dython

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn import set_config
from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline
from sklearn.metrics import plot_confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn import tree
from sklearn.metrics import recall_score
from dython.nominal import associations
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier

ModuleNotFoundError: No module named 'imblearn'

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
If error appeared, please restart kernel or run this block again.
</div>


Let's disable warnings by **[warnings.filterwarnings()](https://docs.python.org/3/library/warnings.html)**

In [None]:
import warnings
warnings.filterwarnings('ignore')

<b>Importing the Data</b>


Load the csv:


In [None]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX08UPEN/METABRIC_RNA_Mutation.csv"
df = pd.read_csv(filename)

We use the method  <code>head()</code>  to display the first 5 columns of the dataframe:

In [None]:
df.head()

<details>
<summary><b>Click to see attribute information</b></summary>

Input features (column names):

    1. `patient_id` - Patient ID
    2. `age_at_diagnosis` - Age of the patient at diagnosis time
    3. `type_of_breast_surgery` - Breast cancer surgery type
    4. `cancer_type` - Breast cancer types
    5. `cancer_type_detailed` - Detailed Breast cancer types
    6. `cellularity` - Cancer cellularity post-chemotherapy, which refers to the number of tumor cells in the specimen and their arrangement into clusters
    7. `chemotherapy` - Whether or not the patient had chemotherapy as a treatment (yes/no)
    8. `pam50_+_claudin-low_subtype` - Pam 50: is a tumor profiling test that helps show whether some estrogen receptor-positive (ER-positive), HER2-negative breast cancers are likely to metastasize (when breast cancer spreads to other organs).
    9. `cohort` - A cohort is a group of subjects who share a defining characteristic
    10. `er_status_measured_by_ihc` - To assess if estrogen receptors are expressed on cancer cells by using immune-histochemistry
    11. `er_status` - Cancer cells are positive or negative for estrogen receptors
    12. `neoplasm_histologic_grade` - Determined by pathology by looking at the nature of the cells, do they look aggressive or not
    13. `her2_status_measured_by_snp6` - To assess if cancer positive for HER2 or not by using advanced molecular techniques
    14. `her2_status` - Whether the cancer is positive or negative for HER2
    15. `tumor_other_histologic_subtype` - Type of cancer based on microscopic examination of the cancer tissue
    16. `hormone_therapy` - Whether or not the patient had hormonal as a treatment (yes/no)
    17. `inferred_menopausal_state` - Whether the patient is is post-menopausal or not (post/pre)
    18. `integrative_cluster` - Molecular subtype of cancer based on some gene expression
    19. `primary_tumor_laterality` - Whether it is involving the right breast or the left breast
    20. `lymph_nodes_examined_positive` - To take samples of the lymph node during the surgery and see if there were involved in the cancer
    21. `mutation_count` - Number of a gene that has relevant mutations
    22. `nottingham_prognostic_index` - It is used to determine the prognosis following surgery for breast cancer. Its value is calculated using three pathological criteria: the size of the tumor; the number of involved lymph nodes; and the grade of the tumor.
    23. `oncotree_code` - The OncoTree is an open-source ontology that was developed at Memorial Sloan Kettering Cancer Center (MSK) for standardizing cancer-type diagnosis from a clinical perspective by assigning each diagnosis a unique OncoTree code.
    24. `overall_survival_months` - Duration from the time of the intervention to death
    25. `overall_survival` - Target variable whether the patient is alive or dead.
    26. `pr_status` - Cancer cells are positive or negative for progesterone receptors
    27. `radio_therapy` - Whether or not the patient had radio as a treatment (yes/no)
    28. `3-gene_classifier_subtype` - Three Gene classifier subtype
    29. `tumor_size` - Tumor size measured by imaging techniques
    30. `tumor_stage` - Stage of cancer based on the involvement of surrounding structures, lymph nodes, and distant spread

Output feature (desired target):

    31. `death_from_cancer` - Whether the patient's death was due to cancer
    
</details>

<b>Question 1:</b> Delete unnecessary columns from 31 to the last one:


In [None]:
df = df.iloc[:,:31]

In [None]:
df.head()

<b>Question 2:</b> Check for NaN and remove them using `dropna`:


In [None]:
df.isnull().values.any()

In [None]:
df = df.dropna()

<b>Question 3:</b> Build a correlation matrix for numeric columns and association heatmap for object columns:


In [None]:
corr = df.corr()
sns.heatmap(corr, linewidths=.5)

In [None]:
col_obj = list(df.select_dtypes(include=['object']).columns)

In [None]:
associations(df[col_obj], annot=False)

<b>Question 4:</b> Remove columns that are strictly correlate each other:


In [None]:
df = df.drop(["patient_id", "cancer_type", "oncotree_code", "overall_survival"], axis=1)

<b>Question 5:</b> Check the data type of the columns and change their data type:

In [None]:
df.dtypes

In [None]:
df[['chemotherapy',"hormone_therapy", "radio_therapy"]] = df[['chemotherapy',"hormone_therapy", "radio_therapy"]].astype(bool)

<b>Question 6:</b> Create two dataframes for the feature column and the target column:

In [None]:
x = df.drop(columns = ["death_from_cancer"])
y = df["death_from_cancer"]

<b>Question 7:</b> Create transformer using `make_column_transformer`:

In [None]:
col_cat = list(x.select_dtypes(include=['object']).columns)
col_num = list(x.select_dtypes(include=['float', 'int', 'bool']).columns)

trans = make_column_transformer((OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1),col_cat),
                                (StandardScaler(),col_num),
                                remainder='passthrough')
set_config(display = 'diagram')
trans

<b>Question 8:</b> Incorporate a train/test split with a ratio of 0.3 for our DataSet.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3, shuffle=False)

<b>Question 9:</b> Create a logistic regression pipeline and fit it:

In [None]:
lr = LogisticRegression()
pipe_lr = make_pipeline(trans,lr)
pipe_lr.fit(x_train,y_train)

<b>Question 10:</b> Calculate the accuracy of the pipeline for test and train DataSets:

In [None]:
scores_train = pipe_lr.score(x_train, y_train)
scores_test = pipe_lr.score(x_test, y_test)
print('Training DataSet accuracy: {: .1%}'.format(scores_train), 'Test DataSet accuracy: {: .1%}'.format(scores_test))

<b>Question 11:</b> Add cross-validation and predict the output:

In [None]:
Rcross = cross_val_score(pipe_lr, x, y, cv=4)
print(np.around(Rcross, decimals=2))
print("The mean of the folds are", round(Rcross.mean(), 2), "and the standard deviation is" , round(Rcross.std(), 2))

yhat = cross_val_predict(pipe_lr, x, y,cv=4)
yhat[0:5]

<b>Question 12:</b> Plot the confusion matrix:

In [None]:
plot_confusion_matrix(pipe_lr, x_test, y_test)
plt.show() 

<b>Question 13:</b> Determine if the count of values in the target column is alike:

In [None]:
sns.countplot(x = y)

<b>Question 14:</b> Use `RandomOverSampler` to balance the number of values in the target column:

In [None]:
ROS = RandomOverSampler()
o_x, o_y = ROS.fit_resample(x,y)
sns.countplot(x = o_y)

<b>Question 15:</b> Add this function to our `Pipeline` and fit the model:

In [None]:
pipe_s_lr = make_pipeline(trans, ROS, lr)
pipe_s_lr.fit(x_train,y_train)
pipe_s_lr

<b>Question 16:</b> Calculate the accuracy for `pipe_s_lr` using the `Recall` metric:

In [None]:
scores_train = recall_score(y_train, pipe_s_lr.predict(x_train), average='macro')
scores_test = recall_score(y_test, pipe_s_lr.predict(x_test), average='macro')
print('Training DataSet accuracy: {: .1%}'.format(scores_train), 'Test DataSet accuracy: {: .1%}'.format(scores_test))

<b>Question 17:</b> Plot the confusion matrix for `pipe_s_lr`:

In [None]:
plot_confusion_matrix(pipe_s_lr, x_test, y_test)  
plt.show() 

<b>Question 18:</b> Add cross-validation for `pipe_s_lr` and predict the output:

In [None]:
Rcross = cross_val_score(pipe_s_lr, x, y, cv=4)
print(np.around(Rcross, decimals=2))
print("The mean of the folds are", round(Rcross.mean(), 2), "and the standard deviation is" , round(Rcross.std(), 2))

yhat = cross_val_predict(pipe_s_lr, x, y,cv=4)
yhat[0:5]

<b>Question 19:</b> Create an ensemble of classifiers including `VotingClassifier` and calculate their accuracy:

In [None]:
names = ["Logistic Regression", "Linear SVM",
         "Decision Tree", "Extra Tree", "Random Forest", "Neural Net", 
         "AdaBoost", "GradientBoostingClassifier", "BaggingClassifier", "VotingClassifier"]

classifiers = [
    LogisticRegression(),
    SVC(kernel="linear", C=0.025),
    DecisionTreeClassifier(max_depth=5),
    ExtraTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(n_estimators=100, random_state=0),
    GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0),
    BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0)]

est = [(str(est), est) for est in classifiers]

eclf = [VotingClassifier(
     estimators=est,
     voting='hard')]
classifiers += eclf
scores_train = []
scores_test = []
scores_train_s = []
scores_test_s = []

for name, classif in zip(names, classifiers):
    print(name,'fitting.....')
    clf = make_pipeline(trans, classif)
    clf.fit(x_train,y_train)
    score_train = recall_score(y_train, clf.predict(x_train), average='macro')
    score_test = recall_score(y_test, clf.predict(x_test), average='macro')
    scores_train.append(score_train)
    scores_test.append(score_test)
    
    clf_s = make_pipeline(trans, ROS, classif)
    clf_s.fit(x_train,y_train)
    score_train_s = recall_score(y_train, clf_s.predict(x_train), average='macro')
    score_test_s = recall_score(y_test, clf_s.predict(x_test), average='macro')
    scores_train_s.append(score_train_s)
    scores_test_s.append(score_test_s)

<b>Question 20:</b> Display the accuracy of each classifier:

In [None]:
res = pd.DataFrame(index = names)
res['Train'] = np.array(scores_train)
res['Test'] = np.array(scores_test)
res['Train Over Sampler'] = np.array(scores_train_s)
res['Test Over Sampler'] = np.array(scores_test_s)

res.index.name = "Classifier accuracy"
pd.options.display.float_format = '{:,.2f}'.format
res

<b>Question 21:</b> Build a diagram of classifiers' accuracy:

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(names, scores_test)
ax.bar(names, scores_test_s)


ax.set_title('Classifier Test Accuracies')
ax.set_xlabel('Classifier')
ax.set_ylabel('Accuracy')

plt.xticks(rotation=90)

plt.show()

<b>Question 22:</b> Create a Pipeline based on Decision Tree and calculate the accuracy:

In [None]:
dtr = DecisionTreeClassifier(max_depth=3)
pipe_s_dtr = make_pipeline(trans, ROS, dtr)
pipe_s_dtr.fit(x_train,y_train)
scores_train = pipe_s_dtr.score(x_train, y_train)
scores_test = pipe_s_dtr.score(x_test, y_test)
print('Training DataSet accuracy: {: .1%}'.format(scores_train), 'Test DataSet accuracy: {: .1%}'.format(scores_test))

<b>Question 23:</b> Build a text visualization of decision tree:

In [None]:
text_representation = tree.export_text(pipe_s_dtr['decisiontreeclassifier'])
print(text_representation)

<b>Question 24</b>: Plot decision tree using `plot_tree`:

In [None]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(pipe_s_dtr['decisiontreeclassifier'],
               feature_names = x_train.columns, 
               class_names = y_train.unique().astype('str'),
               filled = True)

<b>Question 25:</b> Compose a function called `create_ensemble()` that generates an ensemble utilizing a fixed count of classifiers previously established:

In [None]:
def create_ensemble(classifiers, x_train, y_train, x_test, y_test, trans):
    est = [(str(est), est) for est in classifiers]
    eclf = VotingClassifier(estimators=est, voting='hard')
    clf = make_pipeline(trans, eclf)
    clf.fit(x_train, y_train)
    score_train = recall_score(y_train, clf.predict(x_train), average='macro')
    score_test = recall_score(y_test, clf.predict(x_test), average='macro')
    print("Accuracy of ensemble Train: ", round(score_train, 2))
    print("Accuracy of ensemble Test: ", round(score_test, 2))
    return clf

<b>Question 26:</b> Compose a function called `create_ensemble()` that can predict patient vital status, with input parameters consisting of a classifier and a DataFrame.

In [None]:
def predict_patient_status(classifier, data):
    res = pd.DataFrame(classifier.predict(data), columns=["death_from_cancer"])
    return res

<b>Question 27:</b> Create list of classifiers:

In [None]:
classifiers_list = [
    DecisionTreeClassifier(max_depth=5),
    ExtraTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(n_estimators=100, random_state=0),
    GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0),
    BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0)]

<b>Question 28:</b> Create a new ensemble using `create_ensemble`:

In [None]:
ensemble = create_ensemble(classifiers_list, x_train, y_train, x_test, y_test, trans)

<b>Question 29:</b> Make a predict using your new data and `predict_patient_status`:

In [None]:
new_data = [[70, "MASTECTOMY", "Breast Invasive Ductal Carcinoma", "High", False, "LumA", 1, "Positve", "Positive", 2, "NEUTRAL", "Negative", 
            "Ductal/NST", True, "Pre", "3", "Right", 8, 2, 5, 40, "Negative", True, "ER-/HER2-", 20, 2],
            [48, "MASTECTOMY", "Breast Invasive Ductal Carcinoma", "High", True, "LumB", 1, "Positve", "Positive", 2, "NEUTRAL", "Negative", 
            "Ductal/NST", True, "Post", "3", "Right", 1, 2, 4, 83, "Positive", False, "", 15, 2]]

df1 = pd.DataFrame(data=new_data, columns=x.columns)
predict_patient_status(ensemble, df1)

### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/dmytro_shliakhovskyi">Dmytro Shliakhovskyi</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/nataliya_boyko">Ass. Prof. Nataliya Boyko, PhD</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
|    2023-04-01     | 01 | Dmytro Shliakhovkyi | Lab created |



<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>