# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Breast Cancer Risk Prediction</p>


<a id="contents_tabel"></a> 
<h3 align="left"><font color=purple>Table of Contents:</font></h3>

* [Step 1 | Import Libraries](#import)
* [Step 2 | Read Dataset](#read)
* [Step 3 | Sanity check of data](#check)
* [Step 4 | Dataset Overview](#overview)
    - [Step 4.1 | Dataset Basic Information](#basic)
    - [Step 4.2 | Summary Statistics for Numerical Variables](#num_statistics)
    - [Step 4.3 | Summary Statistics for Categorical Variables](#cat_statistics)
* [Step 5 | Missing Value Treatment](#missing)    
* [Step 6 | Categorical Features Encoding](#encoding)
* [Step 7 | EDA](#eda)
    - [Step 7.1 | Univariate Analysis](#univariate)
    - [Step 7.2 | Bivariate Analysis](#bivariate)
        - [Step 7.2.1 | Numerical Features vs Overall Survival Status](#num_target)
        - [Step 7.2.2 | Categorical Features vs Overall Survival Status](#cat_target)
* [Step 8 | Data Preprocessing](#preprocessing)
    - [Step 8.1 | Outlier Treatment](#outlier)
    - [Step 8.2 | Transforming Skewed Features](#transform)   
* [Step 9 | Survival Analysis](#survival)
    - [Step 9.1 | Kaplan-Meier Survival Curve](#kp)
* [Step 10 | Decision Tree Model Building](#dt)    
* [Step 11 | Random Forest Model Building](#rf)
    - [Step 11.1 | RF Base Model Definition](#rf_base)
    - [Step 11.2 | RF Hyperparameter Tuning](#rf_hp)
    - [Step 11.3 | RF Model Evaluation](#rf_eval)
* [Step 12 | Logistic Regression Model Building](#logistic)
    - [Step 12.1 | Logistic Base Model Definition](#logistic_base)
    - [Step 12.2 | Logistic Hyperparameter Tuning](#logistic_hp)
    - [Step 12.3 | Logistic Model Evaluation](#logistic_eval)
* [Step 13 | SVM Model Building](#svm)
    - [Step 13.1 | SVM Base Model Definition](#svm_base)
    - [Step 13.2 | SVM Hyperparameter Tuning](#svm_hp)
    - [Step 13.3 | SVM Model Evaluation](#svm_eval)
* [Step 14 | Conclusion](#conclusion)
* [Step 15 | Prediction](#prediction)

<a id="import"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 1 | Import Libraries</p>
 [Table of Contents](#contents_tabel)

In [1]:
import numpy as np
import pandas as pd
from matplotlib.colors import ListedColormap
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from scipy.stats import skew
from scipy.stats import boxcox
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lifelines import KaplanMeierFitter
from lifelines import CoxPHFitter
from lifelines.fitters.coxph_fitter import CoxPHFitter
from lifelines.statistics import proportional_hazard_test
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

import ipywidgets as widgets
from IPython.display import display, HTML
from lifelines.statistics import multivariate_logrank_test   
from lifelines.statistics import logrank_test

from lifelines.utils import concordance_index as cindex

In [2]:
# Set the resolution of the plotted figures
plt.rcParams['figure.dpi'] = 100

# Configure Seaborn plot styles: Set background color and use dark| grid
sns.set(rc={'axes.facecolor': '#faded9'}, style='darkgrid')

<a id="read"></a> 
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 2 | Read Dataset</p>
 [Table of Contents](#contents_tabel)

In [3]:
# Read datasetdf= 
df = pd. read_csv("C:Breast Cancer METABRIC.csv")
df

FileNotFoundError: [Errno 2] No such file or directory: 'C:Breast Cancer METABRIC.csv'

In [None]:
#head 
df.head()

In [None]:
#tail 
df.tail()


<a id="check"></a> 
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 3 | Sanity check of data</p>
 [Table of Contents](#contents_tabel)

In [None]:
#shape
df.shape

In [None]:
#info
df.info()

In [None]:
#finding missing values 
df.isnull().sum()

In [None]:
#finding duplicates 
df.duplicated().sum()

In [None]:
#identifying garbage values
for i in df.select_dtypes(include= "object").columns:
    print(df[i].value_counts())
    print("***"*10) 

<a id="overview"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 4 | Dataset Overview</p>
 [Table of Contents](#contents_tabel)

<a id="basic"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 4.1 | Dataset Basic Information</p>



In [None]:
df.describe()

<div style="border-radius:10px; padding: 15px; ; font-size:100%; text-align:left">

<h2 align="left"><font color=white>Dataset Description:</font></h2>

| __Patient ID:__ | Unique identifier for each patient.|

| __Age at Diagnosis:__ | Age of the patient when diagnosed with cancer.|

| __Type of Breast Surgery:__ | The type of surgery performed on the breast, such as mastectomy or lumpectomy.|

| __Cancer Type:__ | General classification of the cancer type (e.g., invasive ductal carcinoma).|

| __Cancer Type Detailed:__ | More specific classification of the cancer type.|

| __Cellularity:__ | The degree of cellularity of the tumor, often used in pathology to describe the proportion of cells versus other components in a tissue sample.|

| __Chemotherapy:__ | Indicates whether the patient received chemotherapy (Yes/No).|

| __Pam50 + Claudin-low subtype:__ | Subtypes based on gene expression profiling, including Pam50 and Claudin-low classifications.|

| __Cohort:__ | The group or study cohort to which the patient belongs.|

| __ER status measured by IHC:__ | Estrogen receptor status as measured by Immunohistochemistry (IHC) (e.g., positive or negative).|

| __ER Status:__ | Estrogen receptor status (e.g., positive, negative).|

| __Neoplasm Histologic Grade:__ | Histologic grade of the neoplasm, indicating how much the tumor cells differ from normal cells.|

| __HER2 status measured by SNP6:__ | HER2 (human epidermal growth factor receptor 2) status measured by SNP (single nucleotide polymorphism) analysis.|

| __HER2 Status:__ | HER2 receptor status (e.g., positive, negative).|

| __Tumor Other Histologic Subtype:__ | Other histologic subtypes of the tumor not covered by main classifications.|

| __Hormone Therapy:__ | Indicates whether the patient received hormone therapy (Yes/No).|

| __Inferred Menopausal State:__ | Menopausal state inferred based on age and clinical criteria (e.g., premenopausal, postmenopausal).|

| __Integrative Cluster:__ | Classification based on integrative clustering of genomic data.|

| __Primary Tumor Laterality:__ | The side of the body where the primary tumor is located (left or right).|

| __Lymph nodes examined positive:__ | Number of lymph nodes that tested positive for cancer.|

| __Mutation Count:__ | Total number of genetic mutations identified in the tumor.|

| __Nottingham prognostic index:__ | Prognostic score based on tumor size, lymph node status, and histologic grade.|

| __Oncotree Code:__ | A code that represents the type of cancer based on the OncoTree classification.|

| __Overall Survival (Months):__ | The overall survival time of the patient in months.|

| __Overall Survival Status:__ | Indicates whether the patient is alive or deceased.|

| __PR Status:__ | Progesterone receptor status (e.g., positive, negative).|

| __Radio Therapy:__ | Indicates whether the patient received radiotherapy (Yes/No).|

| __Relapse Free Status (Months):__ | Time in months the patient remained free from cancer relapse.|

| __Relapse Free Status:__ | Indicates whether the patient has had a relapse of cancer (Yes/No).|

| __Sex:__ | The sex of the patient (male or female).|

| __3-Gene classifier subtype:__ | Subtypes based on the expression of three specific genes.|

| __Tumor Size:__ | Size of the primary tumor.|

| __Tumor Stage:__ | Stage of the tumor, indicating the extent of cancer spread.|

| __Patient's Vital Status:__ | Indicates whether the patient is alive or deceased at the last follow-up.|


In [None]:
# Display a concise summary of the dataframe
df.info()

# Inferences:
### Number of Entries: 
The dataset consists of 2509 entries, ranging from index 0 to 2508.

### Columns: 
There are 34 columns in the dataset corresponding to various attributes of the patients and results of tests.

### Data Types:
There are 24 columns of object data type and 10 columns of float data type


<a id="num_statistics"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 4.2 | Summary Statistics for Numerical Variables</p>
 [Table of Contents](#contents_tabel)

In [None]:
# Get the summary statistics for numerical variables
df.describe()

##  Numerical Features:
####    __`Age at Diagnosis`__: Age of the patient when diagnosed with cancer.
####    __`Lymph nodes examined positive`__: Number of lymph nodes that tested positive for cancer.
####   __`Mutation Count`__: Total number of genetic mutations identified in the tumor.
####    __`Tumor Size`__: Size of the primary tumor.

Note: Based on the data types and the feature explanations, we can see that 6 columns ('Cohort', 'Neoplasm Histologic Grade','Nottingham prognostic index', 'Overall Survival (Months)','Relapse Free Status (Months)', 'Tumor Stage') are indeed numerical in terms of data type, but categorical in terms of their semantics. These features should be converted to string (object) data type for proper analysis and interpretation.

<a id="cat_statistics"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 4.3 | Summary Statistics for Categorical Variables</p>
 [Table of Contents](#contents_tabel)

In [None]:
# Get the summary statistics for categorical variables
df.describe(include='object')

In [None]:
df.dtypes

In [None]:
object_columns = df.select_dtypes(include=['object']).columns 
print(object_columns)

In [None]:
float_columns = df.select_dtypes(include="float64").columns
print(float_columns)

In [None]:
# Define the continuous features
continuous_features = ['Age at Diagnosis', 'Lymph nodes examined positive', 'Mutation Count','Tumor Size']

# Identify the features to be converted to object data type
features_to_convert = [feature for feature in df.columns if feature not in continuous_features]

# Convert the identified features to object data type
df[features_to_convert] = df[features_to_convert].astype('object')


df.dtypes

By now Out of 34 columns, 30 columns are of object data types based on their semantics.


In [None]:
object_columns = df.select_dtypes(include=['object']).columns 
print(object_columns)
object_columns

In [None]:
float_columns = df.select_dtypes(include="float64").columns
print(float_columns)

<a id="missing"></a>
# <b><span style='color:#ff6ea3'>Step 5 |</span><span style='color:purple'>  Missing Value Treatment</span></b>
[Table of Contents](#contents_tabel)

In [None]:
# Check for missing values in the dataset
df.isnull().sum()

In [None]:
df.drop_duplicates()

### Filling the null values with the averages of a particular category




In [None]:
df['Age at Diagnosis'].fillna(int(df['Age at Diagnosis'].mean()), inplace=True) 
df['Lymph nodes examined positive'].fillna(int(df['Lymph nodes examined positive'].mean()), inplace=True)
df['Mutation Count'].fillna(int(df['Mutation Count'].mean()), inplace=True) 
df['Tumor Size'].fillna(int(df['Tumor Size'].mean()), inplace=True) 
df.isnull().sum()

### The dropna() method removes the rows that contains NULL values

In [None]:
df= df.dropna()
df.isnull().sum()

In [None]:
#df_new.dtypes
df.dtypes
df['Age at Diagnosis']=df['Age at Diagnosis'].astype(int)
df['Lymph nodes examined positive']=df['Lymph nodes examined positive'].astype(int)
df['Mutation Count']=df['Mutation Count'].astype(int)
df['Tumor Size']=df['Tumor Size'].astype(int)

In [None]:
df.dtypes

<a id="encoding"></a>
# <b><span style='color:#ff6ea3'>Step 6 |</span><span style='color:purple'> Categorical Features Encoding</span></b>
[Table of Contents](#contents_tabel)







## Label encoding

In [None]:
df_cat = df.select_dtypes(object)
df_num = df.select_dtypes(int)

In [None]:
df_num

In [None]:
df.head(20)

In [None]:
for col in df_cat:
    le = LabelEncoder()
    df_cat[col]= le.fit_transform(df_cat[col])
df_cat.head(20)

In [None]:
#df_new = [df_cat] + [df_num]
df_new = pd.concat([df_cat, df_num], axis=1, join='inner')
df_new
for col in df_new:
    df_new = df_new.astype(int) 

In [None]:
df_new.dtypes



<a id="eda"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 7 | EDA</p>

 [Table of Contents](#contents_tabel)

For our Exploratory Data Analysis (EDA), we'll take it in two main steps:

1. Univariate Analysis: Here, we'll focus on one feature at a time to understand its distribution and range.

2. Bivariate Analysis: In this step, we'll explore the relationship between each feature and the target variable. This helps us figure out the importance and influence of each feature on the target outcome.

With these two steps, we aim to gain insights into the individual characteristics of the data and also how each feature relates to our main goal: predicting the target variable.

<a id="univariate"></a>
# <b><span style='color:#ff6ea3'>Step 7.1 |</span><span style='color:purple'> Univariate Analysis</span></b>

We undertake univariate analysis on the dataset's features, based on their datatype:

1. For continuous data: We employ histograms to gain insight into the distribution of each feature. This allows us to understand the central tendency, spread, and shape of the dataset's distribution.

2. For categorical data: Bar plots are utilized to visualize the frequency of each category. This provides a clear representation of the prominence of each category within the respective feature.

By employing these visualization techniques, we're better positioned to understand the individual characteristics of each feature in the dataset.

In [None]:
# Filter out continuous features for the univariate analysis
df_continuous = df[continuous_features]

# Set up the subplot
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))

# Loop to plot histograms for each continuous feature
for i, col in enumerate(df_continuous.columns):
    x = i // 2
    y = i % 2
    values, bin_edges = np.histogram(df_continuous[col], 
                                     range=(np.floor(df_continuous[col].min()), np.ceil(df_continuous[col].max())))
    
    graph = sns.histplot(data=df_continuous, x=col, bins=bin_edges, kde=True, ax=ax[x, y],
                         edgecolor='black', color='purple', alpha=0.6)
    ax[x, y].set_xlabel(col, fontsize=15)
    ax[x, y].set_ylabel('Count', fontsize=12)
    ax[x, y].set_xticks(np.round(bin_edges, 1))
    ax[x, y].set_xticklabels(ax[x, y].get_xticks(), rotation=45)
    ax[x, y].grid(color='lightgrey')
    
plt.suptitle('Distribution of Continuous Variables', fontsize=20)
plt.tight_layout()
plt.subplots_adjust(top=0.92)
plt.show()
    
   

   

In [None]:

# Set up the subplot
fig, ax = plt.subplots(nrows=15, ncols=2, figsize=(10,30))
nrows=15 
ncols=2
# Loop to plot histograms for categorical feature
# Loop to plot bar charts for each categorical feature in the 4x2 layout
for i, col in enumerate(df_cat):
    row = i // 2
    col_idx = i % 2
    
    # Calculate frequency percentages
    value_counts = df_new[col].value_counts(normalize=True).mul(100).sort_values()
    
    # Plot bar chart
    value_counts.plot(kind='barh', ax=ax[row, col_idx], width=0.8, color='purple')
    
    # Add frequency percentages to the bars
    for index, value in enumerate(value_counts):
        ax[row, col_idx].text(value, index, str(round(value, 1)) + '%', fontsize=10, weight='bold', va='center')
    
    ax[row, col_idx].set_xlim([0, 95])
    ax[row, col_idx].set_xlabel('Frequency Percentage', fontsize=10)
    ax[row, col_idx].set_title(f'{col}', fontsize=10)
ax = ax.flatten()
ax[i].axis('off')
plt.suptitle('Distribution of Categorical Variables', fontsize=22)
plt.tight_layout()
plt.subplots_adjust(top=0.95)
plt.show()


<a id="bivariate"></a>
# <b><span style='color:#ff6ea3'>Step 7.2 |</span><span style='color:purple'> Bivariate Analysis</span></b>

<a id="num_target"></a>
### <b><span style='color:#ff6ea3'>Step 7.2.1 |</span><span style='color:purple'> Numerical Features vs Overall Survival Status</span></b>  

In [None]:
# Set color palette
sns.set_palette(['#ff6ea3', 'purple'])

# Create the subplots
fig, ax = plt.subplots(len(continuous_features), 2, figsize=(8,10), gridspec_kw={'width_ratios': [1, 2]})

# Loop through each continuous feature to create barplots and kde plots
for i, col in enumerate(continuous_features):
    # Barplot showing the mean value of the feature for each target category
    graph = sns.barplot(data=df_new, x="Overall Survival Status", y=col, ax=ax[i,0])
    
    # KDE plot showing the distribution of the feature for each target category
    sns.kdeplot(data=df[df_new["Overall Survival Status"]==0], x=col, fill=True, linewidth=2, ax=ax[i,1], label='0')
    sns.kdeplot(data=df[df_new["Overall Survival Status"]==1], x=col, fill=True, linewidth=2, ax=ax[i,1], label='1')
    ax[i,1].set_yticks([])
    ax[i,1].legend(title='Overall Survival (Months)', loc='upper right')
    
    # Add mean values to the barplot
    for cont in graph.containers:
        graph.bar_label(cont, fmt='         %.3g')
        
# Set the title for the entire figure
plt.suptitle('Continuous Features vs Overall Survival Status', fontsize=20)
plt.tight_layout()                     
plt.show()


In [None]:
plt.figure(figsize = (20,15))
plt.title('Correlation of Attributes', y=1.05, size=25)
sns.heatmap(df_new.corr(), cmap='plasma',annot=True,  cbar=False)

____
<a id="cat_target"></a>
### <b><span style='color:#ff6ea3'>Step 7.2.2 |</span><span style='color:purple'> Categorical Features vs Overall Survival Status</span></b>  

In [None]:
# Remove 'Overall Survival Status' from the categorical_features
df_cat1 = [feature for feature in df_cat if feature != 'Overall Survival Status']

In [None]:
fig, ax = plt.subplots(nrows=15, ncols=2, figsize=(10,30)) # Width=10 inches, Height=30 inches

for i,col in enumerate(df_cat1):
    
    # Create a cross tabulation showing the proportion of purchased and non-purchased loans for each category of the feature
    cross_tab = pd.crosstab(index=df_new[col], columns=df_new['Overall Survival Status'])
    
    # Using the normalize=True argument gives us the index-wise proportion of the data
    cross_tab_prop = pd.crosstab(index=df_new[col], columns=df_new['Overall Survival Status'], normalize='index')

    # Define colormap
    cmp = ListedColormap(['#ff6ea3', 'purple'])
    
    # Plot stacked bar charts
    x, y = i//2, i%2
    cross_tab_prop.plot(kind='bar', ax=ax[x,y], stacked=True, width=0.8, colormap=cmp,
                        legend=False, ylabel='Proportion', sharey=True)
    
    
    # Add legend
    ax[x,y].legend(title='Overall Survival Status', loc='best', fontsize=8, ncol=2)
    # Set y limit
    ax[x,y].set_ylim([0,1.12])
    # Rotate xticks
    ax[x,y].set_xticklabels(ax[x,y].get_xticklabels(), rotation=0)

plt.tight_layout()                     
plt.show()

<a id="preprocessing"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 8 | Data Preprocessing</p>
 [Table of Contents](#contents_tabel)

<a id="outlier"></a>
# <b><span style='color:#ff6ea3'>Step 8.1 |</span><span style='color:purple'> Outlier Treatment</span></b>

### IQR method

In [None]:
continuous_features
Q1 = df[continuous_features].quantile(0.25)
Q3 = df[continuous_features].quantile(0.75)
IQR = Q3 - Q1
outliers_count_specified = ((df[continuous_features] < (Q1 - 1.5 * IQR)) | (df[continuous_features] > (Q3 + 1.5 * IQR))).sum()
outliers_count_specified


#### Upon identifying outliers for the specified continuous features, we found the following:
    Age at Diagnosis                   1
    Lymph nodes examined positive    120
    Mutation Count                    31
    Tumor Size                        81

In [None]:
for col in df_new:
    df_new= df_new.astype(int)



<a id="transform"></a>
# <b><span style='color:#ff6ea3'>Step 8.2 |</span><span style='color:purple'> Transforming Skewed Features</span></b>


In [None]:
for col in df_num:
    print(col)
    print(skew(df_num[col]))
    
    plt.figure()
    sns.distplot(df_num[col])
    plt.show()
    

In [None]:
# Define the features (X) and the output labels (y)
X = df_new.drop('Overall Survival Status', axis=1)
y = df_new['Overall Survival Status']

In [None]:
# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

In [None]:
continuous_features

In [None]:
# Checking the distribution of the continuous features
fig, ax = plt.subplots(2, 4, figsize=(15,10))

# Original Distributions
for i, col in enumerate(continuous_features):
    sns.histplot(X_train[col], kde=True, ax=ax[0,i], color='#ff826e').set_title(f'Original {col}')
    

# Applying Box-Cox Transformation
# Dictionary to store lambda values for each feature
lambdas = {}

for i, col in enumerate(continuous_features):
    # Only apply box-cox for positive values
    if X_train[col].min() > 0:
        X_train[col], lambdas[col] = boxcox(X_train[col])
        # Applying the same lambda to test data
        X_test[col] = boxcox(X_test[col], lmbda=lambdas[col]) 
        sns.histplot(X_train[col], kde=True, ax=ax[1,i], color='purple').set_title(f'Transformed {col}')
    else:
        sns.histplot(X_train[col], kde=True, ax=ax[1,i], color='green').set_title(f'{col} (Not Transformed)')

fig.tight_layout()
plt.show()

<a id="survival"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 9 | Survival Analysis</p>

 [Table of Contents](#contents_tabel)


<a id="kp"></a>
# <b><span style='color:#ff6ea3'>Step 9.1 |</span><span style='color:purple'> Kaplan-Meier Survival Curve</span></b>


In [None]:
T = df_new["Overall Survival (Months)"]
E = df_new["Overall Survival Status"]
plt.hist(T, bins = 100)
plt.show()
 

In [None]:
kmf = KaplanMeierFitter()
kmf.fit(durations = T, event_observed = E)
kmf.plot_survival_function()

In [None]:
kmf.survival_function_.plot()
plt.title('Survival function')


In [None]:
df_new = pd.concat([df_cat, df_num], axis=1, join='inner')
df_new
for col in df_new:
    df_new = df_new.astype(int) 
    

In [None]:
ax = plt.subplot(111)
m = (df_new["Chemotherapy"] == 0)
kmf.fit(durations = T[m], event_observed = E[m], label = "yes")
kmf.plot_survival_function(ax = ax)
kmf.fit(T[~m], event_observed = E[~m], label = "no")
kmf.plot_survival_function(ax = ax, at_risk_counts = True)
plt.title("Survival on the basis of Chemotherapy")


<a id="dt"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 10 | Decision Tree Model Building</p>
 [Table of Contents](#contents_tabel)

In [None]:
import pandas as pd

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import tree
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score
 
from IPython.display import Image  
from sklearn.tree import export_graphviz


In [None]:
# Define the base DT model
dt_base = DecisionTreeClassifier(random_state=0)


In [None]:
def tune_clf_hyperparameters(clf, param_grid, X_train, y_train, scoring='recall', n_splits=3):
    '''
    This function optimizes the hyperparameters for a classifier by searching over a specified hyperparameter grid. 
    It uses GridSearchCV and cross-validation (StratifiedKFold) to evaluate different combinations of hyperparameters. 
    The combination with the highest recall for class 1 is selected as the default scoring metric. 
    The function returns the classifier with the optimal hyperparameters.
    '''
    
    # Create the cross-validation object using StratifiedKFold to ensure the class distribution is the same across all the folds
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=0)

    # Create the GridSearchCV object
    clf_grid = GridSearchCV(clf, param_grid, cv=cv, scoring=scoring, n_jobs=-1)

    # Fit the GridSearchCV object to the training data
    clf_grid.fit(X_train, y_train)

    # Get the best hyperparameters
    best_hyperparameters = clf_grid.best_params_
    
    # Return best_estimator_ attribute which gives us the best model that has been fitted to the training data
    return clf_grid.best_estimator_, best_hyperparameters

In [None]:
# Hyperparameter grid for DT
param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [2,3],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2]
}

In [None]:
# Call the function for hyperparameter tuning
best_dt, best_dt_hyperparams = tune_clf_hyperparameters(dt_base, param_grid_dt, X_train, y_train)

In [None]:
# Hyperparameter
grid_param = {
    'criterion': ['gini', 'entropy'],
    'max_depth' : range(2,32,1),
    'min_samples_leaf' : range(1,10,1),
    'min_samples_split': range(2,10,1),
    'splitter' : ['best', 'random']
}

In [None]:
print('DT Optimal Hyperparameters: \n', best_dt_hyperparams)

In [None]:
# Evaluate the optimized model on the train data
print(classification_report(y_train, best_dt.predict(X_train)))

In [None]:
# Evaluate the optimized model on the train data
print(classification_report(y_train, best_dt.predict(X_train)))

In [None]:
# Plotting the Confusion Matrix for Random Forest Algorithm
cm_dt = confusion_matrix(y_test,best_dt.predict(X_test))
plt.figure(figsize=(1.8, 1.8))
sns.set_context('notebook',font_scale = 0.5)
sns.heatmap(cm_dt,annot=True,fmt='d', cmap="Oranges", cbar=False)
plt.title('Decision Tree Confusion Matrix');
plt.xlabel("Predicted_Value")
plt.ylabel("True_Value")
plt.tight_layout()

In [None]:
def evaluate_model(model, X_test, y_test, model_name):
    """
    Evaluates the performance of a trained model on test data using various metrics.
    """
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Get classification report
    report = classification_report(y_test, y_pred, output_dict=True)
    
    # Extracting metrics
    metrics = {
        "precision_0": report["0"]["precision"],
        "precision_1": report["1"]["precision"],
        "recall_0": report["0"]["recall"],
        "recall_1": report["1"]["recall"],
        "f1_0": report["0"]["f1-score"],
        "f1_1": report["1"]["f1-score"],
        "macro_avg_precision": report["macro avg"]["precision"],
        "macro_avg_recall": report["macro avg"]["recall"],
        "macro_avg_f1": report["macro avg"]["f1-score"],
        "accuracy": accuracy_score(y_test, y_pred)
    }
    
    # Convert dictionary to dataframe
    df = pd.DataFrame(metrics, index=[model_name]).round(2)
    
    return df

In [None]:
dt_evaluation = evaluate_model(best_dt, X_test, y_test, 'DT')
dt_evaluation

## ROC-AUC 

In [None]:
dt_base.fit(X_train, y_train)
y_pred_prob_dt = dt_base.predict_proba(X_test)[:, 1]

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_dt = roc_auc_score(y_test, y_pred_prob_dt)
print(f'Decision Tree ROC-AUC Score: {roc_auc_dt:.4f}')

<a id="rf"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 11 | Random Forest Model Building</p>

⬆️ [Table of Contents](#contents_tabel)

<a id="rf_base"></a>
# <b><span style='color:#ff826e'>Step 11.1 |</span><span style='color:purple'> RF Base Model Definition</span></b>

In [None]:
rf_base = RandomForestClassifier(random_state=0)

____
<a id="rf_hp"></a>
# <b><span style='color:#ff826e'>Step 11.2 |</span><span style='color:purple'> RF Hyperparameter Tuning</span></b>

In [None]:
param_grid_rf = {
    'n_estimators': [10, 30, 50, 70, 100],
    'criterion': ['gini', 'entropy'],
    'max_depth': [2, 3, 4],
    'min_samples_split': [2, 3, 4, 5],
    'min_samples_leaf': [1, 2, 3],
    'bootstrap': [True, False]
}

____
<a id="rf_eval"></a>
# <b><span style='color:#ff826e'>Step 11.3 |</span><span style='color:purple'> RF Model Evaluation</span></b>

In [None]:
# Using the tune_clf_hyperparameters function to get the best estimator

best_rf, best_rf_hyperparams = tune_clf_hyperparameters(rf_base, param_grid_rf, X_train, y_train)
print('RF Optimal Hyperparameters: \n', best_rf_hyperparams)

In [None]:
# Evaluate the optimized model on the train data
print(classification_report(y_train, best_rf.predict(X_train)))

In [None]:
# Evaluate the optimized model on the test data
print(classification_report(y_test, best_rf.predict(X_test)))

In [None]:
rf_evaluation = evaluate_model(best_rf, X_test, y_test, 'RF')
rf_evaluation

In [None]:
# Plotting the Confusion Matrix for Random Forest Algorithm
cm_rf = confusion_matrix(y_test,best_rf.predict(X_test))
plt.figure(figsize=(1.8, 1.8))
sns.set_context('notebook',font_scale = 0.5)
sns.heatmap(cm_rf,annot=True,fmt='d', cmap="Oranges", cbar=False)
plt.title('Random Forest Confusion Matrix');
plt.xlabel("Predicted_Value")
plt.ylabel("True_Value")
plt.tight_layout()

## ROC-AUC 

In [None]:
rf_base.fit(X_train, y_train)
y_pred_prob_rf = rf_base.predict_proba(X_test)[:, 1]

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_dt = roc_auc_score(y_test, y_pred_prob_dt)
print(f'Random Forest ROC-AUC Score: {roc_auc_dt:.4f}')

<a id="logistic"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 12 | Logistic Regression Model Building</p>

⬆️ [Table of Contents](#contents_tabel)

____
<a id="logistic_base"></a>
# <b><span style='color:#ff826e'>Step 12.1 |</span><span style='color:purple'> Logistic Base Model Definition</span></b>

In [None]:
# Define the base logistic model and set up the pipeline with scaling
logistic_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('LR', LogisticRegression())
])

____
<a id="logistic_hp"></a>
# <b><span style='color:#ff826e'>Step 12.2 |</span><span style='color:purple'> Logistic Hyperparameter Tuning</span></b>

In [None]:
# Hyperparameter grid for Logistic Regression
# penalty - determines the regularization, and helps prevent overfitting by adding a penalty to the optimization objective.
# 'l1' refers to Lasso regularization, and 'l2' refers to Ridge regularization
# 'C' - inverse of the regularization strength, smaller values specify stronger regularization
# 'solver' - algorithm used for optimization in LR,  'liblinear' is suitable for small datasets, 'saga' for larger dataset
param_grid_logistic = {
    'penalty': ['l1', 'l2'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear']
}

In [None]:
logistic_base=LogisticRegression()

In [None]:
# Call the function for hyperparameter tuning with logistic regression
best_logistic, best_logistic_hyperparams = tune_clf_hyperparameters(logistic_base, param_grid_logistic, X_train, y_train)

# Print the optimal hyperparameters for logistic regression
print('Logistic Regression Optimal Hyperparameters: \n', best_logistic_hyperparams)

____
<a id="logistic_eval"></a>
# <b><span style='color:#ff826e'>Step 12.3 |</span><span style='color:purple'> Logistic Model Evaluation</span></b>

In [None]:
# Evaluate the optimized model on the train data
print(classification_report(y_train, best_logistic.predict(X_train)))

In [None]:
# Evaluate the optimized model on the test data
print(classification_report(y_test, best_logistic.predict(X_test)))

In [None]:
logistic_evaluation = evaluate_model(best_logistic, X_test, y_test, 'LR')
logistic_evaluation

In [None]:
# Plotting the Confusion Matrix for Random Forest Algorithm
cm_lr = confusion_matrix(y_test,best_logistic.predict(X_test))
plt.figure(figsize=(1.8, 1.8))
sns.set_context('notebook',font_scale = 0.5)
sns.heatmap(cm_lr,annot=True,fmt='d', cmap="Oranges", cbar=False)
plt.title('Logistic Regression Confusion Matrix');
plt.xlabel("Predicted_Value")
plt.ylabel("True_Value")
plt.tight_layout()

<a id="svm"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 13 | SVM Model Building</p>

⬆️ [Table of Contents](#contents_tabel)

____
<a id="svm_base"></a>
# <b><span style='color:#ff826e'>Step 13.1 |</span><span style='color:purple'> SVM Base Model Definition</span></b>

In [None]:
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC()) 
])

____
<a id="svm_hp"></a>
# <b><span style='color:#ff826e'>Step 13.2 |</span><span style='color:purple'> SVM Hyperparameter Tuning</span></b>

In [None]:
param_grid_svm = {
    'svm__C': [5],
    'svm__kernel': ['linear', 'rbf', 'poly'],
#     'svm__gamma': [2],
#     'svm__degree': [2,3,4]
}

In [None]:
# Call the function for hyperparameter tuning
best_svm, best_svm_hyperparams = tune_clf_hyperparameters(svm_pipeline, param_grid_svm, X_train, y_train)
print('SVM Optimal Hyperparameters: \n', best_svm_hyperparams)

____
<a id="svm_eval"></a>
# <b><span style='color:#ff826e'>Step 13.3 |</span><span style='color:purple'> SVM Model Evaluation</span></b>

In [None]:
# Evaluate the optimized model on the train data
print(classification_report(y_train, best_svm.predict(X_train)))

In [None]:
# Evaluate the optimized model on the test data
print(classification_report(y_test, best_svm.predict(X_test)))

In [None]:
svm_evaluation = evaluate_model(best_svm, X_test, y_test, 'SVM')
svm_evaluation

In [None]:
# Plotting the Confusion Matrix for Support Vector Classifier Algorithm
cm_svc = confusion_matrix(y_test, best_svm.predict(X_test))
plt.figure(figsize=(1.8,1.8))
sns.set_context('notebook',font_scale = 0.5)
sns.heatmap(cm_svc,annot=True,fmt='d', cmap="Oranges", cbar=False)
plt.title('Support Vector Confusion Matrix');
plt.xlabel("Predicted_Value")
plt.ylabel("True_Value")
plt.tight_layout()

<a id="conclusion"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 14 | Conclusion</p>

 [Table of Contents](#contents_tabel)

In [None]:
# Concatenate the dataframes
all_evaluations = [dt_evaluation, rf_evaluation, logistic_evaluation, svm_evaluation]
results = pd.concat(all_evaluations)

# Sort by 'recall_1'
results = results.sort_values(by='recall_1', ascending=False).round(2)
results

In [None]:
# Sort values based on 'recall_1'
results.sort_values(by='recall_1', ascending=True, inplace=True)
recall_1_scores = results['recall_1']

# Plot the horizontal bar chart
fig, ax = plt.subplots(figsize=(12, 7), dpi=70)
ax.barh(results.index, recall_1_scores, color='purple')

# Annotate the values and indexes
for i, (value, name) in enumerate(zip(recall_1_scores, results.index)):
    ax.text(value + 0.01, i, f"{value:.2f}", ha='left', va='center', fontweight='bold', color='Purple', fontsize=15)
    ax.text(0.1, i, name, ha='left', va='center', fontweight='bold', color='white', fontsize=25)

# Remove yticks
ax.set_yticks([])

# Set x-axis limit
ax.set_xlim([0, 1.2])

# Add title and xlabel
plt.title("Recall for Positive Class across Models", fontweight='bold', fontsize=22)
plt.xlabel('Recall Value', fontsize=16)
plt.show()

<a id="prediction"></a>
# <p style="background-color:purple; font-family:calibri; color:white; font-size:150%; text-align:center; border-radius:15px 50px;">Step 15 | Prediction</p>

[Table of Contents](#contents_tabel)

In [None]:
# Function to make a prediction based on user input
def predict(features):
    return best_rf.predict(np.array(features).reshape(1, -1))

# Define feature names
feature_names =['Type of Breast Surgery', 'Cancer Type',
       'Cancer Type Detailed', 'Cellularity', 'Chemotherapy',
       'Pam50 + Claudin-low subtype', 'Cohort', 'ER status measured by IHC',
       'ER Status', 'Neoplasm Histologic Grade',
       'HER2 status measured by SNP6', 'HER2 Status',
       'Tumor Other Histologic Subtype', 'Hormone Therapy',
       'Inferred Menopausal State', 'Integrative Cluster',
       'Primary Tumor Laterality', 'Nottingham prognostic index',
       'Oncotree Code', 'Overall Survival (Months)', 'Overall Survival Status',
       'PR Status', 'Radio Therapy', 'Relapse Free Status (Months)',
       'Relapse Free Status', 'Sex', '3-Gene classifier subtype',
       'Tumor Stage', "Patient's Vital Status", 'Age at Diagnosis',
       'Lymph nodes examined positive', 'Mutation Count', 'Tumor Size'] 

# Create input widgets
feature_widgets = [widgets.FloatText(value=0.0, description=f'{feature}:') for feature in feature_names]

# Create a button for making predictions
predict_button = widgets.Button(description="Predict")

# Output widget to display prediction
output_widget = widgets.Output()

# Function to handle button click event
def on_button_click(b):
    user_input = [float(widget.value) for widget in feature_widgets]
    prediction = predict(user_input)

    # Display the prediction using IPython.display
    with output_widget:
        display(HTML(f"<b>Prediction:</b> {prediction[0]}"))

# Attach the button click event
predict_button.on_click(on_button_click)

# Display widgets and output area
display(*feature_widgets, predict_button, output_widget)


# Thank You