# DSAI 2201 Winter 2024 Assignment

In [51]:
NAME = "Ahmed Hanif - 60301085"

COLLABORATORS = ""

<h3>Dataset Attributes</h3>

**1. Age**: This represents the age of the patient at diagnosis. Age is a critical factor in breast cancer prognosis and can influence the choice of treatment.

**2. Race**: The race of the patient. This demographic information is important for studying disparities in healthcare outcomes and access.

**3. Marital Status**: Indicates whether the patient is married, single, divorced, widowed, or separated. Marital status can impact social support structures, which may influence outcomes and patient well-being.

**4. T Stage**: The Tumor (T) stage from the TNM classification system. It describes the size of the original tumor and whether it has invaded nearby tissue. The stages range from T1 to T4, with higher numbers indicating larger tumors or greater extent of disease.

**5. N Stage**: The Node (N) stage from the TNM classification, indicating whether the cancer has spread to nearby lymph nodes and to what extent. Stages range from N0 (no lymph node involvement) to N3 (high involvement of lymph nodes).

**6. 6th Stage**: A composite stage based on the TNM classification, integrating Tumor size, Node involvement, and Metastasis (spread to distant organs) into a single stage. Common stages include I, II, III, and IV, with higher numbers indicating more advanced disease.

**7. differentiate**: Refers to the differentiation level of the tumor cells, indicating how much the cancer cells resemble normal cells. Categories include well-differentiated, moderately differentiated, poorly differentiated, and undifferentiated. Higher differentiation generally indicates a slower-growing, less aggressive cancer.

**8. Grade**: The histological grade of the tumor, which assesses the aggressiveness of the cancer based on the appearance of cells under a microscope. It usually ranges from 1 (low grade, well-differentiated) to 3 (high grade, poorly differentiated).

**9. A Stage**: The anatomical stage. This dataset includes "Regional," referring to cancer that has spread beyond the original site to nearby lymph nodes or tissues but not to distant organs.

**10. Tumor Size**: Measured in millimeters, this indicates the largest dimension of the primary tumor. Tumor size is a critical factor in staging and treatment planning.

**11. Estrogen Status**: Indicates whether the cancer cells have receptors for estrogen, a hormone that can promote the growth of breast cancer. "Positive" means the cancer is likely to respond to hormonal therapies that lower estrogen levels or block estrogen receptors.

**12. Progesterone Status**: Similar to estrogen status, this indicates whether the cancer cells have receptors for progesterone. Progesterone receptor-positive cancers may also respond to hormonal therapy.

**13. Regional Node Examined**: The number of lymph nodes examined during diagnosis or treatment. This helps assess the extent of cancer spread and is crucial for accurate staging.

**14. Reginol Node Positive**: Indicates the number of examined lymph nodes that were found to contain cancer. A higher number can signify more advanced disease.

**15. Survival Months**: The number of months the patient has survived after diagnosis. This can be used to analyze outcomes and the effectiveness of treatment strategies.

**16. Status**: Indicates whether the patient is alive or deceased at the time of the last follow-up. This outcome measure is essential for survival analysis and understanding the long-term progno

(OpenAI, 2023)sis of breast cancer patients.
is of breast cancer patients.


## Assignment 1 - Data Analysis
**(20 points in total)**

In Assignments part 1 & part 2 we will go through the entire journey of a small data science project.

More details about the dataset can be found in the Kaggle website in the following links: https://www.kaggle.com/datasets/reihanenamdari/breast-cancer

But a modified version of the dataset was attached in the dropbox, please use it to answer this assignment
     



**Question 1.**  _(2 points)_
* A) Analyze the distribution of death events cases among the patients and the race. calculate their respective numbers and percentages.  _(0.5 points)_
* B) Compute descriptive statistics of the data. Comment on the results.  _(0.75 points)_
* C) Analyze the skew and the kurtosis of medical variables distributions(Numbers & Graphs). Comment on the results.  _(0.75  points)_

**Question 2.**  _(2 points)_
* Use univariate plots  to analyze patterns in each of the medical variables with respect to the outcome (death status). What categories of patients( race and marital status) were most likely to die from breast cancer? . (2 points)


**Question 3.**  _(2 points)_
* Use multivariate plots to:
   * A) Analyze the relationship between medical variables. Comment on the results  _(1 points)_
   * B) Identify potential factors that can predict Death event in married patients. Comment on the results.  _(1 points)_

**Question 4.**  _(3 points)_
* Identify and remove any outliers in the medical variables. Explain the rationale for identifying and removing outliers.  _(3 points)_

**Question 5.**  _(3 points)_
* How did you treat missing values for the attributes that you included in the analysis ? Provide a detailed explanation in the comments._(3 points)_

**Question 6.**  _(3 points)_
* The dataset contains  contains alot of zeros, identify which attributes cannot be 0 medically speaking, and impute them, while explaining the rational behind it._(3 points)_


**Question 7.**  _(2 points)_
* Identify the attibutes that will need rescalling ( with explanation ), apply one of the rescalling techniques we have seen in our course and explain your rational. _(2 points)_

**Question 8.**  _(3 points)_
* Through an extensive research, provide additional attributes that you can create that will enhance your dataset, explain the logic behind it, and add them to your data, _(1.5 points)_
* identify the most irrelevant attribute(s), and exclude them from your project for the next steps  _(1.5 points)_

In [2]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import scipy

<h2><b> Data Cleaning </b></h2>

In [16]:
df = pd.read_csv('./Breast_Cancer-Winter2024.csv')
# Renaming Data Columns in order to remove spaces
df.columns = ['age','race','marital_status', 't_stage','n_stage','sixth_stage','differentiate','grade','a_stage','tumor_size','estrogen_status','progesterone_status','regional_mode_exam','regional_node_positive','survival_months','status']

<b>Survival Months</b> -> 4017 non-missing values -> 7 Missing Values. 
Not an accurate representation as df shows many NaN values
Same is the case with <b> Differentiation </b>

In [3]:
df.isna().sum()

age                       0
race                      0
marital_status            0
t_stage                   5
n_stage                   0
sixth_stage               0
differentiate             6
grade                     0
a_stage                   0
tumor_size                6
estrogen_status           0
progesterone_status       0
regional_mode_exam        0
regional_node_positive    0
survival_months           7
status                    0
dtype: int64

Filling Missing Values

In [4]:
df.duplicated().value_counts()

False    4023
True        1
Name: count, dtype: int64

In [5]:
df.drop_duplicates(inplace = True)

In [6]:
df.duplicated().value_counts()

False    4023
Name: count, dtype: int64

In [7]:
df

Unnamed: 0,age,race,marital_status,t_stage,n_stage,sixth_stage,differentiate,grade,a_stage,tumor_size,estrogen_status,progesterone_status,regional_mode_exam,regional_node_positive,survival_months,status
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4.0,Positive,Positive,24,1,60.0,Alive
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35.0,Positive,Positive,14,5,62.0,Alive
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63.0,Positive,Positive,14,7,75.0,Alive
3,58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,18.0,Positive,Positive,2,1,,Alive
4,47,White,Married,T2,N1,IIB,,3,Regional,41.0,Positive,Positive,3,1,,Alive
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4019,62,Other,Married,T1,N1,IIA,Moderately differentiated,2,Regional,9.0,Positive,Positive,1,1,49.0,Alive
4020,56,White,Divorced,T2,N2,IIIA,Moderately differentiated,2,Regional,46.0,Positive,Positive,14,8,69.0,Alive
4021,68,White,Married,T2,N1,IIB,Moderately differentiated,2,Regional,22.0,Positive,Negative,11,3,69.0,Alive
4022,58,Black,Divorced,T2,N1,IIB,Moderately differentiated,2,Regional,44.0,Positive,Positive,11,1,72.0,Alive


## Question 1:

Solution:



<b>Q1 - Part A </b> Analyze the distribution of death events cases among the patients and the race. calculate their respective numbers and percentages

<p>OHE = One Hot Encoding -> In this iteration, I'll be converting status of dead or alive into a binary value in order to get....
</p>
<b>Why Did I have to do this?</b>

In [8]:
df_OHE = df.copy()
df_OHE = pd.get_dummies(df_OHE,columns = ['status'], prefix = ['status'])
df_OHE

Unnamed: 0,age,race,marital_status,t_stage,n_stage,sixth_stage,differentiate,grade,a_stage,tumor_size,estrogen_status,progesterone_status,regional_mode_exam,regional_node_positive,survival_months,status_Alive,status_Dead
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4.0,Positive,Positive,24,1,60.0,True,False
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35.0,Positive,Positive,14,5,62.0,True,False
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63.0,Positive,Positive,14,7,75.0,True,False
3,58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,18.0,Positive,Positive,2,1,,True,False
4,47,White,Married,T2,N1,IIB,,3,Regional,41.0,Positive,Positive,3,1,,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4019,62,Other,Married,T1,N1,IIA,Moderately differentiated,2,Regional,9.0,Positive,Positive,1,1,49.0,True,False
4020,56,White,Divorced,T2,N2,IIIA,Moderately differentiated,2,Regional,46.0,Positive,Positive,14,8,69.0,True,False
4021,68,White,Married,T2,N1,IIB,Moderately differentiated,2,Regional,22.0,Positive,Negative,11,3,69.0,True,False
4022,58,Black,Divorced,T2,N1,IIB,Moderately differentiated,2,Regional,44.0,Positive,Positive,11,1,72.0,True,False


In [9]:
#death_counts_by_race = df.groupby('race')['death_event'].sum()
df_OHE.head(2)

Unnamed: 0,age,race,marital_status,t_stage,n_stage,sixth_stage,differentiate,grade,a_stage,tumor_size,estrogen_status,progesterone_status,regional_mode_exam,regional_node_positive,survival_months,status_Alive,status_Dead
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4.0,Positive,Positive,24,1,60.0,True,False
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35.0,Positive,Positive,14,5,62.0,True,False


In [10]:
#data_wrt_race=df_OHE.groupby('race')['status_Alive'].value_counts() #wrt = with regards to
death_count_by_race = df_OHE.groupby('race')['status_Alive'].sum()
pd.DataFrame(death_count_by_race).rename(columns = {'status_Alive':'number_of_living_patients'})


Unnamed: 0_level_0,number_of_living_patients
race,Unnamed: 1_level_1
Black,218
Other,287
White,2902


In [11]:
total_count_by_race = df['race'].value_counts()
pd.DataFrame(total_count_by_race).rename(columns = {'count': 'total_number_of_patients'})

Unnamed: 0_level_0,total_number_of_patients
race,Unnamed: 1_level_1
White,3412
Other,320
Black,291


<h2> Percentage of Death Events </h2>

In [12]:
percentage_of_death_occuring = (death_count_by_race / total_count_by_race) * 100
percentage_of_death_occuring = pd.DataFrame(percentage_of_death_occuring, columns = ['death_percentage'])
percentage_of_death_occuring = percentage_of_death_occuring.iloc[[2,0,1]]
percentage_of_death_occuring['death_percentage'] = percentage_of_death_occuring['death_percentage'].round(2)
percentage_of_death_occuring

Unnamed: 0_level_0,death_percentage
race,Unnamed: 1_level_1
White,85.05
Black,74.91
Other,89.69


<h3>Analysis</h3>
<p>
    
</p>

<b> Question 1 - Part B </b>
<p>Compute descriptive statistics of the data. Comment on the results. (0.75 points)</p>

<b>Question 1 Part C</b>

<p>C) Analyze the skew and the kurtosis of medical variables distributions (Numbers & Graphs). Comment on the results. (0.75 points) </p>

In [46]:
df_OHE_numerical = pd.DataFrame(df_OHE.select_dtypes(include=['number']))
df_OHE_numerical.index.name = "Index"
df_OHE_numerical

Unnamed: 0_level_0,age,tumor_size,regional_mode_exam,regional_node_positive,survival_months
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,68,4.0,24,1,60.0
1,50,35.0,14,5,62.0
2,58,63.0,14,7,75.0
3,58,18.0,2,1,
4,47,41.0,3,1,
...,...,...,...,...,...
4019,62,9.0,1,1,49.0
4020,56,46.0,14,8,69.0
4021,68,22.0,11,3,69.0
4022,58,44.0,11,1,72.0


In [48]:
df_OHE_numerical.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,4023.0,53.969923,8.963118,30.0,47.0,54.0,61.0,69.0
tumor_size,4017.0,30.425442,21.155926,0.0,16.0,25.0,38.0,140.0
regional_mode_exam,4023.0,14.358439,8.100241,1.0,9.0,14.0,19.0,61.0
regional_node_positive,4023.0,4.158837,5.109724,1.0,1.0,2.0,5.0,46.0
survival_months,4016.0,71.17754,23.091288,0.0,56.0,73.0,90.0,107.0


<p><b>Age:</b> The Average Age of the patients is 54, with the max age being 69 and minium age being 30. 75% of the Patients are below the age of 61, while 25% are above the age of 61 </p>

<b>Tumor Size</b> 

<b>Regional Mode Exam:</b>

<b>Regional mode Positive</b>

<b>Survival Months</b>

In [43]:
df_OHE_numerical.agg(['skew','kurtosis'])

Unnamed: 0,age,tumor_size,regional_mode_exam,regional_node_positive,survival_months
skew,-0.219932,1.7357,0.828885,2.702183,-0.611541
kurtosis,-0.754989,3.621438,1.648586,8.978869,0.070821


<b>Age:</b>

<b>Tumor Size:</b>

<b>Regional Mode Exam:</b>

<b>Regional mode Positive</b>

<b>Survival Months</b>

<h3>Graphs</h3>

<h3>Examination</h3>

<H2>Question 2. (2 points)</H2>

Use univariate plots to analyze patterns in each of the medical variables with respect to the outcome (death status). What categories of patients( race and marital status) were most likely to die from breast cancer? . (2 points)

###  Q3:

###  Q4:
Solution:



###  Q5:

###  Q6:

###  Q7:

###  Q8 :
Solution


# Assignment 2 - Machine Learning Models for Prediction
**(15 points total)**

 
* In **Assignment 1**, we have explored the data, cleaned up the data, modified features, and created new ones. 
* In **Assignment 2**, we will apply supervised machine learning models for classification and regression, evaluate its perofrmance, and identify the best models to solve the following problems: 

    * The **classification problem** is: given a train dataset of patients who survived or did not survive, build a model which can determine based on a given test dataset not containing the death event information information, if these patients in the test dataset survived or not. 

    * The **regression problem** is: predict the number of months that remains for the survival of the patient.



**Question 1. (Data preparation)**  _(2 points)_
* List the relevant features which you will use for classification and explain your answer (*a relevant feature is a feature that can have an impact on the chance of survival of the patient*).
* List the relevant features which you will use for regression and explain your answer (*a relevant feature is a feature that can have an impact on the prediction of the number of remaining survival months of the patient*).
* Divide both your datasets into a training set (70%) and a testing set (30%). All models will be trained and tested on the same splits.
    





###  Data Preparation & Spliting for the Classification Model

###  Data Preparation & Spliting for the Regression Model

**Question 2. (Classification models)**  _(3 points)_
* Train three different classification models of your choice using the training set. Explain the rationale behind selecting each of these three algorithms. You may refer to the following guidlines for model selection: 
    * Diagram from scikit-learn: https://scikit-learn.org/stable/tutorial/machine_learning_map/
    * Models comparison table: https://docs.google.com/spreadsheets/d/16i47Wmjpj8k-mFRk-NnXXU5tmSQz8h37YxluDV8Zy9U/edit#gid=0



**Question 3. (Evaluation of classification models)**  _(3 points)_
* Evaluate the performance of your three classification models on the testing set using the following metrics: accuracy, area under the curve (AUC), precision, and recall.
* Based on the models evaluation results, what is the best model and why?




**Question 4. (Regression models)**  _(3 points)_
* Train two different regression models of your choice using the training set. Explain the rationale behind selecting each of these two algorithms. 



**Question 5. (Evaluation of regression models)**  _(3 points)_
* Evaluate the performance of your two regression models on the testing set using the following metrics: mean absolute error,mean squared error, and R-square.
* Based on the models evaluation results, what is the best model and why?



**Question 6. (Possible improvements)** _(1 points)_
* How can you improve the accuracy of your classification model?
* How can you improve the accuracy of your regression model?

<h3>Archive</h3>