## Classifying Patient Mortality from Breast Cancer Data

Breast cancer is the most common malignancy among Canadian women (excluding non-melanoma skin cancers). It is the second most typical cancer-related cause of mortality among Canadian women. Breast cancer in men is rare yet it is possible. Invasive ductal carcinoma (IDC), which makes up around 80% of cases, is one of the most common types of breast cancer. It is particularly dangerous since it is nefarious and has the capacity to spread. To get minute tissue samples, a biopsy is commonly carried out. A pathologist will next evaluate if the patient has IDC, an additional kind of breast cancer, or is in good health. 

Projections show that in 2022:

* 28,600 Canadian women will be diagnosed with breast cancer. This will represent 25% of all new cases of cancer in women in 2022.

* 5,500 Canadian women will die from breast cancer. This will be the cause of 14% of all female cancer deaths in 2022.

* 78 Canadian women will be told they have breast cancer every day.

* In Canada, 1 in 8 women will eventually get breast cancer, and 1 in 34 will pass away from it. 

This dataset of breast cancer patients was published in the November 2017 update of the SEER Program of the NCI, which provides information on population-based cancer statistics. The dataset includes female patients who were diagnosed with infiltrating ductal and lobular carcinoma in their breasts between 2006 and 2010 (SEER primary cites recode NOS histology codes 8522/3). 4024 people were ultimately included after patients with unsure tumour sizes, explored regional LNs, positive regional LNs, and patients with survival periods of less than a month were all excluded. 



## Description of Variables


Amounting our target variable of patient status, we have sixteen features at our disposal to conduct our analysis and prediction. They are as follows: 

* Race: One of White, Black or Other (American Indian/AK Native, Asian/Pacific Islander).

* Marital Status: One of Married, Divorced, Single, Widowed or Separated

* T Stage: One of T1, T2, T3 or T4. The T stage describes the size and scope of the primary tumour. Typically, the primary tumour is referred to as the main tumour. The major tumor's size or extension is indicated by the letters T1, T2, T3, and T4. The size of the tumour or the extent of its invasion into neighbouring tissues is indicated by the number after the T. T-T0: There is no sign of a primary tumour. T1 (includes T1a, T1b, and T1c): Tumor is less than or equal to 2 cm (3/4 inch) in size. T2: Tumor has a diameter of at least 2 cm but no greater than 5 cm (2 inches). T3: Tumor is larger than 5 cm in diameter.

* N Stage: One of N1, N2 or N3. Typically, the primary tumour is referred to as the main tumour. The N stands for the number of cancerous lymph nodes in the area. The M indicates whether or not the malignancy has spread. This indicates that the cancer has spread outside of the body's original tumour. N1, N2, and N3: These terms describe the quantity and location of lymph nodes that are cancerous. The more cancerous lymph nodes there are, the higher the number following the N.

* 6th Stage: One of IIA, IIIA, IIIC, IIB or IIIB. Combining the T, N, and M categories, the tumour grade, and the outcomes of ER/PR and HER2 tests, doctors determine the cancer stage.
If surgery is the initial step in treating your cancer, your doctor will often determine the cancer's stage after the final testing results are in, typically 5 to 7 days following surgery.
The stage of the malignancy is largely established clinically when systemic treatment is administered before surgery, which is frequently with drugs and is referred to as neoadjuvant therapy.
Stage I through stage IIA cancer may be referred to by doctors as "early stage" and stage IIB to stage III as "locally advanced."


* Status: One of Alive or Dead.

>_*TEAM NOTE*_ we need to finish this data explanation of each one of the variables. feel free to delete my volume of text if it destroys the word count. 

## Exploratory Data Analysis

### Loading Necessary Libraries

In [4]:
library(tidyverse)

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.2 --
[32mv[39m [34mggplot2[39m 3.3.6      [32mv[39m [34mpurrr  [39m 0.3.4 
[32mv[39m [34mtibble [39m 3.1.8      [32mv[39m [34mdplyr  [39m 1.0.10
[32mv[39m [34mtidyr  [39m 1.2.1      [32mv[39m [34mstringr[39m 1.4.1 
[32mv[39m [34mreadr  [39m 2.1.2      [32mv[39m [34mforcats[39m 0.5.2 
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [9]:
cancer_dataset = read.csv("./Breast_Cancer.csv")
head(cancer_dataset)


Unnamed: 0_level_0,Age,Race,Marital.Status,T.Stage,N.Stage,X6th.Stage,differentiate,Grade,A.Stage,Tumor.Size,Estrogen.Status,Progesterone.Status,Regional.Node.Examined,Reginol.Node.Positive,Survival.Months,Status
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<int>,<int>,<int>,<chr>
1,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,Alive
2,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,Alive
3,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,Alive
4,58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,18,Positive,Positive,2,1,84,Alive
5,47,White,Married,T2,N1,IIB,Poorly differentiated,3,Regional,41,Positive,Positive,3,1,50,Alive
6,51,White,Single,T1,N1,IIA,Moderately differentiated,2,Regional,20,Positive,Positive,18,2,89,Alive


In [14]:
column_names = colnames(cancer_dataset)
features = column_names[-length(column_names)]
features
cat("The dimensions of our dataset are: ", dim(cancer_dataset))

The dimensions of our dataset are:  4024 16

We have fifteen possible features, listed above that can help us gauge the status of a patient given these features of their cancer condition. 

## Data Preprocessing

## Methods for Experimentation 