<a href="https://colab.research.google.com/github/edenhunsader/breast-cancer-subset/blob/main/Colab_Notebook_Breast_cancer_subset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Procedural Overview

> Below are the steps to filter through the 2017 November data of the SEER Program of the NCI to include only cancer related data points.



## Working with raw data

Import pandas inorder to enable editing of the data file

In [None]:
import numpy as np
import pandas as pd

Instruct the pandas software to read the csv file containing the data set


In [None]:
df=pd.read_csv("Breast_Cancer.csv")

Run a test to double check that the data set was properly run

In [None]:
df.shape

(4024, 16)

In [None]:
df[0:5]

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,Alive
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,Alive
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,Alive
3,58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,18,Positive,Positive,2,1,84,Alive
4,47,White,Married,T2,N1,IIB,Poorly differentiated,3,Regional,41,Positive,Positive,3,1,50,Alive


## Narrowing down the data set

Standardize all demographic information in the categories of age, race, and martial status.

First find the average value for each of these categories

In [None]:
df.describe()

Unnamed: 0,Age,Tumor Size,Regional Node Examined,Reginol Node Positive,Survival Months
count,4024.0,4024.0,4024.0,4024.0,4024.0
mean,53.972167,30.473658,14.357107,4.158052,71.297962
std,8.963134,21.119696,8.099675,5.109331,22.92143
min,30.0,1.0,1.0,1.0,1.0
25%,47.0,16.0,9.0,1.0,56.0
50%,54.0,25.0,14.0,2.0,73.0
75%,61.0,38.0,19.0,5.0,90.0
max,69.0,140.0,61.0,46.0,107.0


Turn the catagorical values into numerical in order to find the average

In [None]:
# replacing values
df['Race'].replace(['Black', 'White','Other'],
                        [0, 1, 2], inplace=True)

In [None]:
df['Race'].describe()

count    4024.000000
mean        1.007207
std         0.389647
min         0.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         2.000000
Name: Race, dtype: float64

Now filter out everything but the average age and average race inorder to standardize the data across the demographic categories of age and race.

In [None]:
df[df['Age'] == 54][df['Race'] == 1]

  df[df['Age'] == 54][df['Race'] == 1]


Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
36,54,1,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,27,Positive,Negative,21,6,37,Alive
106,54,1,Married,T2,N1,IIB,Poorly differentiated,3,Regional,40,Positive,Negative,11,1,24,Dead
114,54,1,Married,T3,N2,IIIA,Moderately differentiated,2,Regional,51,Positive,Positive,6,5,103,Alive
140,54,1,Married,T1,N1,IIA,Well differentiated,1,Regional,7,Positive,Positive,4,1,86,Alive
144,54,1,Married,T2,N1,IIB,Moderately differentiated,2,Regional,34,Positive,Positive,20,1,52,Dead
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3935,54,1,Married,T1,N1,IIA,Moderately differentiated,2,Regional,15,Positive,Positive,16,1,97,Alive
3943,54,1,Single,T2,N1,IIB,Poorly differentiated,3,Regional,30,Positive,Positive,10,1,59,Alive
3964,54,1,Widowed,T1,N1,IIA,Poorly differentiated,3,Regional,12,Negative,Negative,1,1,62,Alive
3987,54,1,Married,T1,N1,IIA,Moderately differentiated,2,Regional,7,Negative,Positive,12,1,89,Alive


Because cancer does not differ along the lines of martial status, we do not need to filter that demographic factor out and can remove it all together.

In [None]:
df.drop ('Marital Status', inplace=True, axis=1)

In [None]:
df[df['Age'] == 54][df['Race'] == 1]

  df[df['Age'] == 54][df['Race'] == 1]


Unnamed: 0,Age,Race,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
36,54,1,T2,N2,IIIA,Moderately differentiated,2,Regional,27,Positive,Negative,21,6,37,Alive
106,54,1,T2,N1,IIB,Poorly differentiated,3,Regional,40,Positive,Negative,11,1,24,Dead
114,54,1,T3,N2,IIIA,Moderately differentiated,2,Regional,51,Positive,Positive,6,5,103,Alive
140,54,1,T1,N1,IIA,Well differentiated,1,Regional,7,Positive,Positive,4,1,86,Alive
144,54,1,T2,N1,IIB,Moderately differentiated,2,Regional,34,Positive,Positive,20,1,52,Dead
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3935,54,1,T1,N1,IIA,Moderately differentiated,2,Regional,15,Positive,Positive,16,1,97,Alive
3943,54,1,T2,N1,IIB,Poorly differentiated,3,Regional,30,Positive,Positive,10,1,59,Alive
3964,54,1,T1,N1,IIA,Poorly differentiated,3,Regional,12,Negative,Negative,1,1,62,Alive
3987,54,1,T1,N1,IIA,Moderately differentiated,2,Regional,7,Negative,Positive,12,1,89,Alive


Label this new data set as "Health_Metrics_subset"

In [None]:
Health_Metrics_subset = df[df['Age'] == 54][df['Race'] == 1].copy()

  Health_Metrics_subset = df[df['Age'] == 54][df['Race'] == 1].copy()


Run a test

In [None]:
Health_Metrics_subset

Unnamed: 0,Age,Race,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
36,54,1,T2,N2,IIIA,Moderately differentiated,2,Regional,27,Positive,Negative,21,6,37,Alive
106,54,1,T2,N1,IIB,Poorly differentiated,3,Regional,40,Positive,Negative,11,1,24,Dead
114,54,1,T3,N2,IIIA,Moderately differentiated,2,Regional,51,Positive,Positive,6,5,103,Alive
140,54,1,T1,N1,IIA,Well differentiated,1,Regional,7,Positive,Positive,4,1,86,Alive
144,54,1,T2,N1,IIB,Moderately differentiated,2,Regional,34,Positive,Positive,20,1,52,Dead
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3935,54,1,T1,N1,IIA,Moderately differentiated,2,Regional,15,Positive,Positive,16,1,97,Alive
3943,54,1,T2,N1,IIB,Poorly differentiated,3,Regional,30,Positive,Positive,10,1,59,Alive
3964,54,1,T1,N1,IIA,Poorly differentiated,3,Regional,12,Negative,Negative,1,1,62,Alive
3987,54,1,T1,N1,IIA,Moderately differentiated,2,Regional,7,Negative,Positive,12,1,89,Alive


Download the new subset

In [None]:
Health_Metrics_subset.to_csv("Health_Metrics_subset.csv", index=False)