## Breast Cancer Study - Preprocessing and Training Data Development

The data from the NKI breast cancer dataset will be prepared below for fitting models.

In [1]:
#load necessary packages
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
os.getcwd()

'/Users/shannonballard/Springboard/Springboard_Capstone_2'

In [3]:
path="/Users/shannonballard/Springboard/Springboard_Capstone_2"
os.chdir(path) 

In [4]:
#load the data into pandas df and print the first five rows
nki_bc_cleaned = pd.read_csv('nki_bc_cleaned.csv', index_col = 0)
nki_bc_cleaned.head()

Unnamed: 0,age,eventdeath,survival,timerecurrence,chemo,hormonal,amputation,histtype,diam,posnodes,...,Contig36312_RC,Contig38980_RC,NM_000853,NM_000854,NM_000860,Contig29014_RC,Contig46616_RC,NM_000888,NM_000898,AF067420
0,43,0,14.817248,14.817248,0,0,1,1,25,0,...,0.591103,-0.355018,0.373644,-0.76069,-0.164025,-0.038726,0.237856,-0.087631,-0.369153,0.153795
1,48,0,14.261465,14.261465,0,0,0,1,20,0,...,-0.199829,-0.001635,-0.062922,-0.682204,-0.220934,-0.100088,-0.466537,-0.231547,-0.643019,-0.014098
2,38,0,6.644764,6.644764,0,0,0,1,15,0,...,0.328736,-0.047571,0.084228,-0.69595,-0.40284,-0.099965,0.110155,-0.114298,0.258495,-0.198911
3,50,0,7.748118,7.748118,0,1,0,1,15,1,...,0.648861,-0.039088,0.182182,-0.52464,0.03732,-0.167688,-0.01679,-0.285344,-0.251188,0.86271
4,38,0,6.436687,6.31896,0,0,1,1,15,0,...,-0.287538,-0.286893,0.057082,-0.565021,-0.105632,-0.108148,-0.405853,-0.053601,-0.677072,0.13416


### Categorical Features

If the dataset contains categorical features, dummy features will be created for future model development.

Note that there are several categorical columns that are represented with either 0 or 1 values (eventdeath, chemo, hormonal, and amputation). Dummy variables will not be generated for these columns, as they are already given values of 0 or 1.

In [5]:
# Find columns that could be categorical
# Using 'int64' because values in columns are integers and not strings
nki_bc_cleaned.select_dtypes(include=['int64'])

Unnamed: 0,age,eventdeath,chemo,hormonal,amputation,histtype,diam,posnodes,grade,angioinv,lymphinfil
0,43,0,0,0,1,1,25,0,2,3,1
1,48,0,0,0,0,1,20,0,3,3,1
2,38,0,0,0,0,1,15,0,2,1,1
3,50,0,0,1,0,1,15,1,2,3,1
4,38,0,0,0,1,1,15,0,2,2,1
...,...,...,...,...,...,...,...,...,...,...,...
267,48,1,1,0,1,1,30,0,3,1,3
268,39,1,0,0,1,1,30,0,2,1,1
269,50,1,0,0,1,1,27,0,3,1,1
270,52,1,0,1,1,1,28,0,3,1,1


## Description of Columns

| Variable |Details| Type |
| --- | --- | --- |
|age | Age at which patient was diagnosed with breast cancer | Continuous |
|eventdeath | 0 = alive, 1 = death | Categorical |
|survival | Time (in years) until death or last follow-up | Continuous |
|timerecurrence | Time (in years) until cancer recurrence or last follow-up | Continuous |
|chemo | chemotherapy used (yes=1/no=0) | Categorical |
|hormonal | Hormonal therapy used (yes=1/no=0) | Categorical |
|amputation | Mastectomy (yes = 1/no = 0) | Categorical |
|histtype | Histological grade based on 3 morphological features | Categorical |
|diam | Diameter of primary tumor | Continuous |
|posnodes | number of lymph nodes that contained cancerous cells | Continuous |
|grade | Pathological grade based on cell differentiation & growth rate (1=low, 2=intermediate, 3=high) | Categorical |
|angioinv | Vascular invasion 1= absent, 2= minor, 3 = major | Categorical |
|lymphinfil | level of lymphocytic infiltration | Categorical |
|1,554 gene expression levels | each gene is provided as an individual variable; given as an intensity ratio to that of reference pool | Continuous |

In [6]:
# Identify the unique values for particular categorical columns to see if they should be considered for dummy variables
columns = ['histtype', 'grade', 'angioinv', 'lymphinfil']

for column in columns:
    unique_values = nki_bc_cleaned[column].unique()
    print('The unique values for ', column, 'are: ', unique_values)

The unique values for  histtype are:  [1 2 5 7 4]
The unique values for  grade are:  [2 3 1]
The unique values for  angioinv are:  [3 1 2]
The unique values for  lymphinfil are:  [1 2 3]


In [7]:
columns = ['histtype', 'grade', 'angioinv', 'lymphinfil']

for column in columns:
    value_count = nki_bc_cleaned[column].value_counts()
    print('The value counts for ', column, 'are: \n', value_count)

The value counts for  histtype are: 
 1    254
2     14
4      2
7      1
5      1
Name: histtype, dtype: int64
The value counts for  grade are: 
 3    106
2     95
1     71
Name: grade, dtype: int64
The value counts for  angioinv are: 
 1    169
3     73
2     30
Name: angioinv, dtype: int64
The value counts for  lymphinfil are: 
 1    223
2     27
3     22
Name: lymphinfil, dtype: int64


### Dummy Variables

Dummy variables will be made for all of the above columns.

In [8]:
# Make dummy variables for categorical columns histtype, grade, angioinv, and lymphinfil
nki_bc_dummies = pd.get_dummies(nki_bc_cleaned, prefix=['histtype', 'grade', 'angioinv', 'lymphinfil'], columns=['histtype', 'grade', 'angioinv', 'lymphinfil'])
nki_bc_dummies.head()

Unnamed: 0,age,eventdeath,survival,timerecurrence,chemo,hormonal,amputation,diam,posnodes,esr1,...,histtype_7,grade_1,grade_2,grade_3,angioinv_1,angioinv_2,angioinv_3,lymphinfil_1,lymphinfil_2,lymphinfil_3
0,43,0,14.817248,14.817248,0,0,1,25,0,-0.413955,...,0,0,1,0,0,0,1,1,0,0
1,48,0,14.261465,14.261465,0,0,0,20,0,0.195251,...,0,0,0,1,0,0,1,1,0,0
2,38,0,6.644764,6.644764,0,0,0,15,0,0.596177,...,0,0,1,0,1,0,0,1,0,0
3,50,0,7.748118,7.748118,0,1,0,15,1,0.501286,...,0,0,1,0,0,0,1,1,0,0
4,38,0,6.436687,6.31896,0,0,1,15,0,-0.066771,...,0,0,1,0,0,1,0,1,0,0


### Standardize the magnitude of the numeric features using a scaler

The 'eventdeath' feature will be the response variable, so it will be removed and set as y.

Because the values in the columns have different magnitudes and because the data is both categorical and continuous in nature, the data will be standardized. 

In [22]:
from sklearn.preprocessing import StandardScaler

# Declare an explanatory variable, called X, and assign it the result of dropping 'eventdeath' from the df
X = nki_bc_dummies.drop(['eventdeath'], axis=1)

# Declare a response variable, called y, and assign it the eventdeath column of the df 
y = nki_bc_dummies['eventdeath']

In [27]:
# Import the train_test_split function from the sklearn.model_selection  
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
# Using 75/25 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [40]:
# Make a Scaler object
scaler = StandardScaler()

# fit and transform training data to scaler object
X_train_scaled = scaler.fit_transform(X_train)

# transform testing data to scaler object
X_test_scaled = scaler.transform(X_test)

In [41]:
# create df
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X.columns)

In [42]:
# Check df
X_train_scaled.head()

Unnamed: 0,age,survival,timerecurrence,chemo,hormonal,amputation,diam,posnodes,esr1,G3PDH_570,...,histtype_7,grade_1,grade_2,grade_3,angioinv_1,angioinv_2,angioinv_3,lymphinfil_1,lymphinfil_2,lymphinfil_3
0,0.161953,-0.418151,-0.193216,-0.803219,-0.382188,1.125463,-0.845116,-0.178132,0.841692,-0.036943,...,-0.070186,1.60591,-0.71492,-0.786796,0.754474,-0.338754,-0.592447,0.46291,-0.338754,-0.281718
1,0.884104,-0.753676,-0.507188,-0.803219,2.616516,1.125463,1.453599,-0.650068,0.426528,-0.910159,...,-0.070186,-0.6227,1.398757,-0.786796,0.754474,-0.338754,-0.592447,0.46291,-0.338754,-0.281718
2,0.703567,-0.142793,0.045244,-0.803219,-0.382188,-0.888523,-0.845116,-0.650068,-2.311769,0.105184,...,-0.070186,-0.6227,1.398757,-0.786796,0.754474,-0.338754,-0.592447,0.46291,-0.338754,-0.281718
3,0.342491,0.163709,0.351267,1.24499,-0.382188,1.125463,2.028277,0.293803,-1.463247,0.282117,...,-0.070186,-0.6227,-0.71492,1.270978,-1.325427,-0.338754,1.687915,-2.160247,2.951997,-0.281718
4,0.161953,-0.09112,0.112807,-0.803219,-0.382188,-0.888523,0.87892,-0.178132,1.096974,-1.351171,...,-0.070186,1.60591,-0.71492,-0.786796,0.754474,-0.338754,-0.592447,0.46291,-0.338754,-0.281718


In [43]:
type(y)

pandas.core.series.Series

In [44]:
y2 = y.ravel()

In [45]:
type(y2)

numpy.ndarray