# SEER Breast Cancer Analysis

In this notebook, we will conduct several analyses using the data in the SEER breast cancer data set.  You will practice fitting distributions and conducting hypothesis tests.

As before, we will stratify our data to mimic the analysis in "Breast Cancer Stage Variation and Survival in Association With Insurance Status and Sociodemographic Factors in US Women 18 to 64 Years Old" by Hsu, et. al (2017). [Link to Full Paper.](http://rdcu.be/Gdvp/)

The code here is provided for your convenience, but you are not required to use it.  

In [1]:
import pandas as pd
from os import path
import numpy as np
import matplotlib.pyplot as plt
from read_seer_data import SEER_reader
from scipy import stats
%matplotlib inline 

# Load data into dataframe
We've moved all the data-loading into a python script read_seer_data.py to make this notebook a little less congested.

In [2]:
# This is the file we want to use
txt_file = 'YOUR SEER DATA DIRECTORY/SEER_1973_2015_TEXTDATA/incidence/yr1973_2015.seer9/BREAST.txt'

# Read the file, filter data and add columns
seer_data = SEER_reader()
seer_data.load_seer(txt_file)
seer_data.filter_table_mod4()
seer_data.encode_data_mod4()
seer_data.dropna()

table = seer_data.get_table()
records = seer_data.get_records()

# look at the top 5 rows of the dataframe
table.head(5)

FileNotFoundError: [Errno 2] No such file or directory: 'YOUR SEER DATA DIRECTORY/SEER_1973_2015_TEXTDATA/incidence/yr1973_2015.seer9/BREAST.txt'

## Visualize the Data

Here we will see how breast cancer survival is distributed across different stages at cancer diagnosis.

In [None]:
factor = 'Cancer Stage Num'
response = 'Survival months'

# identify the covariate strata within the data
vals = np.unique(table[factor].values)
    
factors = dict()
for v in vals:
    factors[v] = table[table[factor] == v][response].values
    
# plot approximate distributions using violinplot
plt.figure(figsize=(15, 10))
ax = plt.subplot(111)
plt.violinplot([f for k,f in factors.items()], 
               positions=range(len(vals)),showmeans=True,showextrema=False)

xlabels=[records[factor]['codes'][v] for v in vals]
ax.set_xticks(range(len(vals)))
ax.set_xticklabels(xlabels)
ax.set_ylabel(response)
ax.set_title(factor)
plt.show()

# Distribution Fitting
Use [SciPy](https://docs.scipy.org/doc/scipy/reference/stats.html) to fit the data to continuous distributions

## Distribution A
Fit Stage I breast cancer survival to a [normal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) distribution

In [None]:
# STUDENT CODE GOES HERE 


## Distribution B
Fit Stage II breast cancer survival to a [triangle](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.triang.html) distribution

In [None]:
# STUDENT CODE GOES HERE 


## Distribution C
Fit Stage IV breast cancer survival to a [beta](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html) distribution

In [None]:
# STUDENT CODE GOES HERE 


# -------------------------------------------------------------

# Hypothesis Testing
Use [SciPy](https://docs.scipy.org/doc/scipy/reference/stats.html#statistical-functions) to test for different hypotheses.  

Note that the Student's t-test and F-test assume normality of the data (which ours is not).  These would not be appropriate hypothesis tests to use for this data, but for the purpose of practice, don't worry about it for now.

## Hypothesis A

* $H_1=$ Survival for patients diagnosed with Stage I and Stage II have different expected values (means)
* $H_0=$ The means are the same for Stage I and Stage II 
* (Use a [Student's t-test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html))


In [None]:
# STUDENT CODE GOES HERE 


## Hypothesis B

*  $H_1=$ Survival for patients diagnosed with Stage I and Stage II have different variances
*  $H_0=$ The variances are the same for Stage I and Stage II
*  (Use an [F-test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html))




In [None]:
# STUDENT CODE GOES HERE 


## Hypothesis C

*  $H_1=$ The distributions of survival for patients diagnosed with Stage I and Stage II are different
*  $H_0=$ The distributions are the same
*  (Use a two-sided [Kolmogorov-Smirnov Test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html))


In [None]:
# STUDENT CODE GOES HERE 


## Hypothesis D

*  $H_1=$ The distributions of survival for patients diagnosed with Stage II and Stage III are different
*  $H_0=$ The distributions are the same
*  (Use a two-sided [Kolmogorov-Smirnov Test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html))


In [None]:
# STUDENT CODE GOES HERE 
