<a href="https://colab.research.google.com/github/zarrinan/DS-Sprint-01-Dealing-With-Data/blob/master/DS_Unit_1_Sprint_Challenge_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 1

## Loading, cleaning, visualizing, and analyzing data

In this sprint challenge you will look at a dataset of the survival of patients who underwent surgery for breast cancer.

http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival

Data Set Information:
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Attribute Information:
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
-- 1 = the patient survived 5 years or longer
-- 2 = the patient died within 5 year

Sprint challenges are evaluated based on satisfactory completion of each part. It is suggested you work through it in order, getting each aspect reasonably working, before trying to deeply explore, iterate, or refine any given step. Once you get to the end, if you want to go back and improve things, go for it!

## Part 1 - Load and validate the data

- Load the data as a `pandas` data frame.
- Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
- Validate that you have no missing values.
- Add informative names to the features.
- The survival variable is encoded as 1 for surviving >5 years and 2 for not - change this to be 0 for not surviving and 1 for surviving >5 years (0/1 is a more traditional encoding of binary variables)

At the end, print the first five rows of the dataset to demonstrate the above.

In [0]:
# TODO
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [0]:
bc = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data')
! curl http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data

In [7]:
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data | wc
!ls

haberman.data  haberman.data.1	sample_data


In [0]:
print(bc.shape)
print(bc.info())
print(bc.describe())
bc.head()

Attribute Information:
   1. Age of patient at time of operation (numerical)
   2. Patient's year of operation (year - 1900, numerical)
   3. Number of positive axillary nodes detected (numerical)
   4. Survival status (class attribute)
         1 = the patient survived 5 years or longer
         2 = the patient died within 5 year

In [0]:
col_names = ['age', 'operation year 1900+', 'pos nodes', 'survival status']

In [0]:
bcs = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data', header=None, names=col_names)

In [0]:
print(bcs.shape)
print(bcs.info())
print(bcs.describe())
bcs.head()

In [0]:
bcs.isna().sum().sum()

In [0]:
bcs['if_survived'] = bcs['survival status'].apply(lambda val: 0 if val == 2 else 1)


In [0]:
bcs.head(15)

## Part 2 - Examine the distribution and relationships of the features

Explore the data - create at least *2* tables (can be summary statistics or crosstabulations) and *2* plots illustrating the nature of the data.

This is open-ended, so to remind - first *complete* this task as a baseline, then go on to the remaining sections, and *then* as time allows revisit and explore further.

Hint - you may need to bin some variables depending on your chosen tables/plots.

In [0]:
# TODO
bcs.describe()

In [0]:
plt.hist(bcs['age'])

In [0]:
plt.hist(bcs['operation year 1900+'])

In [0]:
plt.hist(bcs['if_survived'])
bcs['if_survived'].value_counts()/len(bcs['if_survived'])

In [0]:
plt.scatter(bcs['if_survived'], bcs['pos nodes'])

In [0]:
plt.scatter(bcs['age'], bcs['if_survived'])

In [0]:
oper_year_bin = pd.cut(bcs['operation year 1900+'], bins=[0, 58, 61, 63, 65, 69])
age_bin = pd.cut(bcs['age'], bins=[0, 30, 35, 41,42,43, 44,45,46, 51,54, 57, 62, 67, 73, 78, 83])
pos_nodes_bin = pd.cut(bcs['pos nodes'], 5)


In [0]:
pd.crosstab(bcs['if_survived'], oper_year_bin, normalize='columns')

In [0]:
pd.crosstab(bcs['if_survived'], bcs['operation year 1900+'], normalize='columns')

In [0]:
pd.crosstab(bcs['if_survived'], age_bin, normalize='columns')

In [0]:
plt.plot(bcs['age'], bcs['pos nodes'])

In [0]:
pointsize = 20;
plt.xlabel('operation year')
plt.ylabel('age')
plt.title('Survival vs operation year vs age')
plt.scatter(bcs['operation year 1900+'], bcs['age'], pointsize, bcs['if_survived']);

In [0]:
fig, ax = plt.subplots(figsize=(16,10))
sns.heatmap(bcs, annot=True, linewidths=.5, cmap="YlGnBu", ax=ax);

In [0]:
age_bin2 = pd.cut(bcs['age'], bins=[0, 35, 41,46, 51, 57, 62, 73, 83])

ds = pd.crosstab([bcs['if_survived'], oper_year_bin], age_bin2, bcs['if_survived'])
ds.plot(kind='bar', stacked=True)

In [0]:
ds2 = pd.crosstab([pos_nodes_bin, age_bin2], bcs['if_survived'], normalize='columns')
ds2

In [0]:
ds2.plot(kind='bar', stacked=False)


In [0]:
#The data above shows that 58-59 and 65 years of operation are the years when the
#survival rate is lower, will try to figure out the reason

is_58 =  bcs['operation year 1900+'] <= 59
is_65 = bcs['operation year 1900+'] == 65

In [0]:
bcs_58 = bcs[is_58]
bcs_65 = bcs[is_65]

In [0]:
print(bcs['if_survived'].value_counts()/len(bcs['if_survived']))
print(bcs_58['if_survived'].value_counts()/len(bcs_58['if_survived']))
print(bcs_65['if_survived'].value_counts()/len(bcs_65['if_survived']))


In [0]:
bcs_58.describe()
plt.hist(bcs_58['age'])

In [0]:
bcs_65.describe()
plt.hist(bcs_65['age'])

In [0]:
bcs['age'].describe()

In [0]:
bcs_58['age'].describe()

In [0]:
bcs_65['age'].describe()

In [0]:
ds_65 = pd.crosstab([pos_nodes_bin, age_bin2], bcs['if_survived'], normalize='columns')
ds_65.plot(kind='bar', stacked=False)

In [0]:
ds_58 = pd.crosstab([pos_nodes_bin, age_bin2], bcs['if_survived'], normalize='columns')
ds_58.plot(kind='bar', stacked=False)

## Part 3 - Analysis and Interpretation

Now that you've looked at the data, answer the following questions:

- What is at least one feature that looks to have a positive correlation with survival?
- What is at least one feature that looks to have a negative correlation with survival?
- How are those two features related with each other, and what might that mean?

Answer with text, but feel free to intersperse example code/results or refer to it from earlier.

Patients, who were operated before 41 years old, or after 78 years old, tend to survive more often. The data shows that after the age of 41 the probability of survival after a breast cancer operation  drops dramatically, from 86%  for patients before 41 years old, to 77% at the age of 41 and 42% at the age of 46. 
The number of positive axillary nodes are the highest at the age ranges where survival probability is low. The features contributing to survival rate have positive correlation, they are confounding. More domain specific data is required to detect which of the features cause another one, at the same time following intuitive logic, age might be the factor causing the number of positive axillary nodes to increase.

There are drops in the survival rates in 58-59 and 65 years, to 66% and 53% respectively. The age distribution for those year shows almost no difference with the rest data, additional factors should be investigated for those years, maybe medicine used, new surgeon practices, etc.