<a href="https://colab.research.google.com/github/eliza0shrug/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/Copy_of_DS_Unit_1_Sprint_Challenge_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 1

## Loading, cleaning, visualizing, and analyzing data

In this sprint challenge you will look at a dataset of the survival of patients who underwent surgery for breast cancer.

http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival

Data Set Information:
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Attribute Information:
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
-- 1 = the patient survived 5 years or longer
-- 2 = the patient died within 5 year

Sprint challenges are evaluated based on satisfactory completion of each part. It is suggested you work through it in order, getting each aspect reasonably working, before trying to deeply explore, iterate, or refine any given step. Once you get to the end, if you want to go back and improve things, go for it!

## Part 1 - Load and validate the data

- Load the data as a `pandas` data frame.
- Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
- Validate that you have no missing values.
- Add informative names to the features.
- The survival variable is encoded as 1 for surviving >5 years and 2 for not - change this to be 0 for not surviving and 1 for surviving >5 years (0/1 is a more traditional encoding of binary variables)

At the end, print the first five rows of the dataset to demonstrate the above.

In [0]:
# look at data
!curl http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data

In [0]:
! curl http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.names

1. Title: Haberman's Survival Data

2. Sources:
   (a) Donor:   Tjen-Sien Lim (limt@stat.wisc.edu)
   (b) Date:    March 4, 1999

3. Past Usage:
   1. Haberman, S. J. (1976). Generalized Residuals for Log-Linear
      Models, Proceedings of the 9th International Biometrics
      Conference, Boston, pp. 104-122.
   2. Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984),
      Graphical Models for Assessing Logistic Regression Models (with
      discussion), Journal of the American Statistical Association 79:
      61-83.
   3. Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis,
      Department of Statistics, University of Wisconsin, Madison, WI.

4. Relevant Information:
   The dataset contains cases from a study that was conducted between
   1958 and 1970 at the University of Chicago's Billings Hospital on
   the survival of patients who had undergone surgery for breast
   cancer.

5. Number of Instances: 306

6. Number of Attributes: 4 (including the class attribute)

7. 

In [0]:
patients_data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data'

In [0]:
#import pandas
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

column_header = ['Age_of_Patient', 'Year_of_Operation', 'Number_Pos_Aux_Nodes', 'Survival_Status' ]

df = pd.read_csv(patients_data_url, names=column_header)




In [0]:
print(df.shape)
df.head(30)

In [0]:
df.count()

Age_of_Patient          306
Year_of_Operation       306
Number_Pos_Aux_Nodes    306
Survival_Status         306
dtype: int64

The survival variable is encoded as 1 for surviving >5 years and 2 for not - change this to be 0 for not surviving and 1 for surviving >5 years (0/1 is a more traditional encoding of binary variables)

In [0]:
df['Survival_Status'] = df.Survival_Status.map({1:1, 2:0 })
  


In [0]:
df.head(20)

Unnamed: 0,Age_of_Patient,Year_of_Operation,Number_Pos_Aux_Nodes,Survival_Status
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1
5,33,58,10,1
6,33,60,0,1
7,34,59,0,0
8,34,66,9,0
9,34,58,30,1


## Part 2 - Examine the distribution and relationships of the features

Explore the data - create at least *2* tables (can be summary statistics or crosstabulations) and *2* plots illustrating the nature of the data.

This is open-ended, so to remind - first *complete* this task as a baseline, then go on to the remaining sections, and *then* as time allows revisit and explore further.

Hint - you may need to bin some variables depending on your chosen tables/plots.

In [0]:
pd.crosstab(df['Number_Pos_Aux_Nodes'], df['Survival_Status'])

Survival_Status,0,1
Number_Pos_Aux_Nodes,Unnamed: 1_level_1,Unnamed: 2_level_1
0,19,117
1,8,33
2,5,15
3,7,13
4,3,10
5,4,2
6,3,4
7,2,5
8,2,5
9,4,2


In [0]:
pd.crosstab(df['Age_of_Patient'], df['Survival_Status'])

Survival_Status,0,1
Age_of_Patient,Unnamed: 1_level_1,Unnamed: 2_level_1
30,0,3
31,0,2
33,0,2
34,2,5
35,0,2
36,0,2
37,0,6
38,1,9
39,1,5
40,0,3


In [0]:
pd.crosstab(df['Year_of_Operation'], df['Survival_Status'], margins=True)

Survival_Status,0,1,All
Year_of_Operation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
58,12,24,36
59,9,18,27
60,4,24,28
61,3,23,26
62,7,16,23
63,8,22,30
64,8,23,31
65,13,15,28
66,6,22,28
67,4,21,25


In [0]:
bins = [20, 30, 40, 50, 60, 70, 80]

In [0]:
labels = ['20','30','40', '50', '60', '70', '80']

In [0]:
df['Age_bins'] = pd.cut(df['Age_of_Patient'],bins,labels)

In [0]:
pd.crosstab(df['Age_bins'], df['Survival_Status'])

Survival_Status,0,1
Age_bins,Unnamed: 1_level_1,Unnamed: 2_level_1
"(20, 30]",0,3
"(30, 40]",4,36
"(40, 50]",29,64
"(50, 60]",26,67
"(60, 70]",18,45
"(70, 80]",3,10


Attribute Information:
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)


## Part 3 - Analysis and Interpretation

Now that you've looked at the data, answer the following questions:

- What is at least one feature that looks to have a positive relationship with survival?

I don't see any that have a positive relationship with survival status.  

Having made age bins and compared that to survival status I can see that there is a positive relationship between age and survival rate, as age increases the number of people who have survived and the number who have not survived have increased. So both numbers go up as age goes up, but all it says is that more people get breast cancer in that age range, not that they are less likely to survive. Not that I see, anyway.


- What is at least one feature that looks to have a negative relationship with survival?


Number of auxilary nodes seems to have a negative relationship with survival rate. 

- How are those two features related with each other, and what might that mean?

When positive auxilary nodes are low, the chance for survival are much higher than when auxilary nodes start to increase. But after about 8, the survival rate no longer seems profoundly affected by nodes.

