# Exploratory Data Analysis on Haberman Dataset


### 1. About Haberman Dataset
####    The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings     Hospital on the survival of patients who had undergone surgery for breast cancer.

In [48]:
# Getting all the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings


In [82]:
# reading content of csv file & converting it in dataframe then assigned to haberman variable.
haberman = pd.read_csv('haberman.csv', names = ['age', 'operation_year','axil_nodes','survival_status'])
print(haberman.head(20))
print(haberman.shape)

    age  operation_year  axil_nodes  survival_status
0    30              64           1                1
1    30              62           3                1
2    30              65           0                1
3    31              59           2                1
4    31              65           4                1
5    33              58          10                1
6    33              60           0                1
7    34              59           0                2
8    34              66           9                2
9    34              58          30                1
10   34              60           1                1
11   34              61          10                1
12   34              67           7                1
13   34              60           0                1
14   35              64          13                1
15   35              63           0                1
16   36              60           1                1
17   36              69           0           

### 1.1 Observations on Haberman Dataset
#### Number of Cases of patients are *306* 
#### Mean age in our dataset is *52.45* with min age = *30* and max age = *83*
#### Mean number of axil nodes in our dataset is *4*
#### We have *4* features in the dataset naming ***Age, Year of operation, Axil Nodes, Survival status of patient (actually a class)***
#### Details about Features/attributes:
 - Age of patient at time of operation (numerical)
 - Patient's year of operation (year - 1900, numerical)
 - Number of positive axillary nodes(A positive axillary node is a lymph node in the area of the armpit (axilla) to which cancer has spread. This spread is determined by surgically removing some of the lymph nodes and examining them under a microscope to see whether cancer cells are present.) detected (numerical)
 - Survival status (class attribute)
   - 1 = the patient survived 5 years or longer
   - 2 = the patient died within 5 year

In [18]:
# To print the number of datapoints in each class
print(haberman[haberman['survival_status']==1].shape)
print(haberman[haberman['survival_status']==2].shape)

(225, 4)
(81, 4)


- So 225 number of patients survived 5 years or longer from breast cancer surgery.
- And 81 number of patients died within 5 years from breat cancer surgery.

### 1.2 Objective of the problem
#### We will be performing exploratory data analysis( univariate, bivariate or multivariate analysis) to get more insights in the Haberman Dataset. EDA will be perform to visually observe and know the statistics and overall view of how Age, year of Breast Cancer operation, Number of Cancer cells( by knowing the number of positive auxilliary nodes) to know the spread of Cancer *affects/influence the chances of survival of the patient*.

### 1.3 Univariate analysis - Plot PDF, CDF, Boxplot, Violin plots

### PDF & Histogram plot

In [74]:
%matplotlib notebook
# Ignoring the warnings
warnings.filterwarnings('ignore')

#ploting the pdf plot for year of operation 
g = sns.FacetGrid(haberman, height = 5).map(sns.distplot, "operation_year").add_legend()
# Used to put name of y axis
plt.ylabel('Probability of RV X = x for PDF')
# Used to give the title name to the pdf plot
plt.title('PDF for feature Operation_year')
# Used to display the plot
plt.show()
# To show the statistics of operation year
print(' \n Statistics of Year of operation')
print('\n', haberman['operation_year'].describe())

<IPython.core.display.Javascript object>

 
 Statistics of Year of operation

 count    306.000000
mean      62.852941
std        3.249405
min       58.000000
25%       60.000000
50%       63.000000
75%       65.750000
max       69.000000
Name: operation_year, dtype: float64


In [73]:
%matplotlib notebook
# Ignoring the warnings
warnings.filterwarnings('ignore')

#ploting the pdf plot for year of operation 
g = sns.FacetGrid(haberman, hue = 'survival_status', height = 5).map(sns.distplot, "operation_year").add_legend()
# Used to put name of y axis
plt.ylabel('Probability of RV X = x for PDF')
# Used to give the title name to the pdf plot
plt.title('PDF for feature Operation_year')
# Used to display the plot
plt.show()
# To show the statistics of operation year
print(' \n Statistics of Year of operation')
print('\n', haberman['operation_year'].describe())

<IPython.core.display.Javascript object>

 
 Statistics of Year of operation

 count    306.000000
mean      62.852941
std        3.249405
min       58.000000
25%       60.000000
50%       63.000000
75%       65.750000
max       69.000000
Name: operation_year, dtype: float64


### Observations:
 - Approximately 20%(highest) of total operations were performed in b/w year 1958 to 1960. 
 - Approximately 7.5%(lowest) of total operations were performed in b/w year 1961 to 1963. 
 - Maximum no. of patients who died within 5 years of operation were mostly operated in b/w year 1963 to 1966. 
 - Least no. of patients who died within 5 years of operation were mostly operated in b/w year 1966 to 1969.
 - Maximum no. of patients who survived 5 years & longer after operation were operated in b/w year 1959 to 1962.  
 - Least no. of patients who survived 5 years & longer after operation were mostly operated in b/w year 1967 to 1969.
 
### CDF plot

In [76]:
%matplotlib notebook
# CDF plot to understand cumulative density pattern for both kind of patients i.e. 1 and 2

# plots the kernel density estimation plot for haberman dataset i.e. pdf plot
sns.kdeplot(data = haberman, x = 'age', hue = 'survival_status', cumulative = False)
# Used to put name of y axis
plt.ylabel('''Probability of RV X = x for PDF''')
# Used to give the title name to the pdf plot
plt.title('PDF plot for Age feature')
# Used to display the plot
plt.show()
# To show the statistics of age
print(' \n Statistics of Age')
print('\n', haberman['age'].describe())


<IPython.core.display.Javascript object>

 
 Statistics of Age

 count    306.000000
mean      52.457516
std       10.803452
min       30.000000
25%       44.000000
50%       52.000000
75%       60.750000
max       83.000000
Name: age, dtype: float64


In [77]:
%matplotlib notebook
# Plots the kerner density estimation plot for haberman dataset with cumulative functionality ON
sns.kdeplot(data = haberman, x = 'age', hue = 'survival_status', cumulative = True)
# Used to put name of y axis
plt.ylabel('''Probability of RV X <= x for PDF''')
# Used to give the title name to the pdf plot
plt.title('CDF plot for Age feature')
# Used to display the plot
plt.show()
# To calculate statistics of age feature
print(' \n Statistics of Age')
print('\n', haberman['age'].describe())

<IPython.core.display.Javascript object>

 
 Statistics of Age

 count    306.000000
mean      52.457516
std       10.803452
min       30.000000
25%       44.000000
50%       52.000000
75%       60.750000
max       83.000000
Name: age, dtype: float64


### Observations:
- The first plot distribution is almost similar to normal distribution.
- From first plot most of the patients(type 1 & 2 both) fall under 45 to 55 age.
- From second plot it is clear that approximately 73.6% of patient fall under type 1 & rest 26.4% patient fall in type 2 class.
- Also from second plot we can see after age of 70 the slope is very gradual or almost negligible which shows their not 
  significant growth in number patients as we go beyond 75 years of age.
- Probability of finding a survivor after 5 years of operation is maximum at age 53.
- Mean age used in our study comes out at 52.45 years with 10.8 years of standard deviation.


### Box plot

In [78]:
%matplotlib notebook
# Boxplot analysis on number of axil nodes

# closes previous opened plots to free up memory
plt.close()
# to apply grid in the plot region
sns.set(style = 'whitegrid')
# to change the size of plot
plt.rcParams['figure.figsize']= [5,6]
# plot a boxplot using seaborn for feature axil nodes for each type of survival status i.e. type 1 & type 2
g = sns.boxplot(x='survival_status', y = 'axil_nodes', hue ='survival_status', data = haberman)
# function to return handles & labels for legend
handles, _ = g.get_legend_handles_labels()
# To put legend in the plot
g.legend(handles,['survival status 1','survival status 2'])
# Adds title to the plot
plt.title('Axil node w.r.t. survival Status boxplot', size = 15)
# display the plot
plt.show()
# To show the statistics of axil nodes
print(' \n Statistics of number of axil nodes')
print('\n', haberman['axil_nodes'].describe())


<IPython.core.display.Javascript object>

 
 Statistics of number of axil nodes

 count    306.000000
mean       4.026144
std        7.189654
min        0.000000
25%        0.000000
50%        1.000000
75%        4.000000
max       52.000000
Name: axil_nodes, dtype: float64


### Observations:
- Axil nodes of patients with survival status of type 1 has 0 counts(25th percentile), 0 count(50th percentile(median)) 
and 3.4 count(75th percentile) of axil nodes.
- Axil nodes of patients with survival status of type 2 has 1.3 counts(25th percentile), 4.1 count(50th percentile(median)) 
and 11.2 count(75th percentile) of axil nodes.
- Outliers in case of type 1 patients are more in numbers compared to the case of type 2 patients.
- Lower whisker is absent and upper whisker is at 7 counts of axil nodes for type 1 patients.
- Lower whisker is at 0 counts and upper whisker is at 24 counts of axil nodes for type 2 patients.

### Violin plot

In [80]:
%matplotlib notebook
# Violin plot analysis

# closes previous opened plots to free up memory
plt.close()
# Used to change the size of plot
plt.rcParams['figure.figsize'] = [5,6]
# Used to draw violinplot on haberman dataset
sns.violinplot(x='survival_status', y='operation_year', data= haberman)
# Adds title to the plot
plt.title('Operation year w.r.t. survival Status', size = 15)
# Used to display the plot
plt.show()

<IPython.core.display.Javascript object>

### Obervations:
- y axis represents year of operation performed & x axis shows the type of patient (1 or 2).
- pdf plot is not a normal distribution in both types of patients (1 and 2).
- median year of operation is almost same in both types of patients which is approximately 1963.
- Lower & Upper whisker boundaries for type 1 & type 2 both patients are 1958 years 1969 years approximately.
- 25th percentile year of operation of type 1 patient is 1960 & of type 2 patient is 1959 approximately.
- 75th percentile year of operation of type 1 patient is 1966 & of type 2 patient is 1965 approximately.
- 50th percentile year of operation of type 1 patient is 1963 & of type 2 patient is 1963.

### 1.4 Perform Bivariate analysis - Plot 2D Scatter plots and Pair plots

### 2D Scatter plot

In [81]:
%matplotlib notebook
# Plotting 2D scatter plot using matplotlib

# sets the background of plot in darkgrid
sns.set_style('darkgrid')
# plots scatter plot
sns.FacetGrid(haberman, hue='survival_status',height = 6).map(plt.scatter, 'age','axil_nodes').add_legend(frameon = True, loc='upper right')
# Adds title to the plot
plt.title('axil nodes w.r.t. age scatter plot', size = 20)
# Used to display the plot
plt.show()

<IPython.core.display.Javascript object>

## Observations from the plot:
- Most of the patients have less than 15 axil nodes.
- There is no clear separation b/w datapoints of both kind of patients type 1 & type 2 so classification is not possible.
- Most no. of axil nodes is of type 2 patient of age 43 year which is approximately 52 counts.
- Patient with maximum age 83 (type 2 patient) had 2 axil nodes.

### Pairplot

In [60]:
%matplotlib notebook
# closes previous opened plots to free up memory
plt.close()
# puts grid pattern on the background of plot
sns.set_style('whitegrid')
# Draw the plots b/w 2 features at a time i.e. pairplot
g = sns.pairplot(haberman, hue = 'survival_status', size = 4).add_legend(frameon = True, loc ='upper right')
# Used to add title to the matrix of plots
g.fig.suptitle('2D pairplot for age, operation year, axil nodes', y = 1, size =20)
# Used to display the plots
plt.show()

<IPython.core.display.Javascript object>

### Observations from above plot:
- Non diagonal plots are not able to clearly separate datapoints for both type of patients i.e. type 1 & type 2.
- Diagonal plots are pdf plots of age, operation year, axil nodes on the x axis which are not normal distribution.
- From diagonal plots we can see that number of patients of type 1 are higher in number in each of the diagonal plots.

### 1.5 Overall Conclusion:
- Univariate analysis with operation year is highly overlapping and classification of patients based on operation year into 
type 1 & type 2 patients is not possible.
- CDF plots of each type of patients based on 'age' feature shows that type 1 patients are higher in numbers to type 2 patients for almost all ages.
- Outliers are more in type 1 patients compared to type 2 patients when analysis based on number of axil nodes is done.
#### PDF , CDF & Histogram plot
- Approximately 20%(highest) of total operations were performed in b/w year 1958 to 1960.
- Approximately 7.5%(lowest) of total operations were performed in b/w year 1961 to 1963.
- Probability of finding a survivor after 5 years of operation is maximum at age 53.
- Mean age used in our study comes out at 52.45 years with 10.8 years of standard deviation.
#### 2D Scatter plots and Pair plots
- Bivariate analysis done b/w 'age' & 'axil nodes' or any other variable pairs is also not able to classify patients based on
survival status.
- Scatter plot b/w 'age' & 'axil nodes' shows most of datapoints lie below 15 counts of axil nodes and for age group of 33 to 70 years.