# Haberman’s Survival Data Set

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

Attribute Information:

- Age of patient at time of operation (numerical)
- Patient's year of operation (year - 1900, numerical)
- Number of positive axillary nodes detected (numerical)
- Survival status (class attribute)
- 1 = the patient survived 5 years or longer 2 = the patient died within 5 year

# Setup

In [28]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from sklearn.metrics import roc_curve, roc_auc_score,accuracy_score, recall_score, precision_score, auc
from sklearn.metrics import classification_report, f1_score, fbeta_score


In [29]:
df_cancer = pd.read_csv('haberman.csv')
df_cancer.columns = ['age' , 'operation_year' , 'axil_node' , 'survived_status']

In [30]:
df_cancer.tail()

Unnamed: 0,age,operation_year,axil_node,survived_status
300,75,62,1,1
301,76,67,0,1
302,77,65,3,1
303,78,65,1,2
304,83,58,2,2


In [31]:
df_cancer.describe()

Unnamed: 0,age,operation_year,axil_node,survived_status
count,305.0,305.0,305.0,305.0
mean,52.531148,62.84918,4.036066,1.265574
std,10.744024,3.254078,7.19937,0.442364
min,30.0,58.0,0.0,1.0
25%,44.0,60.0,0.0,1.0
50%,52.0,63.0,1.0,1.0
75%,61.0,66.0,4.0,2.0
max,83.0,69.0,52.0,2.0


In [32]:
df_cancer['survived_status'].value_counts(sum)

1    0.734426
2    0.265574
Name: survived_status, dtype: float64

unbalanced base

In [33]:
df_cancer['survived_status'] = df_cancer['survived_status'].map({1:0 , 2:1})
df_cancer.tail()

Unnamed: 0,age,operation_year,axil_node,survived_status
300,75,62,1,0
301,76,67,0,0
302,77,65,3,0
303,78,65,1,1
304,83,58,2,1


Survived Status:
- 1 : Survived 5 years or longer
- 2 : The patient died within 5 year

# Adjustment of Logistic Regression

In [34]:
function_formula = 'survived_status ~ age + operation_year + axil_node'
log_reg = smf.logit(formula = function_formula , data = df_cancer)
log_reg = log_reg.fit()

Optimization terminated successfully.
         Current function value: 0.537621
         Iterations 5


In [36]:
print(log_reg.summary())

                           Logit Regression Results                           
Dep. Variable:        survived_status   No. Observations:                  305
Model:                          Logit   Df Residuals:                      301
Method:                           MLE   Df Model:                            3
Date:                Sun, 22 Jan 2023   Pseudo R-squ.:                 0.07116
Time:                        21:05:01   Log-Likelihood:                -163.97
converged:                       True   LL-Null:                       -176.54
Covariance Type:            nonrobust   LLR p-value:                 1.455e-05
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         -1.8561      2.673     -0.694      0.487      -7.095       3.383
age                0.0193      0.013      1.511      0.131      -0.006       0.044
operation_year    -0.0093      0.042

## Interpretation of the Betas

In [38]:
print(np.exp(log_reg.params))

Intercept         0.156285
age               1.019522
operation_year    0.990716
axil_node         1.092142
dtype: float64


- age : For each additional year for the patients in this observations , the chance of die within 5 years increase 1.9%
- operation_year : For each additional year for operation year in this observations , the chance of die within 5 years decrease in -0.93% 
- axil_node : For each additional node, the chance of die within 5 years increase in 9.21%

# Metrics

## F-Beta Score

## Curva ROC

## Curva PR