# 1. Introduction

This notebook explores the Breast Cancer Survival dataset to understand key factors influencing survival outcomes of patients.

# 2. Problem Statement

To identify important features such as age, operation year, and lymph node count that influence survival rates of breast cancer patients.

# 3. Installing & Importing Libraries

In [None]:
!pip install pandas numpy matplotlib seaborn pandas-profiling -q

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
sns.set()

# 4. Data Acquisition & Description

In [None]:
data = pd.read_csv('Breast_cancer_survival.csv')
data.head()

# 5. Data Pre-Profiling

In [None]:
data.shape
data.columns
data.info()
data.describe()
data.isnull().sum()

# 6. Data Pre-Processing

In [None]:
data.columns = map(str.lower, data.columns)
data.isnull().sum()

# 7. Data Post-Profiling

In [None]:
print("Survival status count: \n", data.status.value_counts())
print("Patient survived 5 years or longer (%):", 225/306*100)
print("Patient died within 5 years (%):", 81/306*100)
profile = pandas_profiling.ProfileReport(data)
profile.to_file(output_file="Breast_cancer_survival_before_preprocessing.html")

# 8. Exploratory Data Analysis

Visualizations answering important exploratory questions.

### Status Count

In [None]:
sns.countplot(x='status', data=data).set_title('Survival after 5 years: 1 = Yes, 2 = No')

### Age Class Grouping

In [None]:
data['age_class'] = pd.cut(data['age'], [30,40,50,60,90], labels=['<=40','41-50','51-60','>60'])
data['age_class'].value_counts().plot(kind='bar', title='Patients Age Group Distribution')
plt.show()

### Survival by Age Group

In [None]:
data['survival_1'] = pd.cut(data['status'], [0,1], labels=['Survived > 5 years'])
data.groupby(['age_class', 'survival_1'])['survival_1'].count().unstack().plot(kind='bar', stacked=True, title='Survival by Age Group')
plt.show()

### Scatter Plot

In [None]:
data.plot.scatter(x='age', y='pos_axillary_nodes')
plt.title('Axillary Nodes vs Age')
plt.show()

### Correlation Heatmap

In [None]:
plt.figure(figsize=(6,6))
sns.heatmap(data.corr(), annot=True, cmap='YlGnBu')
plt.title('Correlation Heatmap')
plt.show()

### Pair Plot

In [None]:
sns.pairplot(data, hue='status')
plt.show()

### Operation Year vs Survival

In [None]:
sns.countplot(x='years_of_operation', data=data, hue='status').set_title('Survival by Year of Operation')
plt.show()

### Violin Plot: Year of Operation

In [None]:
sns.violinplot(x='status', y='years_of_operation', data=data, palette='cool')
plt.title('Survival based on Year of Operation')
plt.show()

### Violin Plot: Axillary Nodes

In [None]:
sns.violinplot(x='status', y='pos_axillary_nodes', data=data, palette='cool')
plt.title('Survival based on Axillary Nodes')
plt.show()

### Age Distribution by Status

In [None]:
sns.histplot(data=data, x='age', hue='status', bins=15, kde=True)
plt.title('Age Distribution by Survival Status')
plt.show()

# 9. Summarization

## 9.1 Conclusion

- Most patients who survived had fewer positive axillary nodes.
- Age and operation year impact survival outcomes.
- There’s visible variation by patient demographics and clinical measures.

## 9.2 Actionable Insights

- Early detection and fewer lymph node involvements correlate with longer survival.
- Patients between 40–60 showed relatively better outcomes.
- Clinical focus on lymph node management is crucial for prognosis.

# 📊 Extended EDA: 10 Key Questions with Visualizations

### 🔹 Q: What is the overall distribution of patients based on their survival status?

In [None]:
sns.countplot(x='status', data=data); plt.title('Survival Status Distribution'); plt.show()

### 🔹 Q: How is the age of patients distributed across the dataset?

In [None]:
sns.histplot(data['age'], bins=10, kde=True); plt.title('Age Distribution'); plt.xlabel('Age'); plt.show()

### 🔹 Q: Which age groups have higher survival rates?

In [None]:
data['age_class'] = pd.cut(data['age'], [30, 40, 50, 60, 90], labels=['<=40','41-50','51-60','>60'])
sns.countplot(x='age_class', hue='status', data=data); plt.title('Survival by Age Group'); plt.show()

### 🔹 Q: What is the relationship between age and the number of positive axillary nodes?

In [None]:
sns.scatterplot(x='age', y='pos_axillary_nodes', hue='status', data=data); plt.title('Age vs Positive Axillary Nodes'); plt.show()

### 🔹 Q: How are years of operation distributed, and does it influence survival?

In [None]:
sns.countplot(x='years_of_operation', hue='status', data=data); plt.title('Year of Operation and Survival'); plt.show()

### 🔹 Q: Do patients with fewer positive axillary nodes have better survival outcomes?

In [None]:
sns.violinplot(x='status', y='pos_axillary_nodes', data=data); plt.title('Survival vs Axillary Nodes'); plt.show()

### 🔹 Q: What is the correlation between the numerical variables in the dataset?

In [None]:
sns.heatmap(data.corr(), annot=True, cmap='YlGnBu'); plt.title('Correlation Heatmap'); plt.show()

### 🔹 Q: Can we identify patterns among features that distinguish survival groups?

In [None]:
sns.pairplot(data, hue='status'); plt.suptitle('Feature Patterns by Survival', y=1.02); plt.show()

### 🔹 Q: Is there a specific year where survival rates were noticeably better or worse?

In [None]:
operation_survival = data.groupby('years_of_operation')['status'].value_counts(normalize=True).unstack()
operation_survival.plot(kind='bar', stacked=True)
plt.title('Survival Rate by Operation Year'); plt.ylabel('Proportion'); plt.show()

### 🔹 Q: How does the number of positive axillary nodes affect survival status distribution?

In [None]:
bins = pd.cut(data['pos_axillary_nodes'], bins=[0,3,6,10,30], labels=['0-3','4-6','7-10','11+'])
sns.countplot(x=bins, hue='status', data=data)
plt.title('Axillary Nodes Grouped vs Survival'); plt.xlabel('Positive Axillary Nodes Group'); plt.show()