# <center>Cancer Diagnosis Analytics Dashboard</center>

# Exploratory Data Analysis

This notebook performs comprehensive exploratory data analysis (EDA) on the METABRIC breast cancer dataset. We will examine patterns, distributions, and relationships within the data to establish a foundation for future predictive modeling and clinical insights.

## Objectives

The objective of this notebook is to perform a thorough analysis of the METABRIC breast cancer dataset. The notebook will cover the following:

### Distribution Analysis

- Analyze survival rate distribution across molecular subtypes
- Explore clinical parameter distributions by age groups
- Examine gene expression patterns across different cancer types

### Temporal Analysis

- Evaluate survival time distribution patterns
- Analyze diagnosis age trends
- Explore treatment response over time

### Patient-Specific Analysis

- Calculate key metrics per patient
- Identify patterns in mutation loads per patient
- Analyze treatment combinations and outcomes

### Feature Relationship Analysis

- Examine correlations between clinical features
- Explore relationships between genomic markers and survival
- Identify potential predictive features for survival outcomes

## Import Libraries

In [10]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import os
from datetime import datetime
from scipy import stats

# Create directory for saving visualizations
os.makedirs('../static/images', exist_ok=True)
print("Created directory: ../static/images")

# Configure visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_theme(style="whitegrid", font_scale=1.2)
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['axes.labelsize'] = 12

# Display settings for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.2f}'.format)

# Record execution time for performance tracking
start_time = datetime.now()
print(f"Notebook execution started at: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")


Created directory: ../static/images
Notebook execution started at: 2025-03-16 23:59:21


## 1. Data Loading

In [11]:
# Load the processed dataset
try:
    df_processed = pd.read_csv('../data/processed/metabric_processed_full.csv')
    print(f"Processed dataset loaded successfully. Shape: {df_processed.shape}")
except FileNotFoundError:
    print("ERROR: Processed file not found!")
    df_processed = None
    
# Display the first few rows of the dataset
if df_processed is not None:
    print("\n--- First 5 rows of the processed dataset ---")
    display(df_processed.head())


Processed dataset loaded successfully. Shape: (5, 2)

--- First 5 rows of the processed dataset ---


Unnamed: 0,sample_id,gene1
0,-1.41,-1.26
1,-0.71,-0.85
2,0.0,0.01
3,0.71,0.57
4,1.41,1.54


## 2. Distribution Analysis

### 2.1 Survival Analysis by Molecular Subtype

In this section, we'll analyze how survival rates vary across different molecular subtypes of breast cancer.

In [12]:
# Check if the necessary columns exist in the dataset
if 'pam50_+_claudin-low_subtype' in df_processed.columns and 'overall_survival' in df_processed.columns:
    # Group by molecular subtype and calculate survival rate
    survival_by_subtype = df_processed.groupby('pam50_+_claudin-low_subtype')['overall_survival'].mean().reset_index()
    survival_by_subtype = survival_by_subtype.rename(columns={'overall_survival': 'survival_rate'})
    survival_by_subtype['survival_rate'] = survival_by_subtype['survival_rate'] * 100
    
    # Plot survival rates by subtype
    plt.figure(figsize=(12, 8))
    bar_plot = sns.barplot(x='pam50_+_claudin-low_subtype', y='survival_rate', data=survival_by_subtype)
    
    # Add percentage labels on top of bars
    for i, p in enumerate(bar_plot.patches):
        bar_plot.annotate(f"{p.get_height():.1f}%", 
                         (p.get_x() + p.get_width() / 2., p.get_height()), 
                         ha = 'center', va = 'bottom', fontsize=11)
    
    plt.title('Survival Rate by Molecular Subtype', fontsize=16)
    plt.xlabel('Molecular Subtype', fontsize=14)
    plt.ylabel('Survival Rate (%)', fontsize=14)
    plt.xticks(rotation=45)
    plt.tight_layout()
    
    # Save the figure with error handling
    try:
        plt.savefig('../static/images/survival_by_subtype.png')
        print("Saved survival_by_subtype.png")
    except Exception as e:
        print(f"Error saving figure: {e}")
    
    # Display the plot in the notebook
    plt.show()
    
    # Save the data to a CSV file
    try:
        survival_by_subtype.to_csv('../data/processed/survival_by_subtype.csv', index=False)
        print("Saved survival_by_subtype.csv")
    except Exception as e:
        print(f"Error saving CSV: {e}")


### 2.2 Distribution of Clinical Features by Age Group

In [13]:
# Check if age column exists
age_col = None
if 'age_at_diagnosis' in df_processed.columns:
    age_col = 'age_at_diagnosis'
elif 'age' in df_processed.columns:
    age_col = 'age'

if age_col is not None:
    # Create age groups
    bins = [0, 40, 50, 60, 70, 80, 100]
    labels = ['<40', '40-50', '50-60', '60-70', '70-80', '>80']
    df_processed['age_group'] = pd.cut(df_processed[age_col], bins=bins, labels=labels)
    
    # Clinical features to analyze
    clinical_features = ['tumor_size', 'lymph_nodes_examined_positive', 'mutation_count']
    clinical_features = [f for f in clinical_features if f in df_processed.columns]
    
    if clinical_features:
        # Plot boxplots for each clinical feature by age group
        for feature in clinical_features:
            plt.figure(figsize=(14, 8))
            sns.boxplot(x='age_group', y=feature, data=df_processed)
            plt.title(f'Distribution of {feature} by Age Group', fontsize=16)
            plt.xlabel('Age Group', fontsize=14)
            plt.ylabel(feature.replace('_', ' ').title(), fontsize=14)
            plt.tight_layout()
            
            # Save the figure with error handling
            try:
                plt.savefig(f'../static/images/{feature}_by_age_group.png')
                print(f"Saved {feature}_by_age_group.png")
            except Exception as e:
                print(f"Error saving figure: {e}")
            
            # Display the plot in the notebook
            plt.show()


### 2.3 Distribution of Tumour Characteristics