# Exploratory Data Analysis (EDA) for Histopathology Cancer Detection
This notebook is dedicated to analyzing and understanding the histopathology cancer detection dataset. We'll explore data distributions, check for any patterns, and perform preliminary visualizations. Additionally, we’ll use tools like Pandas Profiling for a quick overview of dataset characteristics.

In [None]:
# Import necessary libraries and modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
from configs import TRAIN_DIR, LABELS_FILE
from data_utils import load_labels, display_sample_images, calculate_mean_intensity

## 1. Initial Data Overview

In [None]:
# Load and preview labels data
labels_df = load_labels()
labels_df.head()

In [None]:
# Basic statistics and null check
print("Dataset Overview:")
print(labels_df.describe())
print("\nMissing values:", labels_df.isnull().sum().sum())

# ## 2. Distribution of Classes

In [None]:
# Visualize class distribution
sns.countplot(data=labels_df, x="label")
plt.title("Class Distribution: Cancerous vs Non-Cancerous Samples")
plt.xlabel("Label (0 = Non-Cancerous, 1 = Cancerous)")
plt.ylabel("Count")
plt.show()

## 3. Image Analysis

### 3.1 Display Sample Images
Display a few random images from each class.

In [None]:
# Sample cancerous and non-cancerous images
display_sample_images(labels_df, TRAIN_DIR, label=1, sample_size=5)
display_sample_images(labels_df, TRAIN_DIR, label=0, sample_size=5)

### 3.2 Image Intensity Analysis

In [None]:
# Calculate and plot pixel intensity distributions for both classes
plt.figure(figsize=(12, 6))
cancerous_intensity = calculate_mean_intensity(labels_df, TRAIN_DIR, label=1)
non_cancerous_intensity = calculate_mean_intensity(labels_df, TRAIN_DIR, label=0)

sns.histplot(cancerous_intensity, color='red', kde=True, label='Cancerous')
sns.histplot(non_cancerous_intensity, color='blue', kde=True, label='Non-Cancerous')
plt.xlabel("Pixel Intensity")
plt.ylabel("Frequency")
plt.title("Pixel Intensity Distribution for Cancerous vs Non-Cancerous Images")
plt.legend()
plt.show()

## 4. Generate Pandas Profiling Report

In [None]:
# Generate and save the profiling report
profile = ProfileReport(labels_df, title="Pandas Profiling Report for Histopathology Labels", explorative=True)
profile.to_file("EDA_report.html")
print("Pandas Profiling report saved as 'EDA_report.html'.")

## Summary and Insights
- The class distribution is slightly imbalanced, with more non-cancerous samples.
- Intensity distributions reveal differences between classes, useful for feature engineering.
- An EDA report has been saved as `EDA_report.html` for further exploration.