### Exploratory Data Analysis (EDA) for Histopathology Cancer Detection
This notebook is dedicated to analyzing and understanding the histopathology cancer detection dataset. We'll explore data distributions, check for any patterns, and perform preliminary visualizations. Additionally, we’ll use tools like Pandas Profiling for a quick overview of dataset characteristics.

#### Import Libraries

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sweetviz as sv

from scripts.config import LABELS_FILE, TRAIN_DIR, TARGET_SIZE, BATCH_SIZE
from scripts.data_utils import HistopathologyDataModule, display_sample_images

#### Load Data

In [None]:
# Load labels from the training labels CSV
labels_df = pd.read_csv(LABELS_FILE)
print("Data Loaded Successfully")
print(labels_df.head())

# Initialize the data module
data_module = HistopathologyDataModule(batch_size=BATCH_SIZE)
data_module.setup(stage="fit")

# Access the train dataset
train_dataset = data_module.train_dataset

#### Basic Information and Summary Statistics

In [None]:
# Display basic info
print("Dataset Info:")
labels_df.info()

print("\nSummary Statistics:")
labels_df.describe()

#### Class Distribution

In [None]:
# Visualize class distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='label', data=labels_df, palette='pastel')
plt.title('Class Distribution of Cancer Labels')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()

#### Sample Images

In [None]:
# Display samples for label 0 (non-cancerous) before and after preprocessing
print("Non-cancerous samples (Label 0):")
display_sample_images(train_dataset, label=0, sample_size=5)

In [None]:
# Display samples for label 1 (cancerous) before and after preprocessing
print("Cancerous samples (Label 1):")
display_sample_images(train_dataset, label=1, sample_size=5)

#### SweetViz Report

In [None]:
# Generate an automated EDA report using Sweetviz
eda_report = sv.analyze(labels_df)
eda_report.show_html(filepath='eda_report.html', open_browser=False)
print("Sweetviz Report Generated: eda_report.html")

#### Conclusion
This concludes the exploratory data analysis:
- The dataset's class distribution is imbalanced, favoring non-cancerous images.
- Sweetviz provides a comprehensive report for further insights.
- Images are uniform in size but will be resized to the target dimensions during preprocessing.
