### Exploratory Data Analysis (EDA) for Histopathology Cancer Detection
This notebook is dedicated to analyzing and understanding the histopathology cancer detection dataset. We'll explore data distributions, check for any patterns, and perform preliminary visualizations. Additionally, we’ll use tools like Pandas Profiling for a quick overview of dataset characteristics.

#### Import Libraries

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sweetviz as sv

from scripts.config import LABELS_FILE, TRAIN_DIR, TARGET_SIZE
from scripts.data_utils import display_sample_images


#### Load Data

In [None]:
# Load labels from the training labels CSV
labels_df = pd.read_csv(LABELS_FILE)
print("Data Loaded Successfully")
labels_df.head()

#### Basic Information and Summary Statistics

In [None]:
# Display basic info
print("Dataset Info:")
labels_df.info()

print("\nSummary Statistics:")
labels_df.describe()

#### Class Distribution

In [None]:
# Visualize class distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='label', data=labels_df, palette='pastel')
plt.title('Class Distribution of Cancer Labels')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()

#### Sample Images

In [None]:
# Display samples for label 0 (non-cancerous) before and after preprocessing
print("Non-cancerous samples (Label 0):")
display_sample_images(train_dataset, label=0, sample_size=5)

In [None]:
# Display samples for label 1 (cancerous) before and after preprocessing
print("Cancerous samples (Label 1):")
display_sample_images(train_dataset, label=1, sample_size=5)

#### SweetViz Report

In [None]:
# Generate an automated EDA report using Sweetviz
eda_report = sv.analyze(labels_df)
eda_report.show_html(filepath='eda_report.html', open_browser=False)
print("Sweetviz Report Generated: eda_report.html")

#### Image Size Distribution
Check if there is any variance in image sizes (should be uniform but worth verifying).

In [None]:
# Verify image sizes for a random sample of 50 images
image_sizes = [Image.open(os.path.join(TRAIN_DIR, img_id + '.tif')).size for img_id in labels_df['id'].sample(50)]
image_sizes_df = pd.DataFrame(image_sizes, columns=['Width', 'Height'])

# Display unique dimensions
print("Sample Image Sizes:")
print(image_sizes_df['Width'].value_counts(), image_sizes_df['Height'].value_counts())

#### Image Resizing Preview
Verify resizing functionality to the target size defined in `config.py`.

In [None]:
# Resize a sample image to the target size
sample_img_path = os.path.join(TRAIN_DIR, labels_df['id'][0] + '.tif')
sample_img = Image.open(sample_img_path)
resized_img = sample_img.resize(TARGET_SIZE)

plt.figure(figsize=(8, 4))
plt.subplot(1, 2, 1)
plt.imshow(sample_img)
plt.title("Original Image")
plt.axis("off")

plt.subplot(1, 2, 2)
plt.imshow(resized_img)
plt.title(f"Resized Image {TARGET_SIZE}")
plt.axis("off")
plt.show()

#### Conclusion
This concludes the exploratory data analysis:
- The dataset's class distribution is imbalanced, favoring non-cancerous images.
- Sweetviz provides a comprehensive report for further insights.
- Images are uniform in size but will be resized to the target dimensions during preprocessing.
