# Exploratory Data Analysis (EDA) for Histopathology Cancer Detection
This notebook is dedicated to analyzing and understanding the histopathology cancer detection dataset. We'll explore data distributions, check for any patterns, and perform preliminary visualizations. Additionally, we’ll use tools like Pandas Profiling for a quick overview of dataset characteristics.

## Import Libraries

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sweetviz as sv
from scripts.config import LABELS_FILE, TRAIN_DIR
from scripts.data_utils import display_sample_images

In [None]:
# Set global configurations
plt.style.use('ggplot')
sns.set_theme(style="darkgrid")

## Load Data

In [None]:
# Load labels from the training labels CSV
labels_df = pd.read_csv(LABELS_FILE)
print("Data Loaded Successfully")
labels_df.head()

## Basic Information and Summary Statistics

In [None]:
# Display basic info
print("Dataset Info:")
labels_df.info()

print("\nSummary Statistics:")
labels_df.describe()

## Class Distribution

In [None]:
# Visualize class distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='label', data=labels_df, palette='pastel')
plt.title('Class Distribution of Cancer Labels')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()

## Sample Images

In [None]:
# Display a few sample images from each class
display_sample_images(labels_df, TRAIN_DIR, label=1, sample_size=5)  # Cancerous images
display_sample_images(labels_df, TRAIN_DIR, label=0, sample_size=5)  # Non-cancerous images

## SweetViz Report

In [None]:
# Generate an automated EDA report using Sweetviz
eda_report = sv.analyze(labels_df)
eda_report.show_html(filepath='eda_report.html', open_browser=False)
print("Sweetviz Report Generated: eda_report.html")

## Mean Intensity of Images
This section will calculate the mean intensity of each image and plot the distribution to understand brightness levels.


In [None]:
from scripts.data_utils import calculate_mean_intensity

# Compute mean intensity for each image and store results in the dataframe
labels_df['mean_intensity'] = labels_df.apply(
    lambda row: calculate_mean_intensity(
        df=labels_df, 
        img_dir=TRAIN_DIR, 
        label=row['label'], 
        sample_size=1  # Process one image at a time
    ), 
    axis=1
)

## Image Size Distribution
Check if there is any variance in image sizes (should be uniform but worth verifying).

In [None]:
# Verify image size for a sample of images
from PIL import Image

image_sizes = [Image.open(os.path.join(TRAIN_DIR, img_id + '.tif')).size for img_id in labels_df['id'].sample(50)]
image_sizes_df = pd.DataFrame(image_sizes, columns=['Width', 'Height'])
print("Sample Image Sizes:")
print(image_sizes_df['Width'].value_counts(), image_sizes_df['Height'].value_counts())

## Correlation Analysis
Analyze the relationship between different numerical variables, if applicable.

In [None]:
# Compute correlation matrix and visualize
if 'mean_intensity' in labels_df.columns:
    plt.figure(figsize=(6, 4))
    sns.heatmap(labels_df[['label', 'mean_intensity']].corr(), annot=True, cmap='coolwarm')
    plt.title('Correlation Matrix')
    plt.show()

## Conclusion
This concludes the exploratory data analysis. The EDA report generated by Sweetviz provides a comprehensive overview of the dataset. Additional feature engineering and preprocessing can be applied based on these insights.