# Data Exploration - Scientific Image Forgery Detection

This notebook provides initial data exploration for the Recod.ai/LUC Kaggle competition.

**Competition:** [Scientific Image Forgery Detection](https://www.kaggle.com/competitions/recodai-luc-scientific-image-forgery-detection)

## Goal

Detect forgeries in scientific images using computer vision and deep learning techniques.

In [None]:
import os
import sys
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

## 1. Load Data

First, let's load and explore the dataset structure.

In [None]:
# Define data paths
DATA_DIR = Path('../data/raw')

# Check if data exists
if DATA_DIR.exists():
    print(f"Data directory found: {DATA_DIR}")
    print(f"Contents: {list(DATA_DIR.iterdir())}")
else:
    print(f"Data directory not found. Please download the data from Kaggle.")
    print(f"Expected location: {DATA_DIR.absolute()}")

## 2. Dataset Statistics

Analyze the dataset statistics.

In [None]:
# Load metadata if available
# This will depend on the actual data structure from Kaggle

# Example:
# train_df = pd.read_csv(DATA_DIR / 'train.csv')
# print(train_df.head())
# print(train_df.info())

## 3. Visualize Sample Images

Display some sample images from the dataset.

In [None]:
# Example visualization code
# This will be implemented once we have the actual data

def plot_images(image_paths, labels=None, n_cols=4):
    """
    Plot a grid of images.
    
    Args:
        image_paths: List of image paths
        labels: Optional list of labels
        n_cols: Number of columns in the grid
    """
    n_images = len(image_paths)
    n_rows = (n_images + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 4))
    axes = axes.flatten() if n_images > 1 else [axes]
    
    for idx, (img_path, ax) in enumerate(zip(image_paths, axes)):
        img = Image.open(img_path)
        ax.imshow(img)
        ax.axis('off')
        if labels is not None:
            ax.set_title(f'Label: {labels[idx]}')
    
    # Hide empty subplots
    for idx in range(n_images, len(axes)):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.show()

# Example usage:
# image_paths = list(DATA_DIR.glob('*.jpg'))[:8]
# plot_images(image_paths)

## 4. Image Properties Analysis

Analyze image properties like dimensions, aspect ratios, etc.

In [None]:
# Analyze image dimensions
# This will be implemented with actual data

def analyze_image_properties(image_dir):
    """
    Analyze properties of images in a directory.
    
    Args:
        image_dir: Path to directory containing images
    """
    widths = []
    heights = []
    aspects = []
    
    for img_path in Path(image_dir).glob('*.jpg'):
        img = Image.open(img_path)
        w, h = img.size
        widths.append(w)
        heights.append(h)
        aspects.append(w / h)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    axes[0].hist(widths, bins=30)
    axes[0].set_title('Image Widths')
    axes[0].set_xlabel('Width (pixels)')
    
    axes[1].hist(heights, bins=30)
    axes[1].set_title('Image Heights')
    axes[1].set_xlabel('Height (pixels)')
    
    axes[2].hist(aspects, bins=30)
    axes[2].set_title('Aspect Ratios')
    axes[2].set_xlabel('Width / Height')
    
    plt.tight_layout()
    plt.show()
    
    print(f"Average dimensions: {np.mean(widths):.1f} x {np.mean(heights):.1f}")
    print(f"Min dimensions: {np.min(widths)} x {np.min(heights)}")
    print(f"Max dimensions: {np.max(widths)} x {np.max(heights)}")

## 5. Class Distribution

Analyze the distribution of forgery vs. authentic images.

In [None]:
# Analyze class distribution
# This will be implemented with actual data

# Example:
# class_counts = train_df['label'].value_counts()
# plt.bar(class_counts.index, class_counts.values)
# plt.xlabel('Class')
# plt.ylabel('Count')
# plt.title('Class Distribution')
# plt.show()

## Next Steps

1. Download the competition data from Kaggle
2. Place it in the `data/raw/` directory
3. Run this notebook to explore the data
4. Start building and training models

## Useful Resources

- [Competition Page](https://www.kaggle.com/competitions/recodai-luc-scientific-image-forgery-detection)
- [Data Description](https://www.kaggle.com/competitions/recodai-luc-scientific-image-forgery-detection/data)