<a href="https://colab.research.google.com/github/chasslayy/Jua-Shade/blob/main/JuaShade_01_data_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# JuaShade – Data Exploration & Setup
CISC 580 – Computer Vision Graduate Project

**Author:** Chastity Lewis

This notebook covers:
- Environment and library setup
- Reproducibility (random seeds)
- Dataset loading and basic inspection
- Sample visualizations


In [None]:
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
from pathlib import Path

# Optional: CV libraries
import cv2
from torchvision import transforms, datasets

print('Python environment ready.')

Python environment ready.


## Reproducibility: Set Random Seeds

In [None]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

print('Random seeds set to', SEED)

Random seeds set to 42


## Paths and Dataset Location

In [None]:
# Update these paths based on your project structure
BASE_DIR = Path('.')  # project root
DATA_RAW_DIR = BASE_DIR / 'data' / 'raw'
DATA_PROCESSED_DIR = BASE_DIR / 'data' / 'processed'

print('Base dir:', BASE_DIR.resolve())
print('Raw data dir:', DATA_RAW_DIR.resolve())
print('Processed data dir:', DATA_PROCESSED_DIR.resolve())

DATA_RAW_DIR.mkdir(parents=True, exist_ok=True)
DATA_PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

Base dir: /content
Raw data dir: /content/data/raw
Processed data dir: /content/data/processed


## List Sample Images

In [None]:
image_extensions = {'.jpg', '.jpeg', '.png', '.bmp'}
image_files = [p for p in DATA_RAW_DIR.rglob('*') if p.suffix.lower() in image_extensions]

print(f'Found {len(image_files)} images in raw directory.')
image_files[:10]

Found 0 images in raw directory.


[]

## Visualize Sample Images

In [None]:
if len(image_files) == 0:
    print('No images found in data/raw. Please add images to continue.')
else:
    n_samples = min(9, len(image_files))
    sample_paths = random.sample(image_files, n_samples)

    cols = 3
    rows = int(np.ceil(n_samples / cols))
    plt.figure(figsize=(12, 4 * rows))

    for idx, img_path in enumerate(sample_paths, 1):
        img = cv2.imread(str(img_path))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        plt.subplot(rows, cols, idx)
        plt.imshow(img)
        plt.axis('off')
        plt.title(img_path.name)

    plt.tight_layout()
    plt.show()

No images found in data/raw. Please add images to continue.


## Label Distribution (Optional)

If you have a CSV or folder-based label structure, you can explore class distribution here.

In [None]:
# Example for CSV-based labels (customize for your dataset)
# LABELS_CSV = BASE_DIR / 'data' / 'labels.csv'
# if LABELS_CSV.exists():
#     df = pd.read_csv(LABELS_CSV)
#     print(df.head())
#     plt.figure(figsize=(8, 4))
#     df['label'].value_counts().plot(kind='bar')
#     plt.title('Label Distribution')
#     plt.xlabel('Class')
#     plt.ylabel('Count')
#     plt.show()
# else:
#     print('labels.csv not found – skip this step or update this cell for your dataset.')

## Summary
- Environment and seeds initialized ✅
- Data folders created / confirmed ✅
- Sample images listed and visualized (if available) ✅

Next notebook: **02_baseline_model.ipynb** – implement classical baseline (e.g., HOG + SVM).