# A Multi-Representation Approach to Automated Galaxy Classification using CLIP, ViT, and FiftyOne: A Study on the Galaxy10 DECals Dataset


1. Run `00_setup_verification.ipynb` to setup Verification
2. Run `01_data_exploration.ipynb` to begin the Pipeline
3. Run `02_model_embedding.ipynb` to  extract Features
4. Run `03_ensumble_trianing.ipynb` to do Classification and Evaluation
5. Run `04_visulization.ipynb` to integrate all into Voxel51 for Visulization and Analyis
## Galaxy10 Pipeline - Setup Verification

This notebook verifies that all dependencies are installed correctly and the environment is ready for the Galaxy10 pipeline.

## Environment Setup Instructions

### Method 1: Miniconda + Pip (Recommended - Fast)
```bash
# Create conda environment
conda create -n galaxy10 python=3.11 -y
conda activate galaxy10

# Install PyTorch with CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install all dependencies
pip install -r requirements.txt

# Launch Jupyter Lab
jupyter lab
```

### Method 2: Pure Conda (Alternative - Slower)
```bash
# Create conda environment from yml
conda env create -f environment.yml
conda activate galaxy10
jupyter lab
```


## 1. Import Core Libraries

In [5]:
import sys
import os
from multiprocessing import cpu_count

print(f"Python version: {sys.version}")
print(f"Available CPU cores: {cpu_count()}")
print(f"Using {cpu_count() - 1} cores for parallel processing")

Python version: 3.11.14 | packaged by conda-forge | (main, Oct 22 2025, 22:46:25) [GCC 14.3.0]
Available CPU cores: 22
Using 21 cores for parallel processing


## 2. Verify PyTorch and CUDA

In [None]:
import torch
import torchvision

print(f"PyTorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("  CUDA not available - will use CPU only")

PyTorch version: 2.9.0+cu128
Torchvision version: 0.24.0+cu128
CUDA available: True
CUDA version: 12.8
GPU device: NVIDIA RTX 500 Ada Generation Laptop GPU
GPU memory: 3.65 GB


## 3. Verify ML Libraries

In [7]:
import numpy as np
import pandas as pd
import sklearn
import timm
import umap
import h5py

print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")
print(f"TIMM version: {timm.__version__}")
print(f"UMAP version: {umap.__version__}")
print(f"H5PY version: {h5py.__version__}")

NumPy version: 2.2.6
Pandas version: 2.3.3
Scikit-learn version: 1.7.2
TIMM version: 1.0.21
UMAP version: 0.5.9.post2
H5PY version: 3.15.1


## 4. Verify FiftyOne

In [9]:
import fiftyone as fo
import fiftyone.brain as fob

print(f"FiftyOne version: {fo.__version__}")
print(f"FiftyOne Brain available: {fob is not None}")
#print(f"FiftyOne config directory: {fo.config.config_dir}")
print(f"FiftyOne database: {fo.config.database_uri}")


FiftyOne version: 1.9.0
FiftyOne Brain available: True
FiftyOne database: None


## 5. Verify CLIP

In [None]:
try:
    import clip
    print(f"CLIP available: True")
    print(f"Available CLIP models: {clip.available_models()}")
except ImportError as e:
    print(f" CLIP not available: {e}")

CLIP available: True
Available CLIP models: ['RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64', 'ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px']


## 6. Verify Visualization Libraries

In [11]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

print(f"Matplotlib version: {matplotlib.__version__}")
print(f"Seaborn version: {sns.__version__}")

# Configure matplotlib for notebook
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

Matplotlib version: 3.10.7
Seaborn version: 0.13.2


## 7. Verify Project Structure

In [12]:
# Add project root to path
project_root = os.path.abspath('..')
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Verify directory structure
required_dirs = [
    '../src',
    '../src/models',
    '../artifacts',
    '../artifacts/embeddings',
    '../artifacts/models',
    '../artifacts/visualizations'
]

print("Checking project structure:")
for dir_path in required_dirs:
    exists = os.path.exists(dir_path)
    status = "✓" if exists else "✗"
    print(f"{status} {dir_path}")

# Import project config
from src.config import N_JOBS, CLASS_NAMES, EMBEDDING_MODELS

print(f"\nConfiguration loaded successfully!")
print(f"N_JOBS: {N_JOBS}")
print(f"Number of classes: {len(CLASS_NAMES)}")
print(f"Embedding models configured: {list(EMBEDDING_MODELS.keys())}")

Checking project structure:
✓ ../src
✓ ../src/models
✓ ../artifacts
✓ ../artifacts/embeddings
✓ ../artifacts/models
✓ ../artifacts/visualizations
Configuration loaded: Using 21 CPU cores for parallel processing

Configuration loaded successfully!
N_JOBS: 21
Number of classes: 10
Embedding models configured: ['vit', 'efficientnet', 'clip']


## 8. Test CPU Parallelization

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import time

# Generate dummy data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=10, 
                          n_informative=15, random_state=42)

# Test with parallelization
print(f"Testing Random Forest with n_jobs={N_JOBS}...")
start = time.time()
rf = RandomForestClassifier(n_estimators=100, n_jobs=N_JOBS, random_state=42)
rf.fit(X, y)
elapsed = time.time() - start

print(f"✓ Training completed in {elapsed:.2f} seconds")
print(f"✓ CPU parallelization working correctly!")

Testing Random Forest with n_jobs=21...
✓ Training completed in 0.19 seconds
✓ CPU parallelization working correctly!


## 9. Summary

If all cells above executed successfully, your environment is ready for the Galaxy10 pipeline!

### Next Steps:
1. Download the Galaxy10 DECals dataset (Galaxy10_DECals.h5)
2. Place it in the project root directory
3. Run `01_data_exploration.ipynb` to begin the pipeline

### Dataset Download:
The Galaxy10 DECals dataset can be downloaded from:
- Kaggle: https://www.kaggle.com/datasets/jaimetrickz/galaxy10-decals

### **Dataset Reference — Galaxy10 DECals**  
Walmsley, M., et al. (2020). *Galaxy Zoo DECaLS: Detailed visual morphology measurements from volunteers and deep learning for 314,000 galaxies.* MNRAS, 491(2), 1554–1574.  
DOI: [10.1093/mnras/stz2816](https://doi.org/10.1093/mnras/stz2816)

In [None]:
print("="*60)
print(" SETUP VERIFICATION COMPLETE!")
print("="*60)
print("\nYour Galaxy10 pipeline environment is ready.")
print(f"\nSystem Configuration:")
print(f"  - Python: {sys.version.split()[0]}")
print(f"  - PyTorch: {torch.__version__}")
print(f"  - CUDA: {torch.cuda.is_available()}")
print(f"  - CPU Cores: {cpu_count()} (using {N_JOBS} for parallel processing)")
print(f"  - FiftyOne: {fo.__version__}")
print("\nReady to process 17,736 galaxy images! ")

🎉 SETUP VERIFICATION COMPLETE!

Your Galaxy10 pipeline environment is ready.

System Configuration:
  - Python: 3.11.14
  - PyTorch: 2.9.0+cu128
  - CUDA: True
  - CPU Cores: 22 (using 21 for parallel processing)
  - FiftyOne: 1.9.0

Ready to process 17,736 galaxy images! 🌌
