# scLightGAT Usage Guide

This notebook demonstrates how to use the scLightGAT pipeline for cell type annotation.

## Table of Contents
1. [Quick Start (Shell)](#quick-start)
2. [Python API](#python-api)
3. [Available Modes](#modes)
4. [Inference-Only Mode](#inference-only)
5. [Data Structure Requirements](#data-requirements)

## 1. Quick Start (Shell) <a name="quick-start"></a>

```bash
# Basic usage - run on specific dataset
./run_sclight.gat.sh GSE115978

# Run on all datasets
./run_sclight.gat.sh

# With custom epochs
./run_sclight.gat.sh --dvae-epochs 10 --gat-epochs 500 GSE115978

# With batch correction
./run_sclight.gat.sh --batch-key samples GSE115978

# CAF mode (Cancer-Associated Fibroblasts)
./run_sclight.gat.sh --caf CAF

# Hierarchical classification (subtypes)
./run_sclight.gat.sh --hierarchical GSE115978

# Inference-only mode (skip training)
./run_sclight.gat.sh --inference-only GSE123139
```

## 2. Python API <a name="python-api"></a>

In [None]:
# Setup imports
import os
import sys
import scanpy as sc

# Add scLightGAT to path
sys.path.insert(0, '/Group16T/common/lcy/dslab_lcy/GitRepo/scLightGAT')

from scLightGAT.pipeline import train_pipeline
from scLightGAT.training.model_manager import CellTypeAnnotator
from scLightGAT.label_utils import calculate_accuracy, get_aligned_color_palette

In [None]:
# Define paths
train_path = '/Group16T/common/lcy/dslab_lcy/GitRepo/scLightGAT/data/scLightGAT_data/Integrated_training/train.h5ad'
test_path = '/Group16T/common/lcy/dslab_lcy/GitRepo/scLightGAT/data/scLightGAT_data/Independent_testing/GSE115978.h5ad'
output_path = '/Group16T/common/lcy/dslab_lcy/GitRepo/scLightGAT/sclightgat_exp_results/GSE115978'
model_dir = '/Group16T/common/lcy/dslab_lcy/GitRepo/scLightGAT/saved_models'

os.makedirs(output_path, exist_ok=True)
os.makedirs(model_dir, exist_ok=True)

In [None]:
# Run training pipeline
adata_result = train_pipeline(
    train_path=train_path,
    test_path=test_path,
    output_path=output_path,
    model_dir=model_dir,
    train_dvae=True,      # Use DVAE for feature extraction
    use_gat=True,         # Use GAT for refinement
    dvae_epochs=5,        # DVAE training epochs
    gat_epochs=300,       # GAT training epochs
    hierarchical=False,   # Enable subtype prediction
    batch_key=None        # Batch correction key (optional)
)

## 3. Available Modes <a name="modes"></a>

| Mode | Flag | Description |
|------|------|-------------|
| Standard | (default) | Broad cell type prediction |
| Hierarchical | `--hierarchical` | Subtypes for CD4+T, CD8+T, B cells, Plasma, DC |
| CAF | `--caf` | CAF-specific training/test data |
| Inference | `--inference-only` | Skip training, use cached models |

### Available Datasets

| Dataset | Batch Key | Cells |
|---------|-----------|-------|
| GSE115978 | samples | ~4k |
| GSE123139 | Processing | ~6k |
| GSE153935 | sample_id | ~8k |
| GSE166555 | case_id | ~19k |
| Zhengsorted | - | ~18k |
| CAF | - | CAF-specific |

## 4. Inference-Only Mode <a name="inference-only"></a>

After training once, models can be reused for faster inference:

In [None]:
# Using the model manager for inference-only
annotator = CellTypeAnnotator(use_dvae=True, use_hvg=True)

# Check if trained models exist
if annotator.has_trained_models(model_dir):
    print('Trained models found! Loading...')
    encoder, metadata = annotator.load_models(model_dir)
    print(f'Loaded models with config: {metadata}')
else:
    print('No trained models found. Need to train first.')

## 5. Data Structure Requirements <a name="data-requirements"></a>

### Training Data
Required columns in `adata.obs`:
- `Celltype_training` or `Ground Truth`: Cell type labels (13 broad types)
- `Celltype_subtraining`: For hierarchical mode (31 subtypes)
- `batch` (optional): Batch information for Harmony correction

### Test Data
Required:
- `X_umap` in `obsm`: UMAP coordinates
- `Celltype_training` or `Ground Truth` (optional): For accuracy evaluation

### Output Files
```
output_path/
├── adata_with_predictions.h5ad    # Result with predictions
├── umap_comparison.png            # GT vs scLightGAT UMAP
├── umap_scLightGAT.png            # scLightGAT only UMAP
├── accuracy_report.txt            # Custom accuracy metrics
├── lightgbm_confusion_matrix.png  # Classification matrix
└── gat_training_loss.png          # GAT training curve
```

In [None]:
# Check training data structure
train = sc.read_h5ad(train_path)
print('Training data columns:')
print(train.obs.columns.tolist())
print(f'\nCells: {train.shape[0]}')
print(f'Genes: {train.shape[1]}')
print(f'\nCelltype_training labels: {train.obs["Celltype_training"].nunique()}')
if 'Celltype_subtraining' in train.obs.columns:
    print(f'Celltype_subtraining labels: {train.obs["Celltype_subtraining"].nunique()}')