# Complete Loss Functions Summary

## Total Loss Function:

$$L_{total} = \lambda_1 L_{GAEX} + \lambda_2 L_{GAEA} + \lambda_3 L_{clu} + \lambda_4 L_{gae} + \lambda_5 L_{ZINB}$$

**Hyperparameter Values:**
- $\lambda_1 = 0.5$ (Graph expression weight)
- $\lambda_2 = 0.01$ (Graph adjacency weight)  
- $\lambda_3 = 0.1$ (ZINB clustering weight)
- $\lambda_4 = 0.01$ (Graph clustering weight)
- $\lambda_5 = 0.5$ (ZINB reconstruction weight)

---

## Loss 1: ZINB Reconstruction Loss

$$L_{ZINB} = -\log(\text{ZINB}(\bar{X}|\pi, \mu, \theta))$$

**ZINB Distribution:**
$$\text{ZINB}(\bar{X}|\pi, \mu, \theta) = \pi \cdot \delta_0(\bar{X}) + (1-\pi) \cdot \text{NB}(\bar{X}|\mu, \theta)$$

**Negative Binomial:**
$$\text{NB}(\bar{X}|\mu, \theta) = \frac{\Gamma(\bar{X}+\theta)}{\bar{X}!\Gamma(\theta)}\left(\frac{\theta}{\theta+\mu}\right)^\theta\left(\frac{\mu}{\theta+\mu}\right)^{\bar{X}}$$

**Purpose:** Model zero-inflation and overdispersion in scRNA-seq data

---

## Loss 2: Graph Adjacency Reconstruction Loss

$$L_{GAEA} = \|A - \hat{A}\|_F^2$$

Where: $\hat{A} = \text{sigmoid}(Z_L^T Z_L)$

**Purpose:** Preserve structural relationships between cells

---

## Loss 3: Graph Expression Preservation Loss

$$L_{GAEX} = \|\bar{X} - Z_L\|_F^2$$

**Purpose:** Ensure GCN output retains gene expression information

---

## Loss 4: ZINB Module Clustering Loss

$$L_{clu} = KL(P \| Q) = \sum_{i}\sum_{j} p_{ij} \log\frac{p_{ij}}{q_{ij}}$$

**Applied to:** $H_{L/2}$ (middle layer of ZINB autoencoder)

**Purpose:** Guide clustering from content perspective

---

## Loss 5: GCN Module Clustering Loss  

$$L_{gae} = KL(P \| Z_{pre}) = \sum_{i}\sum_{j} p_{ij} \log\frac{p_{ij}}{z_{ij}}$$

Where: $Z_{pre} = \text{softmax}\left(\hat{D}^{-\frac{1}{2}}(A+I)\hat{D}^{-\frac{1}{2}}R_{L/2-1}U_{L/2-1}\right)$

**Applied to:** Graph autoencoder predictions

**Purpose:** Guide clustering from structural perspective

---

## Soft Assignment & Target Distribution:

### Soft Assignment (Q):
$$q_{ij} = \frac{\left[1 + \frac{\|h_i - \mu_j\|^2}{\lambda}\right]^{-\frac{(\lambda+1)}{2}}}{\sum_{j'} \left[1 + \frac{\|h_i - \mu_{j'}\|^2}{\lambda}\right]^{-\frac{(\lambda+1)}{2}}}$$

### Target Distribution (P):
$$p_{ij} = \frac{q_{ij}^2 / g_j}{\sum_{j'} q_{ij'}^2 / g_{j'}}$$

Where: $g_j = \sum_i q_{ij}$ (soft cluster frequency)

---

## Training Strategy:

### Phase 1: Pre-training (100 epochs)
- **Loss:** $L_{ZINB}$ only
- **Purpose:** Initialize ZINB autoencoder

### Phase 2: Joint Training (200 epochs)  
- **Loss:** $L_{total}$ (all 5 components)
- **Purpose:** Optimize all modules simultaneously
- **Convergence:** Stop when cluster assignment changes < 0.1%

---

## Key Insights:

1. **Five complementary objectives** working together
2. **Dual clustering losses** from both ZINB and GCN modules  
3. **Balanced weights** emphasize reconstruction (0.5) and structure (0.01-0.1)
4. **End-to-end training** with self-supervision
5. **ZINB modeling** handles scRNA-seq characteristics
6. **Graph regularization** preserves cell relationships

# Module 4: Self-Supervised Learning

## Objective:
Enable end-to-end clustering through soft assignments and target distribution learning

---

## Soft Assignment (Student's t-distribution):

$$q_{ij} = \frac{\left[1 + \frac{\|h_i - \mu_j\|^2}{\lambda}\right]^{-\frac{(\lambda+1)}{2}}}{\sum_{j'} \left[1 + \frac{\|h_i - \mu_{j'}\|^2}{\lambda}\right]^{-\frac{(\lambda+1)}{2}}}$$

Where:
- $q_{ij}$ = soft assignment probability of sample $i$ to cluster $j$
- $h_i$ = embedding representation of sample $i$
- $\mu_j$ = cluster center $j$
- $\lambda$ = degrees of freedom for Student's t-distribution (typically λ = 1)
- $\|h_i - \mu_j\|^2$ = squared Euclidean distance

**Purpose**: Compute soft cluster assignments using Student's t-distribution

---

## Target Distribution (High-Confidence):

$$p_{ij} = \frac{q_{ij}^2 / g_j}{\sum_{j'} q_{ij'}^2 / g_{j'}}$$

Where:
- $p_{ij}$ = target distribution (sharpened assignments)
- $q_{ij}$ = soft assignment from above
- $g_j = \sum_i q_{ij}$ = soft cluster frequency

**Purpose**: Create high-confidence target distribution to guide learning

---

## How Self-Supervised Learning Works:

### Step 1: Initialize Cluster Centers
- Apply k-means clustering on $H_{L/2}$ (middle layer of ZINB autoencoder)
- Obtain initial cluster centers: $\{\mu_1, \mu_2, ..., \mu_k\}$

### Step 2: Compute Soft Assignments
- Calculate $q_{ij}$ using Student's t-distribution
- Each cell gets probability distribution over clusters
- Soft assignments allow gradual refinement

### Step 3: Generate Target Distribution  
- Compute cluster frequencies: $g_j = \sum_i q_{ij}$
- Create sharpened distribution: $p_{ij} = \frac{q_{ij}^2/g_j}{\sum_{j'} q_{ij'}^2/g_{j'}}$
- Target distribution emphasizes high-confidence assignments

### Step 4: Update Parameters
- Use KL divergence loss to align Q with P
- Update both network parameters and cluster centers
- Iterative refinement improves clustering

---

## Key Properties:

### Student's t-Distribution Benefits:
1. **Heavy tails**: Robust to outliers
2. **Smooth gradients**: Better optimization than hard assignments
3. **Adaptive**: Adjusts based on distance to cluster centers
4. **Normalized**: Probabilities sum to 1 for each sample

### Target Distribution Benefits:
1. **Sharpening**: Emphasizes confident predictions
2. **Frequency normalization**: Prevents trivial solutions
3. **High confidence**: Guides model toward decisive clustering
4. **Balanced**: Accounts for cluster size differences

---

## Dual Self-Supervision:
- Applied to **both** ZINB module ($H_{L/2}$) and GCN module ($Z_{pre}$)
- Same target distribution $P$ guides both modules
- Ensures unified clustering objective across architectures

# Module 3: Attention Fusion Mechanism

## Objective:
Intelligently integrate gene expression + structural information

---

## Multi-Head Attention (8 heads):

### Step 1: Weighted Combination

$$Y_{l-1} = \alpha \times H_{l-1} + (1-\alpha) \times Z_{l-1}$$

Where:
- $Y_{l-1}$ = combined representation
- $\alpha = 0.5$ (balance parameter)
- $H_{l-1}$ = output from ZINB autoencoder
- $Z_{l-1}$ = output from graph autoencoder
- $(1-\alpha) = 0.5$ = complementary weight

**Purpose**: Balance content and structural information equally

---

### Step 2: Multi-Head Attention

$$\text{head}_i = \text{softmax}\left(\frac{Q \times K^T}{\sqrt{d_k}}\right) \times V$$

$$R_l = W \times \text{Concat}(\text{head}_1, ..., \text{head}_8)$$

Where:
- $\text{head}_i$ = output of attention head $i$
- $Q$ = query matrix ($Q = W_i^Q \times Y_{l-1}$)
- $K$ = key matrix ($K = W_i^K \times Y_{l-1}$)
- $V$ = value matrix ($V = W_i^V \times Y_{l-1}$)
- $d_k$ = dimension of key vectors
- $\sqrt{d_k}$ = scaling factor (prevents large dot products)
- $W$ = weight matrix for final projection
- $\text{Concat}$ = concatenation operation
- Number of heads = 8

---

## How It Works:

### For each attention head $i$ (i = 1, 2, ..., 8):

1. **Transform input** into Q, K, V using learned weight matrices
   - $Q_i = W_i^Q \times Y_{l-1}$
   - $K_i = W_i^K \times Y_{l-1}$
   - $V_i = W_i^V \times Y_{l-1}$

2. **Compute attention scores**
   - Calculate similarity: $Q_i \times K_i^T$
   - Scale by $\sqrt{d_k}$ to prevent gradient issues
   - Apply softmax to get attention weights

3. **Apply attention** to values
   - Weighted sum: $\text{attention\_weights} \times V_i$

4. **Concatenate all heads** and project
   - Combine: $[\text{head}_1 | \text{head}_2 | ... | \text{head}_8]$
   - Final transformation: $R_l = W \times \text{concatenated\_heads}$

---

## Key Benefits:

### 1. Adaptive Feature Weighting
- Different features get different importance automatically
- Model learns which genes/structures matter most

### 2. Multi-Perspective Learning
- 8 heads capture diverse patterns simultaneously
- Each head focuses on different aspects

### 3. Layer-by-Layer Fusion
- Fusion happens at each layer, not just once
- Prevents information loss through depth

### 4. Balanced Integration
- α = 0.5 ensures equal contribution from both modules
- No dominance of content or structure

### 5. Prevents Oversmoothing
- Maintains discriminative features from ZINB module
- Combats GCN's tendency to blur information

---

## Attention Mechanism Advantages:

- **Selective**: Focuses on relevant information
- **Dynamic**: Adapts based on input data
- **Interpretable**: Attention weights show importance
- **Powerful**: Captures complex dependencies
- **Scalable**: Parallel computation across heads

# Module 2: Graph Autoencoder (GCN)

## Objective:
Capture high-order structural relationships between cells

---

## Graph Convolutional Network Layer:

$$Z_l = \text{ReLU}\left(\hat{D}^{-\frac{1}{2}} \times (A+I) \times \hat{D}^{-\frac{1}{2}} \times R_{l-1} \times U_{l-1}\right)$$

Where:
- $Z_l$ = output of GCN layer $l$
- $\hat{D}$ = degree matrix
- $A$ = adjacency matrix (KNN graph)
- $I$ = identity matrix
- $R_{l-1}$ = fused representation from previous layer (attention output)
- $U_{l-1}$ = weight parameters of layer $l-1$
- $\text{ReLU}$ = activation function

---

## Key Features:

### A = KNN Adjacency Matrix (k=10)
- Represents cell-cell relationships
- K-nearest neighbor graph construction
- Each cell connected to its 10 most similar neighbors
- Captures local structure in expression space

### I = Identity Matrix
- Self-connections for each node
- Ensures node's own features are included
- Prevents information loss in aggregation

### D̂ = Degree Matrix
- Diagonal matrix with node degrees
- $\hat{D}_{ii} = \sum_j (A_{ij} + I_{ij})$
- Used for symmetric normalization
- Balances influence of neighbors

### R_(l-1) = Fused Representation (Attention Output)
- Combined content + structure information
- Output from attention fusion module
- Prevents oversmoothing
- Maintains discriminative features

### Mitigates Oversmoothing Problem in Deep GCNs
- **Problem**: Deep GCNs blur node features
- **Solution**: Inject content information via $R_{l-1}$
- Preserves cell-specific characteristics
- Enables deeper network architectures

---

## Benefits:

1. **Structural Learning**: Captures cell-cell relationships
2. **High-order Information**: Multi-layer aggregation
3. **Normalized Aggregation**: Symmetric normalization prevents scaling issues
4. **Feature Preservation**: Fused input maintains content information
5. **Robust to Noise**: Graph structure filters technical noise

# Module 1: ZINB-based Autoencoder

## Objective:
Learn low-dimensional representations capturing gene expression patterns

---

## Mathematical Formulation:

### Encoder:
$$H_l = \text{ReLU}(W_l \times H_{l-1} + b_l)$$

Where:
- $H_l$ = output of layer $l$
- $W_l$ = weight matrix of layer $l$
- $b_l$ = bias vector of layer $l$
- $\text{ReLU}$ = Rectified Linear Unit activation function

---

## ZINB Parameters:

### Dropout Parameter:
$$\Pi = \text{sigmoid}(W_\pi \times H_L)$$

### Mean Parameter:
$$M_i = \text{diag}(S_i) \times \exp(W_\mu \times H_L)$$

### Dispersion Parameter:
$$\Theta = \exp(W_\theta \times H_L)$$

Where:
- $\Pi$ (pi) = dropout probability (zero-inflation)
- $M_i$ = mean parameter for cell $i$
- $S_i$ = size factor for cell $i$
- $\Theta$ (theta) = dispersion parameter (overdispersion)
- $H_L$ = final layer output

---

## Network Architecture:

### Encoder Layers:
```
Input: 2000 genes
  ↓
Dense Layer 1: 1000 nodes + ReLU
  ↓
Dense Layer 2: 1000 nodes + ReLU
  ↓
Dense Layer 3: 4000 nodes + ReLU
  ↓
Latent Layer: 10 nodes + ReLU
```

**Architecture Flow:**
$$2000 \rightarrow 1000 \rightarrow 1000 \rightarrow 4000 \rightarrow 10 \text{ nodes}$$

### Decoder Layers (Symmetric):
```
Latent: 10 nodes
  ↓
Dense Layer 1: 4000 nodes + ReLU
  ↓
Dense Layer 2: 1000 nodes + ReLU
  ↓
Dense Layer 3: 1000 nodes + ReLU
  ↓
Output: 2000 genes (reconstructed)
```

---

## Key Features:

1. **Dimensionality Reduction**: 2000 → 10 dimensions
2. **Zero-Inflation Modeling**: Handles dropout events via $\Pi$ parameter
3. **Overdispersion Handling**: Captures variance via $\Theta$ parameter
4. **Symmetric Architecture**: Encoder-decoder structure
5. **ReLU Activation**: Non-linear transformations throughout

---

## Purpose:

- Capture **content information** from gene expression data
- Model the **statistical distribution** of scRNA-seq data
- Learn **meaningful latent representations** for clustering
- Handle **sparsity and noise** inherent in single-cell data