# DialogueGCN: Mathematical Foundations & Implementation Deep Dive


## Table of Contents
1. [Core Architecture](#core-architecture)
2. [Loss Functions](#loss-functions)
3. [Attention Mechanisms](#attention-mechanisms)
4. [Recurrent Components](#recurrent-components)
5. [Graph Neural Network Components](#graph-neural-network-components)
6. [Feature Extraction](#feature-extraction)
7. [Key Papers & References](#key-papers--references)

## Core Architecture

The DialogueGCN model combines three fundamental neural paradigms:

1. **Sequential Processing** (RNN/LSTM/GRU)
2. **Graph Neural Networks** (RGCN)
3. **Attention Mechanisms**

Mathematically, the full model can be represented as:

$$
\text{Emotion} = \text{GNN}(\text{RGCN}(\text{SeqEncoder}(U, qmask), \mathcal{G}(V,E)))
$$

Where:
- $U$: Utterance sequence
- $qmask$: Speaker masks
- $\mathcal{G}$: Dialogue graph construction

[Original Paper](https://arxiv.org/abs/1908.11540) | [Official Code](https://github.com/declare-lab/conv-emotion)

## Loss Functions

### MaskedNLLLoss
```python
class MaskedNLLLoss(nn.Module):
    def forward(self, pred, target, mask):
        mask_ = mask.view(-1,1)
        loss = self.loss(pred * mask_, target) / torch.sum(mask)
```
Mathematics:
$$
L = - \frac{\sum_{t=1}^{T} m_t \cdot y_t \log(p_t)}{\sum_{t=1}^{T} m_t}
$$
Where $m_t$ is the mask value at position $t$.

**Use Case:** Handles variable-length sequences in dialogue systems.

**References:**
- [PyTorch NLLLoss](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html)
- Sequence Masking Explained

### MaskedMSELoss
```python
class MaskedMSELoss(nn.Module):
    def forward(self, pred, target, mask):
        loss = self.loss(pred * mask, target) / torch.sum(mask)
```
Mathematics:
$$
L = \frac{\sum_{t=1}^{T} m_t \cdot (y_t - \hat{y}_t)^2}{\sum_{t=1}^{T} m_t}
$$

**Use Case:** Regression tasks with incomplete sequences.

**References:**
- [MSELoss Docs](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html)

## Attention Mechanisms

### SimpleAttention
```python
scale = self.scalar(M) # seq_len, batch, 1
alpha = F.softmax(scale, dim=0)
```
Mathematics:
$$
\alpha_i = \text{softmax}(W h_i)
$$
$$
c = \sum_i \alpha_i h_i
$$

**Visualization:**
```
[Utter1] --α1--> [Context]
[Utter2] --α2--> [Context]
[Utter3] --α3--> [Context]
```

**References:**
- [Attention Mechanisms Survey](https://arxiv.org/abs/1706.03762)

### MatchingAttention
Implements three attention variants:
- **Dot Product:** $e_{ij} = q^T k_j$
- **General:** $e_{ij} = q^T W k_j$
- **Concat:** $e_{ij} = v^T \text{tanh}(W[q; k_j])$

**Code Reference:** Lines 89-143

**Paper Reference:** Luong et al. (2015)

## Recurrent Components

### DialogueRNNCell
```python
self.g_cell = nn.GRUCell(D_m+D_p, D_g)  # Global state
self.p_cell = nn.GRUCell(D_m+D_g, D_p)  # Party state
self.e_cell = nn.GRUCell(D_p, D_e)      # Emotion state
```
Mathematical Formulation:
- **Global GRU:** $g_t = \text{GRU}([u_t, p_{t-1}^{speaker}])$
- **Party GRU:** $p_t = \text{GRU}([u_t, c_t])$
- **Emotion GRU:** $e_t = \text{GRU}(p_t^{speaker})$

**Reference:** [DialogueRNN Paper](https://arxiv.org/abs/1811.00405)

## Graph Neural Network Components

### RGCN Layer
```python
self.conv1 = RGCNConv(num_features, hidden_size, num_relations)
```
Message Passing:
$$
h_i^{(l+1)} = \sigma \left( \sum_{r \in R} \sum_{j \in N_r(i)} \frac{1}{c_{i,r}} W_r^{(l)} h_j^{(l)} \right)
$$

**Edge Type Handling:**
- $2 \times n_{speakers}^2$ edge types (forward/backward × speaker pairs)
- Basis decomposition for parameter efficiency

**Official Docs:** [RGCNConv](https://arxiv.org/abs/1703.06103)

### Graph Construction
```python
def batch_graphify(features, qmask, lengths, ...):
    # Creates edges based on:
    # 1. Temporal window
    # 2. Speaker information
```
**Edge Formation Rules:**
- Temporal edges within $[t-w_{past}, t+w_{future}]$
- Speaker-aware edges (different types for same/different speakers)
- Direction-aware edges (forward/backward in dialogue)

## Feature Extraction

### CNN Feature Extractor
```python
self.convs = nn.ModuleList([
    nn.Conv1d(embedding_dim, filters, K)
    for K in kernel_sizes
])
```
**Operation Pipeline:**
1. Embedding lookup
2. Multi-width 1D convolutions (3,4,5 grams)
3. Max-pooling over time
4. Feature projection

**Reference:** Kim (2014) - CNN for Text

## Key Papers & References

### DialogueGCN:
- [Paper](https://arxiv.org/abs/1908.11540)
- [Code](https://github.com/declare-lab/conv-emotion)

### Graph Networks:
- Kipf & Welling (2017) - GCN
- Schlichtkrull et al. (2018) - RGCN

### Attention:
- Vaswani et al. (2017) - Transformers
- Luong et al. (2015) - Global Attention

### Dialogue Systems:
- [DialogueRNN](https://arxiv.org/abs/1811.00405)
- ICON

