# Transformer Interview Notes (with Working & Mathematical Expressions)

## 1. Transformer Architecture Overview
A Transformer consists of an **Encoder-Decoder structure** where:
- Encoder: Processes the input sequence into contextual embeddings.
- Decoder: Uses encoder output and previously generated tokens to predict the next token.

Uses **self-attention** mechanism for sequence processing.

---

## 2. Self-Attention Formula
\[ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V \]

Where:
- Q = Query matrix
- K = Key matrix
- V = Value matrix
- d_k = dimension of keys

---

## 3. Query, Key, and Value Computation
\[ Q = XW_Q, \quad K = XW_K, \quad V = XW_V \]

Each token is transformed using trainable weight matrices.

---

## 4. Multi-Head Attention
\[ MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W_O \]
\[ head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) \]

Captures diverse relationships through multiple heads.

---

## 5. Feed Forward Network (FFN)
\[ FFN(x) = ReLU(xW_1 + b_1)W_2 + b_2 \]

Adds non-linearity and depth to each token representation.

---

## 6. Positional Encoding
Since Transformers lack recurrence, positional encoding injects order information:

\[
PE_{(pos, 2i)} = \sin(\frac{pos}{10000^{2i/d_{model}}}) \\
PE_{(pos, 2i+1)} = \cos(\frac{pos}{10000^{2i/d_{model}}})
\]

Added to token embeddings.

---

## 7. Encoder Layer Computation
\[
Z' = LayerNorm(X + MultiHeadAttention(X)) \\
Z = LayerNorm(Z' + FFN(Z'))
\]

Residual connections + normalization for stability.

---

## 8. Masking
Decoder uses **causal masking**:
\[
Mask(i, j) =
\begin{cases}
0, & j \le i \\
-\infty, & j > i
\end{cases}
\]

Ensures no future token is seen.

---

## 9. Cross-Attention (Decoder)
\[
Attention(Q_{dec}, K_{enc}, V_{enc}) = softmax(\frac{Q_{dec}K_{enc}^T}{\sqrt{d_k}})V_{enc}
\]

---

## 10. Output Probability
\[ P(y_t | y_{<t}, X) = softmax(W_o h_t) \]

---

## 11. Loss Function (Cross-Entropy)
\[
\mathcal{L} = -\sum_t y_t \log(\hat{y_t})
\]

---

## 12. Layer Normalization
\[
LayerNorm(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta
\]

---

## 13. Computational Complexity
\[ O(n^2 \cdot d) \]

Self-attention requires pairwise interactions between tokens.

---

## 14. Training Objectives
- **BERT (Masked LM):**
  \[ \mathcal{L}_{MLM} = -\sum_{i \in M} \log P(x_i | X_{\backslash M}) \]
- **GPT (Causal LM):**
  \[ \mathcal{L}_{CLM} = -\sum_t \log P(x_t | x_{<t}) \]

---

## 15. Vision Transformer (ViT)
- Divide image into patches.
- Flatten and linearly project each patch: \( x_i = W_e \cdot Flatten(patch_i) \)
- Add positional encoding and feed to Transformer Encoder.

---

## Summary
| Concept | Key Formula |
|----------|--------------|
| Attention | softmax(QKᵀ / √dₖ) V |
| Multi-Head | Concat(heads)Wₒ |
| FFN | ReLU(xW₁ + b₁)W₂ + b₂ |
| Positional Encoding | sin/cos functions |
| LayerNorm | (x - μ)/√(σ²+ε) * γ + β |

---

**End of Notes – Transformer Working & Formulas**