# ðŸ§ª T-Poti From Scratch: Ultra-Low Precision (2026)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/model-size-reduction/blob/main/chronology/tpoti_demo.ipynb)

## ðŸ“– Theory: Binary and Quaternary Frontiers

> **Note**: T-Poti is an illustrative projection of the 2026 research frontier,
> synthesising real techniques from the binary/ternary quantization literature.

T-Poti pushes to **1-bit (binary)** and **2-bit (quaternary)** precision without
retraining -- targeting extreme edge devices (microcontrollers, wearables).

### 1-bit: Optimal Binary Quantization

With only two levels $\{-\alpha, +\alpha\}$, find the scalar $\alpha$ minimising:

$$\min_{\alpha,\,B \in \{-1,+1\}^n} \|W - \alpha B\|_2^2$$

Holding $B = \text{sign}(W)$ fixed, the optimal scale has a closed form:

$$\alpha^* = \frac{\|W\|_1}{n} = \text{mean}(|W|)$$

This result follows by differentiating and setting the derivative to zero.
It is the same closed-form used by XNOR-Net and BitNet.

### 2-bit: Quaternary Levels

With four levels $\{-1,\,-1/3,\,+1/3,\,+1\}$ (scaled by $\max|W|$),
each weight is encoded in 2 bits. The level set is chosen to minimise expected
squared error for a uniform distribution over $[-1,1]$.

### Error Compensation via Multi-Codebook

At extreme precision the rounding error is large. T-Poti-style methods use
**iterative residual quantization**:
1. Quantize $W$ with binary mask $B_1$: $\hat{W}_1 = \alpha_1 B_1$.
2. Compute residual $E = W - \hat{W}_1$.
3. Quantize $E$ with a second binary mask $B_2$: $\hat{W}_2 = \alpha_2 B_2$.
4. Final representation: $W \approx \alpha_1 B_1 + \alpha_2 B_2$ (2-bit effective).

This multi-codebook strategy is also used in **Residual Vector Quantization (RVQ)**.

### Information-Theoretic Lower Bound

For $W \sim \mathcal{N}(0,1)$, the minimum achievable MSE at $b$ bits per weight:

$$\text{MSE}_{\min} \approx \frac{\pi\sqrt{3}}{2} \cdot 2^{-2b}$$

| Bits | Min MSE |
|---|---|
| 1-bit | ~0.363 |
| 2-bit | ~0.091 |
| 4-bit (NF4) | ~0.006 |

### Limitations
* Very high quantization error -- mostly viable for QAT models.
* 2-bit PTQ on large pre-trained models degrades accuracy significantly
  without careful calibration and error-compensation techniques.

---

In [None]:
import torch

def tpoti_1bit_quantize(W):
    """
    Manual implementation of Optimal Binary Quantization (1-bit).
    Maps W -> alpha * sign(W)
    """
    # 1. Optimal Alpha calculation for a binary mask
    # Formally: alpha = mean(abs(W))
    alpha = torch.mean(torch.abs(W))
    
    # 2. Map to Binary {-1, 1}
    B = torch.sign(W)
    B[B == 0] = 1 # Handle zeros
    
    return B, alpha

def tpoti_2bit_quantize(W):
    """
    Manual 2-bit (Quaternary) quantization.
    Levels: {-1, -0.33, 0.33, 1} scaled by alpha.
    """
    # Normalize to [-1, 1]
    abs_max = torch.max(torch.abs(W))
    W_norm = W / abs_max
    
    # 2-bit levels
    levels = torch.tensor([-1.0, -0.33, 0.33, 1.0])
    
    # Map each value to closest level
    W_flat = W_norm.view(-1, 1)
    diff = torch.abs(W_flat - levels.view(1, -1))
    indices = torch.argmin(diff, dim=1)
    
    W_q = levels[indices].view(W.shape)
    return W_q, abs_max

# Demonstration
W = torch.randn(1024, 1024)
B, alpha = tpoti_1bit_quantize(W)
W_target_2b, scale_2b = tpoti_2bit_quantize(W)

print(f"--- 1-bit Result ---")
print(f"Binary Mask Sample: {B[0, :5].tolist()}")
print(f"Optimal Alpha: {alpha:.4f}")
print(f"1-bit MSE: {torch.mean((W - B*alpha)**2):.6f}")

print(f"\n--- 2-bit Result ---")
print(f"Quaternary Sample: {W_target_2b[0, :5].tolist()}")
print(f"2-bit MSE: {torch.mean((W - W_target_2b*scale_2b)**2):.6f}")

## ðŸ”¢ Worked Example with Numbers

Before the full implementation, letâ€™s trace through the math with a tiny, hand-traceable example.

In [None]:
# Tiny example: 1-bit and 2-bit quantization on a small vector
import torch

w = torch.tensor([0.7, -0.4, 0.9, -0.1, 0.3, -0.8])
print(f"Original weights: {[round(v,2) for v in w.tolist()]}")

# â”€â”€ 1-bit â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
print("\n=== 1-bit  (Binary: {-1, +1}) ===")
alpha = w.abs().mean()
B     = w.sign()
B[B == 0] = 1                         # no exact zeros allowed
recon_1b = B * alpha

print(f"Optimal alpha = mean|w| = {alpha:.4f}")
print(f"  {'orig':>7}  {'sign':>6}  {'recon':>8}  {'err':>8}")
for orig, b, r in zip(w.tolist(), B.tolist(), recon_1b.tolist()):
    print(f"  {orig:+7.4f}  {b:+6.0f}  {r:+8.4f}  {abs(orig-r):8.4f}")
print(f"1-bit MSE: {((w - recon_1b)**2).mean():.6f}")

# â”€â”€ 2-bit â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
print("\n=== 2-bit  (Quaternary: {-1, -0.33, +0.33, +1}) ===")
levels  = torch.tensor([-1.0, -0.33, 0.33, 1.0])
abs_max = w.abs().max()
w_norm  = w / abs_max
diff    = (w_norm.view(-1, 1) - levels.view(1, -1)).abs()
indices = diff.argmin(dim=1)
w_q2    = levels[indices]
recon_2b = w_q2 * abs_max

print(f"Scale (abs_max) = {abs_max:.4f}")
print(f"  {'orig':>7}  {'norm':>7}  {'level':>7}  {'recon':>8}  {'err':>8}")
for orig, nrm, lv, r in zip(w.tolist(), w_norm.tolist(), w_q2.tolist(), recon_2b.tolist()):
    print(f"  {orig:+7.4f}  {nrm:+7.4f}  {lv:+7.2f}  {r:+8.4f}  {abs(orig-r):8.4f}")
print(f"2-bit MSE: {((w - recon_2b)**2).mean():.6f}")

print(f"\nSummary: 2-bit MSE ({((w-recon_2b)**2).mean():.4f}) < 1-bit MSE ({((w-recon_1b)**2).mean():.4f})  as expected")


## ðŸ§ª GPT-2 Evaluation

Apply the method to all 2D weight matrices of GPT-2 and compare perplexity before and after quantization.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch, copy

model_id = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id).eval()

text = "The quick brown fox jumps over the lazy dog. Transformers are powerful sequence models."
inputs = tokenizer(text, return_tensors="pt")

def perplexity(mdl, inputs):
    with torch.no_grad():
        loss = mdl(**inputs, labels=inputs["input_ids"]).loss
    return torch.exp(loss).item()

baseline_ppl = perplexity(model, inputs)
print(f"Baseline GPT-2 Perplexity:       {baseline_ppl:.2f}")

# 1-bit
model_1b = copy.deepcopy(model)
for name, param in model_1b.named_parameters():
    if param.dim() == 2:
        B, alpha = tpoti_1bit_quantize(param.data)
        param.data = B.float() * alpha
ppl_1b = perplexity(model_1b, inputs)
print(f"T-Poti 1-bit GPT-2 Perplexity:   {ppl_1b:.2f}")
print(f"Delta:                           {ppl_1b - baseline_ppl:+.2f}")

# 2-bit
model_2b = copy.deepcopy(model)
for name, param in model_2b.named_parameters():
    if param.dim() == 2:
        W_q, scale = tpoti_2bit_quantize(param.data)
        param.data = W_q * scale
ppl_2b = perplexity(model_2b, inputs)
print(f"T-Poti 2-bit GPT-2 Perplexity:   {ppl_2b:.2f}")
print(f"Delta:                           {ppl_2b - baseline_ppl:+.2f}")


## ðŸ“š References

1. **Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., & Bengio, Y.** (2016).  
   *Binarized Neural Networks.* NeurIPS 2016.  
   [arXiv:1602.02830](https://arxiv.org/abs/1602.02830)

2. **Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A.** (2016).  
   *XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.* ECCV 2016.  
   [arXiv:1603.05279](https://arxiv.org/abs/1603.05279)

3. **Ma, S., Wang, H., Ma, L., et al.** (2024).  
   *The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.*  
   [arXiv:2402.17764](https://arxiv.org/abs/2402.17764)

4. **Zeghidour, N., Luebs, A., Omran, A., Skerry-Ryan, R., & Tagliasacchi, M.** (2021).  
   *SoundStream: An End-to-End Neural Audio Codec.* (Residual Vector Quantization)  
   [arXiv:2107.03312](https://arxiv.org/abs/2107.03312)
