# üß† Q-LoRA: Precision Used in Training & Inference

This note explains:

- Whether Q-LoRA uses **4-bit** weights during inference  
- What precision LoRA adapters use  
- Why adapters are kept in higher precision  
- How inference is actually computed  

---

## ‚úÖ 1. Does Q-LoRA use the 4-bit model during inference?

### ‚úîÔ∏è Yes ‚Äî the base model stays in **4-bit (NF4)** even during inference.

Q-LoRA quantizes the *pretrained* model weights into 4-bit and **keeps them frozen**:

W_base_4bit (frozen)


These 4-bit weights are used both:

- during training  
- during inference  

unless you explicitly merge the LoRA weights.

---

## ‚úÖ 2. Are LoRA adapters also 4-bit?

### ‚ùå No ‚Äî LoRA adapters always stay in **higher precision** (FP16 or BF16).

LoRA consists of two small matrices:

- **A** (r √ó d)  
- **B** (d √ó r)  

These matrices must remain in **FP16/BF16** to avoid losing fine-grained update signals.

### Summary:

| Component | Precision | Why? |
|----------|-----------|------|
| Base model weights | **4-bit NF4** | Memory savings |
| LoRA adapters (A, B) | **FP16/BF16** | Accuracy needed |
| Gradients | FP16/BF16 | Required for training |

---

## üßÆ 3. How LoRA is applied to the 4-bit base model

The LoRA update is:

\[
\Delta W = B A \cdot \frac{\alpha}{r}
\]

Then inference uses:

\[
W_{\text{effective}} = W_{4bit} + \Delta W
\]

This is applied **on the fly** at runtime.  
The model stays quantized; only the adapter update is high precision.

---

## üîß 4. Two Inference Modes in Q-LoRA

### **A) Default Q-LoRA Inference (recommended)**  
- Base model: **4-bit**  
- LoRA adapters: **FP16**  
- Applied dynamically during forward pass

Memory-efficient, fast, and most common.

---

### **B) Merged Inference (optional)**  
You can merge:

\[
W_{\text{merged}} = W_{4bit} + \Delta W
\]

This produces a merged **FP16** model.

- No adapters needed afterward  
- Larger memory footprint  
- Good for deployment

---

## üìä 5. Memory Summary

| Stage | Base Model | LoRA | Total Memory |
|-------|------------|------|--------------|
| Training | 4-bit | FP16 | **Very Low (Q-LoRA advantage)** |
| Inference (default) | 4-bit | FP16 | **Low** |
| Inference (merged) | FP16 merged | ‚Äî | **Much higher** |

---

## üß† Final Takeaways

- ‚úîÔ∏è Base model stays **4-bit** during inference  
- ‚ùå LoRA adapters are **not quantized** ‚Äî stay in FP16/BF16  
- ‚úîÔ∏è Inference combines 4-bit base + FP16 LoRA adapters  
- ‚úîÔ∏è Optionally, weights can be merged afterward  

---

Let me know if you want a markdown section on:
- How to load a Q-LoRA model and run inference in HF  
- How double quantization works  
- Why NF4 quantization improves training  
- Q-LoRA memory calculations for 7B / 13B / 70B models
