### **mAP**

#### **Sort by confidence**
For a given class, collect **all predictions across the dataset** and **sort them by confidence (desc)**. Then you sweep down that list. This sweep is equivalent to trying **all possible confidence thresholds**: at each step you’ve “included” everything above that confidence.

> So AP already **integrates over all confidence thresholds**—you don’t pick one.

---

#### **“Adjust the threshold and calculate again”?**

You *can* pick a single **confidence threshold** for deployment (e.g., to get a desired precision/recall), but **AP does not require** choosing one. AP is the **area under the PR curve** formed by lowering the confidence threshold from high → low (i.e., walking the sorted list).

---

#### **What role does the **computed IoU** play?**

IoU is used for **matching**, not for sorting.

For a chosen **IoU threshold τ** (e.g., 0.5):
- Take the next prediction in the confidence-sorted list.
- Among ground truths in the **same image & class**, find the **unmatched** GT with **max IoU** to this prediction.
- If that max IoU **≥ τ**, mark the prediction **TP** and **lock** that GT (can’t be matched again).
- Otherwise, mark **FP** (either IoU < τ or it’s a duplicate on an already-matched GT).

Then compute cumulative **precision/recall** and integrate to get **AP@τ**.

* **Higher τ** (e.g., 0.75, 0.9) = stricter matching → usually **lower AP**.
* **COCO mAP** averages AP over **many τ** values (0.50:0.95), so it’s stricter than mAP\@0.5.



* For **evaluation**: IoU threshold (e.g., 0.5) decides TP/FP during metric computation.
* For **deployment thresholding**: you still sort/sweep by **confidence**; IoU doesn’t order predictions, it only determines whether each kept prediction is a TP or FP when you compute metrics at each `t`.




---

#### **AP = 1** is the “perfect” case.

Why you still got **AP ≈ 0.917** with **2 FPs**:

* **AP is the area under the Precision–Recall curve**, so it rewards **how well you rank your detections** by confidence.
* In your list, the **first two predictions are correct (TP, TP)**, so precision is **1.0** while recall climbs to **2/3**. That already fills a big chunk of the area.
* You do hit an FP (P3), but then you get the last TP (P4), reaching **full recall** with precision **0.75**.
* The final FP (P5) happens **after recall is already 1.0**, so it **doesn’t reduce the integrated area** (there’s no additional recall gained to integrate over).

Think of AP as:

> the **average of precision values at the moments recall increases**, with a monotonic “envelope” that keeps precision non-increasing.

In your example those moments are P1, P2, P4 → precisions **1.0, 1.0, 0.75** → average (weighted by the recall jumps of 1/3 each) = **0.9167**.

*Same FP count, worse ranking → much lower AP*

If the two FPs were **first**:

* Seq: FP, FP, TP, TP, TP → precisions at TP moments ≈ **0.333, 0.5, 0.6** → envelope → **0.6** AP.
  **Same 2 FPs**, but **AP drops from \~0.92 to \~0.60** because the correct detections weren’t ranked early.

**Bottom line:** AP is high when your **correct detections appear early** in the ranked list and you **recover all GTs**. It’s less about the raw FP count and more about **where** the FPs occur relative to the TPs.


---


#### **mAP is for model selection, not for choosing a single deployment threshold.**
You *use mAP* to compare checkpoints/models. Then, **pick a confidence threshold on a validation set** to meet your product goals (precision/recall trade-off), and finally **lock it** before evaluating on the test set.

Here’s a practical workflow:

- **Train & pick a model using mAP**

* Track mAP (e.g., COCO mAP 0.50:0.95) during training.
* Choose the checkpoint with the best mAP (or your preferred metric).

- **Choose an operating point (confidence threshold) on the *validation* set**

* Sweep confidence `t` from 0→1 (and usually NMS IoU too).
* For each `t`, compute precision/recall (or other metrics: F1, Fβ, cost-weighted utility).
* Pick `t*` that **maximizes your objective** or **meets constraints**, e.g.:

  * maximize F1
  * precision ≥ 0.95 with highest recall
  * ≤ 0.1 false positives per image with highest recall
* Optionally choose **per-class thresholds** (common in detection).

- **Freeze and report on the test set**

* Apply `t*` (and NMS setting) to the test set **once**.
* Report your final metrics (precision, recall, AP\@0.5, AP\@0.75, mAP, etc.).

- **(Optional) Calibrate scores**

* If you need reliable “confidence = probability,” calibrate on validation (Platt/Logistic, isotonic, temperature scaling) and then threshold.

--- 

#### **Tiny pseudocode for picking the threshold**

```python
best_t, best_score = None, -1
for t in np.linspace(0, 1, 201):
    preds_t = keep_predictions_with_confidence_ge(t)
    preds_t = non_max_suppression(preds_t, iou_nms=0.5)
    P, R = precision_and_recall(preds_t, gt, iou_eval=0.5)
    score = 2*P*R/(P+R+1e-12)   # F1 (swap for your objective)
    if score > best_score:
        best_score, best_t = score, t

# Deploy with best_t; evaluate once on the test set.
```



- **mAP**: compare models (ranking quality across *all* thresholds).
- **Confidence threshold**: pick on validation to meet your operating goals.
- **Never tune on the test set.**
