## Maximum likelihood estimation (MLE)

It is the foundation of how ML models are "taught" to pick their parameters. MLE is a method meant find the parameter values that make the observed data most likely to have happened.

Imagine you have a bag of 100 marbles. You don't know how many are red and how many are blue. You pull out 10 marbles, and 8 of them are red.
- Case A: If the bag was 10% red, the odds of pulling 8 reds are almost zero.
- Case B: If the bag was 80% red, the odds of pulling 8 reds are very high.

MLE chooses Case B. It says: "Since we saw 8 reds, let's assume the true parameter ($\theta$) of the bag is 80% red, because that makes our observation most probable."

### MLE vs. MAP
- **MLE**: Only cares about the data itself. $P(Data | \theta)$.
- **MAP** (Maximum A Posteriori): Incorporates **Prior Knowledge**. It says, "The data suggests 80% red marbles, but I know this factory usually makes 50/50 bags, so I'll guess 65%." * 

**Regularization** is actually the mathematical way of injecting "Prior Knowledge" into MLE to turn it into MAP.

### How models use MLE
Almost every algorithm uses a version of MLE to find its "weights":
- **Linear Regression**: Under the assumption that errors follow a Normal (Gaussian) distribution, the **Ordinary Least Squares method** is actually identical to MLE.
- **Logistic Regression**: MLE is used to find the weights that maximize the probability of assigning the correct labels (0 or 1) to the training data. This leads to the Binary Cross-Entropy Loss function.

### The Likelihood function
The likelihood function, denoted as $L(\theta | X)$, measures how well a specific parameter $\theta$ explains the observed data $X$.
- **Probability**: $P(X | \theta)$ — Given these rules, how likely is this data?
- **Likelihood**: $L(\theta | X)$ — Given this data, how likely are these rules?

#### The Log-Likelihood "Trick"
Usually, $L(\theta)$ is the product of many probabilities (e.g., $0.5 \times 0.1 \times 0.9...$). Multiplying thousands of small decimals causes the undeflow error. To solve this, **Natural Log** ($\ln$) of the function is used. 
- It turns multiplication into addition, which is much easier to calculate.
- The "peak" (maximum) of the log function stays in the exact same spot as the original function.


### MLE in Object Detection (The Bounding Box)
In detectors like YOLO or Faster R-CNN, the model predicts four coordinates for a bounding box: $(x, y, w, h)$.
The loss function used to train these models (like Mean Squared Error or IoU Loss) is derived from MLE.

