# Important Takeaways So Far  

Hey, pretty cool — we’ve now tackled our very first ML examples:  

- **Linear Regression**:  
  We found the best-fit line (or plane, if you imagine higher dimensions).  
  In our 2D case, the data was described by two features, and we learned the parameters (weights) for a line:  

  $$
  w_1 \cdot x_1 + w_2 \cdot x_2 + w_0 = y
  $$

  - Here $w_1, w_2, w_0$ are the weights (parameters) the model learns.  
  - Once we know these, we can plug in the feature values $(x_1, x_2)$ to predict $y$.  

- **Logistic Regression**:  
  Very similar structure, but this time our goal was **classification**.  
  Instead of predicting a number directly, we learned weights that indirectly define a **decision boundary** — a line (in 2D) or plane (in higher dimensions) — that separates one class from another.  

---

## The Training Loop

You probably noticed something interesting: we basically **copy–pasted the training loop logic** from linear regression into logistic regression!  
The only things that changed were:  
- The **loss function** (MSE → BCE),  
- The **gradient calculation** (because it depends on the loss).  

Everything else was the same.  

Here’s the general recipe you’ve seen:  

1. **Initialize parameters (weights)** randomly.  
2. For a certain number of **epochs** (an epoch = one full pass over the training set):  
   - For each training record (or a batch — more on that below):  
     - Plug the features into the model → get a prediction $\hat{y}$.  
     - Compare $\hat{y}$ with the true label → compute the **loss**.  
     - Express the loss in terms of the weights.  
     - Compute the **gradient**: this tells us how the loss changes with respect to each weight, and in which direction the loss is increasing.  
     - Update the weights in the **opposite direction** of the gradient, since we want to move toward *lower* loss, not higher.

---

## Vectorization vs. Batching

- In practice, we don’t literally write a `for` loop over each record.  
  Instead, we **vectorize**: a couple of lines of NumPy will compute the average loss and gradient for the *entire dataset at once*.  

- But for really large datasets, we don’t process everything in one go either.  
  Instead, we use **mini-batches**:  
  - Take small subsets (e.g. 16, 32, 64, 128 records).  
  - Do the same forward → loss → gradient → update loop, just on that batch.  
  - Repeat until we’ve seen all batches → that’s one epoch.  

This batching is the secret sauce that makes modern ML training feasible.  

---

## Main Takeaway

At its core, the **training loop is roughly the same across many ML algorithms**.  
The only things that change are:  
- The **loss function** (depends on the problem: regression, classification, etc.),  
- The resulting **gradient** (because it comes from the loss).  

Everything else — initialization, looping over data, updating weights — is the same rhythm.  
