Gradient descent for collaborative filtering is a way to train a recommendation model faster and more systematically by adjusting both user and item factors together to reduce prediction error.



### Setup: users, items, and factors

- We represent each user by a small list of numbers called **user factors** (their hidden tastes or preferences).  
- We represent each item (movies, songs, etc.) by **item factors** (hidden properties of the item).  
- If we choose $Z$ factors, we put all user factors into a matrix $P$ (size $M \times Z$, $M$ = number of users) and all item factors into a matrix $Q$ (size $Z \times N$, $N$ = number of items).  
- The predicted rating of user $i$ for item $j$ is the **dot product** of row $i$ of $P$ and column $j$ of $Q$: “how much the user likes each factor” multiplied by “how strongly the item has each factor,” summed up.

Intuitively: users and items both live in the same hidden “preference space”; prediction is how aligned they are.



### Error and loss (what we want to minimize)

- For each known rating $r_{i,j}$ (what the user actually gave), the model produces a prediction $\hat{r}_{i,j}$.  
- The **error** for that pair is $\hat{r}_{i,j} - r_{i,j}$.  
- The **squared error** is $(\hat{r}_{i,j} - r_{i,j})^2$. Squaring makes all errors positive and punishes big mistakes more.

To get the overall quality of the model:

- We sum the squared errors over all **existing** ratings only.  
- We explicitly **skip missing ratings** (cases where the user never rated that item), because treating them as 0 would incorrectly teach the model that “no interaction means the user hates the item.”

So the mean squared error is “average squared difference between predictions and true ratings, but only where we know the rating.”



### Goal: find the best user and item factors

- Our aim is to choose the numbers in $P$ and $Q$ so that this mean squared error is as small as possible.  
- This means all the user and item factors (every entry in $P$ and $Q$) are **parameters** of one big model.  
- If there are $M$ users, $N$ items, and $Z$ factors, there are $M \times Z$ user parameters plus $Z \times N$ item parameters in total.

Instead of separately solving many small problems (like “fit a regression for each user” or “for each item”), we treat it as **one big optimization problem**.



### Gradient descent idea

Gradient descent is a general method to minimize a function by taking small steps in the direction that reduces it fastest.

Applied here:

- The function we want to minimize = mean squared error over all known ratings.  
- The inputs we can change = every element of $P$ and $Q$.  
- The algorithm repeatedly:
  - Computes how changing each parameter (each $P_{a,b}$, each $Q_{b,j}$) would change the error.  
  - Moves that parameter slightly in the direction that **reduces** the error.

This uses a **learning rate** $\alpha$: a small number that controls how big each adjustment step is. Too big: unstable; too small: very slow.



### Update rule (in plain language)

For a user factor $P_{a,b}$ (user $a$, factor $b$):

- Look at all items that user $a$ actually rated.  
- For each such item $j$:  
  - Compute the prediction error $e_{a,j} = \hat{r}_{a,j} - r_{a,j}$.  
  - Weight that error by the corresponding item factor $Q_{b,j}$.  
- Add up these “error × item-factor” terms.  
- Adjust $P_{a,b}$ by subtracting learning_rate × (this sum).

Intuition:

- If a factor tends to be **too low** for items the user likes, the update nudges that user factor **up**.  
- If a factor tends to be **too high** for items the user rates low, the update nudges it **down**.

A similar rule updates each item factor $Q_{b,j}$, using errors over all users who rated that item and the corresponding user factors.

### Why this is more efficient than the earlier CF method

Earlier collaborative filtering description:

- Alternated between:
  - Holding item factors fixed, fitting separate models for each user.  
  - Holding user factors fixed, fitting separate models for each item.  
- That means many separate optimization problems and can be inefficient.

Gradient descent view:

- Treats **all** user and item factors as one parameter set.  
- Runs **one** optimization loop (one global gradient descent) that updates everything together.  
- This is conceptually similar to how we train neural networks: one loss function, many parameters, one gradient descent process.



### Handling model complexity and interpretation

- The number of factors $Z$ is flexible: could be 2 (toy case) or hundreds/thousands (real systems).  
- Larger $Z$ allows the model to capture more subtle patterns but risks overfitting; this is often controlled with **regularization** (extra penalty on large factor values).  
- The learned factors usually **do not have clear human labels**. They are just abstract directions in a learned space that happen to be useful for predicting ratings.



### Final picture (Funk SVD)

- Start with random user and item factors.  
- Repeatedly use gradient descent to adjust them to reduce mean squared error on known ratings.  
- End up with:
  - Matrix $P$: inferred user factors (each row = a user).  
  - Matrix $Q$: inferred item factors (each column = an item).  
- The product $P \times Q$ approximates the original rating matrix: it predicts missing ratings and underlies the recommendations.

This whole process—factorizing the rating matrix into user and item factors via gradient descent—is commonly called **Funk SVD** in recommender system literature.