# Recommender Systems
## 1. Notation
In the rest of these notes, the following notation will be used:
* $n_u$: number of users (currently saved in the dataset)
* $n_i$: number of items to be rated by the users: popular examples of item would be movies, products...
* $r(i, j)$ a value representing whether the $j$-th user rated the $i$-th item
* $y^{(i, j)}$ the rating given by user $j$ to the item $i$, only defined when $r(i, j) = 1$
* $x^{(i)}$ features vector for item $i$
* $\theta^{(i)}$ parameter vector for user $i$


## 2. First approach: Content-Based Recommendations
Assuming each item has a set of $n$ features. For instance for movie items, the features would be the percentage of each gender, or category and for the a magazine product, the features might be affordability, quality, delivery, price. Given a data set of users and their ratings, the algorithm would find vectors $\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n)}$ such as the predicted rating of user $j$ for item $i$ would be $(\theta^{(j)}) ^ T \cdot x^{(i)}$. Linear models are one possible approach, where the cost function for one user (to learn parameter $\theta^{(j)}$) can be written as:
$\begin{aligned} 
\frac{1}{2} \sum_{i:r(i,j)=1}^{} ((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2} \sum_{k=1}^{n}(\theta^{(j)}_k)^2
\end{aligned}$
where $(i:r(i,j)=1)$ represents the set of indices $i$ that satisfy $r(i, j)=1$, informally the indices of movies rated by the user $j$ \

The general cost function for all users can be written as:
$\begin{aligned} 
\frac{1}{2} \sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}^{} ((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta^{(j)}_k)^2
\end{aligned}$
Such cost function is similar to the neural networks' cost function. The optimization problem can be solved using optimization algorithms such as gradient descent. The gradients are as follows:
$\begin{aligned}
\theta^{(j)}_k := \theta^{(j)}_k - \alpha \cdot \sum_{i:r(i,j)=1}^{} ((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)}) x^{(i)}_k ~ for ~ k = 0 \\
\theta^{(j)}_k := \theta^{(j)}_k - \alpha \cdot (\sum_{i:r(i,j)=1}^{}((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)}) x^{(i)}_k + \lambda \cdot (\theta^{(j)}_k) ^ 2) ~ for ~ k \neq 0
\end{aligned}$

## 3. Collaborative filtering: a more efficient approach
### 3.1 The iterative approach
In the previous section, we assumed the presence of features vectors. Such an assumption is not always realistic as it might be extremely expensive or even impossible. Let's assume the users' preferences are provided: how important certain features are. Additionally, their ratings are provided as well. Using a linear model it is possible to predict (estimate) the features combination of the items. More formally, given $\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n)}$, it is possible to minimze the cost function for one item's features $x^{(i)}$:
$\begin{aligned}
\frac{1}{2} \sum_{j:r(i,j)=1}^{} ((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2} \sum_{k=1}^{n}(x^{(i)}_k)^2
\end{aligned}$
where ${j:r(i,j)=1}$ represents the set of indices $j$ satisfying $r(i,j)=1$. Informally, it represents the set of users' indices who rated the item $i$. 

More generally, given $\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}$, to learn $x^{(1)}, x^{(2)}, ..., x^{(n_i)}$, the general function is optimized:
$\begin{aligned}
\frac{1}{2} \sum_{i=1}^{ni}\sum_{j:r(i,j)=1}^{} ((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_i}\sum_{k=1}^{n}(x^{(i)}_k)^2
\end{aligned}$
Therefore, we conclude that:
1. given $x^{(1)}, x^{(2)}, ..., x^{(n_i)}$, we can estimate $\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}$
2. given $\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}$, we can estimate $x^{(1)}, x^{(2)}, ..., x^{(n_i)}$

Thus, one possible solution is to randomly initialize $\Theta$, estimate $X$, provide a better estimate for $\Theta$ and so on.

### 3.2 Collaborative Filtering
Although the iterative approach is of acceptable efficiency, it requires both additional time and computational power. Thus, a different approach would be to consider both $X$ and $\Theta$ as simulateneously variables/parameters of this optimization problem. We recall the cost function with respect to $\Theta$:
$\begin{aligned}
\frac{1}{2} \sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}^{} ((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta^{(j)}_k)^2
\end{aligned}$ 
as well as the cost function with respect to $X$:
$\begin{aligned}
\frac{1}{2} \sum_{i=1}^{ni}\sum_{j:r(i,j)=1}^{} ((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_i}\sum_{k=1}^{n}(x^{(i)}_k)^2
\end{aligned}$
Optimizing $X$ and $\Theta$ simultaneoulsy:
$\begin{aligned}
J(\Theta, X) = \frac{1}{2} \sum_{(i,j):r(i,j)=1}^{}((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta^{(j)}_k)^2 + \frac{\lambda}{2} \sum_{j=1}^{n_i}\sum_{k=1}^{n}(x^{(i)}_k)^2
\end{aligned}$
It is crucial to note that:
$\begin{aligned} 
\sum_{i=1}^{ni}\sum_{j:r(i,j)=1}^{} = \sum_{(i,j):r(i,j)=1}^{} = \sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}^{}
\end{aligned}$
Informally, 

the set of users who rated item $i$ for all items $i$ = the set of pairs (item, user), $(i, j)$ where user $j$ rated item $i$ = the set of items rated by user $j$ for all users $j$

### 3.3 Final algorithm
1. Initialize $x^{(1)}, x^{(2)}, ..., x^{(n_i)}$,  $\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}$ to small random values to break the symmetry
2. Minimize $J(\Theta, X)$ using an optimazation algorithm:
* $\theta^{(j)}_k := \theta^{(j)}_k - \alpha \cdot \sum_{(i,j):r(i,j)=1}^{}((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)}) x^{(i)}_k + \lambda \cdot (\theta^{(j)}_k) ^ 2$
* $x^{(i)}_k := x^{(i)}_k - \alpha \cdot \sum_{(i,j):r(i,j)=1}^{}((\theta^{(j)}) ^ T \cdot x^{(i)} - y^{(i, j)}) \theta^{(j)}_k + \lambda \cdot (x^{(i)}_k) ^ 2$ 
3. for a user with paramters $\theta$ an item with features $x$, predict a rating of $\theta^{T}x$

### 3.4 Implementation notes
#### 3.4.1 matrices representations and vectorization
We denote 
$X = \begin{bmatrix}(X^{(1)})^T  \\
(X^{(2)})^ T \\
... \\
... \\
(X^{(n_i)})^ T\end{bmatrix}$ and 
$\Theta = \begin{bmatrix}(\theta^{(1)})^T  \\
(\theta^{(2)})^ T \\
... \\
... \\
(\theta^{(n_u)})^ T\end{bmatrix}$
The predictive ratings's matrix:
$\begin{bmatrix}
(\theta^{(1)}) ^ T x^{(1)} && (\theta^{(2)}) ^ T \cdot x^{(1)} && .. && (\theta^{(n_u)}) ^ T x^{(1)}\\
(\theta^{(1)}) ^ T x^{(2)} && (\theta^{(2)}) ^ T \cdot x^{(2)} && .. && (\theta^{(n_u)}) ^ T x^{(2)} \\
.. && .. &&.. && .. \\
(\theta^{(1)}) ^ T x^{(n_i)} && (\theta^{(2)}) ^ T \cdot x^{(n_i)} && .. && (\theta^{(n_u)}) ^ T x^{(n_i)}
\end{bmatrix}$
can be vectorized as follows: $X \cdot \Theta ^ T$

#### 3.4.2 Mean normalization
Considering user $l$ who have not yet rated any item, optimizing the cost function $J(\Theta, X)$ would be reduced to optimizing 
$\begin{aligned}
\sum_{k=1}^{n}(\theta^{(l)}_k)^2
\end{aligned}$
whose solution is $\theta^{(l)}_k = 0, ~ k = 1,2,...,n$. Thus every new user will be assigned $\theta_{new} = 0$ and all their predective rating will be estimated as $0$.

Mean normalization can overcome such issue. The initial rating matrix $R$. replace every rating $r(i,j)$ by $r(i,j) - \mu_i$ where $\mu_i$ is the average rating of item$i$. Additionally, the predictive rating is no longer $\theta^{T}x$, but $~\theta^{T}x + \mu_i$