## Recommender systems
The idea behind recommender systems is to model the preferences of users with respect to a set of products that are, somehow, described by certain features. We model each product $i$ by a vector
$$x^{(j)}=(x_1^{(j)},\dots,x_n^{(j)}),$$
where $x_i^{(j)}$ is the value for the $i$th feture of product $j$.<br>
As data we have a set of ratings that the clients have made of different products, and the goal is to estimate the scores that a user would give to products he/she has not yet rated.<br><br>

To each client $k$ we associate a linear preference model of the form
$$J(\theta^{(k)})=\frac{1}{2m^{(k)}}\sum_{j:\ r(j,k)=1}\left[(\theta^{(k)})^Tx^{(j)}-y^{(j,k)}\right]^2+\frac{\lambda}{2m^{(k)}}\sum_{i=1}^n(\theta_i^{(k)})^2,$$
where $m^{(k)}$ is the number of products rated by user $k$, $r(j,k)\in\{0,1\}$ is the indicator function for wether client $k$ has already rated product $j$ (1) or not (0), and $y^{(j,k)}$ is the rating that user $k$ has given to product $j$, if defined. By means of linear regression we then find the parameters $\theta^{(k)}$ that minimize $J$. As the value of the constant $m^{(k)}$ does not afect the minimization problem, it is usually removed from $J$.<br><br>

The optimization problem for the recommender system with $N$ users/clients is then
$$\min_{\theta^{(1)},\dots,\theta^{(N)}}J(\theta^{(1)},\dots,\theta^{(N)})$$
where,
$$J(\theta^{(1)},\dots,\theta^{(N)})=\frac{1}{2}\sum_{k=1}^N\sum_{j:\ r(j,k)=1}\left[(\theta^{(k)})^Tx^{(j)}-y^{(j,k)}\right]^2+\frac{\lambda}{2}\sum_{k=1}^N\sum_{i=1}^n(\theta_i^{(k)})^2.$$
The gradient is given by
$$\frac{\partial J}{\partial \theta_l^{(p)}}=\sum_{j:\ r(j,p)=1}\left[(\theta^{(p)})^Tx^{(j)}-y^{(j,p)}\right]x_l^{(j)}+\lambda\theta_l^{(p)}(1-\delta_{l,0})$$

### Collaborative filtering
Previously we assumed to have a set of features to describe a product. However, in general is not easy to assign a value to such features for each product manually, for example how romantic a movie is. Assume for example that we have a set of movies and we choose as features how romantic and action oriented a movie is. The goal is now to find the value of the feature vectors $x^{(j)}$ for each movie, but to do that we need to have an estimate of the preferences of the viewers $\theta^{(k)}$. Therefore we proceed to ask the viewers, for example, from 0 to 5 how much do they like romantic movies (that is, parameters $\theta_1^{(k)}$) and how much do they like action movies (parameters $\theta_2^{(k)}$). We have now an estimate for the parameters and can proceed to estimate the $M$ feature vectors as the solution to the minimization problem
$$\min_{x^{(1)},\dots,x^{(M)}}\frac{1}{2}\sum_{j=1}^{M}\sum_{k:\ r(j,k)=1}\left[(\theta^{(k)})^Tx^{(j)}-y^{(j,k)}\right]^2+\frac{\lambda}{2}\sum_{j=1}^{M}\sum_{i=1}^n(x_i^{(j)})^2,$$
for all the features.<br><br>

To obtain both, the parameters $\theta^{(k)}$ for our users and the feature values $x^{(j)}$ for our products we can start with an educated guess for one of them, say the parameters, and then use the optimization procedure to obtain the features, which then are used to refine the estimation of the parameters and so on, until some type of convergence is achieved. <br><br>

Alternatively, one can state the problem simultaneously for both the parameters and the features
$$J(\theta^{(1)},\dots,\theta^{(N)},x^{(1)},\dots,x^{(M)})=\frac{1}{2}\sum_{(j,k):\ r(j,k)=1}\left[(\theta^{(k)})^Tx^{(j)}-y^{(j,k)}\right]^2+\frac{\lambda}{2}\sum_{k=1}^N\sum_{i=1}^n(\theta_i^{(k)})^2+\frac{\lambda}{2}\sum_{j=1}^{M}\sum_{i=1}^n(x_i^{(j)})^2,$$
by minimizing
$$\min_{\theta^{(1)},\dots,\theta^{(N)},x^{(1)},\dots,x^{(M)}}J(\theta^{(1)},\dots,\theta^{(N)},x^{(1)},\dots,x^{(M)}).$$
In this case we no longer use bias terms for features $x_0^{(j)}=1$ or parameters $\theta_0^{(k)}$, so the gradient is
$$\frac{\partial J}{\partial \theta_l^{(p)}}=\sum_{(j,p):\ r(j,p)=1}\left[(\theta^{(p)})^Tx^{(j)}-y^{(j,p)}\right]x_l^{(j)}+\lambda\theta_l^{(p)},$$
$$\frac{\partial J}{\partial x_l^{(q)}}=\sum_{(q,k):\ r(q,k)=1}\left[(\theta^{(k)})^Tx^{(q)}-y^{(q,k)}\right]\theta_l^{(k)}+\lambda x_l^{(q)}.$$
It is important to note that the features $x^{(j)}$ that the algorithm will learn are not always easy to interpret, that is, rarely they end up being something like $x^{(1)}=$romance, $x^{(2)}=$action and so on, and more often thay are some combination of these aspects.

### Low-rank matrix factorization
In order to vectorize the operations needed to compute $J$ and its gradient we notice that the data obtained from the initial client ranking of the products ($y^{(j,k)}$, for feature $j$ and client $k$) can be organized as a matrix
$$Y=\left(\begin{array}{cccc}
y^{(1,1)} & y^{(1,2)} & \cdots & y^{(1,N)}\\ 
y^{(2,1)} & y^{(2,2)} & \ddots & \vdots\\
y^{(M,1)} & y^{(M,2)} & \cdots & y^{(M,N)}\\
\end{array}\right).$$
Likewise, we can construct matrices for the parameters and the features where each row contains one of the vectors, that is
$$\Theta=\left(\begin{array}{c}
\theta^{(1)}\\
\vdots\\
\theta^{(N)}\\
\end{array}\right),$$
and 
$$X=\left(\begin{array}{c}
x^{(1)}\\
\vdots\\
x^{(M)}\\
\end{array}\right),$$
and then compute
$$X\Theta^T-Y.$$

### Mean normalization
What to do with a user $l$ that has not rated any product? In that case, as only the 2nd and 3rd term of $J$ will play a part in the algorithm for that user, the value obtained for $\theta^{(l)}=0$. To avoid this, we can center all the data $Y$ by substracting the mean of each raw, without taking into account unrated values. Then the factor $X\Theta^T$ will have row-wise mean equal to 0 ant the parameters for user $l$ will start at the mean value. When predicting an actual feature value, we may then add again the mean for the correspondent row.