# Collaborative filtering

It's type of models that requires only matrix $R$.

In [1]:
import numpy as np
import pandas as pd

## Memory based

### User based

Aka **user-user filtering**.

This method is based on approach, that if users ranked their common items close it means that it have sence to recommend them same items in future:

|   | $i_1$ | $i_2$ | ... | $i_n$|
|:--|:------|:------|:----|:-----|
|$u_1$|$r_{11}$|$r_{12}$|...|$r_{1n}$|
|$u_2$|$r_{21}$|$r_{22}$|...|$r_{2n}$|
|...|...|...|...|...|
|$u_m$|$r_{m1}$|$r_{m2}$|...|$r_{mn}$|

<!--
# GENERATE arr
np.random.seed(10)
sample_size = 10

arr = np.random.randint(low=1, high=11, size=[1, sample_size])
arr = np.concatenate(
    [
        arr,
        arr + np.random.randint(-1, 2, size=sample_size),
        np.abs(arr - 10) + np.random.randint(-1, 2, size=sample_size)
    ],
    axis = 0
)
arr = np.where((arr>10), 10 , arr)
arr = np.where((arr<1), 1, arr)

# Display display it as markdown

print(pd.DataFrame(
    arr,
    index=[f"$u_{{{i+1}}}$" for i in range(arr.shape[0])], 
    columns=[f"$i_{{{j+1}}}$" for j in range(arr.shape[1])]
).to_markdown())
-->

Consider a simple example:

|         |   $i_{1}$ |   $i_{2}$ |   $i_{3}$ |   $i_{4}$ |   $i_{5}$ |   $i_{6}$ |   $i_{7}$ |   $i_{8}$ |   $i_{9}$ |   $i_{10}$ |
|:--------|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|-----------:|
| $u_{1}$ |        10 |         5 |         1 |         2 |        10 |         1 |         2 |         9 |        10 |          1 |
| $u_{2}$ |        10 |         4 |         2 |         1 |         9 |         1 |         3 |         8 |        10 |          2 |
| $u_{3}$ |         1 |         4 |         8 |         9 |         1 |        10 |         8 |         1 |         1 |          9 |

Without any special techniques, you can see that user 1 and user 2 have very similar preferences. But user 3 has completely different preferences. 

In the following cell, the $R$ matrix under consideration is defined as a numpy array.

In [2]:
arr = np.array([
    [10,  5,  1,  2, 10,  1,  2,  9, 10,  1],
    [10,  4,  2,  1,  9,  1,  3,  8, 10,  2],
    [ 1,  4,  8,  9,  1, 10,  8,  1,  1,  9]
])

So let's try to formalise their similarity somehow - in this case we'll use Pirson's correlation coefficient:

In [27]:
corrmatrix = np.corrcoef(arr)
print("Similarity between first and second user:", corrmatrix[0,1])
print("Similarity between second and third user:", corrmatrix[1,2])

Similarity between first and second user: 0.9802681400094546
Similarity between second and third user: -0.9650036744440637


So we proved that the first client is close to the second, but the third is very different. So according to our approach, we are more likely to recommend to the second user the games that the first one liked and vice versa.

### Item based

Aka **item-item filtering**.

Item based approach technically same as user based. Here we are looking for items that are similar in $r_{ij}, i=\overline{1,n}$ by the same users.

So the $R$ matrix is just transposed compared to the user based approach:

|   | $u_1$ | $u_2$ | ... | $u_m$|
|:--|:------|:------|:----|:-----|
|$i_1$|$r_{11}$|$r_{12}$|...|$r_{1m}$|
|$i_2$|$r_{21}$|$r_{22}$|...|$r_{2m}$|
|...|...|...|...|...|
|$i_n$|$r_{m1}$|$r_{m2}$|...|$r_{mn}$|

**Note** here, in contrast to the definition given above, the order of indexing of the elements of the matrix $R$ is changed ($i$ and $j$ are swapped). That is, here $r_{ji}$.

### Final algorithm

Here we take a general look at how collaborative filtering works. It will have identical operations in the user and item based cases, so we will only consider the abstract matrix $R$. So we have target value $r_{ij}$ for abstract combination of row $i$ and column $j$. For the sake of simplicity, I'll sometimes refer to lines as objects.

**Estimate similarities** For each pair of rows, let's say rows $k,t,t\neq k$, we compute some metric that estimates their similarities $c_{kt}$, for example the Pearson correlation coefficient.

An important nuance is that for some $i,j:\nexists r_{ij}$, which means that the combination $i,j$ did not occur. So during estimating similarity $c_{kt}$ we only can use those $j:\exists r_{kj} \& \exists r_{tj}$. For convenience, such a set of $j$ will be denoted by $J$.

$c_{kt},t\neq k$ - similarity measure of the $k$-th and $t$-th elements. For this example will use just pirson correlation coefficient:

$$c_{kt} = 
\frac{\sum_{j\in J}(r_{kj}-\overline{r_{k}})(r_{tj}-\overline{r_{t}})}
{\sqrt{\sum_{j\in J}(r_{kj}-\overline{r_{k}})^2}\sqrt{\sum_{j \in J}(r_{tj}-\overline{r_{t}})^2}}
$$

**Estimate** Say we are considering row $k'$, and we need to suggest for this row those columns $j$ that best match this row. So $\forall j=\overline{1,m}$ we got estimation $a_{k'j}(\overline{c_{k'}})$ - how well the $j$-th element fits the $k$-th row. Where $\overline{c_{k'}}=(c_{k'1}, c_{k'2}, ..., c_{k'm})$ - estimates of similarity of other objects to object $k'$.

**Finally** for $k'$-row we select those $j$ that have the best $a_{k'j}(\overline{c_{k'}})$, so final answer is $j' = argmax_{j}\left[a_{k'j}(\overline{c_{k'}})\right]$

## Model based

Model-based methods work by building models that attempt to predict a rating for a user-item pair by using ratings as features. 