# Prediction functions

In the collaborative approach, once you have identified similar objects, you need to use them to predict preferences for items. There is a formula that allows you to do this. This formula was difficult for me to understand, so this page focuses on understanding this formula.

In [31]:
import numpy as np

from sklearn.datasets import make_blobs
from IPython.display import HTML, Markdown

# this is dimentions of R
# matrix that will be used
# as example
r_width = 10
r_height = 20

# this is a header for tables for pretty display
common_header = (
    "<tr>"
        "<th rowspan=\"2\">Object</th>"
        f"<th colspan=\"{str(r_width)}\" style='text-align:center'>Ranks of the items</th>"
        "<th rowspan=\"2\">Group</th>"
    "</tr>"
    "<tr>"+
        "".join([f"<th>{str(i)}</th>" for i in range(r_width)])+
    "</tr>"
)

## Task generation

The following cell generates an example that I'll use to show the sense of some transformations.

In [2]:
R, groups = make_blobs(
    n_samples=r_height,
    n_features=r_width,
    centers=3,
    random_state=10
)
R = np.round((R-R.min())*10/(R.max()-R.min())).astype(int)
# add bias for each object
bias = np.random.randint(-2,3, [R.shape[0], 1])
R = R + bias

# sometimes bias can lead to ratings
R = np.where(R<0, 0, R)
R = np.where(R>10, 10, R)


# some code to display result as HTML table
content = "".join([
    (
        "<tr>" + 
            f"<td>{ind}</td>" + 
            "".join([f"<td>{val}</td>" for val in obj]) + 
            f"<td>{groups[ind]}</td>"
        "</tr>"
    )
    for ind, obj in enumerate(R)
])
HTML("<table>" + common_header + content + "</table>")

Object,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Group
Object,0,1,2,3,4,5,6,7,8,9,Group
0,5,8,0,2,6,3,5,1,7,5,1
1,9,1,7,9,6,4,3,9,3,4,0
2,6,9,0,3,8,5,7,3,7,6,1
3,4,0,2,5,2,3,5,4,4,4,2
4,4,7,0,4,6,5,6,2,7,5,1
5,8,4,6,8,7,7,9,7,8,8,2
6,8,10,1,6,8,6,7,3,9,8,1
7,10,10,3,8,10,8,9,5,10,9,1
8,7,4,6,9,6,7,9,7,9,8,2
9,7,10,1,5,8,6,8,4,10,7,1


So let $k'$ - is object to which we need to recoomend something.

So for the example under consideration we'll use $k' = 5$.

In [3]:
consideration_object = 5

## Collaboration

The collaboration for $k'$ object is a set of objects that we think are similar to it. We measure similarity by the Pearson correlation coefficient.

So we can define similarity as the set of objects that have $Sim(i,k') > Sim'$ or more formally $U_{k'}=\left\{i\in U | Sim_{k'i} > Sim' \right\}$. So now we have a hyper-parameter of the algorithm $c'$ that controls how many objects are used to approximate preferences for the object.

The next cell shows a table with the correlation coefficients of the objects in the example with $k'=5$ on the left, and it's collaboration in the case $Sim'=0.8$ on the right.

In [4]:
# it's indices of objects excluding
# the object for which we are generating 
# predictions
other_indices = np.concatenate([
    np.arange(0,consideration_object), 
    np.arange(consideration_object+1, R.shape[0])
])
other_R = R[other_indices, :]
correlations = np.corrcoef(
    other_R, R[consideration_object, :]
)[0,1:]

# HTML code for input
# table that will be displayed
# on the left sides
header = (
    "<tr>"
        "<th rowspan=\"2\">object</th>"
        f"<th colspan=\"{R.shape[1]}\" style='text-align:center'>Ranks of the items</th>"
        "<th rowspan=\"2\">corr. coef</th>"
    "</tr>"
    "<tr>"+
        "".join([f"<th>{str(i)}</th>" for i in range(R.shape[1])])+
    "</tr>"
)
content = "".join([
    (
        "<tr>" + 
        f"<td>{obj}</td>" + 
        "".join([f"<td>{val}</td>" for val in R[obj,:]]) +
        f"<td>{str(correlations[i])}</td>" + 
        "</tr>"
    )
    for i, obj in enumerate(other_indices)
])
input_table = "<table>" + header + content + "</table>"
del header, content

# finding collaboration
collatoratoin_indices = other_indices[correlations > 0.8]
collaboration = R[collatoratoin_indices,:]


# HTML code for table that represents
# collaboration that is on the right side
header = (
    "<tr>"
        "<th rowspan=\"2\">object</th>"
        f"<th colspan=\"{R.shape[1]}\" style='text-align:center'>Ranks of the items</th>"
    "</tr>"
    "<tr>"+
        "".join([f"<th>{str(i)}</th>" for i in range(R.shape[1])])+
    "</tr>"
)
content = "".join([
    "<tr>" +
        f"<td>{object_ind}</td>"+
        ''.join(['<td>'+str(v)+'</td>' for v in obj])+
    "</tr>"
     for obj, object_ind in zip(collaboration, collatoratoin_indices)
])
collaboration_table = "<table>"+header+content+"</table>"
del header, content

HTML(
    "<div style='display: flex;justify-content: space-around;'>"+
    "<div>" + 
        "<p style='font-size:17px;text-align:center'>Input correlations</p>" + 
        input_table + 
    "</div>" +
    "<div style='font-size:100px'>→</div>"
    "<div>" + 
        "<p style='font-size:17px;text-align:center'>Collaboration</p>" + 
        collaboration_table + 
    "</div>"
    "</div>"
)

object,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,corr. coef
object,0,1,2,3,4,5,6,7,8,9,corr. coef
0,5,8,0,2,6,3,5,1,7,5,-0.6846628826516814
1,9,1,7,9,6,4,3,9,3,4,0.9569230732464372
2,6,9,0,3,8,5,7,3,7,6,-0.2872892478789914
3,4,0,2,5,2,3,5,4,4,4,0.9139997048114736
4,4,7,0,4,6,5,6,2,7,5,0.9520726033074148
6,8,10,1,6,8,6,7,3,9,8,0.8975017621528957
7,10,10,3,8,10,8,9,5,10,9,-0.1416274390752352
8,7,4,6,9,6,7,9,7,9,8,0.9674783088625508
9,7,10,1,5,8,6,8,4,10,7,-0.5417284266184769
10,8,0,5,6,4,1,2,6,1,0,-0.5870995880756341

object,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items
object,0,1,2,3,4,5,6,7,8,9
1,9,1,7,9,6,4,3,9,3,4
3,4,0,2,5,2,3,5,4,4,4
4,4,7,0,4,6,5,6,2,7,5
6,8,10,1,6,8,6,7,3,9,8
8,7,4,6,9,6,7,9,7,9,8
11,9,2,7,9,6,3,3,9,4,2


So the collaboration in this case will be $U_5=\{1,3,4,6,8,11\}$. Set of indices of objects belonging to the collaboration of the 5th element.

## Functions

Now when we have collaboration we can predict expected preferences for the items for 5-th element.

Estimation of the preference of $j$-th item for user $k'$ can be computed using:

$$a_{k', j}=\overline{r}_{k'} + \frac{\sum_{l\in U_{k'}}(r_{lj}-\overline{r}_l)Sim(k',l)}{\sum_{l \in U_{k'}}|Sim(k',l)|}$$

Suppose we want to make predict for some $j$-s item. $j$ is defined in the cell before:

In [5]:
interest_j = 4

So lets research this formula step by step.

### Remove object bias

Some objects may have some specificity, which is expressed in the fact that the average valuation of this or that object differs from others. For example, let's say we are talking about the clients of a certain cinema service - some clients are simply more rigorous themselves, so they have any score lower.

To understand what I'm talking about, consider objects of the same group but with completely different average $r$.

In [6]:
considered_group = 0
group_indeces = np.where(groups==considered_group)[0]
ind_sum = np.sum(R[group_indeces], axis=1)

max_sum_index = group_indeces[np.argmax(ind_sum)]
min_sum_index = group_indeces[np.argmin(ind_sum)]

content = (
    "<tr>" + 
        f"<td>Maximum (original index {str(max_sum_index)})</td>" + 
        "".join([f"<td>{val}</td>" for val in R[max_sum_index]]) + 
        f"<td>{str(considered_group)}</td>" +
    "</tr><tr>" +
        f"<td>Minimum (original index {str(min_sum_index)})</td>" +
        "".join([f"<td>{val}</td>" for val in R[min_sum_index]]) + 
        f"<td>{str(considered_group)}</td>" +
    "</tr>"
)
display(HTML(f"<table>{common_header + content}</table>"))
corelation_coefficient = \
    np.corrcoef(R[max_sum_index], R[min_sum_index])[0,1]
message_text = f"Correlation coefficient - {round(corelation_coefficient, 3)}"
display(HTML(f"<p style='font-size:15px'>{message_text}</p>"))

Object,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Ranks of the items,Group
Object,0,1,2,3,4,5,6,7,8,9,Group
Maximum (original index 17),10,3,9,9,8,5,4,9,4,4,0
Minimum (original index 13),5,0,4,5,4,1,0,6,0,0,0


They have completely different numbers, but the correlation is very strong - they both prefer the same items, but one is generally less "generous" with the grade.

So by operation $(r_{lj}-\overline{r}_l)$ in the numerator we are trying to remove bias $l$-th object from the ratings for $j$-th item. Or more spcificly for example under consideration:

In [35]:
values = collaboration[:, [interest_j]]
biases = np.mean(collaboration, axis=1)[:, np.newaxis]
corrected_items = values-biases

header = f"""
|$l$|$r_{{l,{interest_j}}}$|$\overline{{r_l}}$|$r_{{l,{interest_j}}} - \overline{{r_l}}$|
|---|-------|--------|-----|
"""
content = "\n".join([
    (
        "|" + "|".join([
            str(collatoratoin_indices[i]),
            str(values.ravel()[i]),
            str(biases.ravel()[i]),
            str(round(corrected_items.ravel()[i],3))
        ]) + "|"
    )
    for i in range(len(collaboration))
])
ans = (header + content)

|$l$|$r_{l,4}$|$\overline{r_l}$|$r_{l,4} - \overline{r_l}$|
|---|-------|--------|-----|
|1|6|5.5|0.5|
|3|2|3.3|-1.3|
|4|6|4.6|1.4|
|6|8|6.6|1.4|
|8|6|7.2|-1.2|
|11|6|5.4|0.6|

Thus, using natural language $(r_{lj}-\overline{r}_l)$, the preferences of the objects from the collaboration are corrected for the item $j$.  Obviosly some aggregation of these should characterise the preferences of the object we are considering to element $j$.

### Weighing of objects

Obviously, objects within a collaboration that are more similar to the object in question than others should affect the result more than less similar objects from the collaboration. Therefore, it is rational to weight the contributions of objects $(r_{lj}-\overline{r}_l)$ from the colaboration by their similarity measure to the object in question. In our case, the similarity measure is the Pearson correlation coefficient $(Sim(k',l))$. 

We can interpret the components of the formula:

$$\frac{\sum_{l\in U_{k'}}(r_{lj}-\overline{r}_l)Sim(k',l)}{\sum_{l \in U_{k'}}|Sim(k',l)|}$$

As weighted on correlations contributions of the objects from collaboration. 

**Note** there is sum of the absolute values of the correlation coefficients - $|Sim(k',l)|$. If the definition of collaboration allows for objects with a negative relationship - i.e. like anti-collaboration, e.g. people with diametrically opposed views. Accordingly, we have to take into account their high scores as something bad for the client in question. But we have to weight them by absolute value, so in the denominator is the correlation module. 

### Add bias

The last compoment of the formula is $\overline{r}_{k'}$. In remove bias step we removed biases of the objects. But object under consideration have it's own bias - by adding it to the result we bring the prediction to the mean values of the object in question.