# UserKNN Implementation Testing
Using a sample dataset from the UserKNN paper (<a href=''>GroupLens</a>), implement the 
following collobrative filtering algorithm:
$$P = \vec \mu + \frac{(A - \bar J) R}{M abs(R)}

In [2]:
import numpy as np

In [1]:
rating_json = [
    {'id': 1, 'reviews':[{'reviewer_id': 'Ken', 'rating': 1}, {'reviewer_id': 'Lee', 'rating': 4}, {'reviewer_id': 'Meg', 'rating': 2}, {'reviewer_id': 'Nan', 'rating': 2}] },
    {'id': 2, 'reviews':[{'reviewer_id': 'Ken', 'rating': 5}, {'reviewer_id': 'Lee', 'rating': 2}, {'reviewer_id': 'Meg', 'rating': 4}, {'reviewer_id': 'Nan', 'rating': 4}] },
    {'id': 3, 'reviews':[ {'reviewer_id': 'Meg', 'rating': 2}]},
    {'id': 4, 'reviews':[{'reviewer_id': 'Ken', 'rating': 2}, {'reviewer_id': 'Lee', 'rating': 5}, {'reviewer_id': 'Nan', 'rating': 5}] },
    {'id': 5, 'reviews':[{'reviewer_id': 'Ken', 'rating': 4}, {'reviewer_id': 'Lee', 'rating': 1}, {'reviewer_id': 'Nan', 'rating': 1}] },
    {'id': 6, 'reviews':[{'reviewer_id': 'Lee', 'rating': 2}, {'reviewer_id': 'Meg', 'rating': 5}] },
]

In [None]:
reviewers = []
for i in rating_json:
    pass



Create test datasets from paper

In [3]:
# Ratings array, unrated items are 0
# TODO - unrated items come in as nan, mask generated programatically
A = np.array([
    [1, 4, 2, 2],
    [5, 2, 4, 4],
    [0, 0, 3, 0],
    [2, 5, 0, 5],
    [4, 1, 0, 1],
    [0, 2, 5, 0]
])


# Mask for ratings
M = np.array([
    [1, 1, 1, 1],
    [1, 1, 1, 1],
    [0, 0, 1, 0],
    [1, 1, 0, 1],
    [1, 1, 0, 1],
    [0, 1, 1, 0]
])

n, p = A.shape

Calculate mean for each user $\bar \mu$

In [4]:
mu = np.sum(A, axis=0) / np.sum(M, axis=0) # mean of ALL ratings for each user, no exclusions
mu

array([3. , 2.8, 3.5, 3. ])

Calculate covariance matrix  $r_{KJ}$

In [5]:
A = A.astype('float')
A[A == 0] = np.nan # so non-values are excluded from correlation coefficients
R = np.ma.corrcoef(np.ma.masked_invalid(A), rowvar=False).data
np.fill_diagonal(R, 0) # so self-value isn't summed in numerator Sum(J_i - J_bar) @ R
A = np.nan_to_num(A, nan=0)
R

array([[ 0.        , -0.8       ,  1.        ,  0.        ],
       [-0.8       ,  0.        , -0.96380941,  0.6       ],
       [ 1.        , -0.96380941,  0.        ,  1.        ],
       [ 0.        ,  0.6       ,  1.        ,  0.        ]])

In [6]:
A

array([[1., 4., 2., 2.],
       [5., 2., 4., 4.],
       [0., 0., 3., 0.],
       [2., 5., 0., 5.],
       [4., 1., 0., 1.],
       [0., 2., 5., 0.]])

Create denominator $ \sum_{J} |r_{KJ}| \forall K$.

In [7]:
D = M @ abs(R)
D

array([[1.8       , 2.36380941, 2.96380941, 1.6       ],
       [1.8       , 2.36380941, 2.96380941, 1.6       ],
       [1.        , 0.96380941, 0.        , 1.        ],
       [0.8       , 1.4       , 2.96380941, 0.6       ],
       [0.8       , 1.4       , 2.96380941, 0.6       ],
       [1.8       , 0.96380941, 0.96380941, 1.6       ]])

Create $\bar J$, where for the $i^{th}$ row, take the average for each user's entire column **excluding** the $i^{th}$ value


In [8]:
D0 = np.ones([n,n])
np.fill_diagonal(D0, 0)

J_bar = (D0 @ A) / (D0 @ M)
J_bar

array([[3.66666667, 2.5       , 4.        , 3.33333333],
       [2.33333333, 3.        , 3.33333333, 2.66666667],
       [3.        , 2.8       , 3.66666667, 3.        ],
       [3.33333333, 2.25      , 3.5       , 2.33333333],
       [2.66666667, 3.25      , 3.5       , 3.66666667],
       [3.        , 3.        , 3.        , 3.        ]])

alternative that i think is wrong: 
$ \bar J$ is the $j^{th}$ th person's average rating with respect to the $i^{th}$ person (based on overlapping reviews)

In [9]:
#J_bar = (M.T @ A) / (M.T @ M)


In [10]:
P = mu + ((A - J_bar) @ R) / D
P

  P = mu + ((A - J_bar) @ R) / D


array([[ 1.22222222,  4.1795326 ,  1.66259639,  2.3125    ],
       [ 3.81481481,  1.96411495,  5.17480722,  3.04166667],
       [ 4.57333333,  4.08919639,        -inf,  0.65333333],
       [-4.125     ,  7.11428542,  3.05559135, -0.08333333],
       [ 0.875     ,  3.30476161,  3.78181226, -5.08333333],
       [ 4.55555556,  1.42252972, -1.7252972 ,  3.875     ]])

Na√Øve testing, comparing residuals

In [12]:
abs(A - P) * M

array([[0.22222222, 0.1795326 , 0.33740361, 0.3125    ],
       [1.18518519, 0.03588505, 1.17480722, 0.95833333],
       [0.        , 0.        ,        inf, 0.        ],
       [6.125     , 2.11428542, 0.        , 5.08333333],
       [3.125     , 2.30476161, 0.        , 6.08333333],
       [0.        , 0.57747028, 6.7252972 , 0.        ]])

If we want to replace given values with true value

In [100]:
M_switch = (M - 1) * -1

np.nan_to_num(P * M_switch, nan=0) + A

  np.nan_to_num(P * M_switch, nan=0) + A


array([[1.        , 4.        , 2.        , 2.        ],
       [5.        , 2.        , 4.        , 4.        ],
       [4.57333333, 4.08919639, 3.        , 0.65333333],
       [2.        , 5.        , 3.05559135, 5.        ],
       [4.        , 1.        , 3.78181226, 1.        ],
       [4.55555556, 2.        , 5.        , 3.875     ]])

## SCRATCH
Calculate sigma for each user

In [63]:
S = np.sqrt(np.diag(((A - mu) ** 2).T @ M)).reshape([p, 1])

In [64]:
S @ S.T

array([[10.        , 10.39230485,  7.07106781, 10.        ],
       [10.39230485, 10.8       ,  7.34846923, 10.39230485],
       [ 7.07106781,  7.34846923,  5.        ,  7.07106781],
       [10.        , 10.39230485,  7.07106781, 10.        ]])

In [222]:
1.4/1.6

0.8749999999999999

In [225]:
np.corrcoef([4,2,5,1], [2,4,5,1])

array([[1. , 0.6],
       [0.6, 1. ]])

In [226]:
np.corrcoef([1, 5], [2,4])

array([[1., 1.],
       [1., 1.]])

In [233]:


M.T @ M

array([[4, 4, 2, 4],
       [4, 5, 3, 4],
       [2, 3, 4, 2],
       [4, 4, 2, 4]])

In [238]:
(M.T @ np.nan_to_num(A, nan=0)) / (M.T @ M)

array([[3.        , 3.        , 3.        , 3.        ],
       [3.        , 2.8       , 3.66666667, 3.        ],
       [3.        , 2.66666667, 3.5       , 3.        ],
       [3.        , 3.        , 3.        , 3.        ]])

In [228]:
M.T

array([[1, 1, 0, 1, 1, 0],
       [1, 1, 0, 1, 1, 1],
       [1, 1, 1, 0, 0, 1],
       [1, 1, 0, 1, 1, 0]])

In [38]:
2.8/1.8

1.5555555555555554

In [61]:
mu + np.nan_to_num((A[5].reshape([4,1]) @ np.ones([1,4]) - J_bar).T, nan=0) @ R /  D[5]

array([[4.55555556, 0.8       , 4.5       , 3.875     ],
       [4.65185185, 0.46666667, 4.3       , 4.15833333],
       [4.57407407, 1.3       , 5.16666667, 3.3125    ],
       [4.55555556, 0.8       , 4.5       , 3.875     ]])

In [66]:
np.ones(n).reshape([n,1 ]) @ mu.reshape([1, p])

array([[3. , 2.8, 3.5, 3. ],
       [3. , 2.8, 3.5, 3. ],
       [3. , 2.8, 3.5, 3. ],
       [3. , 2.8, 3.5, 3. ],
       [3. , 2.8, 3.5, 3. ],
       [3. , 2.8, 3.5, 3. ]])

In [71]:
I0 = np.ones([n,n])
np.fill_diagonal(I0, 0)

(I0 @ np.nan_to_num(A, nan=0)) / (I0 @ M)


array([[3.66666667, 2.5       , 4.        , 3.33333333],
       [2.33333333, 3.        , 3.33333333, 2.66666667],
       [3.        , 2.8       , 3.66666667, 3.        ],
       [3.33333333, 2.25      , 3.5       , 2.33333333],
       [2.66666667, 3.25      , 3.5       , 3.66666667],
       [3.        , 3.        , 3.        , 3.        ]])

In [70]:
I0 @ M

array([[3., 4., 3., 3.],
       [3., 4., 3., 3.],
       [4., 5., 3., 4.],
       [3., 4., 4., 3.],
       [3., 4., 4., 3.],
       [4., 4., 3., 4.]])