# UserKNN Implementation Testing
Using a sample dataset from the UserKNN paper (<a href=''>GroupLens</a>), implement the 
following collobrative filtering algorithm:
$$P = \vec \mu + \frac{(A - \bar J) R}{M abs(R)}

In [3]:
import numpy as np

In [4]:
rating_json = [
    {'id': 1, 'reviews':[{'reviewer_id': 'Ken', 'rating': 1}, {'reviewer_id': 'Lee', 'rating': 4}, {'reviewer_id': 'Meg', 'rating': 2}, {'reviewer_id': 'Nan', 'rating': 2}] },
    {'id': 2, 'reviews':[{'reviewer_id': 'Ken', 'rating': 5}, {'reviewer_id': 'Lee', 'rating': 2}, {'reviewer_id': 'Meg', 'rating': 4}, {'reviewer_id': 'Nan', 'rating': 4}] },
    {'id': 3, 'reviews':[ {'reviewer_id': 'Meg', 'rating': 2}]},
    {'id': 4, 'reviews':[{'reviewer_id': 'Ken', 'rating': 2}, {'reviewer_id': 'Lee', 'rating': 5}, {'reviewer_id': 'Nan', 'rating': 5}] },
    {'id': 5, 'reviews':[{'reviewer_id': 'Ken', 'rating': 4}, {'reviewer_id': 'Lee', 'rating': 1}, {'reviewer_id': 'Nan', 'rating': 1}] },
    {'id': 6, 'reviews':[{'reviewer_id': 'Lee', 'rating': 2}, {'reviewer_id': 'Meg', 'rating': 5}] },
]

In [5]:
reviewers = []
for i in rating_json:
    pass



Create test datasets from paper

In [6]:
# Ratings array, unrated items are 0
# TODO - unrated items come in as nan, mask generated programatically
A = np.array([
    [1, 4, 2, 2],
    [5, 2, 4, 4],
    [0, 0, 3, 0],
    [2, 5, 0, 5],
    [4, 1, 0, 1],
    [0, 2, 5, 0]
])


# Mask for ratings

#M = np.array([
#    [1, 1, 1, 1],
#    [1, 1, 1, 1],
#    [0, 0, 1, 0],
#    [1, 1, 0, 1],
#    [1, 1, 0, 1],
#    [0, 1, 1, 0]
#])

n, p = A.shape

Calculate mean for each user $\bar \mu$

Calculate covariance matrix  $r_{KJ}$

In [7]:
A = A.astype('float')
A[A == 0] = np.nan # so non-values are excluded from correlation coefficients

In [8]:
M = ~np.ma.masked_invalid(A).mask
M

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [False, False,  True, False],
       [ True,  True, False,  True],
       [ True,  True, False,  True],
       [False,  True,  True, False]])

In [9]:

R = np.ma.corrcoef(np.ma.masked_invalid(A), rowvar=False).data
np.fill_diagonal(R, 0) # so self-value isn't summed in numerator Sum(J_i - J_bar) @ R
A = np.nan_to_num(A, nan=0)
R

array([[ 0.        , -0.8       ,  1.        ,  0.        ],
       [-0.8       ,  0.        , -0.96380941,  0.6       ],
       [ 1.        , -0.96380941,  0.        ,  1.        ],
       [ 0.        ,  0.6       ,  1.        ,  0.        ]])

In [10]:
np.ma.masked_invalid(A)

masked_array(
  data=[[1.0, 4.0, 2.0, 2.0],
        [5.0, 2.0, 4.0, 4.0],
        [0.0, 0.0, 3.0, 0.0],
        [2.0, 5.0, 0.0, 5.0],
        [4.0, 1.0, 0.0, 1.0],
        [0.0, 2.0, 5.0, 0.0]],
  mask=[[False, False, False, False],
        [False, False, False, False],
        [False, False, False, False],
        [False, False, False, False],
        [False, False, False, False],
        [False, False, False, False]],
  fill_value=1e+20)

In [11]:
np.ma.corrcoef(A ,rowvar=False)

masked_array(
  data=[[1.0, 0.0, -0.27695585470349865, 0.5454545454545455],
        [0.0, 1.0, -0.34668762264076824, 0.7169281790988649],
        [-0.27695585470349865, -0.34668762264076824, 1.0,
         -0.3692744729379982],
        [0.5454545454545455, 0.7169281790988649, -0.3692744729379982,
         1.0]],
  mask=[[False, False, False, False],
        [False, False, False, False],
        [False, False, False, False],
        [False, False, False, False]],
  fill_value=1e+20)

In [12]:
mu = np.sum(A, axis=0) / np.sum(M, axis=0) # mean of ALL ratings for each user, no exclusions
mu

array([3. , 2.8, 3.5, 3. ])

In [None]:
np.sum(M, axis=0)

Create denominator $ \sum_{J} |r_{KJ}| \forall K$.

In [None]:
D = M @ abs(R)
D

Create $\bar J$, where for the $i^{th}$ row, take the average for each user's entire column **excluding** the $i^{th}$ value


In [None]:
D0 = np.ones([n,n])
np.fill_diagonal(D0, 0)

J_bar = (D0 @ A) / (D0 @ M)
J_bar

alternative that i think is wrong: 
$ \bar J$ is the $j^{th}$ th person's average rating with respect to the $i^{th}$ person (based on overlapping reviews)

In [2]:
#J_bar = (M.T @ A) / (M.T @ M)
R

NameError: name 'R' is not defined

In [None]:
P = mu + ((A - J_bar) @ R) / D
P

Naïve testing, comparing residuals

In [None]:
abs(A - P) * M

If we want to replace given values with true value

In [None]:
M_switch = (M - 1) * -1

np.nan_to_num(P * M_switch, nan=0) + A

## SCRATCH
Calculate sigma for each user

In [None]:
S = np.sqrt(np.diag(((A - mu) ** 2).T @ M)).reshape([p, 1])

In [None]:
S @ S.T

In [None]:
1.4/1.6

In [None]:
np.corrcoef([4,2,5,1], [2,4,5,1])

In [None]:
np.corrcoef([1, 5], [2,4])

In [None]:


M.T @ M

In [None]:
(M.T @ np.nan_to_num(A, nan=0)) / (M.T @ M)

In [None]:
M.T

In [None]:
2.8/1.8

In [None]:
mu + np.nan_to_num((A[5].reshape([4,1]) @ np.ones([1,4]) - J_bar).T, nan=0) @ R /  D[5]

In [None]:
np.ones(n).reshape([n,1 ]) @ mu.reshape([1, p])

In [None]:
I0 = np.ones([n,n])
np.fill_diagonal(I0, 0)

(I0 @ np.nan_to_num(A, nan=0)) / (I0 @ M)
