# Notebook 21: Estimating Missing Data
***

In this notebook, we will dive deeper into the problem of estimating missing ratings/data. Specifically, we will conduct some **preprocessing** on our data matrix, which can account for differences in how different users rate different items. Some users tend to be kind raters, and some will be stingier with those 5-star reviews. Normalizing our utility matrix before performing the UV decomposition on it will account for these differences.

We'll need numpy for this notebook, so let's load it.

In [7]:
import numpy as np

<br>

### Exercise 1: Preprocessing the data matrix

In class we performed a few iterations to find the UV decomposition of the following matrix, where the rows correspond to different users, and the columns correspond to different items. The elements of the matrix are the users' ratings for each item. There two unknown values:
* User 3's rating for Item 2, and
* User 6's rating for Item 5.

In [8]:
M = np.array([[5,2,4,4,3],
              [3,1,2,4,1],
              [2,np.nan,3,1,4],
              [2,5,4,3,5],
              [2,5,4,3,5],
              [4,4,5,4,np.nan]])
M

array([[ 5.,  2.,  4.,  4.,  3.],
       [ 3.,  1.,  2.,  4.,  1.],
       [ 2., nan,  3.,  1.,  4.],
       [ 2.,  5.,  4.,  3.,  5.],
       [ 2.,  5.,  4.,  3.,  5.],
       [ 4.,  4.,  5.,  4., nan]])

We suggested in class that **preprocessing** the data matrix would lead to better results, by accounting for differences in the ways different users tend to rate items, and differences in item ratings. Some of the possible methods for preprocessing are:
* Subtract from each non-blank element $m_{ij}$ the average rating of user $i$
* Subtract from each non-blank element in column $j$ the average rating of item $j$
* Do both of these, in either order
* From element $m_{ij}$ subtract $\frac{1}{2} \times$ (the average of user $i$ + the average of item $j$)

Let's shoot for the stars and subtract from each non-missing element $m_{ij}$ the average rating of user $i$, then subtract from that intermediate matrix the average rating of item $j$.

In [12]:
# initialize
user_means = np.nanmean(M, axis=1)
item_means = np.nanmean(M, axis=0)

print(user_means)
print(item_means)

# normalize M first by subtracting from each non-blank element that user's mean rating
M_norm = M.copy()
print(M_norm)

for u in range(len(user_means)):
    M_norm[u,:] -= user_means[u]

# normalize M once more by subtracting from each non-blank element that item's mean rating
for i in range(len(item_means)):
    M_norm[:,i] -= item_means[i]

print(M_norm)

[3.6  2.2  2.5  3.8  3.8  4.25]
[3.         3.4        3.66666667 3.16666667 3.6       ]
[[ 5.  2.  4.  4.  3.]
 [ 3.  1.  2.  4.  1.]
 [ 2. nan  3.  1.  4.]
 [ 2.  5.  4.  3.  5.]
 [ 2.  5.  4.  3.  5.]
 [ 4.  4.  5.  4. nan]]
[[-1.6        -5.         -3.26666667 -2.76666667 -4.2       ]
 [-2.2        -4.6        -3.86666667 -1.36666667 -4.8       ]
 [-3.5                nan -3.16666667 -4.66666667 -2.1       ]
 [-4.8        -2.2        -3.46666667 -3.96666667 -2.4       ]
 [-4.8        -2.2        -3.46666667 -3.96666667 -2.4       ]
 [-3.25       -3.65       -2.91666667 -3.41666667         nan]]


Note that whatever we subtract off from each element of the matrix during preprocessing, we need to *add that back in* when estimating the missing values after our UV decomposition. So **go back** to the code cell above and **add in** a matrix that is of the same size as $M$, whose elements are the total amount that we subtracted off from each element of $M$ during the preprocessing normalization.

So we currently have $M=M_{norm}-M_{sub}$, and we'll use $M_{norm}$ for our UV fitting.

**Reflect:** Why did we not compute the user and item rating means at the same time? Why did we have to normalize by the user ratings first, *then* compute the mean item ratings?

<br>

### Exercise 2: U and V!

Let's find the UV decomposition of M (`M_norm`) using 2 dimensional vectors.  Per the slides, we need to alternatingly compute:


$$x=u_{rs}=\frac{\sum_j v_{sj} (m_{rj} - \sum_{k \ne s} u_{rk}v_{kj} )}{\sum_j v_{sj}^2}$$

$$y=v_{rs}=\frac{\sum_i u_{ir} (m_{is} - \sum_{k \ne r} u_{ik}v_{ks} )}{\sum_i u_{ir}^2}$$

for $x$ in the $U$ matrix and $y$ in the $V$ matrix.

Let's start with a couple of "easy" updates, and initialize U and V as all-ones, then update `u[0,0]` and `v[0,0]`.


In [None]:
#Initialize U and V
d=2
U = np.ones((M.shape[0],d))
V= np.ones((d,M.shape[1]))

#Update U[0,0]
#U[0,0]= #TODO
#Update V[0,0]
#V[0,0]= #TODO


In [None]:
##Now set it up as a loop, running down U and V in order (by whichever dimension first)
#TO DO:
for
    for
        U[a,b]=
        V[b,a]=
        
    

We've done a step!  Let's see how we're doing.

<br>

### Exercise 3: Back to M

To go back to doing inference in M, we have to do 2 things: compute $P=UV$, then undo our normalization step.  The final result can be compared to M!

Recall that our current scoring metric is RMSE:

$$\sqrt{\frac{1}{n} \sum_{i,j} (M_{i,j} - P_{i,j})^2} $$


In [None]:
P=np.matmul(U,V)

#Put M_sub into P
P_unnorm= # TODO

def RMSE(M1, M2):
    # todo... sum over all non-NAN entries in M1.  consider np.nansum.
    return rmse

RMSE(M,P_unnorm)

It's hard to say that we're doing great after one iteration, but we could at least check that we've done better than the RMSE from the all-ones initializations.

In [None]:
U = np.ones((M.shape[0],d))
V= np.ones((d,M.shape[1]))

P=np.matmul(U,V)
P_unnorm= # TODO

RMSE(M,P_unnorm)

Note: we could (and probably should!) also save a little time by just computing the RMSE of $P$ compared directly to $M_{norm}$.  


**Contemplate**: how does doing the RMSE calculation change depending on which one we use?

<br>

### Exercise 4: Bring Order to the Galaxy

Are we convinced that order really matters?  Repeat exercise 2, but instead of looping in a structured format over the rows and columns, create a random ordering of the $u_{rs}$ indices and another random ordering of the $v_{rs}$ indices, then pass those into the inside of your loop.

In [None]:
#initialize U and V again
U = np.ones((M.shape[0],d))
V= np.ones((d,M.shape[0]))
indices_list=[(x, y) for x in range(M.shape[0]) for y in range(d)]

#randomize the u's update order
U_index_order = # TODO
#randomize the v's update order
U_index_order = # TODO

for 
    U[u_index_order]= # TODO
    V[v_index_order]= # TODO


Any better?  Any different?

In [None]:
P=np.matmul(U,V)
#Put M_sub into P
P_unnorm= # TODO
RMSE(M,P_unnorm)