# 1) User User Collaborative Filtering

[user user collaborative filtering blog](https://medium.com/@tomar.ankur287/user-user-collaborative-filtering-recommender-system-51f568489727)

[blog 2](https://medium.com/sfu-big-data/recommendation-systems-user-based-collaborative-filtering-using-n-nearest-neighbors-bf7361dc24e0)

 []() |Batman  | X-Men| Star wars| The notebook| Bridget Jones's Diary|
--|--|--|--|--|--|
Alice|5|4.5|5|2|1|
Bob|4.5|4| []()|2|2|
Carol|2|3|1|5|5|

- Alice and Bob seem to be in agreement, disagree with Carol
- Is Star Wars a good recommendation for Bob?
- Intuitively, we see that Bob's ratings are similar to Alice's, thus he is likely to also like Star Wars
- Math-speak: Alice's and Bob's ratings are highly correlated


## Average Rating
- Why is it limited?
- It treats everyone's rating of the movie equally
- Bob's s(i,j) equally depends on Alice's rating and Carol's rating, even though he doesn't agree with Carol

\begin{equation*}
s(i,j) = \frac{\sum_{i' \in \Omega}^{}r_{i'j}}{|\Omega_{j}|}
\end{equation*}

## Weighting Ratings
- We'll see how to calculate these weighs later
- Intuitively, I want it to be small for users I don't agree with, large for users I do agree with

\begin{equation*}
s(i,j) = \frac{\sum_{i' \in \Omega_{j}}^{}w_{ii'} r_{i'j}}{\sum_{i' \in \Omega_{j}}^{}w_{ii'}}
\end{equation*}


## Another issue with average rating
- your interpretation of a rating is different from mine
- Users can be biased to be optimistic or pessimistic
- E.g, optimistic: most movies are a 5, bad movie is a 3
- E.g. pessimistic: most movies are 1 or 2, good movie is a 4


## Deviation
- We don't care about your absolute rating, we care how much it deviates from your own average
    - if your average is 2.5, but you rate something a 5, it must be really great
    - if you rate everything a 5,it's difficult to know how those items compare
    
$\overline{r}_{i}$ is sample mean of r      
$dev(i,j) = r(i,j) - \overline{r}_{i}$, for a known rating

- My predicted deviation is the average deviation for a movie
- Then my predicted rating is my own average + predicted deviation

$dev(i,j) = r(i,j) - \overline{r}_{i}$ for a known rating  
$\hat{dev(i,j)} = \frac{1}{|\Omega_{j}|}\sum_{i'\in \Omega_{j}}^{}r(i',j)-\overline{r}_{i'}$  
for a prediction from known ratings  
$s(i,j) = \overline{r}_{i} + \frac{\sum_{i'\in \Omega_{j}}^{}r(i',j)-\overline{r}_{i'}}{|\Omega_{j}|} = \overline{r}_{i}+\hat{dev(i,j)}$


## Combine
- Combine the idea of deviations with the idea of weightings to get our final formula
- Note: absolute value since weights can be negative

\begin{equation*}
s(i,j) = \overline{r}_{i} + \frac{\sum_{i'\in \Omega_{j}}^{}w_{ii'}(r_{i'j}-\overline{r}_{i'})}{\sum_{i'\in \Omega_{j}}^{}|w_{ii'}|}
\end{equation*}



## How to calculate weights

- How do we calculate correlations between 2 sequences of data?
- Pearson correlation coefficient

\begin{equation*}
\varrho_{xy} = \frac{\sum_{i=1}^{N}(x_{i}-\overline{x})-(y_{i}-\overline{y})}{\sqrt{\sum_{i=1}^{N}(x_{i}-\overline{x})^2} \sqrt{\sum_{i=1}^{N}(y_{i}-\overline{y})^2}}
\end{equation*}  

- Problem: our matrix is sparse

- Idea: just use the data we have

\begin{equation*}
w_{ii'} = \frac{\sum_{j\in \Psi_{ii'}}^{}(r_{ij}-\overline{r}_{i})-(r_{i'j}-\overline{r}_{i'})}{\sqrt{\sum_{j\in \Psi_{i}}^{}(r_{ij}-\overline{r}_{i})^2} \sqrt{\sum_{j \in \Psi_{i'}}^{}(r_{i'j}-\overline{r}_{i'})^2}}
\end{equation*}  

$\Psi_{i}$ = set of movies that user i has rated  
$\Psi_{ii'}$ = set of movies both user i and user i' have rated  
$\Psi_{ii'} = \Psi_{i} \cap \Psi_{i'}$ 





## What about cosine similarity?

- Pearson correlation is old school statistics
- But wait

\begin{equation*}
cos\theta = \frac{x^{T}y}{|x||y|} = \frac{\sum_{i=1}^{N}x_{i}y_{i}}{\sqrt{\sum_{i=1}^{N}x_{i}^2}\sqrt{\sum_{i=1}^{N}y_{i}^2}}
\end{equation*}

## Cosine similarity vs Pearson correlation
- They are the same, Except Pearson is centered
- We want to center them anyway because we're working with deviations, not absolute ratings


## Problem

- What if 2 users have zero movies in common, or just few?
- If zero, don't consider, i' in user i's calculation- it can't be calculated
- If few(e.g. <5) then don't use the weight, since not enough data to be accurate



## Neighborhood

- In practice, don't sum over all users who rated movie j(takes too long)
- It can help to precompute weights beforehand
- Non-trivial: instead of summing over all users, take the ones with the highest weight
- E.g. user K nearest neighbors, K = 25 up to 50

\begin{equation*}
s(i,j) = \overline{r}_{i} + \frac{\sum_{i'\in \Omega_{j}}^{}w_{ii'}(r_{i'j}-\overline{r}_{i'})}{\sum_{i'\in \Omega_{j}}^{}|w_{ii'}|}
\end{equation*}

- Is it userful to keep negative weights? Yes A strong correlation is very negative - no correlation is zero
- Thus, you might want to keep neighbors sorted on absolute correlation

## Data

[kaggle movielens 20 million](https://www.kaggle.com/grouplens/movielens-20m-dataset)

## Complexity
- If you're a company building a recommendation engine,you need to make recommendations for all N users
- we need w(i,i') for i = 1,...,N -> $O(n^2)$
- Including the time to calculate each correlation -> $O(N^2M)$

## Big Data

- Suppose we have 100k = $10^5$ users
- Then we need$(10^5)^2 = 10^10$ weights
- 10 billion
- we are reaching the limits of capacity
- Not just size, but time required to calculate $O(N^2)$ is slow


## Suggestion: Work on a subset of data
- What I did
- Take a top n users and top m movies (n < N, m <M)
- Top user = user who has rated the most movies
- Top movie = movie that has been rated the most times
- Yields a more dense matrix
- Experiment to determine a workable value for n and m


## Recommendation for a single user
- Naive calculation
- Calculate weights between this user and all users: O(MN)
- Calculate scores s(i,j) for all items: also O(MN)
    - M items
    - Sum over N terms for each item
- Sort the scores: O(MlogM)
- Total: O(MN)+ O(MlogM)
- This is theoretical- practically, you can make things faster
    - Precomputing user-user weights
    - For score, only sum using users similar to me- if top K, then O(MK)
    - Sorting, if keep only L items, then O(MLogL) 

 ## Precomputing
 - Precompute recommendations using big data technologies as offline jobs
 - You don't want to do an O(MlogM) sort in real time - even O(M) would be bad
 - Use a cronjob to run your task everyday at some time, e.g. 3pm
 - Your API can simply retreive items already stored in a database

# 2) Data Processing


## User IDs
- rating.csv  
- User IDs from 1... 100k
- IDs index a Numpy array, so we want them to start from 0
- also want to make sure we utilize all space in the array
- so we don't want the max User IDs to be 100k, but only have 500 users
- The only step is to subtract 1 from each ID

## Movie IDs
- go from 1 ... 100k
- But they are not sequential
- only ~20k movies
- for these, we should make a new mapping that goes from 0 ~20k 

## Shrinking the data

- Full dataset is too large to perform $O(N^2)$ algorithm
- User-User CF won't scale on a single machine
- If you're an expert at big data(Spark), you can write a distributed job
- preprocessing_shrink.py
    - select subset of users and movies
    - Users who rated the most movies
    - Movies who've been rated by the most users

```python
df = pd.read_csv('../large_files/movielens-20m-dataset/edited_rating.csv')
print("original dataframe size:", len(df))

N = df.userId.max() + 1 # number of users
M = df.movie_idx.max() + 1 # number of movies

user_ids_count = Counter(df.userId)
movie_ids_count = Counter(df.movie_idx)

# number of users and movies we would like to keep
n = 10000
m = 2000

user_ids = [u for u, c in user_ids_count.most_common(n)]
movie_ids = [m for m, c in movie_ids_count.most_common(m)]

# make a copy, otherwise ids won't be overwritten
df_small = df[df.userId.isin(user_ids) & df.movie_idx.isin(movie_ids)].copy()
# need to remake user ids and movie ids since they are no longer sequential
new_user_id_map = {}
i = 0
for old in user_ids:
  new_user_id_map[old] = i
  i += 1
print("i:", i)

new_movie_id_map = {}
j = 0
for old in movie_ids:
  new_movie_id_map[old] = j
  j += 1
print("j:", j)

print("Setting new ids")
df_small.loc[:, 'userId'] = df_small.apply(lambda row: new_user_id_map[row.userId], axis=1)
df_small.loc[:, 'movie_idx'] = df_small.apply(lambda row: new_movie_id_map[row.movie_idx], axis=1)

```

## Table to Dictionary
- In code, I want to ask questions like
    - given user i, which movies j did they rate?
    - given movie j, which users i have rated it?
    - given user i and movie j, what is the rating?
- Pandas is like SQL table, so we should be able to write "queries" to grab this info?
- Python dictionaries are already a key ,value lookup
    - user2movie: user ID -> movie ID
    - movie2user : movie ID -> user ID
    - usermovie2rating: (user ID, movie ID) -> rating
    
- looping through the array would be O(NM)
- looping through dictionary O(|$\Omega$|)

# 3) User User Collaborative Filtering in Code

```python
with open('user2movie.json', 'rb') as f:
  user2movie = pickle.load(f)

with open('movie2user.json', 'rb') as f:
  movie2user = pickle.load(f)

with open('usermovie2rating.json', 'rb') as f:
  usermovie2rating = pickle.load(f)

with open('usermovie2rating_test.json', 'rb') as f:
  usermovie2rating_test = pickle.load(f)

N = np.max(list(user2movie.keys())) + 1
# the test set may contain movies the train set doesn't have data on
m1 = np.max(list(movie2user.keys()))
m2 = np.max([m for (u, m), r in usermovie2rating_test.items()])
M = max(m1, m2) + 1

```


\begin{equation*}
s(i,j) = \overline{r}_{i} + \frac{\sum_{i'\in \Omega_{j}}^{}w_{ii'}(r_{i'j}-\overline{r}_{i'})}{\sum_{i'\in \Omega_{j}}^{}|w_{ii'}|}
\end{equation*}


```python
K = 25 # number of neighbors we'd like to consider
limit = 5 # number of common movies users must have in common in order to consider
neighbors = [] # store neighbors in this list
averages = [] # each user's average rating for later use
deviations = [] # each user's deviation for later use
for i in range(N):
  # find the 25 closest users to user i
  movies_i = user2movie[i]
  movies_i_set = set(movies_i)
  
  # calculate avg and deviation
  ratings_i = { movie:usermovie2rating[(i, movie)] for movie in movies_i }
  avg_i = np.mean(list(ratings_i.values()))
  dev_i = { movie:(rating - avg_i) for movie, rating in ratings_i.items() }
  dev_i_values = np.array(list(dev_i.values()))
  sigma_i = np.sqrt(dev_i_values.dot(dev_i_values))
    
  ...
  # save these for later use
  averages.append(avg_i)
  deviations.append(dev_i)

  sl = SortedList()
  for j in range(N):
    # don't include yourself
    if j != i:
      movies_j = user2movie[j]
      movies_j_set = set(movies_j)
      common_movies = (movies_i_set & movies_j_set) # intersection
      if len(common_movies) > limit:
        # calculate avg and deviation
        ratings_j = { movie:usermovie2rating[(j, movie)] for movie in movies_j }
        avg_j = np.mean(list(ratings_j.values()))
        dev_j = { movie:(rating - avg_j) for movie, rating in ratings_j.items() }
        dev_j_values = np.array(list(dev_j.values()))
        sigma_j = np.sqrt(dev_j_values.dot(dev_j_values))

        # calculate correlation coefficient
        numerator = sum(dev_i[m]*dev_j[m] for m in common_movies)
        w_ij = numerator / (sigma_i * sigma_j)

        # insert into sorted list and truncate
        # negate weight, because list is sorted ascending
        # maximum value (1) is "closest"
        sl.add((-w_ij, j))
        if len(sl) > K:
          del sl[-1]

```

# 4) Item - Item Collaborative Filtering

## Recap about User User Collaborative Filtering
- For user user CF, I want to find users like me
- The movie that those users have seen, that I haven't seen, become my recommendations
- It's intuitive that if they are like me, I would like movies they've rated highly
- 2 users are similar of their row vectors have small distance between them



## Item Item 
- What if we looked column-wise instead?
- Let's find 2 products that are similar
- They are similar if their column vectors' distance is small

![](https://takuti.github.io/Recommendation.jl/latest/assets/images/cf.png)

## Example
- The correlation between the column vectors is high
- If you like Power Rangers, you'll also like Transformers

 []() |Power Rangers  | Transformers| Ninja Turtles|
--|--|--|--|
Alice|4.5|5|4|
Bob|5|5|4.5 |
Carol|2|2|0.5|

## Item Correlation

\begin{equation*}
w_{jj'} = \frac{\sum_{i\in \Psi_{jj'}}^{}(r_{ij}-\overline{r}_{j})-(r_{ij'}-\overline{r}_{j'})}{\sqrt{\sum_{i\in \Psi_{jj'}}^{}(r_{ij}-\overline{r}_{j})^2} \sqrt{\sum_{i \in \Psi_{jj'}}^{}(r_{ij'}-\overline{r}_{j'})^2}}  
\end{equation*}  

$\Omega_{j} = $ users who rated item j   
$\Omega_{jj'} = $ users who rated item j and item j'   
$\overline{r}_{j} = $ average rating for item j  



## Item Score


\begin{equation*}
s(i,j) = \overline{r}_{j} + \frac{\sum_{j'\in \Psi_{i}}^{}w_{jj'}(r_{ij'}-\overline{r}_{j'})}{\sum_{j'\in \Psi_{i}}^{}|w_{jj'}|}
\end{equation*}

$\Psi_{i}$ = items user i has rated

- Deviation : how much user i likes item j': compared to how much everyone else likes j'(IMO, not as intuitive as user user CF)
- if user i really likes j' and j is similar to j'($w_{jj'}$ is high), then user i probably likes j too

## Comparison

- User User CF: choose items for a user, because items have been liked by similar users
- Item Item CF: choose items for a user, because this user has liked similar items in the past


## Another Prespective
- pretend items are people, so they have feelings
- Flip the user-item matrix sideways
- Each entry tells me how much item j likes user i
- To choose a user to recommend to item j, I can look at other items j' who liked the same users as item j
- If item j and item j' are similar, then they like the same users
- User-based and Item-based CF are mathmatically identical


## Practical differences

- When comparing 2 items, you have a lot more data than when comparing 2 users
    - Each user: up to ~20k items to look at
    - Each item: up to ~100k users to look at
    - Thus for item item cf , weights are calculated based on more data
- Item Item CF is faster
    - given a user, calculate scores for each item: O($M^2N$)
        - there are $M^2$ item item weights, and each vector is length N
    - for user based CF, we saw $O(N^2M)$
    - N >> M , so $N^2$ compared to $M^2$ is even worse
    
- Item based CF is more accurate    

# 5) Item - Item Collaborative Filtering in Code

```python
with open('user2movie.json', 'rb') as f:
  user2movie = pickle.load(f)

with open('movie2user.json', 'rb') as f:
  movie2user = pickle.load(f)

with open('usermovie2rating.json', 'rb') as f:
  usermovie2rating = pickle.load(f)

with open('usermovie2rating_test.json', 'rb') as f:
  usermovie2rating_test = pickle.load(f)

K = 20 # number of neighbors we'd like to consider
limit = 5 # number of common movies users must have in common in order to consider
neighbors = [] # store neighbors in this list
averages = [] # each item's average rating for later use
deviations = [] # each item's deviation for later use

for i in range(M):# Item M
  # find the K closest items to item i
  users_i = movie2user[i]
  users_i_set = set(users_i)

  # calculate avg and deviation
  ratings_i = { user:usermovie2rating[(user, i)] for user in users_i }
  avg_i = np.mean(list(ratings_i.values()))
  dev_i = { user:(rating - avg_i) for user, rating in ratings_i.items() }
  dev_i_values = np.array(list(dev_i.values()))
  sigma_i = np.sqrt(dev_i_values.dot(dev_i_values))

  # save these for later use
  averages.append(avg_i)
  deviations.append(dev_i)

  sl = SortedList()
  for j in range(M):
    # don't include yourself
    if j != i:
      users_j = movie2user[j]
      users_j_set = set(users_j)
      common_users = (users_i_set & users_j_set) # intersection
      if len(common_users) > limit:
        # calculate avg and deviation
        ratings_j = { user:usermovie2rating[(user, j)] for user in users_j }
        avg_j = np.mean(list(ratings_j.values()))
        dev_j = { user:(rating - avg_j) for user, rating in ratings_j.items() }
        dev_j_values = np.array(list(dev_j.values()))
        sigma_j = np.sqrt(dev_j_values.dot(dev_j_values))

        # calculate correlation coefficient
        numerator = sum(dev_i[m]*dev_j[m] for m in common_users)
        w_ij = numerator / (sigma_i * sigma_j)

        # insert into sorted list and truncate
        # negate weight, because list is sorted ascending
        # maximum value (1) is "closest"
        sl.add((-w_ij, j))
        if len(sl) > K:
          del sl[-1]

  # store the neighbors
  neighbors.append(sl)

```

# 6) Collaborative Filtering Conclusion


## Section summary
- previously, we considered s(j) only - a single score for each item regardless of which user is looking
- This section: s(i,j) score depends on user i and item j
- Problem with average rating
    - not all rating should be treated equally
    - User who I agree with should be weighted higher
    - Users I disagree with should be weighted lower
- Pearson correlation as weights


## Deviations
- scores as deviation (how much better a user likes an item compared to how they normally feel) 

- Looking at this equation, we can see that it's just linear regression
- Gives us yet another alternative to Pearson correlation

\begin{equation*}
\hat{d}(i,j) = \sum_{i'\in \Omega_{j}}^{} w_{ii'}d(i',j)
\end{equation*}

$\hat{d}(i,j)$ is Output prediction  
$d(i',j)$ is input features

## User User and Item Item

- Item based CF is more accurate
