<a href="https://colab.research.google.com/github/cyrus2281/notes/blob/main/MachineLearning/Recommenders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommenders

>[Recommenders](#scrollTo=COYGNbvomdLn)

>>[Mathematical Recommenders](#scrollTo=X7LG4JFYmfH9)

>>>[Lift](#scrollTo=-NPdgLKNkwA2)

>>>[Hacker News Formula](#scrollTo=bNo3EqUbmjRp)

>>>[Reddit Formula](#scrollTo=OX9Eah-ppQoq)

>>>[Google Page Rank](#scrollTo=cCDmUR3Fp0YC)

>>>>[Markov Models](#scrollTo=bIU2zR8LtEfH)

>>>>[Transition Probability Matrix](#scrollTo=UENDikH0tHHt)

>>>>[Calculating Probabilities](#scrollTo=Cbn1ogLEtKl_)

>>>>[Beta Posterior Mean](#scrollTo=-PvC0vn1xck2)

>>>>[State Distribution](#scrollTo=pS0RySi_1Ecy)

>>>>[PageRank](#scrollTo=Q640uQxg3_sJ)

>>[Statistics](#scrollTo=Ivwe3u2lzHp3)

>>>[Smoothing (Dampening)](#scrollTo=S34-RUx8zI_f)

>>>[Explore-Exploit Dilemma](#scrollTo=gahD3CwG1XEv)

>>>[Bayesian Method](#scrollTo=N4WrwUfO6MUp)

>>[Collaborative Filtering](#scrollTo=biqlBm7295Kg)

>>>[Sparsity](#scrollTo=VHSIzt7MLEya)

>>>[Regression](#scrollTo=pmcJqzBwNcuy)

>>>[User-User Collaborative Filtering](#scrollTo=XlrdEcfgxdmH)

>>>>[Pearson Correlation Coefficient](#scrollTo=LJzwxoPl2ayu)

>>>>[Cosine Similarity](#scrollTo=B2ZTwhaW4o-v)

>>>>[Python Implementation](#scrollTo=7fXxflLq7YCc)

>>>[Item-Item Collaborative Filtering](#scrollTo=RVJUmoIsVf11)

>>>>[Python Implementation](#scrollTo=QSNW575oXYDB)

>>>[Comparison](#scrollTo=-Hewxxx2C7sS)



## Mathematical Recommenders

These examples are non-personalized recommendations


### Lift

$$
\text{Lift} = \frac{p(A,B)}{p(A)p(B)} = \frac{p(A|B)}{p(A)} = \frac{p(B|A)}{p(B)}
$$

- Symmetric
- If A and B are independent, then $p(A|B) = p(A)$
  - $p(A|B) / p(A) = 1$
- if increasing the probability of B increases the probability of A, then Lift > 1

### Hacker News Formula

Balancing Popularirty with Age

$$
\frac{f(\text{popularity})}{g(\text{age})} \\[1cm]
$$

$$
\text{score} = \frac{(\text{ups} - \text{downs} -1 )^{0.8}}{(\text{age}+2)^{\text{gravity}}} \times \text{penalty}
$$

- gravity = 1.8
- penalty = multiplier to implement "business rules" (e.g. penalize self-posts, "controversial" posts, + many more rules)

\

age starts from 2, to prevent division by zero

exponent of numerator is bigger than the exponent of the denominator, meaning denominator grows faster.

Age always overtakes popularity

\

exponent 0.8 causes sublinear growth.
meaning 0 → 100 worth more than 1000 → 1100



### Reddit Formula

$$
\text{score} = \text{sign}(\text{ups}-\text{down}) \times \log \{ \max(1, |\text{ups} - \text{downs}|) \} + \frac{\text{age}}{45000}
$$

- log of the absoulte value of net votes - sublinear curve - initial votes matter more - max since log 0 is not possible

\

- can be positive or negative
  - The more downvotes you get, the futher your score goes down

\

- age is in seconds from inception of reddit
- age is always positive
- newer links → more score
- reddit scores will forever increase linearly


### Google Page Rank

Logic: The page rank of a page is the probability I would end up on that page if I surfed the internet randomly for an infinite amount of time



#### Markov Models

- Simplest way to think about Markov Models are bigrams from NLP
- Build a probablistic language model
- Can ask "what is the probability of the next word in the sentence 'love' give the previous word was 'I'?" i.e., p(love | I )

**Bigrams**

- It's a bigram because we only consider 2 words at a time

We don't have to think of each item as a word, just a generic state: $x(t)$

"Markov" means $x(t)$ doesn't depend on any values 2 or more steps behind, only the immediate last value.
$$
p(x_t | x_{t-1}, x_{t-2}, \cdots, x_1) = p(x_t | x_{t-1})
$$



#### Transition Probability Matrix

- $A(i,j)$ tells use the probability of going to state j from state i
$$
A(i,j) = p(x_t = j | x_{t-1} = i )
$$

- Key: rows must sum to 1
    - Since it's a probability this must be true
    - If true, A is called a "stochastic matrix" or "Markov Matrix"
$$
\sum^M_{j=1}A(i,j) = \sum^M_{j=1}p(x_t=j|x_{t-1}=i) =1
$$


\

Example:

Weather is
- state 1 = Sunny
- state 2 = Rainy

Suppose:
- p( sunny | sunny ) = 0.9
- p( sunny | rainy ) = 0.1
- p( rainy | sunny ) = 0.1
- p( rainy | rainy ) = 0.9



#### Calculating Probabilities

$$
p(\text{rainy} | \text{sunny}) = \frac{\text{count}(\text{sunny} \rightarrow \text{rainy})}{\text{count}(\text{sunny})}
$$

Generalized

$$
p(B|A) = \frac{\text{count}(A \rightarrow B)}{\text{count}(A)}
$$

Now probability of a sentence would be

$$
p(x_1, \cdots, x_T) = p(x_1) \prod^T_{t=2} p(x_t | x_{t-1})
$$

Problem: If a bigram didn't apear in the train set, the probility would be 0 and anything × 0 is 0.

**Add-1 Smoothing**

$$
p(x_t=j|x_{t-1}=i) = \frac{\text{count}(i \rightarrow j)+\epsilon}{\text{count}(i)+\epsilon V} \\
$$

- Add a "fake count" to every possible bigram
  - ϵ can be any value, for example 1.
- V = Vocabulary size = number of unique words in dataset
- In this case, V=M (number of states) since each state is a word


e.g. p(and | and) never occurs but would get positve probability


#### Beta Posterior Mean

In our case, the equation is the beta posterior mean instead of only 2 possible outcomes, V possible outcomts

$$
E(\pi) = \frac{\alpha'}{\alpha'+\beta'} = \frac{\alpha+(\sum_{i=1}^N X_i)}{\alpha + \beta + N} \\
$$

#### State Distribution

$\pi_t$ = state probability distribution at time t

$\pi(t)$ is a row vector by convention

For the weather example,

$$
\pi_t = [p(x_t = \text{sunny}), p(x_t = \text{rainy})]
$$

**Future State Distribution**

Calculating the $\pi(t+1)$ use Bayes rule

$$
p(x_{t+1} = j) = \sum^M_{i=1} p(x_{t+1} = j, x_t =i) \\
= \sum^M_{i=1} p(x_{t+1}=j | x_t = i) p(x_t = i) \\
= \sum^M_{i=1} A(i,j)\pi(i) \\
= \pi_{t+1}(j)
$$

Since A is a matrix and $\pi(t)$ is a vector, we can express it in terms of matrix math
$$
\pi_{x+1}(j) = \sum^M_{i=1} A(i,j)\pi_t(i) \\
\pi_{t+1} = \pi_t A
$$

Further future

$$
\pi_{t+2} = \pi_t A^2 \\
\pi_{t+k} = \pi_t A^k \\
$$

For infinity

$$
\pi_\infty = \lim_{t\rightarrow \infty} \pi_0 A^t \\
\pi_\infty = \pi_\infty A
$$

This is just the eigenvalue problem
  - Give matrix A, find a vector and a scalar s.t. multiplying the vector by A is equivalent to stretching it be the scalar.

#### PageRank

Every page on the internet is a state in a Markov Model

The transition probablity is distributed equally amongst all links on a page
- p(dlc.com | lp.me ) = 0.5
- p(yt.com | lp.me ) = 0.5

In general, we can write the transition probability as:

$$
p(x_t =j | x_{x-1} = i) = \frac{1}{n(i)}
$$
if $i$ links to $j$, $n(i) = $ number of links on page $i$, otherwise $0$.

**Smoothing**

$$
G = 0.85A + 0.15U \\
U(i,j) = \frac{1}{M} \\
\forall i,j = 1 \dots M
$$

Find the limiting distribution of G - yields a vector of length M - these probabilities are the respective PageRanks for ech page on the internet

$$
\pi_\infty = \pi_\infty G
$$


**Perron-Frobenius Theorem**:
> If G is a valid Markov matrix and all its elements are positive then the stationary distribution and limiting distribution are the same
- Limiting Distribution: state distribution you'd arrive at after transitioning by G an infinite number of times
- Stationary Distribution: a state distribution that does not change after transitioning by G

## Statistics

### Smoothing (Dampening)

To resolve the issue of 0 in sample data when getting mean

$$
r = \frac{\sum^N_{i=1} X_i + \lambda \mu_0}{N+\lambda}
$$


- $\lambda$ some random small non-zero number
- $\mu_0$ the global avergage or just some middle value

\

For example:
- 1000 reviews of 4 star - μ = 3 - λ = 1 → 3.999
- 5 reviews of 4 star - μ = 3 - λ = 1 → 3.83
- 1 review of 4 star - μ = 3 - λ = 1 → 3.5

### Explore-Exploit Dilemma

Example 1

Imagine we want to find the slot machine with the highest win rate among 10 slot machines.

Traditional statistical test can tell us whether or not there's a significant difference between win rates between machines.

If playing each machine 100 times, meaning 1000 turns total, 900 (9/10) turns yielded a suboptimal reward.

Hence the dilemma, Play more or play less!


---

Example 2

Watching a bunach of YouTube videos on how to make eggs.

Now your reccomendations are filled with videos about making eggs

porbably suboptimal - once I've figured out how to make eggs, I don't want to watch more egg videos.

YouTube is not exploiting the fact that I watched eggs video and not exploring other topics

Should there be a stronger exploration component?

Maybe I'd like to seE movie trailers or machine learning videos

---

How do we strike a balance between these 2 opposing forces?

Smoothed average gives us one part of the solution

Making good things look worse and bad things look better


### Bayesian Method

Bayesian method automatically balances need to explore and exploit

- 2 fat distributions: explore both (totally random ranking)
- 2 skinny distributions: exploit both (nearly deterministic ranking)
- Mixed: explore and exploit co-exit


Completely automatic - does not require A/B testing



## Collaborative Filtering

Non-specific to any particular user, Score each item from 1 to M a number.

Basic algorithm is to make s(j) the average rating for j

$$
s(j) = \frac{\sum_{i \in \Omega_j} r_{ij}}{|\Omega_j|} \\
$$

- $\Omega_j$ = set of all users who rated item j
- $r_{ij}$ = rating user i gave item j

Translates to, average rating for a product is the sum of rating divided by the number of the ratings.

**Personalize the score**

s(i,j) can depend both on user i and item j

$$
s(i,j) = \frac{\sum_{i' \in \Omega_j} r_{i'j}}{|\Omega_j|} \\
$$

i' is just an index

i = 1 … N, N = number of users

j = 1 … M, M = number of items

$R_{N\times M}$ = user -item ratings matrix of size N × M


\

- User-item matrix is reminiscent of term-document.
- X(t,d) = # of time term t appears in document d
- In terms of recommender systems, can think of X(t,d) as "how much does t like the item d"




### Sparsity

One characteristic of the user items matrix that makes it unique to recommender systems is that it's Sparse.

- Term-document matrix is sparse because most entries are 0
- User-item matrix is sparse because most entries are **empty**

The average user does not interact with all items.


**Goal of Collaborative Filtering**

- Most of r(i,j) doesn't exist - this is good.

If every user has seen every item, then there's nothing to recommend

Goal:
> We want to guess what you might rate an item you haven't seen yet

$$
s(i,j) = \hat r (i,j) = \text{ guess what user i might rate item j}
$$

E.g. if we think you might rate some move a 5, we definitely want you to watch that movie.


### Regression

Since this is a regression probelm, the evaluation metric is going to be the mean squared error.

Outline:
- user-user collaborative filtering
- item-item collaborative filtering

$$
\text{MSE} = \frac{1}{|\Omega|} \sum_{i,j \in \Omega}(r_{ij} - \hat r_{ij})^2
$$

Ω = set of pairs (i,j) where user i has rated item j

we're going to take our models predicted ratings, compare them to the actual ratings, square the difference and then take the average of those squared differences

### User-User Collaborative Filtering

||item 1 | item2 | ... | item n |
|--|--|--|--|--|
|user 1| score |score | ... |score |
|user 2| score |score | ... |score |
| ... | ... |... | ... |... |
|user n| score |score | ... |score |

Comparing rows together, if 2 rows are very similar, it can be concluded that they have similar taste


Average Rating reminder
$
s(i,j) = \frac{\sum_{i'\in \Omega_j} r_{i'j}}{|\Omega_j|}
$

It treats everyone's rating of the items equally

User 1's s(i,j) equally depends on user 2 rating and user 3 rating. even though user 1 doesn't agree with user 2.


**Weighting Ratings**

To make it small for users who don't agree and large for users who do agree

$$
s(i,j) = \frac{\sum_{i'\in \Omega_j} w_{ii'} r_{i'j}}{\sum_{i'\in\Omega_j} w_{ii'}} \\
$$

Users can be biased, optimistic or perstimistic.

**Deviation**

Don't care about your absolute rating, but how much it deviates from your own average.
- if your average is 2.5, but you rate something 5, it must be really great
- if you rate everything a 5, it's diffcult to know how those items compare

$$
\text{dev}(i,j) = r(i,j) - \bar r_i, \text{ for a known rating}
$$

My predicted rating is my own average + predicted deviation

$$
\hat{\text{dev}}(i,j) = \frac{1}{|\Omega_j|}\sum_{i'\in \Omega_j} r(i',j) - \bar r_{i'} \\
$$

For a prediction from known ratings

$$
s(i,j) = \bar r_i + \frac{1}{|\Omega_j|}\sum_{i'\in \Omega_j} r(i',j) - \bar r_{i'} \\
= \bar r_i + \hat{\text{dev}}(i,j) \\
$$

Note: In order to make recommendation, I don't need to add back the average, because it's the same over all items

**Combine**

Combine the idea of deviations with the idea of weightings to get our final formula

$$
s(i,j) = \bar r_i + \frac{\sum_{i'\in \Omega_j} w_{ii'} \{ r_{i'j} - \bar r_{i'} \}}{\sum_{i'\in \Omega_j} |w_{ii'}|}  \\
$$

How to calculate wieghts

#### Pearson Correlation Coefficient

$$
\varrho_{xy} = \frac
{\sum^N_{i=1} (x_i -\bar x)(y_i - \bar y)}
{
  \sqrt{\sum^N_{i=1} (x_i - \bar x)^2 }
  \sqrt{\sum^N_{i=1} (y_i - \bar y)^2 }
}
$$

Our data is sparse, meaning we have a lot of missing data.

Update formula:

$$
w_{ii'} = \frac
{\sum_{j\in \Psi_{ii'}} (r_{ij} -\bar i_i)(r_{i'j} - \bar r_{i'})}
{
  \sqrt{\sum_{j\in \Psi_{ii'}} (r_{ij} -\bar i_i)^2 }
  \sqrt{\sum_{j\in \Psi_{ii'}} (r_{i'j} -\bar i_{i'})^2 }
}
$$

- $\Psi_i$ = set of items that user i has rated
- $\Psi_{ii'}$ = set of items that user i and i' have rated
- $\Psi_{ii'} = \Psi_i \cap \Psi_{i'}$

#### Cosine Similarity

$$
\cos \theta = \frac{x^Ty}{|x| \ |y|} = \frac
{\sum_{i=1}^N x_iy_i}
{
  \sqrt{\sum_{i=1}^N x_i^2 }
  \sqrt{\sum_{i=1}^N y_i^2 }
} \\
$$

They are the same, execpt pearson is centered.

We want to center them anyway because we're working with deviations, not absolute ratings


\

If 2 users have zero or very few items in common, we don't want to consider them in the calculation

**Neighborhood**

In practice, don't sum over all users who rated item j (takes too long)

- It can help to precompute weights beforehand
- Instead of summing over all users, take the ones with highest weight
  - E.g. use K nearest neighbors, K=25 upto 50

\

In Summary,
Discard users w/ no items in common, or few items. Keep only users whose weights are high

#### Python Implementation

Outline
- Split data into train and test sets
- Calculate weightsss using train set
- Make a predict function, e.g. score ← predict(i,j)
- Output MSE for train and test sets


Data fetched from [MovieLens 20M Dataset](https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) - File `rating.csv`

In [None]:
LARGE_FILE_DIR  = "MachineLearning/notes/large_files"

Parsing IDs

In [None]:
import pandas as pd

df = pd.read_csv(LARGE_FILE_DIR + "/rating.csv")

# note:
# user ids are ordered sequentially from 1..138493
# with no missing numbers
# movie ids are integers from 1..131262
# NOT all movie ids appear
# there are only 26744 movie ids

# make the user ids go from 0...N-1
df.userId = df.userId - 1

# create a mapping for movie ids
unique_movie_ids = set(df.movieId.values)
movie2idx = {}
count = 0
for movie_id in unique_movie_ids:
  movie2idx[movie_id] = count
  count += 1

# add them to the data frame
# takes awhile
df['movie_idx'] = df.apply(lambda row: movie2idx[row.movieId], axis=1)

df = df.drop(columns=['timestamp'])

df.to_csv(LARGE_FILE_DIR + '/edited_rating.csv', index=False)

Shrinking data size

In [None]:
import numpy as np
from collections import Counter

# load in the data
df = pd.read_csv(LARGE_FILE_DIR + '/edited_rating.csv')
print("original dataframe size:", len(df))

N = df.userId.max() + 1 # number of users
M = df.movie_idx.max() + 1 # number of movies

user_ids_count = Counter(df.userId)
movie_ids_count = Counter(df.movie_idx)

# number of users and movies we would like to keep
n = 10000
m = 2000

user_ids = [u for u, c in user_ids_count.most_common(n)]
movie_ids = [m for m, c in movie_ids_count.most_common(m)]

# make a copy, otherwise ids won't be overwritten
df_small = df[df.userId.isin(user_ids) & df.movie_idx.isin(movie_ids)].copy()

# need to remake user ids and movie ids since they are no longer sequential
new_user_id_map = {}
i = 0
for old in user_ids:
  new_user_id_map[old] = i
  i += 1
print("i:", i)

new_movie_id_map = {}
j = 0
for old in movie_ids:
  new_movie_id_map[old] = j
  j += 1
print("j:", j)

print("Setting new ids")
df_small.loc[:, 'userId'] = df_small.apply(lambda row: new_user_id_map[row.userId], axis=1)
df_small.loc[:, 'movie_idx'] = df_small.apply(lambda row: new_movie_id_map[row.movie_idx], axis=1)
# df_small.drop(columns=['userId', 'movie_idx'])
# df_small.rename(index=str, columns={'new_userId': 'userId', 'new_movie_idx': 'movie_idx'})
print("max user id:", df_small.userId.max())
print("max movie id:", df_small.movie_idx.max())

print("small dataframe size:", len(df_small))
df_small.to_csv(LARGE_FILE_DIR + '/small_rating.csv', index=False)

original dataframe size: 20000263
i: 10000
j: 2000
Setting new ids
max user id: 9999
max movie id: 1999
small dataframe size: 5392025


Creating user-movie-rating data structures

In [None]:
import pickle
import pandas as pd
from sklearn.utils import shuffle


df = pd.read_csv(LARGE_FILE_DIR + '/small_rating.csv')

N = df.userId.max() + 1 # number of users
M = df.movie_idx.max() + 1 # number of movies

# split into train and test
df = shuffle(df)
cutoff = int(0.8*len(df))
df_train = df.iloc[:cutoff]
df_test = df.iloc[cutoff:]

# a dictionary to tell us which users have rated which movies
user2movie = {}
# a dicationary to tell us which movies have been rated by which users
movie2user = {}
# a dictionary to look up ratings
usermovie2rating = {}
print("Calling: update_user2movie_and_movie2user")
count = 0
def update_user2movie_and_movie2user(row):
  global count
  count += 1
  if count % 100000 == 0:
    print("processed: %.3f" % (float(count)/cutoff))

  i = int(row.userId)
  j = int(row.movie_idx)
  if i not in user2movie:
    user2movie[i] = [j]
  else:
    user2movie[i].append(j)

  if j not in movie2user:
    movie2user[j] = [i]
  else:
    movie2user[j].append(i)

  usermovie2rating[(i,j)] = row.rating
df_train.apply(update_user2movie_and_movie2user, axis=1)

# test ratings dictionary
usermovie2rating_test = {}
print("Calling: update_usermovie2rating_test")
count = 0
def update_usermovie2rating_test(row):
  global count
  count += 1
  if count % 100000 == 0:
    print("processed: %.3f" % (float(count)/len(df_test)))

  i = int(row.userId)
  j = int(row.movie_idx)
  usermovie2rating_test[(i,j)] = row.rating
df_test.apply(update_usermovie2rating_test, axis=1)

# note: these are not really JSONs
with open(LARGE_FILE_DIR + '/user2movie.json', 'wb') as f:
  pickle.dump(user2movie, f)

with open(LARGE_FILE_DIR + '/movie2user.json', 'wb') as f:
  pickle.dump(movie2user, f)

with open(LARGE_FILE_DIR + '/usermovie2rating.json', 'wb') as f:
  pickle.dump(usermovie2rating, f)

with open(LARGE_FILE_DIR + '/usermovie2rating_test.json', 'wb') as f:
  pickle.dump(usermovie2rating_test, f)

Calling: update_user2movie_and_movie2user
processed: 0.023
processed: 0.046
processed: 0.070
processed: 0.093
processed: 0.116
processed: 0.139
processed: 0.162
processed: 0.185
processed: 0.209
processed: 0.232
processed: 0.255
processed: 0.278
processed: 0.301
processed: 0.325
processed: 0.348
processed: 0.371
processed: 0.394
processed: 0.417
processed: 0.440
processed: 0.464
processed: 0.487
processed: 0.510
processed: 0.533
processed: 0.556
processed: 0.580
processed: 0.603
processed: 0.626
processed: 0.649
processed: 0.672
processed: 0.695
processed: 0.719
processed: 0.742
processed: 0.765
processed: 0.788
processed: 0.811
processed: 0.835
processed: 0.858
processed: 0.881
processed: 0.904
processed: 0.927
processed: 0.950
processed: 0.974
processed: 0.997
Calling: update_usermovie2rating_test
processed: 0.093
processed: 0.185
processed: 0.278
processed: 0.371
processed: 0.464
processed: 0.556
processed: 0.649
processed: 0.742
processed: 0.835
processed: 0.927


Training and prediction

\

Weights:
$$
w_{ii'} = \frac
{\sum_{j\in \Psi_{ii'}} (r_{ij} -\bar r_i)(r_{i'j} - \bar r_{i'})}
{
  \sqrt{\sum_{j\in \Psi_{ii'}} (r_{ij} -\bar r_i)^2 }
  \sqrt{\sum_{j\in \Psi_{ii'}} (r_{i'j} -\bar r_{i'})^2 }
}
$$

Predict:
$$
s(i,j) = \bar r_i + \frac{\sum_{i'\in \Omega_j} w_{ii'} \{ r_{i'j} - \bar r_{i'} \}}{\sum_{i'\in \Omega_j} |w_{ii'}|}  \\
$$

In [None]:
import pickle
import numpy as np
from sklearn.utils import shuffle
from datetime import datetime
from sortedcontainers import SortedList

with open(LARGE_FILE_DIR + '/user2movie.json', 'rb') as f:
  user2movie = pickle.load(f)

with open(LARGE_FILE_DIR + '/movie2user.json', 'rb') as f:
  movie2user = pickle.load(f)

with open(LARGE_FILE_DIR + '/usermovie2rating.json', 'rb') as f:
  usermovie2rating = pickle.load(f)

with open(LARGE_FILE_DIR + '/usermovie2rating_test.json', 'rb') as f:
  usermovie2rating_test = pickle.load(f)


N = np.max(list(user2movie.keys())) + 1
# the test set may contain movies the train set doesn't have data on
m1 = np.max(list(movie2user.keys()))
m2 = np.max([m for (u, m), r in usermovie2rating_test.items()])
M = max(m1, m2) + 1
print("N:", N, "M:", M)

if N > 10000:
  print("N =", N, "are you sure you want to continue?")
  print("Comment out these lines if so...")
  exit()


# to find the user similarities, you have to do O(N^2 * M) calculations!
# in the "real-world" you'd want to parallelize this
# note: we really only have to do half the calculations, since w_ij is symmetric
K = 25 # number of neighbors we'd like to consider
limit = 5 # number of common movies users must have in common in order to consider
neighbors = [] # store neighbors in this list
averages = [] # each user's average rating for later use
deviations = [] # each user's deviation for later use
for i in range(N):
  # find the 25 closest users to user i
  movies_i = user2movie[i]
  movies_i_set = set(movies_i)

  # calculate avg and deviation
  ratings_i = { movie:usermovie2rating[(i, movie)] for movie in movies_i }
  avg_i = np.mean(list(ratings_i.values()))
  dev_i = { movie:(rating - avg_i) for movie, rating in ratings_i.items() }
  dev_i_values = np.array(list(dev_i.values()))
  sigma_i = np.sqrt(dev_i_values.dot(dev_i_values))

  # save these for later use
  averages.append(avg_i)
  deviations.append(dev_i)

  sl = SortedList()
  for j in range(N):
    # don't include yourself
    if j != i:
      movies_j = user2movie[j]
      movies_j_set = set(movies_j)
      common_movies = (movies_i_set & movies_j_set) # intersection
      if len(common_movies) > limit:
        # calculate avg and deviation
        ratings_j = { movie:usermovie2rating[(j, movie)] for movie in movies_j }
        avg_j = np.mean(list(ratings_j.values()))
        dev_j = { movie:(rating - avg_j) for movie, rating in ratings_j.items() }
        dev_j_values = np.array(list(dev_j.values()))
        sigma_j = np.sqrt(dev_j_values.dot(dev_j_values))

        # calculate correlation coefficient
        numerator = sum(dev_i[m]*dev_j[m] for m in common_movies)
        w_ij = numerator / (sigma_i * sigma_j)

        # insert into sorted list and truncate
        # negate weight, because list is sorted ascending
        # maximum value (1) is "closest"
        sl.add((-w_ij, j))
        if len(sl) > K:
          del sl[-1]

  # store the neighbors
  neighbors.append(sl)

  # print out useful things
  if i % 1 == 0:
    print(i)


# using neighbors, calculate train and test MSE

def predict(i, m):
  # calculate the weighted sum of deviations
  numerator = 0
  denominator = 0
  for neg_w, j in neighbors[i]:
    # remember, the weight is stored as its negative
    # so the negative of the negative weight is the positive weight
    try:
      numerator += -neg_w * deviations[j][m]
      denominator += abs(neg_w)
    except KeyError:
      # neighbor may not have rated the same movie
      # don't want to do dictionary lookup twice
      # so just throw exception
      pass

  if denominator == 0:
    prediction = averages[i]
  else:
    prediction = numerator / denominator + averages[i]
  prediction = min(5, prediction)
  prediction = max(0.5, prediction) # min rating is 0.5
  return prediction


train_predictions = []
train_targets = []
for (i, m), target in usermovie2rating.items():
  # calculate the prediction for this movie
  prediction = predict(i, m)

  # save the prediction and target
  train_predictions.append(prediction)
  train_targets.append(target)

test_predictions = []
test_targets = []
# same thing for test set
for (i, m), target in usermovie2rating_test.items():
  # calculate the prediction for this movie
  prediction = predict(i, m)

  # save the prediction and target
  test_predictions.append(prediction)
  test_targets.append(target)


# calculate accuracy
def mse(p, t):
  p = np.array(p)
  t = np.array(t)
  return np.mean((p - t)**2)

print('train mse:', mse(train_predictions, train_targets))
print('test mse:', mse(test_predictions, test_targets))

### Item-Item Collaborative Filtering



To find out if 2 items are similar



$$
w_{jj'} = \frac
{\sum_{i\in \Omega_{jj'}} (r_{ij} -\bar r_j)(r_{ij'} - \bar r_{j'})}
{
  \sqrt{\sum_{i\in \Omega_{jj'}} (r_{ij} -\bar r_j)^2 }
  \sqrt{\sum_{j\in \Omega_{jj'}} (r_{ij'} -\bar r_{j'})^2 }
}
$$

- $\Omega_j$ users who rated item j
- $\Omega_{jj'}$ users who rated item j and item j'
- $\bar r_j$ avergae rating for item j


Formula:

$$
s(i,j) = \bar r_j + \frac{\sum_{j'\in \Psi_i} w_{jj'} \{ r_{ij'} - \bar r_{j'} \}}{\sum_{j'\in \Psi_i} |w_{jj'}|}  \\
$$

- $\Psi_i$ items user i has rated

Deviation: how much user i likes item j', compared to how much everyone else like j'

If user i really likes j' (more than other users do) and j is similar to j' (weight is high), the user i probably likes j too.

#### Python Implementation



In [1]:
# LARGE_FILE_DIR  = "MachineLearning/notes/large_files"
LARGE_FILE_DIR  = "."

In [2]:
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from datetime import datetime
from sortedcontainers import SortedList

with open(LARGE_FILE_DIR + '/user2movie.json', 'rb') as f:
  user2movie = pickle.load(f)
with open(LARGE_FILE_DIR + '/movie2user.json', 'rb') as f:
  movie2user = pickle.load(f)
with open(LARGE_FILE_DIR + '/usermovie2rating.json', 'rb') as f:
  usermovie2rating = pickle.load(f)
with open(LARGE_FILE_DIR + '/usermovie2rating_test.json', 'rb') as f:
  usermovie2rating_test = pickle.load(f)


N = np.max(list(user2movie.keys())) + 1
# the test set may contain movies the train set doesn't have data on
m1 = np.max(list(movie2user.keys()))
m2 = np.max([m for (u, m), r in usermovie2rating_test.items()])
M = max(m1, m2) + 1
print("N:", N, "M:", M)

if M > 2000:
  print("N =", N, "are you sure you want to continue?")
  print("Comment out these lines if so...")
  exit()


# to find the user similarities, you have to do O(M^2 * N) calculations!
# in the "real-world" you'd want to parallelize this
# note: we really only have to do half the calculations, since w_ij is symmetric
K = 20 # number of neighbors we'd like to consider
limit = 5 # number of common movies users must have in common in order to consider
neighbors = [] # store neighbors in this list
averages = [] # each item's average rating for later use
deviations = [] # each item's deviation for later use

for i in range(M):
  # find the K closest items to item i
  users_i = movie2user[i]
  users_i_set = set(users_i)

  # calculate avg and deviation
  ratings_i = { user:usermovie2rating[(user, i)] for user in users_i }
  avg_i = np.mean(list(ratings_i.values()))
  dev_i = { user:(rating - avg_i) for user, rating in ratings_i.items() }
  dev_i_values = np.array(list(dev_i.values()))
  sigma_i = np.sqrt(dev_i_values.dot(dev_i_values))

  # save these for later use
  averages.append(avg_i)
  deviations.append(dev_i)

  sl = SortedList()
  for j in range(M):
    # don't include yourself
    if j != i:
      users_j = movie2user[j]
      users_j_set = set(users_j)
      common_users = (users_i_set & users_j_set) # intersection
      if len(common_users) > limit:
        # calculate avg and deviation
        ratings_j = { user:usermovie2rating[(user, j)] for user in users_j }
        avg_j = np.mean(list(ratings_j.values()))
        dev_j = { user:(rating - avg_j) for user, rating in ratings_j.items() }
        dev_j_values = np.array(list(dev_j.values()))
        sigma_j = np.sqrt(dev_j_values.dot(dev_j_values))

        # calculate correlation coefficient
        numerator = sum(dev_i[m]*dev_j[m] for m in common_users)
        w_ij = numerator / (sigma_i * sigma_j)

        # insert into sorted list and truncate
        # negate weight, because list is sorted ascending
        # maximum value (1) is "closest"
        sl.add((-w_ij, j))
        if len(sl) > K:
          del sl[-1]

  # store the neighbors
  neighbors.append(sl)

  # print out useful things
  if i % 1 == 0:
    print(i)


# using neighbors, calculate train and test MSE

def predict(i, u):
  # calculate the weighted sum of deviations
  numerator = 0
  denominator = 0
  for neg_w, j in neighbors[i]:
    # remember, the weight is stored as its negative
    # so the negative of the negative weight is the positive weight
    try:
      numerator += -neg_w * deviations[j][u]
      denominator += abs(neg_w)
    except KeyError:
      # neighbor may not have been rated by the same user
      # don't want to do dictionary lookup twice
      # so just throw exception
      pass

  if denominator == 0:
    prediction = averages[i]
  else:
    prediction = numerator / denominator + averages[i]
  prediction = min(5, prediction)
  prediction = max(0.5, prediction) # min rating is 0.5
  return prediction



train_predictions = []
train_targets = []
for (u, m), target in usermovie2rating.items():
  # calculate the prediction for this movie
  prediction = predict(m, u)

  # save the prediction and target
  train_predictions.append(prediction)
  train_targets.append(target)

test_predictions = []
test_targets = []
# same thing for test set
for (u, m), target in usermovie2rating_test.items():
  # calculate the prediction for this movie
  prediction = predict(m, u)

  # save the prediction and target
  test_predictions.append(prediction)
  test_targets.append(target)


# calculate accuracy
def mse(p, t):
  p = np.array(p)
  t = np.array(t)
  return np.mean((p - t)**2)

print('train mse:', mse(train_predictions, train_targets))
print('test mse:', mse(test_predictions, test_targets))

### Comparison

- User-User CF: choose items for a user, because those items have been liked by similar users

- Item-item CF: choose items for a user, because this user liked similar items in the past

- When comparing 2 items, you have a lot more data than when comparing 2 users

- Item-Based CF is fater
  - given a user, score for ecah item $O(M^2N)$
    - There are $M^2$ item-item weights< and each vector is length N
  - For user-based CF< we had  $O(N^2M)
    - $N>M$, so $N^2$ compared to $M^2$ is even worse

- Item-based CF is more accurate as there's more data to train the weights on.