# Project 4 - Movie Recommendations

#### Alex Pissinou Makki

In [2]:
library(dplyr)
library(ggplot2)
library(recommenderlab)
library(DT)
library(data.table)
library(reshape2)

set.seed(1528)

In [3]:
myurl = "https://liangfgithub.github.io/MovieData/"

## System I: Recommendation based on genres

Suppose you know the user's favorite genre. How would you recommend movies to him/her?

Propose **two** recommendation schemes along with all necessary technical details.

For example, you can recommend the top-five most popular movies in that genre, then you have to define what you mean by "most popular". Or recommend the top-five highly-rated movies in that genre; again need to define what you mean by highly-rated. (Will the movie that receives only one 5-point review be considered highly rated?) Or recommend the most trendy movies in that genre; define how you measure trendiness.

For this part, you do not really need `recommenderlab`. Some data waggling/summary tools would be enough.

#### Read in Data

##### ratings data

In [4]:
# use colClasses = 'NULL' to skip columns
ratings = read.csv(paste0(myurl, 'ratings.dat?raw=true'), 
                   sep = ':',
                   colClasses = c('integer', 'NULL'), 
                   header = FALSE)
colnames(ratings) = c('UserID', 'MovieID', 'Rating', 'Timestamp')

##### movies data

In `movies.dat`, some movie names contain single colon (`:`), so the method above does not work.

In [5]:
movies = readLines(paste0(myurl, 'movies.dat?raw=true'))
movies = strsplit(movies, split = "::", fixed = TRUE, useBytes = TRUE)
movies = matrix(unlist(movies), ncol = 3, byrow = TRUE)
movies = data.frame(movies, stringsAsFactors = FALSE)
colnames(movies) = c('MovieID', 'Title', 'Genres')
movies$MovieID = as.integer(movies$MovieID)

# convert accented characters
movies$Title = iconv(movies$Title, "latin1", "UTF-8")

# extract year
movies$Year = as.numeric(unlist(
  lapply(movies$Title, function(x) substr(x, nchar(x)-4, nchar(x)-1))))

##### user data

In [6]:
users = read.csv(paste0(myurl, 'users.dat?raw=true'),
                 sep = ':', header = FALSE)
users = users[, -c(2,4,6,8)] # skip columns
colnames(users) = c('UserID', 'Gender', 'Age', 'Occupation', 'Zip-code')

### Recommendation Scheme I: Top-5 Highest-Rated Movies

Here, we define "highest-rated" as movies that have the highest average ratings having at least 100 user ratings. The minimum user ratings limitation is meant to ensure that a movie is not ranked on top only because it has a small number of people rate it high.

To compute the most popular movies per genre, we, first, find the movies in the given genre. Then, we filter out the movies with less than 100 user reviews. For the remaining movies, we compute the average ratings and return the 5 movie with the highest average rating.

In [7]:
recommend_by_genre_avg <- function(genre, num_rec = 5) {
    # aggregate reviews for each movie
    tmp = ratings %>% 
      group_by(MovieID) %>% 
      summarize(num_ratings = n(), avg_ratings = mean(Rating)) %>%
      inner_join(movies, by = 'MovieID')
    # filter movies by minimum number of ratings required
    min_num_ratings = 100
    popular_avg_ratings = as.data.frame(tmp %>% 
      filter(num_ratings >= min_num_ratings), stringsAsFactors=FALSE)
    # create a binary movies x genres + MovieID matrix
    genres = as.data.frame(movies$Genres, stringsAsFactors=FALSE)
    tmp = as.data.frame(tstrsplit(genres[,1], '[|]',
                                  type.convert=TRUE),
                        stringsAsFactors=FALSE)
    genre_list = c("Action", "Adventure", "Animation", 
                   "Children's", "Comedy", "Crime",
                   "Documentary", "Drama", "Fantasy",
                   "Film-Noir", "Horror", "Musical", 
                   "Mystery", "Romance", "Sci-Fi", 
                   "Thriller", "War", "Western")
    m = length(genre_list)
    genre_matrix = matrix(0, nrow(movies), length(genre_list))
    for(i in 1:nrow(tmp)){
      genre_matrix[i,genre_list %in% tmp[i,]]=1
    }
    colnames(genre_matrix) = genre_list
    genre_matrix <- cbind(genre_matrix, MovieID = movies$MovieID)
    remove("tmp", "genres")
    # create a joined matrix of movies in the `genre` and associated avg. rating
    # sorted descending
    genre_movies = as.data.frame(genre_matrix[genre_matrix[, genre] == 1,,drop=FALSE],
                             stringsAsFactors=FALSE)
    genre_avg_ratings = popular_avg_ratings %>% 
        inner_join(genre_movies, by = 'MovieID')
    genre_avg_ratings = genre_avg_ratings[order(-genre_avg_ratings$avg_ratings),]
    return(head(genre_avg_ratings, num_rec))
}

As an example, for the `"War"` genre, you can see that we have returned the top 5 movies with the highest `avg_ratings` and a minimum of 100 `ratings_per_movie`.

In [8]:
recommend_by_genre_avg("War")

Unnamed: 0_level_0,MovieID,num_ratings,avg_ratings,Title,Genres,Year,Action,Adventure,Animation,Children's,⋯,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
12,527,2304,4.510417,Schindler's List (1993),Drama|War,1993,0,0,0,0,⋯,0,0,0,0,0,0,0,0,1,0
25,1178,230,4.473913,Paths of Glory (1957),Drama|War,1957,0,0,0,0,⋯,0,0,0,0,0,0,0,0,1,0
15,750,1367,4.44989,Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963),Sci-Fi|War,1963,0,0,0,0,⋯,0,0,0,0,0,0,1,0,1,0
17,912,1669,4.412822,Casablanca (1942),Drama|Romance|War,1942,0,0,0,0,⋯,0,0,0,0,0,1,0,0,1,0
29,1204,831,4.401925,Lawrence of Arabia (1962),Adventure|War,1962,0,1,0,0,⋯,0,0,0,0,0,0,0,0,1,0


### Recommendation Scheme I: Top-5 Trendiest Movies

Here, we define "trendy" as movies that have the most ratings, i.e., largest number of people watched and rated the movie, with a minimum average rating of 3. Similar to the previous recommendation system, the limitation on `avg_ratings` is meant to filter out "bad" recommendations, where even though the movie has a large number of ratings, most are negative.

Similar to before, to compute the trendiest movies per genre, we, first, find the movies in the given genre. Then, we filter out the movies with average rating less than 3.0. For the remaining movies, we compute the number of ratings and return the 5 movie with the highest number of ratings.

In [9]:
recommend_by_genre_num <- function(genre, num_rec = 5) {
    # aggregate reviews for each movie
    tmp = ratings %>% 
      group_by(MovieID) %>% 
      summarize(num_ratings = n(), avg_ratings = mean(Rating)) %>%
      inner_join(movies, by = 'MovieID')
    # filter movies by minimum average rating required
    min_avg_ratings = 3.0
    trendy_num_ratings = as.data.frame(tmp %>% 
      filter(avg_ratings >= min_avg_ratings), stringsAsFactors=FALSE)
    # create a binary movies x genres + MovieID matrix
    genres = as.data.frame(movies$Genres, stringsAsFactors=FALSE)
    tmp = as.data.frame(tstrsplit(genres[,1], '[|]',
                                  type.convert=TRUE),
                        stringsAsFactors=FALSE)
    genre_list = c("Action", "Adventure", "Animation", 
                   "Children's", "Comedy", "Crime",
                   "Documentary", "Drama", "Fantasy",
                   "Film-Noir", "Horror", "Musical", 
                   "Mystery", "Romance", "Sci-Fi", 
                   "Thriller", "War", "Western")
    m = length(genre_list)
    genre_matrix = matrix(0, nrow(movies), length(genre_list))
    for(i in 1:nrow(tmp)){
      genre_matrix[i,genre_list %in% tmp[i,]]=1
    }
    colnames(genre_matrix) = genre_list
    genre_matrix <- cbind(genre_matrix, MovieID = movies$MovieID)
    remove("tmp", "genres")
    # create a joined matrix of movies in the `genre` and associated num rating
    # sorted descending
    genre_movies = as.data.frame(genre_matrix[genre_matrix[, genre] == 1,,drop=FALSE],
                             stringsAsFactors=FALSE)
    genre_num_ratings = trendy_num_ratings %>% 
        inner_join(genre_movies, by = 'MovieID')
    genre_num_ratings = genre_num_ratings[order(-genre_num_ratings$num_ratings),]
    return(head(genre_num_ratings, num_rec))
}

In [10]:
recommend_by_genre_num("Action")

Unnamed: 0_level_0,MovieID,num_ratings,avg_ratings,Title,Genres,Year,Action,Adventure,Animation,Children's,⋯,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
15,260,2991,4.453694,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi,1977,1,1,0,0,⋯,1,0,0,0,0,0,1,0,0,0
68,1196,2990,4.292977,Star Wars: Episode V - The Empire Strikes Back (1980),Action|Adventure|Drama|Sci-Fi|War,1980,1,1,0,0,⋯,0,0,0,0,0,0,1,0,1,0
74,1210,2883,4.022893,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War,1983,1,1,0,0,⋯,0,0,0,0,0,1,1,0,1,0
35,480,2672,3.763847,Jurassic Park (1993),Action|Adventure|Sci-Fi,1993,1,1,0,0,⋯,0,0,0,0,0,0,1,0,0,0
149,2028,2653,4.337354,Saving Private Ryan (1998),Action|Drama|War,1998,1,0,0,0,⋯,0,0,0,0,0,0,0,0,1,0


For ShinyApps System I, the latter recommendation scheme, i.e., Top-$n$ Trendiest Movies, was used. That is, results for each genre were cached and loaded into ShinyApps for faster processing.

## System II: Recommendation based on genres

Review **two** collaborative recommendation algorithms: UBCF and IBCF. (Suggest reading Sec 2.1-2.2 of the [recommenderlab tutorial](https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf))

Please follow the following steps to provide your review.

### User-based collaborative filtering (UBCF)

For UBCF, use the following options:

- `normalize = 'center'`: Let $R$ denote the rating matrix with rows as users and columns as movies; this option means that we need to subtract each non-NA entry by its row mean. Here, row means are computed based on non-NA entries; for example, the mean of vector (2, 4, NA, NA) should be 3.
- `nn = 20`: nearest neighborhood size is 20. That is, the prediction for a new user is based on ratings from 20 users who are most similar to this new user.
- `weighted = TRUE`: (this is the default option) Ratings from users that are more similar to the new user receive higher weights. That is, we use equation (4) (on page 6) instead of equation (3) (on page 5) in recommenderlab tutorial
- `method = 'Cosine'`: this similarity measure is defined in the 2nd paragraph of Sec 2.1 in recommenderlab tutorial. Remember to transform this measure to be between 0 and 1. There is a typo in the transformation formula in the tutorial; see below.

![Correction](correction.png "Correction")

In my implementation, as suggested, I first center the training and test data and then use `proxy::simil` to create a cosine similarity vector, $S_{n}$, where $n$ is the number of users. I, then, get the indicies of the top $nn = 20$ users with highest similarity scores. Then, I use the modified $S$ matrix to calculate the ratings for each movie based on the available ratings from the user using the following formula for user $a$ and movie $l$:

$$
\hat{r}_{al}=\cfrac{1}{\sum_{i\in S}s_{i}}\sum_{i\in S}s_{i}r_{il}
$$
where $S = \{i: s_i\text{ and }r_{ij}\text{ not NA}\}$ only contains movies rated by user $a$, i.e., non-`NA`.

Finally, the following post-processing steps are done before returning the final predicted ratings:
- Add back the mean of the test user to the predicted ratings
- Set infinite and `nan` values to `NA`
- Set movies watched by the test user to `NA`

With the above method, I was able to outperform the set performance criteria.

In [14]:
ubcf <- function(R_train, R_test, nn = 20) {
    # normalize data
    data = as(R_train, "matrix")
    user.means = rowMeans(data, na.rm = TRUE)
    data = data - user.means
    newdata = as(R_test, "matrix")
    newuser.means = rowMeans(newdata, na.rm = TRUE)
    newdata = newdata - newuser.means
    # similarity vector
    user_sim = as.vector(proxy::simil(data, newdata, method = "cosine"))
    user_sim = (1 + user_sim) / 2
    # sort similarities
    top_n_idx <- sort(user_sim, index.return = TRUE, decreasing = TRUE)$ix[1:nn]
    neighbor_ratings = data[top_n_idx,]
    neighbor_sim = user_sim[top_n_idx]
    predicted_ratings <- c()
    for (j in 1:length(newdata)) {
        if (sum(is.na(neighbor_ratings[,j])) == nn) {
            predicted_ratings[j] = NA
        } else {
            predicted_ratings[j] = sum(neighbor_ratings[,j] * neighbor_sim, na.rm = TRUE)
            predicted_ratings[j] = predicted_ratings[j] / sum((!is.na(neighbor_ratings[,j])) * neighbor_sim)
        }
    }

    # Add back mean of test_user
    predicted_ratings = predicted_ratings + newuser.means
    # Set infinite values to NA
    predicted_ratings[is.infinite(predicted_ratings) | is.nan(predicted_ratings)] <- NA
    # Set movies watched by the test_user to NA
    predicted_ratings[!is.na(newdata)] <- NA
    return(predicted_ratings)
}

Here, I demonstrate how UBCF predicts the ratings of a new user based on training data. Use the first 500 users from MovieLens as training and predict the ratings of the 501st user.

In [15]:
library(recommenderlab)
myurl = "https://liangfgithub.github.io/MovieData/"
ratings = read.csv(paste0(myurl, 'ratings.dat?raw=true'), 
                   sep = ':',
                   colClasses = c('integer', 'NULL'), 
                   header = FALSE)
colnames(ratings) = c('UserID', 'MovieID', 'Rating', 'Timestamp')
i = paste0('u', ratings$UserID)
j = paste0('m', ratings$MovieID)
x = ratings$Rating
tmp = data.frame(i, j, x, stringsAsFactors = T)
Rmat = sparseMatrix(as.integer(tmp$i), as.integer(tmp$j), x = tmp$x)
rownames(Rmat) = levels(tmp$i)
colnames(Rmat) = levels(tmp$j)
Rmat = new('realRatingMatrix', data = Rmat)

train = Rmat[1:500, ]
test = Rmat[501, ]

**Store the predicted ratings for the 501st user in a vector named `mypred`.** Remember to provide all necessary code so we can reproduce your calculation for `mypred`.

In [16]:
mypred <- ubcf(train, test, nn = 20)

Next, compare your prediction with the one from `recommenderlab`

In [17]:
recommender.UBCF <- Recommender(train, method = "UBCF",
                                parameter = list(normalize = 'center', 
                                                 method = 'Cosine', 
                                                 nn = 20))

p.UBCF <- predict(recommender.UBCF, test, type="ratings")
p.UBCF <- as.numeric(as(p.UBCF, "matrix"))

sum(is.na(p.UBCF) != is.na(mypred)) ### should be zero
max(abs(p.UBCF - mypred), na.rm = TRUE)  ### should be less than 1e-06 

The last two commands above show that (1)  `p.UBCF` and `mypred` assign NA to the same set of movies and (2) their non-NA predictions are very close (**should be less than 1e-06**).

**NAs in the prediction.** In `mypred` and `p.UBCF`, a movie may receive NA prediction due to two reasons: 1) none of the 20 similar users has provided a rating for this movie yet; 2) the 501st user has watched this movie before (i.e., s/he has already assigned a rating for this movie).

### Item-based collaborative filtering (IBCF)

Do the same for IBCF. For IBCF, use the following options:

- `normalize = 'center'`
- `k = 30`: the nearest neighborhood size for items is 30.
- `weighted = TRUE`: (this is the default option) That is, we use equation (5) (on page 7) in [recommenderlab tutorial](https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf)
- `method = 'Cosine'`

In my implementation, as suggested, I first center the training data and then use `proxy::simil` to create a similarity matrix, $S_{m\times m}$ for movies, i.e., $m\times m$, where $m$ is the number of movies. I pre-process $S$ to only set the top $nn = 30$ values to non-`NA`. Then, I use the modified $S$ matrix to calculate the ratings for each movie based on the available ratings from the user using the following formula for user $a$ and movie $l$:

$$
\hat{r}_{al}=\cfrac{1}{\sum_{i\in S(l)}s_{li}}\sum_{i\in S(l)}s_{li}r_{ai}
$$
where $S(l)$ only contains movies rated by user $a$, i.e., non-`NA`.

Note that the test data was not centered as its values are directly used in the computation of the predicted ratings, so remove and adding the mean to test data would not have any effect.

Finally, the rest of the post-processing steps are done similar to before:
- Set infinite and `nan` values to `NA`
- Set movies watched by the test user to `NA`

With the above method, I was able to outperform the set performance criteria.

In [18]:
ibcf <- function(R_train, R_test, nn = 30) {
    # normalize data
    data = as(R_train, "matrix")
    user.means = rowMeans(data, na.rm = T)
    data = data - user.means
    newdata = as(R_test, "matrix")

    # similarity matrix
    item.sim = as.matrix(proxy::simil(t(data), method = "cosine"))
    item.sim = (1 + item.sim) / 2
    # process similarity matrix
    for (i in 1:nrow(item.sim)) {
        neighbor_idx <- tail(order(item.sim[i,], decreasing = F, na.last = F), nn)
        item.sim[i, -neighbor_idx] <- NA
    }

    non_na = which(!is.na(newdata))
    predicted_ratings = colSums(t(item.sim[,non_na]) * newdata[non_na], na.rm = T)
    predicted_ratings = predicted_ratings / rowSums(item.sim[,non_na], na.rm = T)

    # Set infinite values to NA
    predicted_ratings[is.infinite(predicted_ratings) | is.nan(predicted_ratings)] <- NA
    # Set movies watched by the test_user to NA
    predicted_ratings[!is.na(newdata)] <- NA
    return(predicted_ratings)
}

**Store your prediction for the 501st user in a vector named `mypred`.** Then compare your prediction with the one from `recommenderlab`

In [19]:
mypred <- ibcf(train, test, nn = 30)

Again, demonstrate how IBCF predicts the ratings of the 501st user based on ratings from the first 500 users.

In [20]:
recommender.IBCF <- Recommender(train, method = "IBCF",
                                parameter = list(normalize = 'center', 
                                                 method = 'Cosine', 
                                                 k = 30))

p.IBCF <- predict(recommender.IBCF, test, type="ratings")
p.IBCF <- as.numeric(as(p.IBCF, "matrix"))

## first output: should be less than 10
sum(is.na(p.IBCF) != is.na(mypred))  

## second output: should be less than 10%
mydiff = abs(p.IBCF - mypred)
sum(mydiff[!is.na(mydiff)] > 1e-6) / sum(!is.na(mydiff)) 

The first output measures how many mismatches among NA assignments are between `p.IBCF` and `mypred`. You should target to have less than 10 mismatches.

The second output measures the percentage of disagreement (difference bigger than 1e-06) among non-NA predictions. You should target to have this number less than 10%.

**Question:** why do we encounter such a big discrepancy for IBCF, but not for UBCF? I have a partial answer but would like students to think about it.

UBCF is a relatively simpler algorithm with no need for a training step, compared to IBCF. There are some choices that we made in our implementation of IBCF that could've led to this discrepancy. For example, we disregarded `NA` values throughout our calculation. Potentially, `recommenderlab` could have chosen to impute some values. Given the vector multiplicaiton and $m\times m$ similarity matrix, any decision to alter the values could lead to bigger changes when compared to UBCF. Namely, even scaling the similarity matrix by `(1 + mat) / 2` could be implemented differently in `recommenderlab` for IBCF which could explain the big discrepancy.

## Resources

You can use others' code, as long as you cite the source.

- Github for the nice Book Recommender System mentioned above [https://github.com/pspachtholz/BookRecommender] where you can also find his Kaggle report.
- R code for `recommenderlab` can be found here [Link]