<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width = 400, align = "center"></a>

<h1 align=center><font size = 5>Exercise: Collaborative Filtering</font></h1>

In this notebook we will be doing a collaborative filtering exercise using the *recommenderlab* function from the package 'recommenderlab'. For this exercise we will be using datasets acquired from [GroupLens](http://grouplens.org/datasets/movielens/) containing some information regarding a list of movies, such as user ratings, user IDs, movie IDs, and movie titles.

<hl>

## Table of Contents
<div class="alert alert-block alert-info" style="margin-top: 20px">

<br>
- <p><a href="#ref0">Acquiring the data</a></p>
- <p><a href="#ref1">Preprocessing</a></p>
- <p><a href="#ref2">Collaborative Filtering</a></p>
- <p><a href="#ref3">Advantages and Disadvantages of Collaborative Filtering</a></p>
<p></p>
</div>
<hr>

<a id="ref0"></a>
## Donwload the Data

To download and store the data, simply run the following R cell:  
Dataset acquired from [GroupLens](http://grouplens.org/datasets/movielens/)

In [1]:
# rating dataset
download.file("https://ibm.box.com/shared/static/226cs3qlqylgkdiiqttjghwij0mfd6sp.csv", "/resources/data/ratings_cleaned.csv")
#Moview dtaset
download.file("https://ibm.box.com/shared/static/jj7hu6jsvdwtyw1q4n9gpfdnt4mtd9rc.csv", "/resources/data/movies_cleaned.csv")

## Load the Data

Let's begin by loading the data and storing them into dataframes named **'movies_df'** and **'ratings_df'**:

In [2]:
#Loading the movie information into a dataframe
movies_df <- read.csv('/resources/data/movies_cleaned.csv',  sep=",")

In [3]:
#Loading the user information into a dataframe
ratings_df <- read.csv('/resources/data/ratings_cleaned.csv',  sep=",")

Let's have a look at the structure of these dataframes:

In [4]:
# write your code here

str(movies_df)
str(ratings_df)


'data.frame':	9125 obs. of  3 variables:
 $ movieId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ title  : Factor w/ 9123 levels "¡Three Amigos! (1986)",..: 8301 4319 3420 8648 2761 3591 6860 8253 7673 3287 ...
 $ genres : Factor w/ 902 levels "(no genres listed)",..: 329 394 687 646 596 242 687 377 2 124 ...
'data.frame':	100004 obs. of  4 variables:
 $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ movieId  : int  31 1029 1061 1129 1172 1263 1287 1293 1339 1343 ...
 $ rating   : num  2.5 3 3 2 4 2 2 2 3.5 2 ...
 $ timestamp: int  1260759144 1260759179 1260759182 1260759185 1260759205 1260759151 1260759187 1260759148 1260759125 1260759131 ...


<div align="right">
<a href="#p1" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p1" class="collapse">
```
str(movies_df)
str(ratings_df)
```
</div>

## Format the data

Now we will clean the data and have a look at the first few rows of each dataframes, to see what our data looks like:

In [5]:
head(movies_df)
head(ratings_df)

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller


userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151


Let's remove the timestamp column from the ratings dataframe, since we won't need it for this exercise. Then, look at the first few rows of our final dataframes:

In [6]:
# write your code here

ratings_df$timestamp = NULL
head(movies_df)
head(ratings_df)


movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller


userId,movieId,rating
1,31,2.5
1,1029,3.0
1,1061,3.0
1,1129,2.0
1,1172,4.0
1,1263,2.0


<div align="right">
<a href="#p2" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p2" class="collapse">
```
ratings_df$timestamp = NULL
head(movies_df)
head(ratings_df)
```
</div>

<a id="ref2"></a>
## User-Based Collaborative Filtering

User-based collaborative filtering (**UBCF**) is a collaborative filtering technique that makes recommendations using the similarity between users. The assumption is that users with similar preferences will rate items similarly. Thus missing ratings for a user can be predicted by first finding a neighborhood of similar users and then aggregate the ratings of these users to form a prediction.


UBCF algorithms first identify the k most similar users (nearest neighbors) to the active user, using a similarity measure such as the Pearson correlation or Cosine similarity, in which each user is treated as a vector in the m-dimensional item space and the similarities between the active user and other users are computed between the vectors. After the most similar users have been discovered, their corresponding rows in the user-item matrix are aggregated to identify a set of items, L, rated by the group together with their frequency. With the set L, UBCF techniques then recommend the top-n most frequent items in that the active user has not rated. UBCF algorithms have limitations related to scalability and real-time performance.

To do our implementation of recommender systems we will install and load the recommenderlab and Matrix packages:

In [7]:
# NOTE: installing these three packages takes a very long time, but it will complete.

install.packages("proxy")
install.packages("recommenderlab")
install.packages("Matrix")
library(recommenderlab)
library(Matrix)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Loading required package: Matrix
Loading required package: arules

Attaching package: ‘arules’

The following objects are masked from ‘package:base’:

    abbreviate, write

Loading required package: proxy

Attaching package: ‘proxy’

The following object is masked from ‘package:Matrix’:

    as.matrix

The following objects are masked from ‘package:stats’:

    as.dist, dist

The following object is masked from ‘package:base’:

    as.matrix

Loading required package: registry


## Prepare the data

To use the recommenderlab package, our data will need to be converted to sparse format:

In [8]:
sparse_ratings <- sparseMatrix(i = ratings_df$userId, j = ratings_df$movieId, x = ratings_df$rating,
                              dimnames = list(paste("u", 1:length(unique(ratings_df$userId)), sep = ""), 
                                               paste("m", 1:max(ratings_df$movieId), sep = ""))) 
                               
sparse_ratings[1:10, 1:10]

dim(sparse_ratings)

   [[ suppressing 10 column names ‘m1’, ‘m2’, ‘m3’ ... ]]


10 x 10 sparse Matrix of class "dgCMatrix"
                       
u1  . . . . . . . . . .
u2  . . . . . . . . . 4
u3  . . . . . . . . . .
u4  . . . . . . . . . 4
u5  . . 4 . . . . . . .
u6  . . . . . . . . . .
u7  3 . . . . . . . . 3
u8  . . . . . . . . . .
u9  4 . . . . . . . . .
u10 . . . . . . . . . .

We can see that our sparse matrix has more columns (movies) at 163949 than the 9066 movies that we have ratings for in our ratings dataframe. However, this will not affect our results.

NOTE: In our ratings dataframe, we have 9066 different movieIds with the largest value for a movieId being 163949. While in our movies dataframe, we have 9125 different movies with the largest value for a movieId being 164979.

## Create recommender models

The function **recommender** from the recommenderlab package works with a realRatingMatrix object, which we will create from our sparse matrix:

In [9]:
real_ratings <- new("realRatingMatrix", data = sparse_ratings)
real_ratings

671 x 163949 rating matrix of class ‘realRatingMatrix’ with 100004 ratings.

Now let's construct a recommender model using the **Recommender** algorithm and the first 500 users of the real rating matrix. 

In [10]:
rmodel <- Recommender(real_ratings[1:500], method = "UBCF", param=list(normalize = "center", method = "Pearson"))

Let's make movie recommendations for users 501 and 502 from our data.

We can make recommendations for new users using the **predict** function and the recommender models we created. However, the recommendations will not be in list form so we will have to use the **as** function to display it as a list.

In [11]:
recom <- predict(rmodel, real_ratings[501:502],n=5)
lrecom <- as(recom, "list")
lrecom

As you can see our lists contain the recommendations as the corresponding movieIds prefixed with the letter 'm'.
To obtain a list of integer-valued movieIds we will have to perform some data transformation.

We can do this using a combination of the **as.numeric** and **sub** functions in R:

In [12]:
lr <- lapply(lrecom, function(x) as.numeric(sub("m","", x)))
lr

Let's have a look at the movies user 501 has rated.

In [13]:
user501=ratings_df$movieId[ratings_df$userId==501]
print("User501")
cat("\n")
for (i in user501){
    movie <- movies_df$title[movies_df$movieId==i]
    print (movie, max.levels=0)
}

[1] "User501"

[1] Toy Story (1995)
[1] Dead Man Walking (1995)
[1] Seven (a.k.a. Se7en) (1995)
[1] Usual Suspects, The (1995)
[1] Taxi Driver (1976)
[1] Pulp Fiction (1994)
[1] Shawshank Redemption, The (1994)
[1] Forrest Gump (1994)
[1] Lion King, The (1994)
[1] Mask, The (1994)
[1] Silence of the Lambs, The (1991)
[1] Pinocchio (1940)
[1] Fargo (1996)
[1] Wallace & Gromit: A Close Shave (1995)
[1] Trainspotting (1996)
[1] Godfather, The (1972)
[1] Reservoir Dogs (1992)
[1] E.T. the Extra-Terrestrial (1982)
[1] Monty Python and the Holy Grail (1975)
[1] Wallace & Gromit: The Wrong Trousers (1993)
[1] 12 Angry Men (1957)
[1] Clockwork Orange, A (1971)
[1] Goodfellas (1990)
[1] Godfather: Part II, The (1974)
[1] Full Metal Jacket (1987)
[1] Groundhog Day (1993)
[1] Men in Black (a.k.a. MIB) (1997)
[1] Truman Show, The (1998)
[1] Good Will Hunting (1997)
[1] Jackie Brown (1997)
[1] Big Lebowski, The (1998)
[1] Twilight (1998)
[1] Saving Private Ryan (1998)
[1] American History X (1998)


Now, let's see which movies are recommended for the user. For each movie we will show the title, the genres, the total number of ratings and the average rating.

In [14]:
print("User-based collaborative filtering")
cat("\n")
cat("\n")
cat("\n")

print("User 501 Recommendations:")
cat("\n")
cat("\n")

u501_recom <- lapply(lr[1], function(x) for (i in x){
                                movie <- movies_df$title[movies_df$movieId==i]
                                print (movie, max.levels=0)
                                genres <- movies_df$genres[movies_df$movieId==i]
                                cat("Genres: ", as.character(genres), "\n")
                                indices <- which(ratings_df$movieId==i, arr.ind=T)
                                cat("Total ratings: ", length(indices), "\n")
                                cat("Average rating: ",mean(ratings_df$rating[indices]), "\n")
                                cat("\n")
}
    )
u501_recom

[1] "User-based collaborative filtering"



[1] "User 501 Recommendations:"


[1] Willy Wonka & the Chocolate Factory (1971)
Genres:  Children|Comedy|Fantasy|Musical 
Total ratings:  148 
Average rating:  3.753378 

[1] Star Wars: Episode IV - A New Hope (1977)
Genres:  Action|Adventure|Sci-Fi 
Total ratings:  291 
Average rating:  4.221649 

[1] Who Framed Roger Rabbit? (1988)
Genres:  Adventure|Animation|Children|Comedy|Crime|Fantasy|Mystery 
Total ratings:  108 
Average rating:  3.666667 

[1] American Beauty (1999)
Genres:  Drama|Romance 
Total ratings:  220 
Average rating:  4.236364 

[1] Heat (1995)
Genres:  Action|Crime|Thriller 
Total ratings:  104 
Average rating:  3.884615 



Now, let's do the same for user 502. 

Display the list of movies user 502 has rated:

In [15]:
# your code here

user502 <- ratings_df$movieId[ratings_df$userId==502]
print("User502")
cat("\n")
for (i in user502){
    movie <- movies_df$title[movies_df$movieId==i]
    print (movie, max.levels=0)
}


[1] "User502"

[1] Toy Story (1995)
[1] Heat (1995)
[1] Four Rooms (1995)
[1] Leaving Las Vegas (1995)
[1] Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
[1] Dead Man Walking (1995)
[1] Usual Suspects, The (1995)
[1] Mighty Aphrodite (1995)
[1] Mr. Holland's Opus (1995)
[1] French Twist (Gazon maudit) (1995)
[1] Crossing Guard, The (1995)
[1] Happy Gilmore (1996)
[1] Rumble in the Bronx (Hont faan kui) (1995)
[1] Flirting With Disaster (1996)
[1] Apollo 13 (1995)
[1] Smoke (1995)
[1] Strange Days (1995)
[1] Clerks (1994)
[1] Dumb & Dumber (Dumb and Dumber) (1994)
[1] Ed Wood (1994)
[1] Hoop Dreams (1994)
[1] Star Wars: Episode IV - A New Hope (1977)
[1] Pulp Fiction (1994)
[1] Shawshank Redemption, The (1994)
[1] What's Eating Gilbert Grape (1993)
[1] River Wild, The (1994)
[1] Bronx Tale, A (1993)
[1] Hudsucker Proxy, The (1994)
[1] Blade Runner (1982)
[1] True Romance (1993)
[1] Terminator 2: Judgment Day (1991)
[1] One Fine Day (1996)
[1] Fargo (1996)
[1] Mission: Impossible (1996)
[1] M

<div align="right">
<a href="#p3" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p3" class="collapse">
```
user502 <- ratings_df$movieId[ratings_df$userId==502]
print("User502")
cat("\n")
for (i in user502){
    movie <- movies_df$title[movies_df$movieId==i]
    print (movie, max.levels=0)
}
```
</div>

Display the recommendations for user 502, similarly to what we did for user 501:

In [17]:
# you code here:

print("User-based collaborative filtering")
cat("\n")
cat("\n")
cat("\n")

print("User 502 Recommendations:")
cat("\n")
cat("\n")

u502_recom <- lapply(lr[2], function(x) for (i in x){
                                movie <- movies_df$title[movies_df$movieId==i]
                                print (movie, max.levels=0)
                                genres <- movies_df$genres[movies_df$movieId==i]
                                cat("Genres: ", as.character(genres), "\n")
                                indices <- which(ratings_df$movieId==i, arr.ind=T)
                                indices <- which(ratings_df$movieId==i, arr.ind=T)
                                cat("Total ratings: ", length(indices), "\n")
                                cat("Average rating: ",mean(ratings_df$rating[indices]), "\n")
                                cat("\n")
}
    )
u502_recom


[1] "User-based collaborative filtering"



[1] "User 502 Recommendations:"


[1] Silence of the Lambs, The (1991)
Genres:  Crime|Horror|Thriller 
Total ratings:  304 
Average rating:  4.138158 

[1] Saving Private Ryan (1998)
Genres:  Action|Drama|War 
Total ratings:  191 
Average rating:  3.945026 

[1] Adaptation (2002)
Genres:  Comedy|Drama|Romance 
Total ratings:  42 
Average rating:  3.77381 

[1] Back to the Future (1985)
Genres:  Adventure|Comedy|Sci-Fi 
Total ratings:  226 
Average rating:  4.015487 

[1] Fugitive, The (1993)
Genres:  Thriller 
Total ratings:  213 
Average rating:  3.953052 



<div align="right">
<a href="#p4" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p4" class="collapse">
```
print("User-based collaborative filtering")
cat("\n")
cat("\n")
cat("\n")

print("User 502 Recommendations:")
cat("\n")
cat("\n")

u502_recom <- lapply(lr[2], function(x) for (i in x){
                                movie <- movies_df$title[movies_df$movieId==i]
                                print (movie, max.levels=0)
                                genres <- movies_df$genres[movies_df$movieId==i]
                                cat("Genres: ", as.character(genres), "\n")
                                indices <- which(ratings_df$movieId==i, arr.ind=T)
                                indices <- which(ratings_df$movieId==i, arr.ind=T)
                                cat("Total ratings: ", length(indices), "\n")
                                cat("Average rating: ",mean(ratings_df$rating[indices]), "\n")
                                cat("\n")
}
    )
u502_recom
```
</div>

## Want to learn more?

IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: [SPSS Modeler for Mac users](https://cocl.us/ML0151EN_SPSSMod_mac) and [SPSS Modeler for Windows users](https://cocl.us/ML0151EN_SPSSMod_win)

Also, you can use Data Science Experience to run these notebooks faster with bigger datasets. Data Science Experience is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, DSX enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of DSX users today with a free account at [Data Science Experience](https://cocl.us/ML0151EN_DSX)

### Thank you for completing this exercise!

Notebook created by: Dominique Warren, <a href = "https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>

## References
[Recommenderlab](https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf)

[Package 'recommenderlab'](https://cran.r-project.org/web/packages/recommenderlab/recommenderlab.pdf)

[Package ‘Matrix’](https://cran.r-project.org/web/packages/Matrix/Matrix.pdf)

[Collaborative Filtering Recommender Systems](http://files.grouplens.org/papers/FnT%20CF%20Recsys%20Survey.pdf)

[R Documentation](https://cran.r-project.org/manuals.html)


<hr>
Copyright &copy; 2017 [IBM Cognitive Class](https://cocl.us/ML0151EN_cclab_cc). This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).