<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width = 400, align = "center"></a>

<h1 align=center><font size = 5>COLLABORATTIVE FILTERING</font></h1>

Recommendation systems (sometimes called *recommender systems*) are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous and can commonly be seen in online stores, movies databases and job finders. In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement a simple version of one using R.

<hl>

## Table of Contents
<div class="alert alert-block alert-info" style="margin-top: 20px">

<br>
- <p><a href="#ref0">Acquiring the data</a></p>
- <p><a href="#ref1">Preprocessing</a></p>
- <p><a href="#ref2">Collaborative Filtering</a></p>
- <p><a href="#ref3">Advantages and Disadvantages of Collaborative Filtering</a></p>
<p></p>
</div>
<hr>

<a id="ref0"></a>
## Acquiring the Data

To acquire and extract the data, simply run the following R cell to download and store them. It might take a few minutes:  
Dataset acquired from [GroupLens](http://grouplens.org/datasets/movielens/)

In [None]:
# rating dataset
download.file("https://ibm.box.com/shared/static/q61myoukbyz969b97ddlcq0dny0l07bf.dat", "/resources/data/ratings.dat")

In [None]:
#Moview dtaset
download.file("https://ibm.box.com/shared/static/dn84btkn9gmxmdau32c5xb0vamie6jy4.dat", "/resources/data/movies.dat")

Now you're ready to start working with the data!
<a id="ref1"></a>
## Preprocessing

Let's begin by loading the data into their dataframes:

In [1]:
#Loading the movie information into a dataframe
movies_df <- read.csv('/resources/data/movies.dat', header = FALSE, sep=":")
# Head is a function that gets the first 6 rows of a dataframe
head(movies_df)

V1,V2,V3,V4,V5
1,,Toy Story (1995),,Animation|Children's|Comedy
2,,Jumanji (1995),,Adventure|Children's|Fantasy
3,,Grumpier Old Men (1995),,Comedy|Romance
4,,Waiting to Exhale (1995),,Comedy|Drama
5,,Father of the Bride Part II (1995),,Comedy
6,,Heat (1995),,Action|Crime|Thriller


In [1]:
#Loading the user information into a dataframe
ratings_df <- read.csv('/resources/data/ratings.dat', header = FALSE, sep=":")
# Alternatively let's look at the first 20 rows of the datatframe
head(ratings_df, 20)

V1,V2,V3,V4,V5,V6,V7
1,,1193,,5,,978300760
1,,661,,3,,978302109
1,,914,,3,,978301968
1,,3408,,4,,978300275
1,,2355,,5,,978824291
1,,1197,,3,,978302268
1,,1287,,5,,978302039
1,,2804,,5,,978300719
1,,594,,4,,978302268
1,,919,,4,,978301368


You can see here that there are some issues that arise when reading the data. Movies that have a colon in the title are causing additional columns to be generated, such as column 4 which contains the part of a movie's title that appears after the colon for movies with a colon in the title. We will now run some code to deal with some of these issues.

Let's have a look at the raw data to see what may be causing the problem.

We will do this by using the function **readLines** to store the raw data and using the head function to preview it.

In [2]:
# Here we read the movies data again in the raw format and display the first few rows
lines <- readLines("/resources/data/movies.dat")
head(lines, 20)

It would appear that for each line of the data, the information that would go into each column is separated by a double colon (**::**) as opposed to the single colon (**:**) we used for our sep value in our *read.csv* function call. However, the read.csv function only allows us to use single characters for our field separator character (**sep**) value.

We can use the function gsub to replace the double colons (::) in our data with the symbol tilde (~).

In [3]:
# Here we replace the sep character used in the data ("::") with one that does not appear in the data ("~")
lines <- gsub("::", "~", lines)
head(lines, 20)

Let's redo the movies dataframe with our modified raw data.

In [4]:
# Now we recreate the movies dataframe using the updated data
movies_df <- read.csv(text=lines, sep="~", header = FALSE)
head(movies_df, 20)

V1,V2,V3
1,Toy Story (1995),Animation|Children's|Comedy
2,Jumanji (1995),Adventure|Children's|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children's
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


So each movie has a unique ID, a title with its release year along with it and several different genres in the same field. Name the columns and then remove the year from the title column using R's handy "sub" function and then clean any trailing whitespaces.

In [5]:
names(movies_df)[names(movies_df)=="V1"] = "movieId"
names(movies_df)[names(movies_df)=="V2"] = "title"
names(movies_df)[names(movies_df)=="V3"] = "genres"
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df$title = sub("\\s+$", "", movies_df$title)

Let's look at the result!

In [6]:
head(movies_df, 20)

movieId,title,genres
1,Toy Story (1995),Animation|Children's|Comedy
2,Jumanji (1995),Adventure|Children's|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children's
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


With that, let's also drop the genres column since we won't need it for this particular recommendation system.

In [7]:
#Dropping the genres column
movies_df$genres = NULL


Here's the final movies dataframe:

In [8]:
# Display the first 20 rows
head(movies_df, 20)

movieId,title
1,Toy Story (1995)
2,Jumanji (1995)
3,Grumpier Old Men (1995)
4,Waiting to Exhale (1995)
5,Father of the Bride Part II (1995)
6,Heat (1995)
7,Sabrina (1995)
8,Tom and Huck (1995)
9,Sudden Death (1995)
10,GoldenEye (1995)


<br>

Next, let's look at the ratings dataframe.

In [9]:
head(ratings_df)

V1,V2,V3,V4,V5,V6,V7
1,,1193,,5,,978300760
1,,661,,3,,978302109
1,,914,,3,,978301968
1,,3408,,4,,978300275
1,,2355,,5,,978824291
1,,1197,,3,,978302268


Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. Let's name the columns accordingly and drop the timestamp column since we won't be using it for this type of recommendation.

In [None]:
# Removing the Empty Column Ex: V2, V4, V6 using subset function.
# These columns were generated because the data is separated by "::" while the read.csv function only accepts single characters
# for the sep value  such as ":" or "~", thus the read function assumed that our data was separated by single colons (":").
ratings_df <- subset( ratings_df, select = -c(V2, V4, V6 ) )
head(ratings_df)

Lets name the columns in rating_df as follows  
- V1 as userId
- V3 as movieId
- V5 as rating
- V7 as timestamp

Remove Column timestamp

In [None]:
names(ratings_df)[names(ratings_df)=="V1"] = "userId"
names(ratings_df)[names(ratings_df)=="V3"] = "movieId"
names(ratings_df)[names(ratings_df)=="V5"] = "rating"
names(ratings_df)[names(ratings_df)=="V7"] = "timestamp"
ratings_df$timestamp = NULL
# Here's how the final ratings Dataframe looks like:
head(ratings_df)

<a id="ref2"></a>
## Collaborative Filtering

Now, time to start our work on recommendation systems. 

The first technique we're going to take a look at is called __Collaborative Filtering__, which is also known as __User-User Filtering__. As hinted by its alternate name, this technique uses other users data to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one we will be using here is going to be based on the __Pearson Correlation Function__.

<img src="https://ibm.box.com/shared/static/qbubncyen4qqdn5idup8hpp3834igzwt.png" alt="Drawing" style="width: 700px;"/>


The process for creating a User Based recommendation system is as follows:
- Select a user with the movies the user has watched
- Based on his rating to movies, find the top X neighbours 
- Get the watched movie record of the user for each neighbour.
- Calculate a similarity score using some formula
- Recommend the items with the highest score


Let's begin by creating an input user to recommend movies to:

Notice: To add more movies, simply increase the amount of elements in the userInput. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a "The", like "The Matrix" then write it in like this: 'Matrix, The' .

In [None]:
inputUser = data.frame("title"=c("Breakfast Club, The (1985)", "Toy Story (1995)", "Jumanji (1995)", "Pulp Fiction (1994)", "Akira (1988)"), 
                       "rating"=c(5, 3.5, 2, 5, 4.5))
head(inputUser)

#### Adding movieIds to the input user
With the input complete, let's extract the input movies's ID's from the movies dataframe and add them into it.

We can achieve this by first filtering out the rows that contain the input movies' titles and getting their IDs.

In [None]:
inputUser$movieId = rep(NA, length(inputUser$title))
for (i in 1:length(inputUser$title)){
    inputUser$movieId[i] = as.character(movies_df$movieId[movies_df$title == inputUser$title[i]])
}
head(inputUser)

#### The users who have seen the same movies
Now with the movie IDs in our input, we can now get the subset of users who have watched and reviewed the movies that our input user has seen.

In [None]:
#Filtering out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df$movieId %in% inputUser$movieId,]
head(userSubset)

With every user extracted, let's sort them by the amount of movies that they have in common with the input and get the first 100 of them.

In [None]:
top100 <- head(sort(table(factor(userSubset$userId)), decreasing = TRUE), 100)

View the first 6 of the 100 most similar users to our input user and the amount of movies they have in common:

In [None]:
head(top100)

Now let's extract the userIDs from the table and transform it into a table to make it easier to subset the data later on.

In [None]:
userList <- as.data.frame.table(top100)
colnames(userList) <-  c("userId","commonMovies")
head(userList)

Now let's get the movies watched by these 100 users from the ratings dataframe and then create the UserSubset data frame (using merge function to combine the columns)

In [None]:
userSubset = ratings_df[ratings_df$userId %in% userList$userId,]
temp = as.data.frame(table(userSubset$movieId))
names(temp)[names(temp)=="Var1"] = "movieId"
userSubset = merge(temp, userSubset)

This is what our final userSubset dataframe looks like:

In [None]:
head(userSubset)

Let's look at one of the users, e.g. the one with userID 533.

In [None]:
head(userSubset[userSubset$userId == 533,])

Now let's filter out the movies with less than 10 occurrences in our dataframe:

In [None]:
userSubset = userSubset[userSubset$Freq > 10,]
head(userSubset)

#### Similarity of users to input user
Next, we are going to compare the top users to our specified user and find the one that is most similar.  
we're going to find out how similar each user is to the input user through the __Pearson Correlation Coefficient__. It is used to measure the strength of the linear association between two variables. The formula for finding this coefficient between sets X and Y with N values can be seen in the image below. 

Why Pearson Correlation?

Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y, then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems because for example two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales .
<center>
$S_{xx} = \sum{x^2} - (\sum{x})^2/n$
<center>
$S_{yy} = \sum{y^2} - (\sum{y})^2/n$
<center>
$S_{xy} = \sum{xy} - (\sum{x})(\sum{y})/n$
<center>
$r =\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}$

The values given by the formula vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation. 

In our case, a 1 means that the two users have similar tastes while a -1 means the opposite.

Now, we calculate the Pearson Correlation between input user and subset group, and store it in a data frame, where the key is the user Id and the value is the coefficient


In [None]:
pearson_df = data.frame("userId"=integer(), "similarityIndex"=double())
for (user in userList$userId)
{
    userRating = userSubset[userSubset$userId == user,]
    
    moviesInCommonX = userRating[userRating$movieId %in% inputUser$movieId,]
    moviesInCommonX = moviesInCommonX[complete.cases(moviesInCommonX),]
    
    moviesInCommonY = inputUser[inputUser$movieId %in% userRating$movieId,]
    moviesInCommonY = moviesInCommonY[complete.cases(moviesInCommonY),]
    
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum(moviesInCommonX$rating^2) - (sum(moviesInCommonX$rating)^2)/nrow(moviesInCommonX)
    Syy = sum(moviesInCommonY$rating^2) - (sum(moviesInCommonY$rating)^2)/nrow(moviesInCommonY)
    Sxy = sum(moviesInCommonX$rating*moviesInCommonY$rating) - (sum(moviesInCommonX$rating)*sum(moviesInCommonY$rating))/nrow(moviesInCommonX)
    
    
    if(Sxx == 0 | Syy == 0 | Sxy == 0)
    {
        pearsonCorrelation = 0
    }
    else
    {
        pearsonCorrelation = Sxy/sqrt(Sxx*Syy)
    }
    
    pearson_df = rbind(pearson_df, data.frame("userId"=user, "similarityIndex"=pearsonCorrelation))   
}

Here's a look at the similarity scores:

In [None]:
head(pearson_df)

#### The top x similar users to input user
Now let's get the top 50 users that are most similar to the input.

Now, let's start recommending movies to the input user.

#### Rating of selected users to all movies
We're going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our __pearsonDF__ from the ratings dataframe and then store their correlation in a new column called _similarityIndex". This is achieved below by merging of these two tables.

In [None]:
topUsersRating = merge(userSubset, pearson_df)
head(topUsersRating, 15)

Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.

We can easily do this by simply multiplying two columns, then taking the mean of the aggregate of the movieId column:

In [None]:
#Multiplies the similarity by the user's ratings
topUsersRating$weightedRating = topUsersRating$similarityIndex*topUsersRating$rating
weightedAverage_df = aggregate(topUsersRating$weightedRating, list(topUsersRating$movieId), mean)
head(weightedAverage_df)

Since we lose the column's names after doing so, we simply set them again in the next cell:

In [None]:
names(weightedAverage_df)[names(weightedAverage_df)=="Group.1"] = "movieId"
names(weightedAverage_df)[names(weightedAverage_df)=="x"] = "weightedAverage"
head(weightedAverage_df)

Now we merge the averages with the movies dataframe so we can get their titles.

In [None]:
recommendation_df = merge(weightedAverage_df, movies_df)

And then we finally sort it to see the top 20 movies that the algorithm recommended!

In [None]:
head(recommendation_df[order(-recommendation_df$weightedAverage),], 20)

<a id="ref4"></a>
## Advantages and Disadvantages of Collaborative Filtering

##### Advantages
* Takes other user's ratings into consideration
* Doesn't need to study or extract information from the recommended item
* Adapts to the user's interests which might change over time

##### Disadvantages
* Approximation function can be slow
* There might be a low of amount of users to approximate
* Privacy issues when trying to learn the user's preferences

## Want to learn more?

IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: [SPSS Modeler for Mac users](https://cocl.us/ML0151EN_SPSSMod_mac) and [SPSS Modeler for Windows users](https://cocl.us/ML0151EN_SPSSMod_win)

Also, you can use Data Science Experience to run these notebooks faster with bigger datasets. Data Science Experience is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, DSX enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of DSX users today with a free account at [Data Science Experience](https://cocl.us/ML0151EN_DSX)

### Thanks for completing this lesson!

Authors: Gabriel Garcez Barros Sousa, <a href = "https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>

## References
[GroupLens Dataset](http://grouplens.org/datasets/movielens/)

[Collaborative Filtering Recommender Systems](http://files.grouplens.org/papers/FnT%20CF%20Recsys%20Survey.pdf)

[R Documentation](https://cran.r-project.org/manuals.html)


<hr>
Copyright &copy; 2017 [IBM Cognitive Class](https://cocl.us/ML0151_cclab). This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).