
# Movies Recommendations

## Table of Contents

1. [Recommender Systems](#recsys)
2. [Content-based Recommendation](#content)
3. [Collaborative Filtering](#collab)


<a class="anchor" id="recsys"></a>

# 1. Recommender Systems 


Generally, a Recommender System (RS) intends to connect users to items (e.g., movies, videos, songs, products) from a large catalogue so that the items might be of interest to these users. To do so, an RS tries to predict a relevance score for items in the catalogue by exploiting information about the user (e.g., the user's profile and preferences) and the item (e.g., content, specification, and characteristic features).

Online shopping sites like Amazon and streaming platforms such as Netflix or Spotify heavily rely on an RS to improve their business. You have probably come across product or movie recommendations; they are aggregated in
application or website sections such as "Because you watched..." or "Because you purchased...".

There are two basic types of recommender systems:

* **Content-based filtering**: Searches for a small set of items whose
    contents best matches the user profile. In other words, a
    content-based RS focuses on the properties of items. They assume
    that users might be interested in items whose content is similar to
    those of items consumed before.
* **Collaborative filtering**: Searches for a small set of users or items sharing similar preferences. It assumes that users who share similar preferences or have similar tastes might be interested in similar items (or that items which have similar appeal to many users can be suggested to users who like some, but not yet all of them). The former is also know as "user-to-user" collaborative filtering, while the latter is know as "item-to-item" collaborative filtering. 

We will look into both types using movie recommendation as an example. 

<a class="anchor" id="content"></a>

# 2. Content-based Recommendation

The following picture shows the basic principle behind a content-based recommender system as explained above.

<div style="align: left; text-align:center;">
    <img src="img/contentbased.png" alt="Content-Based Recommender Systems" />
    <span style="display:block;">Schema of a Content-based Recommender System.</span>
    <br/>
</div>

The following sequence of steps to create recommendations is a typical "workflow" used in a data science application. Basically, we follow the steps shown in the above picture.

### Loading the data

In [1]:
# We use the Pandas library to read the dataset from a .csv file:

import pandas as pd

movie_dataset = pd.read_csv('Data//contentrecsys.csv')

In [2]:
movie_dataset

Unnamed: 0,Movie,Adventure,Action,SciFi,Drama,Crime,Thriller,User1,User2
0,Star Wars IV,1,1,1,0,0,0,1,-1
1,Saving Private Ryan,0,0,0,1,1,0,0,0
2,American Beauty,0,0,0,1,0,0,0,0
3,City of Gold,0,0,0,1,1,0,-1,1
4,Interstellar,0,0,1,1,0,0,1,0
5,The Matrix,1,1,1,0,0,1,0,1


We see that the dataset is about six movie categories and two users (the two rightmost columns), and it desribes six movies only.

### Splitting movie data from user data

Since the dataset contains both data about movies and users, we have to separate these two groups in order to make recommendations.

We start by selecting the six columns that correspond to movie data:

In [3]:
columns_movie_profile = movie_dataset.columns[:7] # Selecting the six columns of the data frame

In [4]:
print(columns_movie_profile)

Index(['Movie', 'Adventure', 'Action', 'SciFi', 'Drama', 'Crime', 'Thriller'], dtype='object')


Next we select the columns corresponding to user data:

In [5]:
columns_user_history = movie_dataset.columns[7:] # Selecting the last two columns of the data frame

In [6]:
columns_user_history

Index(['User1', 'User2'], dtype='object')

### Creating a view

The next step is to materialize a view from the movie dataset containing just the user ratings.

In [7]:
user_history = movie_dataset[columns_user_history]

In [8]:
user_history

Unnamed: 0,User1,User2
0,1,-1
1,0,0
2,0,0
3,-1,1
4,1,0
5,0,1


### Querying a DataFrame

Pandas offers a variety of methods for finding and selecting specific data contained in a DataFrame.
Here, we use the **.query()** method to select movies that User1 has liked. For more on the use of the query function of Pandas, see https://towardsdatascience.com/10-examples-that-will-make-you-use-pandas-query-function-more-often-a8fb3e9361cb.

Note that we use **columns_movie_profile** to create a view where only information about movies is displayed.

In [9]:
user1_movies = movie_dataset.query('User1 == 1')[columns_movie_profile]

In [10]:
user1_movies

Unnamed: 0,Movie,Adventure,Action,SciFi,Drama,Crime,Thriller
0,Star Wars IV,1,1,1,0,0,0
4,Interstellar,0,0,1,1,0,0


### Defining user profiles 

For making content-based recommendations, we need to define user profiles. For simplicity, we here define a user profile as the mean of all movie profiles that this user has liked. This can be done by using the Pandas method .mean():

In [11]:
user1_profile = user1_movies.mean() # User profile is the mean of the movie profiles

  user1_profile = user1_movies.mean() # User profile is the mean of the movie profiles


In [12]:
user1_profile

Adventure    0.5
Action       0.5
SciFi        1.0
Drama        0.5
Crime        0.0
Thriller     0.0
dtype: float64

In [13]:
# Next we do the same with User2:

user2_movies = movie_dataset.query('User2 == 1')[columns_movie_profile]

In [14]:
user2_profile = user2_movies.mean()

  user2_profile = user2_movies.mean()


In [15]:
user2_profile

Adventure    0.5
Action       0.5
SciFi        0.5
Drama        0.5
Crime        0.5
Thriller     0.5
dtype: float64

### Selecting movies for recommendation

Now we need to select movies that are suitable for recommendation: We only consider movies the user to whom we want to make a recommendation has not watched yet. This can be done by using the .query() method and selecting movies where the user viewing history contains a zero.

In [16]:
movies_to_user1 = movie_dataset.query('User1 == 0')[columns_movie_profile]

In [17]:
movies_to_user1

Unnamed: 0,Movie,Adventure,Action,SciFi,Drama,Crime,Thriller
1,Saving Private Ryan,0,0,0,1,1,0
2,American Beauty,0,0,0,1,0,0
5,The Matrix,1,1,1,0,0,1


### Defining the recommendation model

A content-based recommender system tries to find the best matches between user profiles and movie profiles. A simple way to achieve this is using a *similarity metric*. Here, we define our recommendation model using *cosine similarity*, which considers both user profiles and movie profiles as vectors and finds the cosine of the angle between these vectors.

Recall that objects represented by these vectors are "similar" if their corresponding vectors are “close” to each other, i.e., the angle between them is small. In other words, the smaller the angle between the two vectors, the higher the similarity between the objects they represent. Let the two vectors be called A and B and the angle between them $\theta$, then, as discussed in the previous notebook, their similarity is the cosine of $\theta$, which is the normalized scalar (or dot) product of the two vectors A and B. 


We use SciKit-Learn's **.cosine_similarity()** function to compute it:

In [18]:
from sklearn.metrics.pairwise import cosine_similarity 

Since the **.cosine_similarity()** function expects two collections of vectors as parameters, we need to transform our user profile vector:  

In [22]:
user1_profile

Adventure    0.5
Action       0.5
SciFi        1.0
Drama        0.5
Crime        0.0
Thriller     0.0
dtype: float64

In [23]:
transformed_user1_profile = user1_profile.values.reshape(1,-1)

In [24]:
transformed_user1_profile

array([[0.5, 0.5, 1. , 0.5, 0. , 0. ]])

Now we can compute similarities:

In [25]:
similarities = cosine_similarity(movies_to_user1._get_numeric_data(), user1_profile.values.reshape(1,-1))

In [26]:
print(similarities)

[[0.26726124]
 [0.37796447]
 [0.75592895]]


We add the similarity information to the movies_to_user1 DataFrame ...

In [27]:
movies_to_user1['Similarity'] = similarities

In [29]:
movies_to_user1

Unnamed: 0,Movie,Adventure,Action,SciFi,Drama,Crime,Thriller,Similarity
1,Saving Private Ryan,0,0,0,1,1,0,0.267261
2,American Beauty,0,0,0,1,0,0,0.377964
5,The Matrix,1,1,1,0,0,1,0.755929


... and sort movies according to their similarities:

In [30]:
movies_to_user1.sort_values(by=['Similarity'], ascending=False)

Unnamed: 0,Movie,Adventure,Action,SciFi,Drama,Crime,Thriller,Similarity
5,The Matrix,1,1,1,0,0,1,0.755929
2,American Beauty,0,0,0,1,0,0,0.377964
1,Saving Private Ryan,0,0,0,1,1,0,0.267261


As you can see, The Matrix is the movie to be recommended first to User 1.

### Exercise

Follow the same process to find out which movies to recommend to User 2.

In [None]:
# Add your code here

<a class="anchor" id="collab"></a>

# 3. Collaborative Filtering

We now turn to the second category of recommender systems. The following picture shows the basic principle behind a recommender system that is based on collaborative filtering.


<div style="align: left; text-align:center;">
    <img src="img/collaborativefiltering.png" alt="Collaborative Filtering" />
    <span style="display:block;">Schema of a Recommender System based on Collaborative Filtering.</span>
    <br/>
</div>

As mentioned above, there are two "versions" of collaborative filtering, user-2-user and item-2-item collaborative filtering. In the following, we use user-to-user collaborative filtering. The workflow is similar to what we have seen above; furthermore user-2-user and item-2-item filtering essentially follow the same steps.

### Loading the data

In [1]:
import pandas as pd

We use a different dataset than before which contains a bit more data.

In [2]:
ratings = pd.read_csv('Data//movie_ratings.csv')

In [3]:
ratings

Unnamed: 0,User,Lady in the Water,Snakes on a Plane,Just My Luck,Superman Returns,"You, Me and Dupree",The Night Listener
0,Lisa Rose,2.5,3.5,3.0,3.5,2.5,3.0
1,Gene Seymour,3.0,3.5,1.5,5.0,3.5,3.0
2,Michael Phillips,2.5,3.0,0.0,3.5,0.0,4.0
3,Claudia Puig,0.0,3.5,3.0,4.0,2.5,4.5
4,Mick LaSalle,3.0,4.0,2.0,3.0,2.0,3.0
5,Jack Matthews,3.0,4.0,0.0,5.0,3.5,3.0
6,Toby,0.0,4.5,0.0,4.0,1.0,0.0


### Selecting the 'current' user

We start by selecting the 'current' user as the one who will receive recommendations. Here, we select user **Toby** by querying the **ratings** DataFrame:

In [4]:
current_user_profile = ratings.query("User == 'Toby'")._get_numeric_data()

In [5]:
current_user_profile

Unnamed: 0,Lady in the Water,Snakes on a Plane,Just My Luck,Superman Returns,"You, Me and Dupree",The Night Listener
6,0.0,4.5,0.0,4.0,1.0,0.0


### Finding similar users

Since the central idea of Collaborative Filtering is that users who have shown similar taste in the past are likely to show similar taste in the future, we need to find users that are similar to current user Toby.

We again employ cosine similarity. First, we reshape the current user profile required by the pairwise version of the **cosine_similarity** function:

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(ratings._get_numeric_data(), current_user_profile.values.reshape(1,-1))

Second, we add the similarity scores to our DataFrame:

In [13]:
current_user_profile.values.reshape(1,-1).shape

(1, 6)

In [12]:
ratings._get_numeric_data().shape

(7, 6)

In [14]:
similarities.shape

(7, 1)

In [41]:
ratings['Similarity'] = similarities

And we can sort the data set using the these similarity scores:

In [42]:
ratings.sort_values(by=['Similarity'], ascending=False)

Unnamed: 0,User,Lady in the Water,Snakes on a Plane,Just My Luck,Superman Returns,"You, Me and Dupree",The Night Listener,Similarity
6,Toby,0.0,4.5,0.0,4.0,1.0,0.0,1.0
5,Jack Matthews,3.0,4.0,0.0,5.0,3.5,3.0,0.806716
1,Gene Seymour,3.0,3.5,1.5,5.0,3.5,3.0,0.771527
4,Mick LaSalle,3.0,4.0,2.0,3.0,2.0,3.0,0.737255
0,Lisa Rose,2.5,3.5,3.0,3.5,2.5,3.0,0.715365
3,Claudia Puig,0.0,3.5,3.0,4.0,2.5,4.5,0.7051
2,Michael Phillips,2.5,3.0,0.0,3.5,0.0,4.0,0.687246


### Predicting ratings for unseen movies

To predict user ratings for unseen movies, we first need to find out which movies the current user did not watch yet (i.e., corresponding ratings should contain a value of zero):

In [43]:
movies = ratings.columns[1:-1] # Select columns that refer to movie titles

In [44]:
movies

Index(['Lady in the Water', 'Snakes on a Plane', 'Just My Luck',
       'Superman Returns', 'You, Me and Dupree', 'The Night Listener'],
      dtype='object')

To facilitate our approach, we create a view of **current_user_profile** that contains only movie columns:

In [45]:
user_movie_view = current_user_profile[movies]

In [46]:
user_movie_view

Unnamed: 0,Lady in the Water,Snakes on a Plane,Just My Luck,Superman Returns,"You, Me and Dupree",The Night Listener
6,0.0,4.5,0.0,4.0,1.0,0.0


We use the view to filter out movies that already contain a rate given by the current user. We use the **any()** method to return all columns with a non-zero value and apply these columns on top of the collection of movies:

In [47]:
unseen_movies = movies[(user_movie_view == 0).all()]

In [48]:
print(unseen_movies)

Index(['Lady in the Water', 'Just My Luck', 'The Night Listener'], dtype='object')


A common procedure to predict user ratings in Collaborative Filtering is
to use a weighted average score. This score takes into consideration not
only the rating given by other users, but also considers how similar
these users are with regard to the current user. Intuitively, this
approach gives more importance to the opinion of users that are more
similar to the current user.

Let's create a new DataFrame that will store the ratings of other users weighted by their similarity with the current user:

In [49]:
rec_scores = pd.DataFrame()

Since it does not make sense to use the current user ratings in this process, we have to remove them from **rec_scores**:

In [50]:
ratings_without_current_user = ratings.drop(index=6) # The index 6 corresponds to Toby in the original Data Frame

In [51]:
for mov in unseen_movies:
    rec_scores[mov] = ratings_without_current_user[mov] * ratings_without_current_user['Similarity']
    
rec_scores

Unnamed: 0,Lady in the Water,Just My Luck,The Night Listener
0,1.788414,2.146096,2.146096
1,2.314582,1.157291,2.314582
2,1.718114,0.0,2.748983
3,0.0,2.115299,3.172949
4,2.211765,1.47451,2.211765
5,2.420147,0.0,2.420147


We also add the original similarity scores to **rec_scores**:

In [52]:
rec_scores['Similarity'] = ratings_without_current_user['Similarity']

The **rec_scores** DataFrame shows the rating of other users weighted by their similarity. Using these scores, we can provide recommendations in accordance with the following idea: A user who is more similar to Toby will contribute more to the overall score than a user who is different from him.

In [53]:
rec_scores

Unnamed: 0,Lady in the Water,Just My Luck,The Night Listener,Similarity
0,1.788414,2.146096,2.146096,0.715365
1,2.314582,1.157291,2.314582,0.771527
2,1.718114,0.0,2.748983,0.687246
3,0.0,2.115299,3.172949,0.7051
4,2.211765,1.47451,2.211765,0.737255
5,2.420147,0.0,2.420147,0.806716


Now we can predict the rating for a movie not yet seen using the following formula:

$R = \frac{\sum_{n}^{u=1}R_u \cdot S_u}{\sum_{u=1}^{n}S_u}$,

where $R_u$ is the rating given by user $u$, and $S_u$ is the similarity between $u$ and the current user.

Note that we divide weighted ratings by the sum of all the similarities of users that rated a movie. This is done to avoid that a movie reviewed by more people would have a bigger impact on the prediction.

We compute the predictions as follows. First, we create a variable that will store the predicted scores:

In [54]:
predictions = []

Second, we iterate through the unseen movies and compute their average weighted ratings. Note that by using the **non_zero_scores** we consider only users that have actually rated the movie:

In [55]:
for mov in unseen_movies:
    current_movie_scores = rec_scores[mov]
    non_zero_scores = current_movie_scores[current_movie_scores > 0] # Consider only users that rated this movie
    
    sum_similarities = sum(rec_scores['Similarity'].iloc[non_zero_scores.index]) # Consider only users that rated this movie
    
    pred_score = sum(non_zero_scores)/sum_similarities
    
    predictions.append((pred_score, mov))

In [56]:
print(predictions)

[(2.8113811174320773, 'Lady in the Water'), (2.3532311480364974, 'Just My Luck'), (3.394486289405419, 'The Night Listener')]


### Making recommendations

To make recommndations out of the predictions, we just have to rank the movies by their predicted scores, and we see which movies will be recommended to Toby:

In [57]:
predictions.sort()
predictions.reverse()

In [58]:
predictions

[(3.394486289405419, 'The Night Listener'),
 (2.8113811174320773, 'Lady in the Water'),
 (2.3532311480364974, 'Just My Luck')]