
Data Science, Pandas Library, Python / By Lukas
What is a Recommendation System?
If you use Netflix or Amazon you have already seen the results of recommendation systems – movie or item recommendations that fit your taste or needs. So, at its core a recommendation system is a statistical algorithm that computes similarities based on previous choices or features and recommends users which movie to watch or what else they might need to buy.

How Does a Recommendation System Work?
Assume that persons A and B like a movie M1 and person A also likes movie M2. Now, we can conclude that person B will also like movie M2 with a high probability. Well, that’s very little data and probably a rather imprecise prediction. Yet, it illustrates how collaborative filtering works. In a real world application we would need much more data to make good recommendations. The recommendation algorithms based this concept are called collaborative filtering.

Another popular way to recommend items is so called content-based filtering. Content-based filtering computes recommendations based on similarities of items or movies. In the case of movies we could look at different features like: genre, actors, … to compute similarity.
If a user liked a given movie, the probability is high that the user will also like similar movies. Thus, it makes sense to recommend movies with a high similarity to those the user liked.

Implementing a Recommendation System
If you want to understand the code below better, make sure to sign up for our free email course “Introduction to Pandas and Data Science” on our Email Academy. Throughout the course, we develop a recommendation system for movies. At its core, there is the method corrwith() from the Pandas library.

This is the final implementation of our recommendation system:


Download the MovieLens data set from: https://grouplens.org/datasets/movielens/latest/
How to Use Pandas corrwith() Method?
The Pandas object DataFrame offers the method corrwith() which computes pairwise correlations between DataFrames or a DataFrame and a Series. With the parameter axis, you can either compute correlations along the rows or columns. Here is the complete signature, blue parameters are optional and have default values.


The arguments in detail:

1.) other: A Series or DataFrame with which to compute the correlation.

2.) axis: Pass 0 or ‘index’ to compute correlations column-wise, 1 or ‘columns’ for row-wise.

3.) drop: Drop missing indices from result.

4.) method: The algorithm used to compute the correlation. You can either choose from: ‘pearson’, ‘kendall’ or ‘spearman’ or implement your own algorithm. So, either you pass one of the three strings or a callable.



In [1]:
#Here is a practical example:

import pandas as pd

ratings = pd.read_csv(r'C:\Users\Administrateur\Downloads\ml-latest-small\ml-latest-small\ratings.csv')
movies = pd.read_csv(r'C:\Users\Administrateur\Downloads\ml-latest-small\ml-latest-small\movies.csv')
dftags = pd.read_csv(r'C:\Users\Administrateur\Downloads\ml-latest-small\ml-latest-small\tags.csv')
dflinks = pd.read_csv(r'C:\Users\Administrateur\Downloads\ml-latest-small\ml-latest-small\links.csv')

ratings = {
       'Spider Man':[3.5, 1.0, 4.5, 5.0],
       'James Bond':[1.0, 2.5, 5.0, 4.0],
       'Titanic':[5.0, 4.5, 1.0, 2.0] 
}

new_movie_ratings = pd.Series([2.0, 2.5, 5.0, 3.5])
all_ratings = pd.DataFrame(ratings)

print(all_ratings.corrwith(new_movie_ratings))



Spider Man    0.566394
James Bond    0.953910
Titanic      -0.962312
dtype: float64


In [2]:
import pandas as pd

ratings = pd.read_csv(r'C:\Users\Administrateur\Downloads\ml-latest-small\ml-latest-small\ratings.csv')
movies = pd.read_csv(r'C:\Users\Administrateur\Downloads\ml-latest-small\ml-latest-small\movies.csv')



In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
data = pd.merge(movies, ratings, on='movieId')
data.sample(7)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
37636,1993,Child's Play 3 (1991),Comedy|Horror|Thriller,599,0.5,1498502795
24527,1193,One Flew Over the Cuckoo's Nest (1975),Drama,128,3.0,899033127
6398,235,Ed Wood (1994),Comedy|Drama,273,5.0,835860827
6776,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,602,3.0,840875621
68457,5785,Jackass: The Movie (2002),Action|Comedy|Documentary,414,4.0,1072058073
71087,6440,Barton Fink (1991),Drama|Thriller,23,5.0,1107341765
63798,4866,"Last Castle, The (2001)",Action,608,4.0,1117520106


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   movieId    100836 non-null  int64  
 1   title      100836 non-null  object 
 2   genres     100836 non-null  object 
 3   userId     100836 non-null  int64  
 4   rating     100836 non-null  float64
 5   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 4.6+ MB


In [6]:
def recommend_movies(similar_to):
    data = pd.merge(movies, ratings, on='movieId')
    user_rating_pt = data.pivot_table(index='userId', columns='title', values='rating')
    toy_story_ratings = user_rating_pt[movie]  
    similar_movies = user_rating_pt.corrwith(toy_story_ratings)
    corr = pd.DataFrame(similar_movies, columns=['Correlation'])
    corr.sort_values('Correlation', ascending=False, inplace=True)
    
    corr = corr.join(data.groupby('title')['rating'].count())
    corr.rename(columns = {'rating':'rating_count'}, inplace = True)
    
    mask = corr['rating_count'] > 100
    recommendations = corr[mask].sort_values('Correlation', ascending=False)
    
    return list(recommendations.iloc[1:5].index)

movie = 'Toy Story (1995)'

rec_movies = ','.join(recommend_movies(movie))
print(f'You liked {movie}. You might also like: {rec_movies}')

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


You liked Toy Story (1995). You might also like: Incredibles, The (2004),Finding Nemo (2003),Aladdin (1992),Monsters, Inc. (2001)


In [7]:
df_rate = ratings.copy()

In [8]:
df_mov = movies.copy()

In [9]:
df_link = dflinks.copy()

In [10]:
df_tag = dftags.copy()

From a given dictionary of lists (ratings) we create a DataFrame. This DataFrame has three columns and four rows. Each column contains the movie ratings of all four users.
The Series new_movie_ratings contains the ratings for a new movie of all four users.
Using the method corrwith() on the DataFrame we get the correlation between the new ratings and the old ones.
The output of the snippet above is:

Spider Man    0.566394
James Bond    0.953910
Titanic      -0.962312
As you can see, the new movie has the highest correlation with the James Bond movie. This means, a recommendation system which works purely based on ratings, should recommend the James Bond movie to users that liked the new movie.


What is Correlation?
Correlation describes the statistical relationship between two entities. This is to say, it’s how two variables move in relation to one another. Correlation is given as a value between -1 and +1. However, correlation is not causation!

There are three types of correlation:

Positive correlation:
A positive correlation is a value in the range 0.0 < c <= 1.0. A correlation of 1.0 means that if the first variable moves up, the second one will also move up. This relationship is weaker if the correlation is lower than 1.0.

Negative correlation:
A negative correlation is a value in the range 0.0 > c >= -1.0. Negative correlation means that two variable have the opposite behaviour. So, if the first one moves up the second one moves down.

Zero or no correlation:
A correlation of zero means there is no relationship between the two variables. If the first variable moves up, the second one may do anything else.