In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

This notebook is from the DataCamp course "Building Recommendation Engines in Python." 
https://learn.datacamp.com

In [2]:
# load the User Ratings dataset
# this is a subset of the MovieLens dataset 
user_ratings_df = pd.read_csv('user_ratings.csv')
print(user_ratings_df.head())

   userId  movieId  rating   timestamp             title  \
0       1        1     4.0   964982703  Toy Story (1995)   
1       5        1     4.0   847434962  Toy Story (1995)   
2       7        1     4.5  1106635946  Toy Story (1995)   
3      15        1     2.5  1510577970  Toy Story (1995)   
4      17        1     4.5  1305696483  Toy Story (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1  Adventure|Animation|Children|Comedy|Fantasy  
2  Adventure|Animation|Children|Comedy|Fantasy  
3  Adventure|Animation|Children|Comedy|Fantasy  
4  Adventure|Animation|Children|Comedy|Fantasy  


## Non-personalized Recommendations 

One approach is to recommend the most popular movie which in this case is the movie that has been viewed by the most users. This is ***implicit data.*** 

In [3]:
# Get the counts of occurrences of each movie title
movie_popularity = user_ratings_df["title"].value_counts()

# Inspect the most common values
print(movie_popularity.head().index)

Index(['Forrest Gump (1994)', 'Shawshank Redemption, The (1994)',
       'Pulp Fiction (1994)', 'Silence of the Lambs, The (1991)',
       'Matrix, The (1999)'],
      dtype='object')


Forrest Gump has been watched by 329 viewers and is the most viewed movie. 

## User Ratings  
A lot of users might watch a movie and really dislike it. The **ratings** in this dataset give us a source of **explicit data**. 

In [4]:
# Find the mean of the ratings given to each title
average_rating_df = user_ratings_df[["title", "rating"]].groupby('title').mean()

# Order the entries by highest average rating to lowest
sorted_average_ratings = average_rating_df.sort_values(by='rating', ascending=False)

# Inspect the top movies
print(sorted_average_ratings.head())

                                     rating
title                                      
Gena the Crocodile (1969)               5.0
True Stories (1986)                     5.0
Cosmic Scrat-tastrophe (2015)           5.0
Love and Pigeons (1985)                 5.0
Red Sorghum (Hong gao liang) (1987)     5.0


Even though this is a real world dataset, the highest-ranked movies are pretty obscure. Let's check how many users have actually seen these movies. 

## Combining Popularity and Reviews 
Two of the most common methods of non-personalized recommendations are: 
+ Most frequently watched
+ Most highly rated 
We've seen some of the downsides of both methods. Recommending the most frequently watched doesn't take into account how people felt about the movie. Recommending the most highly rated, we might recommend really long-tail, niche items. Here we recommend highly rated movies but make sure that they have been reviewed by at least 50 users. 

In [6]:
# Create a list of only movies appearing > 50 times in the dataset
movie_popularity = user_ratings_df["title"].value_counts()
popular_movies = movie_popularity[movie_popularity > 50].index
print(popular_movies)

Index(['Forrest Gump (1994)', 'Shawshank Redemption, The (1994)',
       'Pulp Fiction (1994)', 'Silence of the Lambs, The (1991)',
       'Matrix, The (1999)', 'Star Wars: Episode IV - A New Hope (1977)',
       'Jurassic Park (1993)', 'Braveheart (1995)',
       'Terminator 2: Judgment Day (1991)', 'Schindler's List (1993)',
       ...
       'Grand Budapest Hotel, The (2014)', 'Caddyshack (1980)',
       'Grumpier Old Men (1995)', 'Training Day (2001)', 'Bad Boys (1995)',
       'Army of Darkness (1993)', 'The Devil's Advocate (1997)',
       'Mulholland Drive (2001)', 'Splash (1984)', 'Blow (2001)'],
      dtype='object', length=437)


In [7]:
# Use this popular_movies list to filter the original DataFrame
popular_movies_rankings =  user_ratings_df[user_ratings_df["title"].isin(popular_movies)]

In [8]:
# Find the average rating given to these frequently watched films
popular_movies_average_rankings = popular_movies_rankings[["title", "rating"]].groupby('title').mean()
print(popular_movies_average_rankings.sort_values(by="rating", ascending=False).head())

                                                      rating
title                                                       
Shawshank Redemption, The (1994)                    4.429022
Godfather, The (1972)                               4.289062
Fight Club (1999)                                   4.272936
Cool Hand Luke (1967)                               4.271930
Dr. Strangelove or: How I Learned to Stop Worry...  4.268041


Now we have an easy way to make non-personalized recommendations based on the rating of an item and how frequently it has been interacted with. 