This is an example of a collaborative approach to movie recommendations.  A dataset containing a list of movies is created, and then another data set 
containing a single person's reviews is passed in.  The code then finds the nearest neighbors to that user (e.g. people that like the same movies).
It then predicts movies that the user will like based on their neighbor's likes.

# Create a virtual environment.  Optional, only needs to be done once.  MUST BE RUN IN THE TERMINAL
python -m venv .venv

# Active the environment.  On MacOS, WSL, Linux
source .venv/bin/activate
# Activiate the environment  On Windows
.\.venv\Scripts\activate

In [126]:
# The next line only needs to be run once
%pip install lenskit
import lenskit.datasets as ds
import pandas as pd

###  NOTE!  if changing the data set, be sure to also change the "minimum_to_include" value.  E.g. if large dataset, prob set the min to thousands
#base_data_path = "sample-data-small/"
base_data_path = "sample-data-large/"  # folder containing larger set of user reviews

#data = ds.MovieLens('sample-data-small/')
# use the next line for large dataset (25 mill reviews).  You can find this set at https://grouplens.org/datasets/movielens/25m/
data = ds.MovieLens(base_data_path)

ratings = data.ratings



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.


MovieLens stores a user's ID number (the first row few rows look like they're all ratings from user 1), the item's ID (in this case each ID is a different movie)

Step 1.2

In [127]:
rows_to_show = 10   # <-- Change this number to get back more or less recommendations
data.ratings.head(rows_to_show)  # <-- Try changing "ratings" to "movies", "tags", or "links" to see the kinds of data that's stored in the other MovieLens files

Unnamed: 0,user,item,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
5,1,1088,4.0,1147868495
6,1,1175,3.5,1147868826
7,1,1217,3.5,1147878326
8,1,1237,5.0,1147868839
9,1,1250,4.0,1147868414


Pull in genre and title from the "movies" file

In [128]:
joined_data = data.ratings.join(data.movies['genres'], on='item')
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data.head(rows_to_show)

Unnamed: 0,user,item,rating,timestamp,genres,title
0,1,296,5.0,1147880044,Comedy|Crime|Drama|Thriller,Pulp Fiction (1994)
1,1,306,3.5,1147868817,Drama,Three Colors: Red (Trois couleurs: Rouge) (1994)
2,1,307,5.0,1147868828,Drama,Three Colors: Blue (Trois couleurs: Bleu) (1993)
3,1,665,5.0,1147878820,Comedy|Drama|War,Underground (1995)
4,1,899,3.5,1147868510,Comedy|Musical|Romance,Singin' in the Rain (1952)
5,1,1088,4.0,1147868495,Drama|Musical|Romance,Dirty Dancing (1987)
6,1,1175,3.5,1147868826,Comedy|Drama|Romance,Delicatessen (1991)
7,1,1217,3.5,1147878326,Drama|War,Ran (1985)
8,1,1237,5.0,1147868839,Drama,"Seventh Seal, The (Sjunde inseglet, Det) (1957)"
9,1,1250,4.0,1147868414,Adventure|Drama|War,"Bridge on the River Kwai, The (1957)"


The movies and ratings have now been loaded. 

***STEP 2***

Show a list of the top rated movies

**Step 2.1**

In [129]:
average_ratings = (data.ratings).groupby(['item']).mean()
sorted_avg_ratings = average_ratings.sort_values(by="rating", ascending=False)
joined_data = sorted_avg_ratings.join(data.movies['genres'], on='item')
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[1:]]

print("RECOMMENDED FOR ANYBODY:")
joined_data.head(rows_to_show)

RECOMMENDED FOR ANYBODY:


Unnamed: 0_level_0,rating,timestamp,genres,title
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
136782,5.0,1558738000.0,Thriller,The Girl is in Trouble (2015)
186119,5.0,1558738000.0,Children,A Gift Horse (2015)
137032,5.0,1436144000.0,Drama|Thriller,The Perfect Neighbor (2005)
184643,5.0,1558738000.0,Thriller,Relentless (2018)
137038,5.0,1436144000.0,Drama|Thriller,The Perfect Wife (2001)
197231,5.0,1558738000.0,Crime|Drama|Mystery|Romance|Thriller,The Harrow (2016)
184669,5.0,1558738000.0,Horror,Devil's Whisper (2017)
137048,5.0,1436145000.0,Drama,Perfect Child (2007)
137050,5.0,1436145000.0,Thriller,The Rival (2006)
137052,5.0,1436145000.0,Drama|Thriller,A Job to Kill For (2006)


There are several rare movies here.  It could be the result of having a limited number of reviews.  
Add a "count" column to see how many reviews there are for each movie.

In [130]:
average_ratings = (data.ratings).groupby('item') \
       .agg(count=('user', 'size'), rating=('rating', 'mean')) \
       .reset_index()

sorted_avg_ratings = average_ratings.sort_values(by="rating", ascending=False)
joined_data = sorted_avg_ratings.join(data.movies['genres'], on='item')
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[1:]]


print("RECOMMENDED FOR ANYBODY:")
joined_data.head(rows_to_show)

RECOMMENDED FOR ANYBODY:


Unnamed: 0,count,rating,genres,title
29523,1,5.0,Thriller,The Girl is in Trouble (2015)
49654,1,5.0,Children,A Gift Horse (2015)
29643,1,5.0,Drama|Thriller,The Perfect Neighbor (2005)
49041,1,5.0,Thriller,Relentless (2018)
29646,1,5.0,Drama|Thriller,The Perfect Wife (2001)
54556,1,5.0,Crime|Drama|Mystery|Romance|Thriller,The Harrow (2016)
49052,1,5.0,Horror,Devil's Whisper (2017)
29651,1,5.0,Drama,Perfect Child (2007)
29652,1,5.0,Thriller,The Rival (2006)
29653,1,5.0,Drama|Thriller,A Job to Kill For (2006)


We probably don't want movies that don't have many reviews

**Step 2.2**

In [198]:
minimum_to_include = 50000 #<-- You can try changing this minimum to include movies rated by fewer or more people

average_ratings = (data.ratings).groupby(['item']).mean()
rating_counts = (data.ratings).groupby(['item']).count()
average_ratings = average_ratings.loc[rating_counts['rating'] > minimum_to_include]
sorted_avg_ratings = average_ratings.sort_values(by="rating", ascending=False)
joined_data = sorted_avg_ratings.join(data.movies['genres'], on='item')
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[3:]]

print("RECOMMENDED FOR ANYBODY (minimum " + str(minimum_to_include) + " reviews ):")
joined_data.head(rows_to_show)

RECOMMENDED FOR ANYBODY (minimum 50000 reviews ):


Unnamed: 0_level_0,genres,title
item,Unnamed: 1_level_1,Unnamed: 2_level_1
318,Crime|Drama,"Shawshank Redemption, The (1994)"
858,Crime|Drama,"Godfather, The (1972)"
50,Crime|Mystery|Thriller,"Usual Suspects, The (1995)"
527,Drama|War,Schindler's List (1993)
2959,Action|Crime|Drama|Thriller,Fight Club (1999)
296,Comedy|Crime|Drama|Thriller,Pulp Fiction (1994)
2571,Action|Sci-Fi|Thriller,"Matrix, The (1999)"
593,Crime|Horror|Thriller,"Silence of the Lambs, The (1991)"
1196,Action|Adventure|Sci-Fi,Star Wars: Episode V - The Empire Strikes Back...
1198,Action|Adventure,Raiders of the Lost Ark (Indiana Jones and the...


Lets show the top movies for someone looking for a specific genre

In [182]:
genre = "Drama"
average_ratings = (data.ratings).groupby(['item']).mean()
rating_counts = (data.ratings).groupby(['item']).count()
average_ratings = average_ratings.loc[rating_counts['rating'] > minimum_to_include]
average_ratings = average_ratings.join(data.movies['genres'], on='item')
average_ratings = average_ratings.loc[average_ratings['genres'].str.contains(genre)]

sorted_avg_ratings = average_ratings.sort_values(by="rating", ascending=False)
joined_data = sorted_avg_ratings.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[3:]]
print("RECOMMENDED FOR A " + genre.upper() + " MOVIE FAN (minimum " + str(minimum_to_include) + " reviews ):")
joined_data.head(rows_to_show)

RECOMMENDED FOR A DRAMA MOVIE FAN (minimum 50000 reviews ):


Unnamed: 0_level_0,genres,title
item,Unnamed: 1_level_1,Unnamed: 2_level_1
318,Crime|Drama,"Shawshank Redemption, The (1994)"
858,Crime|Drama,"Godfather, The (1972)"
527,Drama|War,Schindler's List (1993)
2959,Action|Crime|Drama|Thriller,Fight Club (1999)
296,Comedy|Crime|Drama|Thriller,Pulp Fiction (1994)
2858,Drama|Romance,American Beauty (1999)
7153,Action|Adventure|Drama|Fantasy,"Lord of the Rings: The Return of the King, The..."
356,Comedy|Drama|Romance|War,Forrest Gump (1994)
110,Action|Drama|War,Braveheart (1995)





STEP 3

Step 3 is personalizing our recommender system AI based on a user's reviews. The sample format can be found here: https://github.com/adamcodes716/movie-recommenders/sample-data-small/sample-movie-ratings/.csv

This data needs to be in a specific format (see here:  https://lkpy.lenskit.org/en/stable/interfaces.html#lenskit.algorithms.Recommender.recommend).
This script has been updated to use "Rating10" column, which is base 10.  We need to divide by 2 to get base 5.

Step 3.1

My ratings are out of 10 and this exercise requires ratings out of 5.  We will perform some match using "Rating10" to get a base 5 rating.
I left base 5 code in this next cell if your ratings file is base 5. 

In [199]:
import csv

reviewer1_rating_dict = {}
reviewer2_rating_dict = {}  #optional, if you wanted to try to find a movie for two people using their reviews

path = base_data_path + "reviewer1-movie-ratings.csv"

with open(path, newline='') as csvfile:
  ratings_reader = csv.DictReader(csvfile)
  for row in ratings_reader:
     # Use this block if ratings are base 5
    # if ((row['ratings'] != "") and (float(row['ratings']) > 0) and (float(row['ratings']) < 6)):
    #   reviewer1_rating_dict.update({int(row['item']): float(row['ratings'])})
    # Use this block if ratings are base 10
     if ((row['ratings'] != "") and (float(row['ratings']) > 0) and (float(row['ratings']) < 11)):
       reviewer1_rating_dict.update({int(row['item']): round(float(row['ratings']) / 2, 2)})  


    
    
# this section is only needed if you are trying to find a movie for 2 reviewers. They would be combined later into single DF     
# path = base_data_path + "reviewer2-movie-ratings.csv"
# with open(path, newline='') as csvfile:
#    ratings_reader = csv.DictReader(csvfile)
#    for row in ratings_reader:
#      # Use this block if ratings are base 5
#      # if ((row['ratings'] != "") and (float(row['ratings']) > 0) and (float(row['ratings']) < 6)):
#      #   reviewer1_rating_dict.update({int(row['item']): float(row['ratings'])})
#      # Use this block if ratings are base 10
#      if ((row['ratings'] != "") and (float(row['ratings']) > 0) and (float(row['ratings']) < 11)):
#       reviewer2_rating_dict.update({int(row['item']): round(float(row['ratings']) / 2, 2)})  



     
print("Rating dictionaries assembled!")
print("Sanity check:")
print("\tReviewer 1's rating for the first movie is " + str(list(reviewer1_rating_dict.values())[0]))

Rating dictionaries assembled!
Sanity check:
	Reviewer 1's rating for the first movie is 3.5


Step 4
We will use UserUser from Lenskit to try to find reviewers (neighbors) that like the same movies that I like.
As part of this search, we will set a max and min number of neighbors.


In [184]:
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser

num_recs = 100  #<---- This is the number of recommendations to generate. You can change this if you want to see more recommendations

user_user = UserUser(15, min_nbrs=3) #These two numbers set the minimum (3) and maximum (15) number of neighbors to consider. These are considered "reasonable defaults," but you can experiment with others too
algo = Recommender.adapt(user_user)
algo.fit(data.ratings) # this essentially "trains" s user-user CF model.  The ratings data are memorized in a format that is usable for computations
print("Set up a User-User algorithm!")

Set up a User-User algorithm!


Now that the system has defined clusters, we can feed in reviewer 1's movie reviews.  The User-User algorithm will then find a neighborhood of users similar to my movie ratings. It will look at movies that these similar users have rated that we haven't seen yet. Based on their ratings, it will predict how we may rate that movie if we watched it. Finally, it will order these predictions and print them in descending order and return the top recommendations.

**Step 4.2**

In [185]:
reviewer1_recs = algo.recommend(-1, num_recs, ratings=pd.Series(reviewer1_rating_dict))  #Here, -1 tells it that it's not an existing user in the set, that we're giving new ratings, while 10 is how many recommendations it should generate

joined_data = reviewer1_recs.join(data.movies['genres'], on='item')      
joined_data = joined_data.join(data.movies['title'], on='item')

joined_data = joined_data[joined_data.columns[2:]]
print("\n\nRECOMMENDED FOR REVIEWER 1 (minimum " + str(minimum_to_include) + " reviews ):")
joined_data



RECOMMENDED FOR REVIEWER 1 (minimum 50000 reviews ):


Unnamed: 0,genres,title
0,Drama,Emperor Tomato Ketchup (1971)
1,Documentary,Dreams with Sharp Teeth (2008)
2,Drama,Wetlands (2011)
3,Drama|War,Hitler: A Film from Germany (Hitler - ein Film...
4,Documentary,King Cohen: The Wild World of Filmmaker Larry ...
...,...,...
95,Animation,This Unnameable Little Broom (1985)
96,Documentary,45365 (2010)
97,Drama,Wives and Daughters (1999)
98,(no genres listed),AC/DC- Let There Be Rock (1980)


There are a fair number of documentaries in that list.  Don't get me wrong, I love documentaries.  But I am more interested in finding a good movie on saturday night.  Lets add in some filtering to look for certain genres and/or avoid certain genres.

**Step 4.23**

In [186]:
genre_to_contain = "" # "Drama|Comedy|(no genres listed)"  # add a value if you only want movies in a specific genre
genre_to_not_contain = "Documentary|Animation" # "Documentary|Horror|Animation"  # add a value if you do not want movies from a specific genre

# FROM THE PREVIOUS BLOCK
filtered_joined_data = reviewer1_recs.join(data.movies['genres'], on='item')      
filtered_joined_data = filtered_joined_data.join(data.movies['title'], on='item')
filtered_joined_data = filtered_joined_data[filtered_joined_data.columns[2:]]
filtered_joined_data = filtered_joined_data.loc[filtered_joined_data['genres'].str.contains(genre_to_contain)]  # Movies MUST contain genre(s)
filtered_joined_data = filtered_joined_data.loc[ ~filtered_joined_data['genres'].str.contains(genre_to_not_contain)] # Movies must NOT contains genre(s))
print("\n\nRECOMMENDED GENRE-SPECIFIC MOVIES FOR REVIEWER 1:")
filtered_joined_data




RECOMMENDED GENRE-SPECIFIC MOVIES FOR REVIEWER 1:


Unnamed: 0,genres,title
0,Drama,Emperor Tomato Ketchup (1971)
2,Drama,Wetlands (2011)
3,Drama|War,Hitler: A Film from Germany (Hitler - ein Film...
5,Drama,East Wind (Vent d'Est) (1993)
7,(no genres listed),We Always Lie to Strangers (2013)
8,Drama|Mystery|Thriller,Diamond Dust (2018)
9,Comedy|Drama|Fantasy,The House That Swift Built (1982)
10,Comedy,Assume the Position with Mr. Wuhl (2007)
12,Drama,Blackbird (2014)
13,(no genres listed),Stephen Fry Live: More Fool Me (2014)
