Name: John Herrick \
Date: 5/20/2023

In [1]:
# Importing likely useful modules.

import pandas as pd
import numpy as np
import warnings
import sys
import os

In [2]:
# Initial data read of movie list.

df_movie = pd.read_csv(r"Movie_recommender_data\ml-latest-small\movies.csv")
df_movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
# Initial data read of ratings list.

df_ratings = pd.read_csv(r"Movie_recommender_data\ml-latest-small\ratings.csv")
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


After examining the contents of the tags data, I do not suspect I will use it. The number of discrete tags is too high for it to be useful without extensive work.

In [4]:
# Initial data read of tags list.

df_tags = pd.read_csv(r"Movie_recommender_data\ml-latest-small\tags.csv")
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [5]:
# Demonstrating number of tags in the tag column, which is almost half of the total number of rows.

print(len(set(df_tags['tag'])))

1589


I join the movie data and the ratings data together using the 'movieId' column. Now the ratings are tied to an actual movie title.

In [6]:
df_movie_ratings = df_movie.merge(df_ratings, on = 'movieId', how = 'left')
df_movie_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982700.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,847435000.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1106636000.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1510578000.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1305696000.0


Creating a dataframe that collects the mean of the ratings associated with a given ID in the 'movieId' column. This will be useful for later attempts to correlate movies based on their average ratings.

In [7]:
average_ratings = pd.DataFrame(df_movie_ratings.groupby('movieId')['rating'].mean().round(3))
average_ratings['avg_ratings'] = average_ratings['rating']
average_ratings.drop('rating', inplace = True, axis = 1)
average_ratings.head()

Unnamed: 0_level_0,avg_ratings
movieId,Unnamed: 1_level_1
1,3.921
2,3.432
3,3.26
4,2.357
5,3.071


Combining my average ratings with the df_movie_ratings dataframe. Now I have movie titles, ratings, and average ratings all in one dataframe.

In [8]:
df_movies_ratings_avg = df_movie_ratings.merge(average_ratings, on = 'movieId', how = 'left')
df_movies_ratings_avg.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,avg_ratings
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982700.0,3.921
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,847435000.0,3.921
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1106636000.0,3.921
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1510578000.0,3.921
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1305696000.0,3.921


Creating a dataframe that collects the total number of reviews associated with any given movie title. This will be useful in a later step where I filter out movies that have less than or equal to 100 total reviews. This is to ensure that the average reviews are actually based upon a substantial amount of data and are therefore less subject to oultiers artificially biasing the average rating upwards or downwards.

In [9]:
rating_count = pd.DataFrame(df_movie_ratings.groupby('title')['rating'].count())
rating_count['total_ratings'] = rating_count['rating']
rating_count.drop('rating', inplace = True, axis =1)

Combining my total ratings dataframe with my movies dataframe. This allows me to later examine correlations between movie titles while also filtering for a given number of total reviews.

In [10]:
df_movie = df_movie.merge(rating_count, on = 'title', how = 'left')
df_movie.head()

Unnamed: 0,movieId,title,genres,total_ratings
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215
1,2,Jumanji (1995),Adventure|Children|Fantasy,110
2,3,Grumpier Old Men (1995),Comedy|Romance,52
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7
4,5,Father of the Bride Part II (1995),Comedy,49


I now arrange my data to have my movies titles in the columns positions and the various users in the row positions (listed by a 'userId' column entry). The values are the average ratings of the given title. This pivotted table will be used in a future step to find the correlation of any user-entered movie (that is already in the data set) with all of the other movies in this dataframe.

In [11]:
movie_corrs = df_movie_ratings.pivot_table(index = 'userId', columns = 'title', values = 'rating')
movie_corrs.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,,4.0,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,,,,,,,,,,,...,,,,,,,,,,


I create a list of all the titles in the data set to allow for catching any incorrect entries in my final recommender program.

In [12]:
movie_list = []

for item in df_movie['title']:
    if item not in movie_list:
        movie_list.append(item)

Finally, I create the requested recommender program. I have embedded it within two 'while' loops to allow for a greeting when the program is initially run, as well as checking for incorrect inputs at the prompt. This creates a simple program that allows for the entry of a single movie title, checks its correlation with the other titles in the data set according to their average ratings (filtering for more than 100 reviews as a minimum), and then returns the top ten most correlated movies by title in a pleasing format. I have also included an option to quit the program, should that be desired.

In [13]:
# Creating two counters for later use.
time = 0
stop = 0

# Outer while loop allows the user to quit prematurely, inner while loop allows for checking for incorrect inputs by the user.
while True:
    while True:
        
        # Creating new, temporary dataframe for the correlation work.
        temp_df = df_movie.copy()
        
        # 'time' counter allows for a greeting on initial startup.
        if time == 0:
            mov = input("Hello and welcome to my movie recommender. Please enter a movie you like and this program will " \
                        "return 10 recommendations to you based on your entry ('q' to quit): ")
            time+=1
        else:
            
            # Input prompt for the user.
            mov = input("Please enter a movie you like and this program will return 10 " \
                        f"recommendations to you based on your entry: ")
            
        # Option to quit if desired.
        if mov.lower() == 'q':
            print("This program will now stop.")
            stop = 1
            break
        
        # Check for incorrect input.
        elif mov not in movie_list:
            print("I'm sorry, your submission is not a recgonized entry. Please try again.")
        else:
            break
    
    # Checks if user indicated desire to quit and stops program if so.
    if stop:
        break
    else:
        
        # Supression of extraneous warnings.
        if not sys.warnoptions:
            warnings.simplefilter("ignore")
            os.environ["PYTHONWARNINGS"] = "ignore"
            
            # Checking correlation of entry with all other movies in data set.
            corr_values = movie_corrs.corrwith(movie_corrs[mov])
            
            # Creating a dataframe of the correlation values and then attaching it to the temporary dataframe.
            corr_values = pd.DataFrame(corr_values, columns = ['correlation'])
            corr_values.dropna(inplace = True)
            temp_df = temp_df.merge(corr_values, on = 'title', how = 'left')
            
            # Finding the top 10 correlations (after the first, which is just the user's entry).
            top_11 = temp_df[temp_df['total_ratings']>100].sort_values('correlation', 
                                                                       ascending = False).head(11).reset_index()
            
            # Presentation of results in a pleasing formatted string.
            print(f"You're top 10 recommended titles based on your selection are " \
                  f"{', '.join([item for item in top_11['title'][1:-1]])}, and {top_11['title'][10]}.")
            break

Hello and welcome to my movie recommender. Please enter a movie you like and this program will return 10 recommendations to you based on your entry ('q' to quit): Toy Story
I'm sorry, your submission is not a recgonized entry. Please try again.
Please enter a movie you like and this program will return 10 recommendations to you based on your entry: Toy Story (1995)
You're top 10 recommended titles based on your selection are Incredibles, The (2004), Finding Nemo (2003), Aladdin (1992), Monsters, Inc. (2001), Mrs. Doubtfire (1993), Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001), American Pie (1999), Die Hard: With a Vengeance (1995), E.T. the Extra-Terrestrial (1982), and Home Alone (1990).


Sources:

I used an online guide to creating recommenders. This particular guide was written by Amal Nair for the article "How To Build Your First Recommender System Using Python & MovieLens Dataset" for the website Analytics India Magazine (AIM). The article can be found at https://analyticsindiamag.com/how-to-build-your-first-recommender-system-using-python-movielens-dataset/, and was referenced by me on 5/20/2023.