In this project, I referenced the steps found at Analytics India Magazine. (https://analyticsindiamag.com/how-to-build-your-first-recommender-system-using-python-movielens-dataset/)

However, for this project, I ensured that the recommender system relied on specific user input in order for the system to work. This was done by including a simple UserInput option. 

# 1. Loading the Data

In [2]:
# importing the necessary libraries
import numpy as np
import pandas as pd

### Loading movies.csv

Loading the movies.csv file which contains information about the movie titles, movie genres, and associated movie ID. 

In [3]:
# loading the data into a dataframe
movies1 = pd.read_csv(r"C:\Users\eadam\Desktop\DSC 630\Homework\Week 9\movies.csv")

In [4]:
# checking the initial dataframe
movies1.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Changing the column titles in the movies dataframe to more professional looking titles. 

In [5]:
# creating a dictionary to replace the column names to a better option
dictionary_movies = {'movieId':'MovieID', 'title':'Title', 'genres':'Genres'}

In [6]:
# replacing the column names with a more professional option
movies2 = movies1.rename(columns=dictionary_movies, inplace=False)
movies2.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
movies2['Title'].value_counts()

Confessions of a Dangerous Mind (2002)    2
Emma (1996)                               2
War of the Worlds (2005)                  2
Eros (2004)                               2
Saturn 3 (1980)                           2
                                         ..
Foxfire (1996)                            1
Bow, The (Hwal) (2005)                    1
Blockers (2018)                           1
Eddie Murphy Raw (1987)                   1
The Blue Planet (2001)                    1
Name: Title, Length: 9737, dtype: int64

### Loading ratings.csv

Loading the ratings.csv file which contains information about user ID, movie ID, movie ratings, and timestamps. 

In [6]:
# loading the ratings csv
ratings1 = pd.read_csv(r"C:\Users\eadam\Desktop\DSC 630\Homework\Week 9\ratings.csv")

In [7]:
# checking the intial output
ratings1.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Replacing the column titles to more professional looking titles. 

In [8]:
# replacing the column names
dictionary_ratings = {'userId':'UserID', 'movieId':'MovieID', 'rating':'Rating', 'timestamp':'Timestamp'}
ratings2 = ratings1.rename(columns=dictionary_ratings, inplace=False)
ratings2.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


# 2. Prepping the Data

Merging the two dataframes, movies1 and ratings1. I merge these two on the MovieID column. 

In [10]:
# merging movies1 and ratings1 
data1 = movies2.merge(ratings2, on='MovieID', how='left')
data1.head(2)

Unnamed: 0,MovieID,Title,Genres,UserID,Rating,Timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982703.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,847434962.0


In [11]:
# creating a list of the Movies 
MovieDF = pd.DataFrame(data1, columns=['Title', 'Genres'])

# dropping duplicate values from the MovieDF dataframe
MovieDF1 = MovieDF.drop_duplicates()
MovieDF1

Unnamed: 0,Title,Genres
0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
215,Jumanji (1995),Adventure|Children|Fantasy
325,Grumpier Old Men (1995),Comedy|Romance
377,Waiting to Exhale (1995),Comedy|Drama|Romance
384,Father of the Bride Part II (1995),Comedy
...,...,...
100849,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
100850,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
100851,Flint (2017),Drama
100852,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


As there were potentially more ratings than movies, due to multiple users rating the same movie, I'll dig through the dataframe and assign an Average Rating along with the total number of ratings for each movie. 

In [13]:
# creating a dataframe of the Average Rating for each movie with number of ratings
AverageRating = pd.DataFrame(data1.groupby('Title')['Rating'].mean())
AverageRating['Total Ratings'] = pd.DataFrame(data1.groupby('Title')['Rating'].count())
AverageRating.head(10)

Unnamed: 0_level_0,Rating,Total Ratings
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
'71 (2014),4.0,1
'Hellboy': The Seeds of Creation (2004),4.0,1
'Round Midnight (1986),3.5,2
'Salem's Lot (2004),5.0,1
'Til There Was You (1997),4.0,2
'Tis the Season for Love (2015),1.5,1
"'burbs, The (1989)",3.176471,17
'night Mother (1986),3.0,1
(500) Days of Summer (2009),3.666667,42
*batteries not included (1987),3.285714,7


# 3. Building the Recommender

In the next step, I create a pivot table of the movie along with the watcher that rated the movie. This will allow me to associate the user and the movie.

In [16]:
# creating a pivot table of UserID and Rating
PT_MovieWatcher = data1.pivot_table(index='UserID', columns='Title', values='Rating')
PT_MovieWatcher

Title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,,4.0,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606.0,,,,,,,,,,,...,,,,,,,,,,
607.0,,,,,,,,,,,...,,,,,,,,,,
608.0,,,,,,,,,,,...,,,,,,4.5,3.5,,,
609.0,,,,,,,,,,,...,,,,,,,,,,


Below, I create an user input variable that allows a user to input their movie choice. The correlation function will then take that input and find how each movie correlates to the user's choice. 

In [17]:
# creating a user input variable 
UserInput = input('Please enter a movie: ')

Please enter a movie: Star Trek (2009)


In [18]:
# finding correlations of the user inputted movie
# this output is not in any order and lists the movies randomly
correlations = movie_user.corrwith(movie_user[UserInput])
correlations

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Title
'71 (2014)                                        NaN
'Hellboy': The Seeds of Creation (2004)           NaN
'Round Midnight (1986)                            NaN
'Salem's Lot (2004)                               NaN
'Til There Was You (1997)                         NaN
                                               ...   
eXistenZ (1999)                             -0.083624
xXx (2002)                                   0.258775
xXx: State of the Union (2005)               0.866025
¡Three Amigos! (1986)                        0.460977
À nous la liberté (Freedom for Us) (1931)         NaN
Length: 9719, dtype: float64

Next, I take the correlations of the movies and put them into a dataframe. I'll drop any movies that do not correlate to the user selected movie. 

In [20]:
# inputting the correlations into a dataframe, recommendation 
# I drop any movies that do not have any correlation to the user selected movie
recommendation = pd.DataFrame(correlations, columns=['Correlation'])
recommendation.dropna(inplace=True)
recommendation.head(10)

Unnamed: 0_level_0,Correlation
Title,Unnamed: 1_level_1
"'burbs, The (1989)",0.559259
(500) Days of Summer (2009),0.745959
*batteries not included (1987),0.944911
10 Cloverfield Lane (2016),0.493007
10 Things I Hate About You (1999),-0.190476
"10,000 BC (2008)",0.02739
101 Dalmatians (1996),0.278685
101 Dalmatians (One Hundred and One Dalmatians) (1961),0.504827
11:14 (2003),-1.0
12 Angry Men (1957),-0.125899


Next, I take the recommendation dataframe and join it with the Total Ratings column of the Average Rating dataframe. 

In [23]:
# merging recommendation dataframe with total ratings column from Average Rating dataframe
recommendation1 = recommendation.join(AverageRating['Total Ratings'])
recommendation1.head(10)

Unnamed: 0_level_0,Correlation,Total Ratings
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",0.559259,17
(500) Days of Summer (2009),0.745959,42
*batteries not included (1987),0.944911,7
10 Cloverfield Lane (2016),0.493007,14
10 Things I Hate About You (1999),-0.190476,54
"10,000 BC (2008)",0.02739,17
101 Dalmatians (1996),0.278685,47
101 Dalmatians (One Hundred and One Dalmatians) (1961),0.504827,44
11:14 (2003),-1.0,4
12 Angry Men (1957),-0.125899,57


Next, I sort the values to put the highest correlated movies that have more than 100 Total Ratings. I initially tried to use only 50 Total Ratings. However, this resulted in movies that, while highly correlated, did not seem like a good follow on movie to the user input (in this case, Star Trek (2009)). 

In [35]:
# sorting the movies by highest correlation that have over 100 ratings
recc = recommendation1[recommendation1['Total Ratings']>100].sort_values('Correlation', ascending=False).reset_index()

In [36]:
# merging the recc dataframe with the movies2 dataframe to include Movie ID and Genres in the output for the user
Rec_verification = recc.merge(movies2, on='Title', how='left')
Rec_verification.head(10)

Unnamed: 0,Title,Correlation,Total Ratings,MovieID,Genres
0,Star Wars: Episode VI - Return of the Jedi (1983),0.698083,196,1210,Action|Adventure|Sci-Fi
1,Harry Potter and the Chamber of Secrets (2002),0.662398,102,5816,Adventure|Fantasy
2,Star Wars: Episode I - The Phantom Menace (1999),0.656503,140,2628,Action|Adventure|Sci-Fi
3,"Lord of the Rings: The Return of the King, The...",0.641521,185,7153,Action|Adventure|Drama|Fantasy
4,"Fugitive, The (1993)",0.611124,190,457,Thriller
5,Up (2009),0.564032,105,68954,Adventure|Animation|Children|Drama
6,Twister (1996),0.551673,123,736,Action|Adventure|Romance|Thriller
7,"Bourne Identity, The (2002)",0.549123,112,5418,Action|Mystery|Thriller
8,"Usual Suspects, The (1995)",0.533401,204,50,Crime|Mystery|Thriller
9,Batman Begins (2005),0.523335,116,33794,Action|Crime|IMAX


# Conclusion

In this project, I did not use a specific package that would help with building a recommender system. I found it very valuable to be able to build this type of system with simple packages that are generally known by all python coders. From this system, I was able to find 10 movies that would be recommended to a user selected movie. I found that it was better to have more reviews with lower correlation. When I had the Total Ratings set to over 50, rather than 100, I was getting movies that I, as a person who has seen Star Trek (2009) would not reccommend to another person. When upping the Total Ratings to over 100, I recieved movies that I would recommend to another person if they liked Star Trek (2009). 