# Film Recommender

By: Alissa Trujillo<br>

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

**Importing the Datasets**

In [2]:
movie_df = pd.read_csv("ratings.csv")
movie_titles_df = pd.read_csv("movies.csv")

**Merging the Datasets**

In [3]:
movie_df = movie_df.merge(movie_titles_df, 
                          on='movieId', how='left')
movie_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


By merging the two sets of data together, we can identify further information about the movies the users are reviewing. This will allow us to understand their movie tastes and identify the most similar recommendations by movie name, year and genre.

**Taking a Closer Look at Ratings**

In [4]:
avg_ratings = pd.DataFrame(movie_df.groupby('title')['rating'].mean())
avg_ratings.sort_values(by='rating', ascending=False)

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
Gena the Crocodile (1969),5.0
True Stories (1986),5.0
Cosmic Scrat-tastrophe (2015),5.0
Love and Pigeons (1985),5.0
Red Sorghum (Hong gao liang) (1987),5.0
...,...
Don't Look Now (1973),0.5
Journey 2: The Mysterious Island (2012),0.5
Joe Dirt 2: Beautiful Loser (2015),0.5
Jesus Christ Vampire Hunter (2001),0.5


This table shows us one row for each of our 9719 movies along with its' average rating. We can see our 5 highest rated movies and our 5 lowest rated movies above. However, it is likely that many of the movies with a perfect score (or perfectly bad score) have very few ratings. We can take a look at the total ratings as well.

In [5]:
avg_ratings['Total Ratings'] = pd.DataFrame(movie_df.groupby('title')['rating'].count())
avg_ratings.sort_values(by='rating', ascending=False)

Unnamed: 0_level_0,rating,Total Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Gena the Crocodile (1969),5.0,1
True Stories (1986),5.0,1
Cosmic Scrat-tastrophe (2015),5.0,1
Love and Pigeons (1985),5.0,1
Red Sorghum (Hong gao liang) (1987),5.0,1
...,...,...
Don't Look Now (1973),0.5,1
Journey 2: The Mysterious Island (2012),0.5,1
Joe Dirt 2: Beautiful Loser (2015),0.5,1
Jesus Christ Vampire Hunter (2001),0.5,1


My hypothesis that these polarized movies had very few ratings was correct. We can correct for that by only looking at movies that have at least 100 ratings.

In [6]:
movie100_df = avg_ratings.loc[avg_ratings['Total Ratings'] > 100]
movie100_df.sort_values(by='rating', ascending=False)

Unnamed: 0_level_0,rating,Total Ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Shawshank Redemption, The (1994)",4.429022,317
"Godfather, The (1972)",4.289062,192
Fight Club (1999),4.272936,218
"Godfather: Part II, The (1974)",4.259690,129
"Departed, The (2006)",4.252336,107
...,...,...
"Net, The (1995)",3.040179,112
Cliffhanger (1993),3.034653,101
Home Alone (1990),2.995690,116
Batman Forever (1995),2.916058,137


Looking at our data now, we can see the highest and lowest rated movies that have at least 100 ratings. The top-rated movies are all films that are widely considered to be classics, demonstrating that average ratings by a crowd are much more meaningful than a single 5-star rating.

**Finding Correlations**

In [7]:
movie_user = movie_df.pivot_table(index='userId',columns='title',values='rating')
movie_user.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


This piece of code creates a sparse matrix in which each user is represented as a row and each movie is represented as a column. This allows us to see each individual user's rating of each movie that they watched. Our recommender system can take an input film title and look at ratings from other users who rated that movie high. They will then use that information to create recommendations for our user.

**Creating a  Recommender Function**

The recommender function asks the user for their favorite movie, then finds the matching movies among the included titles. For simplicity, this function will recommend based on the first match found, but could be expanded upon to further clarify this information with the end user.

Once the function identifies the movie, it will use the information from the matrix above to identify similar films. It will weed out movies with less than 100 ratings, and then show the user the 5 most related movies along with their correlation score (or 'match' for user-friendly presentation).

In [13]:
def recommend(movie):
    movie_matches = [col for col in movie_df.title if
                    movie in col]
    try:
        match1 = movie_matches[0]
    except:
        print('No matches found for', movie)
        return()
        
    print('Finding recommendations for:', match1)
    
    corr = movie_user.corrwith(movie_user[match1])
    rec = pd.DataFrame(corr,columns=['Correlation'])
    rec.dropna(inplace=True)
    rec = rec.join(avg_ratings['Total Ratings'])
    rec_m = rec[rec['Total Ratings']>100].sort_values(
        'Correlation', ascending=False).reset_index()
    rec_m = rec_m.merge(movie_titles_df, on='title',
                       how='left')
    
    rec_m.drop(['Total Ratings', 'movieId'], axis=1, 
               inplace=True)
    rec_m = rec_m.rename(columns={'Correlation': 'match'})
    
    return rec_m.head()

As an example, I will take a look at recommendations for fans of Shrek.

In [14]:
recommend('Shrek')

Finding recommendations for: Shrek (2001)


Unnamed: 0,title,match,genres
0,Shrek (2001),1.0,Adventure|Animation|Children|Comedy|Fantasy|Ro...
1,Finding Nemo (2003),0.64498,Adventure|Animation|Children|Comedy
2,"Monsters, Inc. (2001)",0.581781,Adventure|Animation|Children|Comedy|Fantasy
3,Aladdin (1992),0.498432,Adventure|Animation|Children|Comedy|Musical
4,Batman Forever (1995),0.488341,Action|Adventure|Comedy|Crime


Go ahead and try! It will prompt you for your favorite movie.

In [None]:
mv = input("What is your favorite movie?" )

In [None]:
recommend(mv)

**Conclusion**

By taking the ratings of many users and finding movies that were rated highly by similar movie-goers, we are able to create recommendations for the end user. The collaborative sourcing allows us to personalize the recommending process.

Using the larger dataset would allow us to hone in on ratings for movies that did not have enough ratings to be included in the recommendations. It would also be useful to have a strategy to vet the recommendations a little bit, as it seems a couple of movies (specifically I found 'Finding Nemo' and 'Star Trek') to be correlated with lots of movies, even if they were very different genres. This leads me to believe they were rated highly by many users so the correlations were strong all around.