# Movie Recommender System

Dataset found at https://grouplens.org/datasets/movielens/ (small dataset for educational purposes)

Just how Amazon recommends products based on what you have purchased, the goal of this notebook is to recommend 10 popular simliar movies based on a specific movie.

### Organizing dataset

In [1]:
import pandas as pd
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')

#taking a look at ratings data
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [2]:
#taking a look at movies data
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
#merging data
data = ratings.merge(movies, on='movieId', how='left')
data.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


In [4]:
#checking for null values
data.isnull().values.any()

False

### Exploring dataset

We know that both average rating per movie & number of ratings per movie are important attributes. For simplicity, the system will be built on only those two features

In [5]:
#create new dataframe based on these attributes
avgrating_count = pd.DataFrame(data.groupby('title')['rating'].mean())

In [6]:
#add number of ratings for a movie
avgrating_count['rating_counts'] = pd.DataFrame(data.groupby('title')['rating'].count())

In [7]:
#take a look at our new dataframe
avgrating_count.head()

Unnamed: 0_level_0,rating,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'71 (2014),4.0,1
'Hellboy': The Seeds of Creation (2004),4.0,1
'Round Midnight (1986),3.5,2
'Salem's Lot (2004),5.0,1
'Til There Was You (1997),4.0,2


We can see that along with a movie rating, there is now also the number of ratings that movie has

In [8]:
user_rating = data.pivot_table(index='userId', columns='title', values='rating')
user_rating.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


We can now see how each user ranks each movie. There are a lot of null values given the fact that not every user can rank every movie.

For now, to better filter data, I will use ratings only from 'Toy Story', one of my favorites

In [9]:
toystory_ratings = user_rating['Toy Story (1995)']
toystory_ratings.head()

userId
1    4.0
2    NaN
3    NaN
4    NaN
5    4.0
Name: Toy Story (1995), dtype: float64

Now we need to retrieve all movies that are simliar to Toy Story.
To do this we will find the correlation between user ratings for Toy Story and all the other movies according to two previously mentioned features.

In [10]:
toystory_similar = user_rating.corrwith(toystory_ratings)

#creating new dataframe according to correlation
corr_toystory = pd.DataFrame(toystory_similar, columns=['correlation'])

#dropping null values
corr_toystory.dropna(inplace=True)
corr_toystory.head()

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
"'burbs, The (1989)",0.240563
(500) Days of Summer (2009),0.353833
*batteries not included (1987),-0.427425
10 Cent Pistol (2015),1.0
10 Cloverfield Lane (2016),-0.285732


In [11]:
#sorting above data
corr_toystory.sort_values('correlation',ascending=False).head()

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
Land Before Time III: The Time of the Great Giving (1995),1.0
Faster Pussycat! Kill! Kill! (1965),1.0
Amen. (2002),1.0
"Machine Girl, The (Kataude mashin gâru) (2008)",1.0
Waydowntown (2000),1.0


The output above show movies that have a high correlation with Toy Story, but they are not popular movies.

The solution is to retrieve only those correlated movies that have more than 50 ratings (so only famous movies appear)

In [12]:
#adding how many ratings each movie has to dataframe
corr_toystory = corr_toystory.join(avgrating_count['rating_counts'])
corr_toystory.head()

Unnamed: 0_level_0,correlation,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",0.240563,17
(500) Days of Summer (2009),0.353833,42
*batteries not included (1987),-0.427425,7
10 Cent Pistol (2015),1.0,2
10 Cloverfield Lane (2016),-0.285732,14


In [13]:
#sorting above data
corr_toystory[corr_toystory['rating_counts']>50].sort_values('correlation', ascending=False).head(11)

Unnamed: 0_level_0,correlation,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story (1995),1.0,215
Toy Story 2 (1999),0.699211,97
Arachnophobia (1990),0.652424,53
"Incredibles, The (2004)",0.643301,125
Finding Nemo (2003),0.618701,141
Aladdin (1992),0.611892,183
Erin Brockovich (2000),0.598016,70
Wallace & Gromit: The Wrong Trousers (1993),0.589625,56
Blazing Saddles (1974),0.585892,62
"Wolf of Wall Street, The (2013)",0.578479,54


That's much better, now our system recommends Toy Story 2 along with other pixar movies (which makes a lot of sense)

## Building recommender function

In [14]:
#function uses features explored above but can be applied to any movie

def mostSimilar(movie):
    ratings = user_rating[movie]
    similar = user_rating.corrwith(ratings)
    
    #creating new dataframe that has correlation values
    corr = pd.DataFrame(similar, columns=['correlation'])
    #drop null values
    corr.dropna(inplace=True)
    
    corr = corr.join(avgrating_count['rating_counts'])
    
    #function will return 11 (selected + 10 recommended) movies with more than 50 ratings
    return corr[corr['rating_counts']>50].sort_values('correlation', ascending=False).head(11)

## Testing Recommender System

Now we test the function with different movies

In [15]:
mostSimilar('Monsters, Inc. (2001)')

Unnamed: 0_level_0,correlation,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Monsters, Inc. (2001)",1.0,132
Kung Fu Panda (2008),0.685689,54
"Bug's Life, A (1998)",0.677159,92
Sense and Sensibility (1995),0.66586,67
"Client, The (1994)",0.657404,57
In the Line of Fire (1993),0.64249,70
Mars Attacks! (1996),0.627171,86
Die Hard: With a Vengeance (1995),0.626018,144
Star Trek II: The Wrath of Khan (1982),0.616612,62
"Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966)",0.61464,72


In [16]:
mostSimilar('2001: A Space Odyssey (1968)')

Unnamed: 0_level_0,correlation,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
2001: A Space Odyssey (1968),1.0,109
Sabrina (1995),0.643736,54
Spirited Away (Sen to Chihiro no kamikakushi) (2001),0.607761,87
Magnolia (1999),0.547034,52
Taxi Driver (1976),0.539283,104
"Clockwork Orange, A (1971)",0.508757,120
Vertigo (1958),0.497692,60
Star Trek (2009),0.48836,59
Snow White and the Seven Dwarfs (1937),0.483803,77
Apocalypse Now (1979),0.478877,107


## Evaluating Effectiveness

In [17]:
import numpy as np


#Root mean squared value function
def rmse(predictions, targets):
    diff = predictions - targets
    diff_squared = diff**2
    diff_squared_mean = diff_squared.mean()
    rmse_val = np.sqrt(diff_squared_mean)
    
    return rmse_val

In [18]:
#testing movie whose recommendations have higher correlations
rmse(mostSimilar('Monsters, Inc. (2001)'), 1)

correlation       0.342418
rating_counts    85.203500
dtype: float64

In [19]:
#testing movie whose recommendations have lower correlations
rmse(mostSimilar('2001: A Space Odyssey (1968)'), 1)

correlation       0.454123
rating_counts    86.504335
dtype: float64

Final Observations: 

-RMSE indicates that those movies whose recommendations have higher correlations also have lower error 

-Changing how many rating_counts are required for recommendations greatly affects the quality of the system and its RMSE. The less the requirement, the lower the error but better the recommendations.

-This model only takes two simple features into account, future systems should take genre and userId's into account for a more personalized system.