# Movie Recommender System

Dataset found at https://grouplens.org/datasets/movielens/ (small dataset for educational purposes)

Just how Amazon recommends products based on what you have purchased, the goal of this notebook is to recommend simliar movies based on a specific movie.

### Organzing dataset

In [2]:
import pandas as pd
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')

#taking a look at ratings data
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
#taking a look at movies data
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
#merging data
data = ratings.merge(movies, on='movieId', how='left')
data.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


### Building recommendation system

We know that both average rating per movie & number of ratings per movie are important attributes. So we will focus on building our system based on those.

In [22]:
#create new dataframe based on these attributes
avgrating_count = pd.DataFrame(data.groupby('title')['rating'].mean())

In [23]:
#add number of ratings for a movie
avgrating_count['rating_counts'] = pd.DataFrame(data.groupby('title')['rating'].count())

In [24]:
#take a look at our new dataframe
avgrating_count.head()

Unnamed: 0_level_0,rating,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'71 (2014),4.0,1
'Hellboy': The Seeds of Creation (2004),4.0,1
'Round Midnight (1986),3.5,2
'Salem's Lot (2004),5.0,1
'Til There Was You (1997),4.0,2


We can see that along with a movie rating, there is now also the number of ratings that movie has

In [26]:
user_rating = data.pivot_table(index='userId', columns='title', values='rating')
user_rating.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


We can now see how each user ranks each movie. Obviously there are a lot of null values given the fact that not every user can rank every movie.

## Testing recommender system
Now we need to find ratings for a specific movie, in this case 'Toy Story', one of my favorites

In [27]:
toystory_ratings = user_rating['Toy Story (1995)']
toystory_ratings.head()

userId
1    4.0
2    NaN
3    NaN
4    NaN
5    4.0
Name: Toy Story (1995), dtype: float64

Now we need to retrieve all movies that are simliar to Toy Story.
To do this we will find the correlation between user ratings for Toy Story and all the other movies.

In [39]:
toystory_similar = user_rating.corrwith(toystory_ratings)

#creating new dataframe according to correlation
corr_toystory = pd.DataFrame(toystory_similar, columns=['correlation'])

#dropping null values
corr_toystory.dropna(inplace=True)
corr_toystory.head()

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
"'burbs, The (1989)",0.240563
(500) Days of Summer (2009),0.353833
*batteries not included (1987),-0.427425
10 Cent Pistol (2015),1.0
10 Cloverfield Lane (2016),-0.285732


In [31]:
#sorting above data
corr_toystory.sort_values('correlation',ascending=False).head()

Unnamed: 0_level_0,correlation
title,Unnamed: 1_level_1
Land Before Time III: The Time of the Great Giving (1995),1.0
Faster Pussycat! Kill! Kill! (1965),1.0
Amen. (2002),1.0
"Machine Girl, The (Kataude mashin gâru) (2008)",1.0
Waydowntown (2000),1.0


The output above shows that movies that have a high correlation with Toy Story are not that well known, so correlation is not neccesarly a good metric to use.

The solution is to retrieve only those correlated movies that have more than 50 ratings (so only famous movies appear)

In [32]:
corr_toystory = corr_toystory.join(avgrating_count['rating_counts'])
corr_toystory.head()

Unnamed: 0_level_0,correlation,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",0.240563,17
(500) Days of Summer (2009),0.353833,42
*batteries not included (1987),-0.427425,7
10 Cent Pistol (2015),1.0,2
10 Cloverfield Lane (2016),-0.285732,14


In [36]:
#sorting above data
corr_toystory[corr_toystory['rating_counts']>50].sort_values('correlation', ascending=False).head()

Unnamed: 0_level_0,correlation,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story (1995),1.0,215
Toy Story 2 (1999),0.699211,97
Arachnophobia (1990),0.652424,53
"Incredibles, The (2004)",0.643301,125
Finding Nemo (2003),0.618701,141


That's much better, now our system recommends Toy Story 2 along with other pixar movies (which makes a lot of sense)