## Movie Recommendation system using Pearson correlation technique

#### Read the dataset with the necessary features

In [1]:
import pandas as pd
import numpy as np

In [2]:
movie_rating_data=pd.read_csv('ratings.csv')
movie_rating_data.sort_values('rating', ascending=True).head(10)

Unnamed: 0,userId,movieId,rating,timestamp
3752,22,53519,0.5,1268727137
60861,393,5445,0.5,1430506636
47025,307,2017,0.5,1186173639
22446,153,1198,0.5,1525548264
60865,393,5902,0.5,1430507509
69266,448,3933,0.5,1191614594
31670,219,1831,0.5,1195349347
24858,175,43904,0.5,1234189741
22453,153,1356,0.5,1525552756
31661,219,1721,0.5,1214043346


In [3]:
movies_df=pd.read_csv('movies.csv')
movies_df.sort_values('movieId', ascending=True).head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [4]:
tags_df=pd.read_csv('tags.csv')
tags_df=tags_df[['movieId','tag']]
tags_df.sort_values('movieId', ascending=True).head(10)

Unnamed: 0,movieId,tag
2886,1,fun
981,1,pixar
629,1,pixar
35,2,Robin Williams
34,2,magic board game
33,2,fantasy
982,2,game
562,3,old
561,3,moldy
984,5,remake


In [5]:
movie_rating_data=movie_rating_data.merge(movies_df,on='movieId',how='left')
movie_rating_data=movie_rating_data.merge(tags_df,on='movieId',how='left')
movie_rating_data.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,tag
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
1,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
2,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,fun
3,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance,moldy
4,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance,old
5,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller,
6,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,mystery
7,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,twist ending
8,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,serial killer
9,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,mindfuck


### Recommendation Based on Rating Counts

In [6]:
rating = pd.DataFrame(movie_rating_data.groupby('title')['rating'].mean())
rating['Total Rating']=pd.DataFrame(movie_rating_data.groupby('title')['rating'].count())
rating.sort_values('Total Rating', ascending=False).head(10)

Unnamed: 0_level_0,rating,Total Rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Pulp Fiction (1994),4.197068,55567
Fight Club (1999),4.272936,11772
Star Wars: Episode IV - A New Hope (1977),4.231076,6526
Léon: The Professional (a.k.a. The Professional) (Léon) (1994),4.018797,4655
2001: A Space Odyssey (1968),3.894495,4469
Eternal Sunshine of the Spotless Mind (2004),4.160305,4454
Inception (2010),4.066434,3718
"Big Lebowski, The (1998)",3.924528,3392
Donnie Darko (2001),3.981651,3161
Forrest Gump (1994),4.164134,2961


### Recommendations based on correlations

We use Pearsons’R correlation coefficient to measure the linear correlation between two variables, in our case, the rating given by users to a particular movie.
Here we create a matrix that represents the correlation between user and movie.

In [7]:
movie_user=movie_rating_data.pivot_table(index='userId',columns='title',values='rating')
movie_user.head(10)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,1.0,,,
10,,,,,,,,,,,...,,,,,,,,,,


#### Now, we can choose any movie to test our recommender system. Here, I selected Toy Story (1995). For finding a correlation with other movies we are using function corrwith(). This function calculates the correlation of the movie with every movie.

In [8]:
correlation=movie_user.corrwith(movie_user['Toy Story (1995)'])
correlation.head(10)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


title
'71 (2014)                                      NaN
'Hellboy': The Seeds of Creation (2004)         NaN
'Round Midnight (1986)                          NaN
'Salem's Lot (2004)                             NaN
'Til There Was You (1997)                       NaN
'Tis the Season for Love (2015)                 NaN
'burbs, The (1989)                         0.240563
'night Mother (1986)                            NaN
(500) Days of Summer (2009)                0.353833
*batteries not included (1987)            -0.427425
dtype: float64

### In our data, there are many empty values. So first we remove all empty values and then joining the total rating with our data table.

In [9]:
recommandation=pd.DataFrame(correlation,columns=['Pearson Correlation'])
recommandation.dropna(inplace=True)
recommandation=recommandation.join(rating['Total Rating'])
recommandation.head()

Unnamed: 0_level_0,Pearson Correlation,Total Rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",0.240563,17
(500) Days of Summer (2009),0.353833,336
*batteries not included (1987),-0.427425,7
10 Cent Pistol (2015),1.0,2
10 Cloverfield Lane (2016),-0.285732,28


## To ensure statistical significance, we are only selecting the movie that has at least 150 ratings. We also merging genres for verifying our system.

In [10]:
recc=recommandation[recommandation['Total Rating']>150].sort_values('Pearson Correlation',ascending=False). 
recc=recc.merge(movies_df,on='title',how='left')
recc.head(10)

Unnamed: 0,title,Pearson Correlation,Total Rating,movieId,genres
0,Toy Story (1995),1.0,645,1,Adventure|Animation|Children|Comedy|Fantasy
1,Avengers: Infinity War - Part I (2018),0.942264,195,122912,Action|Adventure|Sci-Fi
2,Friends with Benefits (2011),0.768301,180,88405,Comedy|Romance
3,Toy Story 2 (1999),0.699211,776,3114,Adventure|Animation|Children|Comedy|Fantasy
4,Whiplash (2014),0.659713,342,112552,Drama
5,"Incredibles, The (2004)",0.643301,500,8961,Action|Adventure|Animation|Children|Comedy
6,Finding Nemo (2003),0.618701,423,6377,Adventure|Animation|Children|Comedy
7,Aladdin (1992),0.611892,183,588,Adventure|Animation|Children|Comedy|Musical
8,Big Hero 6 (2014),0.589433,246,115617,Action|Animation|Comedy
9,Blazing Saddles (1974),0.585892,186,3671,Comedy|Western
