<h1 style="text-align:center"> Recommender Systems with Pearson's Correlation Coefficient</h1>
<img src="https://cdn-images-1.medium.com/max/800/1*Vjmmcg_rDdtOK28xJcI_0A.jpeg" />

<br>
Recommender systems are the systems that are designed to recommend things to the user based on many different factors

Pearson's Correlation Coefficient is a very simple yet effective way to find how 1 variable linearly changes with respect to another. 
we can use this to our advantage and build a recommender system with this concept

<img src="https://www.wallstreetmojo.com/wp-content/uploads/2019/03/Correlation-Coefficient-Formula-2.jpg"/>


NOTE: 
If correlation coefficient is closer to 1 for two variables, these variables are directly proportional to each other.<br>
If it is closer to -1 , these variables are inversely proportional to each other. <br>
If the magnitude of the correlation coefficient is lower or closer to 0, the variables are probably don't have a strong dedpendency with respect to each other 

In [1]:
#importing the libraries
import numpy as np
import pandas as pd

<h2> MovieLens Dataset </h2>

for the purpose of implementing recommender systems, I have used the movielens dataset which contains the ratings for 100k movies

In [2]:
#data import
df1 = pd.read_csv('./ml-100k/u.data',sep='\t',names=['user_id','item_id','rating','timestamp'])
df2 = pd.read_csv("./ml-100k/u.item", sep="|", encoding="iso-8859-1",names=["item_id","item_name","date","unknown1"
"website","rat1","rat2","rat3","rat4","rat5","rat6","rat7","rat8","rat9","rat10","rat11","rat12","rat13",
"rat14","rat15","rat16","rat17","rat18","rat19","rat20"])
df1.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [3]:
print("shape of df1: ",df1.shape)

shape of df1:  (100000, 4)


The dataframe1 contains the user id , the movie id and the corresponding ratings

In [4]:
df2 = df2.iloc[:,0:2]
df2.head()

Unnamed: 0,item_id,item_name
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [5]:
print("shape of df2: ",df2.shape)

shape of df2:  (1682, 2)


The dataframe2 contains the movie name and it's corresponding item_id

In [6]:
data = df1.merge(df2,on="item_id")
data.drop(['timestamp'],inplace=True,axis=1)
data.head()

Unnamed: 0,user_id,item_id,rating,item_name
0,196,242,3,Kolya (1996)
1,63,242,3,Kolya (1996)
2,226,242,5,Kolya (1996)
3,154,242,3,Kolya (1996)
4,306,242,5,Kolya (1996)


In [7]:
print("shape of data: ",data.shape)

shape of data:  (100000, 4)


Merging the dataframe 1 to dataframe 2 to get the entire dataset

<h2> Pivot Table </h2>

We utilize the Pivot Table from pandas create a table with each movie representing a column and each user representing a row

In [8]:
data_table = pd.pivot_table(data,values='rating',columns='item_name',index='user_id')
data_table.head()

item_name,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,2.0,,,,,4.0,,,...,,,,4.0,,,,,4.0,


<h2> Making Recommendations </h2>

That's it for this basic recommender systems, inorder to make predictions , we are going to get a movie name from the user and give a list of movies that the user might like. This is where the correlation coefficient comes into play

<h5> Let's assume that the user liked the movie 101 Dalmatians (1996). we have to give a list of movies that we think the user might like.</h5>

In [9]:
print("here are a list of 20 movies to recommend to a user who has liked '101 Dalmatians (1996)'")
print(data_table.corr()['101 Dalmatians (1996)'].sort_values(ascending=False).iloc[:20])

here are a list of 20 movies to recommend to a user who has liked '101 Dalmatians (1996)'
item_name
Browning Version, The (1994)              1.0
Roseanna's Grave (For Roseanna) (1997)    1.0
Stranger, The (1994)                      1.0
House Party 3 (1994)                      1.0
Scarlet Letter, The (1926)                1.0
Tie That Binds, The (1995)                1.0
Boys (1996)                               1.0
Sixth Man, The (1997)                     1.0
Ready to Wear (Pret-A-Porter) (1994)      1.0
Ed (1996)                                 1.0
Apostle, The (1997)                       1.0
April Fool's Day (1986)                   1.0
Grateful Dead (1995)                      1.0
Madame Butterfly (1995)                   1.0
True Crime (1995)                         1.0
Trial by Jury (1994)                      1.0
Loch Ness (1995)                          1.0
Gay Divorcee, The (1934)                  1.0
Swan Princess, The (1994)                 1.0
Big Squeeze, The (1996)   

So, This is how we can use the pearson's correlation coefficient to recommend movies to users based on the movies they liked <br>