# **Movies Recommender System**

<a  align=center><img src = "https://bingepost.com/wp-content/uploads/2020/04/What-to-watch-on-Netflix-1.jpg" width=1000> </a>

<h1 align=center><font size = 5>Collaborative Filtering</font></h1>

**Steps:**
- Get movies dataframe (given data)
- Get users ratings dataframe (given data)
- Get input user ratings (user input/assumption)
- Learning the similarity weights (Pearson Correlation)
- Find the recommendations (user profile * original movies genres)

## 1. Get movies dataframe

In [None]:
import pandas as pd
from scipy.stats import pearsonr

In [None]:
#Movies information
movies_df = pd.read_csv('ml-latest\\movies.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
movies_df['year'] = movies_df['title'].str.extract('(\(\d\d\d\d\))')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,(1995)
1,2,Jumanji (1995),Adventure|Children|Fantasy,(1995)
2,3,Grumpier Old Men (1995),Comedy|Romance,(1995)
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,(1995)
4,5,Father of the Bride Part II (1995),Comedy,(1995)


In [None]:
movies_df['year'] = movies_df['year'].str.extract('(\d\d\d\d)')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


In [None]:
movies_df['title'] = movies_df['title'].str.replace('(\(\d\d\d\d\))', '')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [None]:
movies_df['title'] = movies_df['title'].str.strip()
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [None]:
#Drop genres as we don't need them at Collaborative Filtering technique
movies_df = movies_df.drop('genres', axis=1)
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


## 2. Get users ratings dataframe

In [None]:
#Users ratings information
ratings_df = pd.read_csv('ml-latest\\ratings.csv')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


In [None]:
#Drop timestamp column
ratings_df = ratings_df.drop('timestamp', axis=1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


## 3. Get input user ratings

We will use __Collaborative Filtering recommendation system__, which is also known as __User-User Filtering__.

This technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input user and then recommends items that they have liked to him. There are several methods of finding similar users, and the one we will use here is based on the __Pearson Correlation Function__.

In [None]:
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


#### Add movieId to input user
We will extract the input movies' ID's from the movies dataframe and add them into it.

In [None]:
#Filtering out the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
inputId

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
293,296,Pulp Fiction,1994
1246,1274,Akira,1988
1885,1968,"Breakfast Club, The",1985


In [None]:
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('year', axis=1)
inputMovies = inputMovies.sort_values(by='movieId')
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


## 4. Learning the similarity weights (Pearson Correlation)

#### The users who have seen the same movies
With the movies ID's in our input, we can get the subset of users who have watched and reviewed these movies in our input.


In [None]:
#Filtering out users who have watched movies that the input has watched
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset

Unnamed: 0,userId,movieId,rating
19,4,296,4.0
441,12,1968,3.0
479,13,2,2.0
531,13,1274,5.0
681,14,296,2.0
...,...,...,...
22883679,247738,296,4.0
22884132,247751,1,4.0
22884142,247751,296,4.0
22884164,247751,1274,5.0


We now group up the rows by user ID.

In [None]:
userSubsetGroup = userSubset.groupby(['userId'])

Lets look at one of the users, e.g. the one with userID=1130

In [None]:
userSubsetGroup.get_group(1130)

Unnamed: 0,userId,movieId,rating
104167,1130,1,0.5
104168,1130,2,4.0
104214,1130,296,4.0
104363,1130,1274,4.5
104443,1130,1968,4.5


Let's also sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user.

In [None]:
#Sorting it so users with movies most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

In [None]:
userSubsetGroup

[(75,       userId  movieId  rating
  7507      75        1     5.0
  7508      75        2     3.5
  7540      75      296     5.0
  7633      75     1274     4.5
  7673      75     1968     5.0), (106,       userId  movieId  rating
  9083     106        1     2.5
  9084     106        2     3.0
  9115     106      296     3.5
  9198     106     1274     3.0
  9238     106     1968     3.5), (686,        userId  movieId  rating
  61336     686        1     4.0
  61337     686        2     3.0
  61377     686      296     4.0
  61478     686     1274     4.0
  61569     686     1968     5.0), (815,        userId  movieId  rating
  73747     815        1     4.5
  73748     815        2     3.0
  73922     815      296     5.0
  74362     815     1274     3.0
  74678     815     1968     4.5), (1040,        userId  movieId  rating
  96689    1040        1     3.0
  96690    1040        2     1.5
  96733    1040      296     3.5
  96859    1040     1274     3.0
  96922    1040     1968  

In [None]:
#Have a look at first user
userSubsetGroup[0]

(75,       userId  movieId  rating
 7507      75        1     5.0
 7508      75        2     3.5
 7540      75      296     5.0
 7633      75     1274     4.5
 7673      75     1968     5.0)

#### Similarity of users to input user
Next, we are going to compare all users to our specified user and find the group of users that is most similar.  
we're going to find out how similar each user is to the input through the __Pearson Correlation__. It is used to measure the strength of a linear association between two variables.

Why Pearson Correlation?

Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y, then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems because for example two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales.

![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0 "Pearson Correlation")

__Pearson Correlation Coeff__ vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation. 

In our case, a 1 means that the two users have similar tastes while a -1 means the opposite.

In [None]:
#We will select a subset of users to iterate through, to not waste too much time going through every single user.
userSubsetGroup = userSubsetGroup[0:100]

In [None]:
for name, group in userSubsetGroup:
    print(name)
    print(group)
    break

75
      userId  movieId  rating
7507      75        1     5.0
7508      75        2     3.5
7540      75      296     5.0
7633      75     1274     4.5
7673      75     1968     5.0


In [None]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Sorting the input to match current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #Store them in a temporary variable in a list format
    tempRatingList = temp_df['rating'].tolist()
    #Also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()

    #Pearson correlation coefficient and the p-value for testing non-correlation
    pearsonCorrelationDict[name] = round(pearsonr(tempRatingList, tempGroupList)[0], 3)

In [None]:
pearsonCorrelationDict.items()

dict_items([(75, 0.827), (106, 0.586), (686, 0.832), (815, 0.577), (1040, 0.943), (1130, 0.289), (1502, 0.877), (1599, 0.439), (1625, 0.716), (1950, 0.179), (2065, 0.439), (2128, 0.586), (2432, 0.139), (2791, 0.877), (2839, 0.82), (2948, -0.117), (3025, 0.451), (3040, 0.895), (3186, 0.678), (3271, 0.27), (3429, 0.0), (3734, -0.15), (4099, 0.059), (4208, 0.294), (4282, -0.439), (4292, 0.656), (4415, -0.112), (4586, -0.902), (4725, -0.08), (4818, 0.489), (5104, 0.767), (5165, -0.439), (5547, 0.172), (6082, -0.047), (6207, 0.962), (6366, 0.658), (6482, 0.0), (6530, -0.352), (7235, 0.698), (7403, 0.117), (7641, 0.716), (7996, 0.627), (8008, -0.226), (8086, 0.693), (8245, 0.0), (8572, 0.86), (8675, 0.537), (9101, -0.086), (9358, 0.692), (9663, 0.194), (9994, 0.503), (10248, -0.248), (10315, 0.537), (10368, 0.469), (10607, 0.416), (10707, 0.962), (10863, 0.602), (11314, 0.82), (11399, 0.517), (11769, 0.938), (11827, 0.49), (12069, 0.0), (12120, 0.929), (12211, 0.86), (12325, 0.962), (12916, 

In [None]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index').reset_index(drop=False) #orient: key is index or column
pearsonDF.columns = ['userId', 'similarityIndex']
pearsonDF = pearsonDF.sort_values(by='similarityIndex', ascending=False)
pearsonDF

Unnamed: 0,userId,similarityIndex
34,6207,0.962
55,10707,0.962
64,12325,0.962
67,13053,0.961
4,1040,0.943
...,...,...
51,10248,-0.248
37,6530,-0.352
24,4282,-0.439
31,5165,-0.439


#### The top x similar users to input user
Now let's get the top 20 users who are most similar to the input.

In [None]:
topUsers = pearsonDF.iloc[0:20]
topUsers.head()

Unnamed: 0,userId,similarityIndex
34,6207,0.962
55,10707,0.962
64,12325,0.962
67,13053,0.961
4,1040,0.943


## 5. Find the recommendations

#### Rating of selected users to all movies
We're going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our __pearsonDF__ from the ratings dataframe and then store their correlation in a new column called _similarityIndex".

In [None]:
topUsersRating = pd.merge(topUsers, ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating

Unnamed: 0,userId,similarityIndex,movieId,rating
0,6207,0.962,1,3.5
1,6207,0.962,2,2.0
2,6207,0.962,5,1.5
3,6207,0.962,10,2.5
4,6207,0.962,16,3.5
...,...,...,...,...
16136,2839,0.820,6537,3.5
16137,2839,0.820,6755,4.0
16138,2839,0.820,6857,3.5
16139,2839,0.820,7263,3.5


Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.

We can easily do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns:

It shows the idea of all similar users to candidate movies for the input user:

In [None]:
#Multiplies the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,userId,similarityIndex,movieId,rating,weightedRating
0,6207,0.962,1,3.5,3.367
1,6207,0.962,2,2.0,1.924
2,6207,0.962,5,1.5,1.443
3,6207,0.962,10,2.5,2.405
4,6207,0.962,16,3.5,3.367


In [None]:
#Applies a sum to the topUsers after grouping it up by movieId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,17.885,67.1295
2,17.885,45.0975
3,5.435,14.616
4,0.929,2.787
5,5.499,10.3475


In [None]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()
#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.753397,1
2,2.521526,2
3,2.689236,3
4,3.000000,4
5,1.881706,5
...,...,...
148452,3.000000,148452
148454,4.000000,148454
149354,3.000000,149354
150776,3.000000,150776


Now let's sort it and see the top 20 movies that the algorithm recommended!

In [None]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(20)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
4381,5.0,4381
4766,5.0,4766
26444,5.0,26444
73854,5.0,73854
73587,5.0,73587
1283,5.0,1283
3729,5.0,3729
6214,5.0,6214
71264,5.0,71264
6279,5.0,6279


In [None]:
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(20)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
119,121,"Boys of St. Vincent, The",1992
1255,1283,High Noon,1952
1276,1305,"Paris, Texas",1984
2820,2905,Sanjuro (Tsubaki Sanjûrô),1962
3639,3729,Shaft,1971
3640,3730,"Conversation, The",1974
3669,3759,Fun and Fancy Free,1947
3679,3769,Thunderbolt and Lightfoot,1974
3685,3775,Make Mine Music,1946
3686,3776,Melody Time,1948


### Advantages and Disadvantages of Collaborative Filtering

##### Advantages
* Takes other user's ratings into consideration
* Doesn't need to study or extract information from the recommended item
* Adapts to the user's interests which might change over time

##### Disadvantages
* Approximation function can be slow
* There might be a low of amount of users to approximate
* Privacy issues when trying to learn the user's preferences