In [1]:
import os
os.chdir("../")
%pwd

'c:\\Users\\abhis\\Desktop\\MLProjects\\Movie Recommender'

One of the most popular methods for making recommendations is **collaborative filtering**. In collaborative filtering, you are using the collaboration of user-item recommendations to assist in making new recommendations.

There are two main methods of performing collaborative filtering:

 1. **Neighborhood-Based Collaborative Filtering**, which is based on the idea that we can either correlate items that are similar to provide recommendations or we can correlate users to one another to provide recommendations.

 2. **Model Based Collaborative Filtering**, which is based on the idea that we can use machine learning and other mathematical models to understand the relationships that exist amongst items and users to predict ratings and provide ratings.

In this notebook, we will be working on performing **neighborhood-based collaborative filtering**. There are two main methods for performing collaborative filtering:

 1. **User-based collaborative filtering**: In this type of recommendation, users related to the user you would like to make recommendations for are used to create a recommendation.

 2. **Item-based collaborative filtering**: In this type of recommendation, first you need to find the items that are most related to each other item (based on similar ratings). Then you can use the ratings of an individual on those similar items to understand if a user will like the new item.

In this notebook we will be implementing **user-based collaborative filtering**. However, it is easy to extend this approach to make recommendations using **item-based collaborative filtering**. First, let's read in our data and necessary libraries.

In [2]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype


%matplotlib inline

In [3]:
# Read the data

ratings_df = pd.read_csv('artifacts/data_preparation/final_data/ratings.csv')
movies_df = pd.read_csv('artifacts/data_preparation/final_data/movies.csv')

#### Measures of Similarity
When using **neighborhood** based collaborative filtering, it is important to understand how to measure the similarity of users or items to one another.

There are a number of ways in which we might measure the similarity between two vectors (which might be two users or two items). In this notebook, we will look specifically at two measures used to compare vectors:

 - **Pearson's correlation coefficient**
Pearson's correlation coefficient is a measure of the strength and direction of a linear relationship. The value for this coefficient is a value between -1 and 1 where -1 indicates a strong, negative linear relationship and 1 indicates a strong, positive linear relationship.

If we have two vectors x and y, we can define the correlation between the vectors as:

$$
\text{CORR}(x,y) = \frac{\text{{COV}}(x,y)}{\text{STDEV}(x)\text{STDEV}(y)}
$$

where,

$$
\text{{STDEV}}(x) = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

and,
$$
\text{{COV}}(x,y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})
$$

where n is the length of the vector, which must be the same for both x and y and 
 is the mean of the observations in the vector.

We can use the correlation coefficient to indicate how alike two vectors are to one another, where the closer to 1 the coefficient, the more alike the vectors are to one another. There are some potential downsides to using this metric as a measure of similarity. You will see some of these throughout this workbook.

 - Euclidean Distance

 Euclidean distance is a measure of the straightline distance from one vector to another. Because this is a measure of distance, larger values are an indication that two vectors are different from one another (which is different than Pearson's correlation coefficient).

Specifically, the euclidean distance between two vectors x and y is measured as:

$$
\text{{EUCL}}(x,y) = \sqrt{ \sum_{i=1}^{n} (x_i - y_i)^2}
$$

Different from the correlation coefficient, no scaling is performed in the denominator. Therefore, you need to make sure all of your data are on the same scale when using this metric.

**Note**: Because measuring similarity is often based on looking at the distance between vectors, it is important in these cases to scale your data or to have all data be in the same scale. In this case, we will not need to scale data because they are all on a 5 point scale, but it is always something to keep in mind!

--------

User-Item Matrix

In order to calculate the similarities, it is common to put values in a matrix. In this matrix, users are identified by each row, and items are represented by columns.

In the above matrix, you can see that **User 1** and **User 2** both used **Item 1**, and User 2, User 3, and User 4 all used Item 2. However, there are also a large number of missing values in the matrix for users who haven't used a particular item. A matrix with many missing values (like the one above) is considered sparse.

Our first goal for this notebook is to create the above matrix with the reviews dataset. However, instead of 1 values in each cell, you should have the actual rating.

The users will indicate the rows, and the movies will exist across the columns. To create the user-item matrix, we only need the first three columns of the reviews dataframe, which you can see by running the cell below.


To make the computation faster, We will consider the following assumptions:
 1. If an user rated movies less than 20 times, they are too new.
 2. If a movie is rated less than 50 times, it should not be recommended.

In [4]:
more_than_20 = ratings_df['userId'].value_counts() > 20
# getting the index of these users
ind = more_than_20[more_than_20].index
ratings_df = ratings_df[ratings_df['userId'].isin(ind)]

In [5]:
ratings_df.shape

(26442998, 3)

In [6]:
# megre movies with ratings

rating_with_movies = ratings_df.merge(movies_df, on = "movieId")

In [7]:
# figure out which movie got how much rating

num_rating = rating_with_movies.groupby('title')['rating'].count().reset_index()
num_rating.rename(columns={"rating":"num_of_rating"},inplace=True)

In [8]:
final_rating = rating_with_movies.merge(num_rating, on = 'title')

In [9]:
# filter out book with more than 50 ratings only

final_rating =final_rating[final_rating['num_of_rating'] >= 50]

In [10]:
final_rating.head()

Unnamed: 0,userId,movieId,rating,title,imdbId,tmdbId,genres,overview,popularity,poster_path,vote_average,vote_count,director,keywords,num_of_rating
0,4,1,4.0,Toy Story (1995),114709,862,"['Animation', 'Adventure', 'Family', 'Comedy']","Led by Woody, Andy's toys live happily in his ...",101.402,/uXDfjJbdP4ijW5hWSBrPrlKpxab.jpg,8.0,16771,John Lasseter,"['martial arts', 'jealousy', 'friendship', 'bu...",61743
1,10,1,5.0,Toy Story (1995),114709,862,"['Animation', 'Adventure', 'Family', 'Comedy']","Led by Woody, Andy's toys live happily in his ...",101.402,/uXDfjJbdP4ijW5hWSBrPrlKpxab.jpg,8.0,16771,John Lasseter,"['martial arts', 'jealousy', 'friendship', 'bu...",61743
2,14,1,4.5,Toy Story (1995),114709,862,"['Animation', 'Adventure', 'Family', 'Comedy']","Led by Woody, Andy's toys live happily in his ...",101.402,/uXDfjJbdP4ijW5hWSBrPrlKpxab.jpg,8.0,16771,John Lasseter,"['martial arts', 'jealousy', 'friendship', 'bu...",61743
3,15,1,4.0,Toy Story (1995),114709,862,"['Animation', 'Adventure', 'Family', 'Comedy']","Led by Woody, Andy's toys live happily in his ...",101.402,/uXDfjJbdP4ijW5hWSBrPrlKpxab.jpg,8.0,16771,John Lasseter,"['martial arts', 'jealousy', 'friendship', 'bu...",61743
4,22,1,4.0,Toy Story (1995),114709,862,"['Animation', 'Adventure', 'Family', 'Comedy']","Led by Woody, Andy's toys live happily in his ...",101.402,/uXDfjJbdP4ijW5hWSBrPrlKpxab.jpg,8.0,16771,John Lasseter,"['martial arts', 'jealousy', 'friendship', 'bu...",61743


In [11]:
final_rating.drop_duplicates(['userId','title'], inplace=True)

In [12]:
final_rating[['userId','title']].nunique()

userId    169552
title      13088
dtype: int64

#### Creating the User-Item Matrix
In order to create the user-items matrix (like the one above), I personally started by using a pivot table.However, I quickly ran into a memory error (a common theme throughout this notebook). I will help you navigate around many of the errors I had, and achieve useful collaborative filtering results!

---
1. Create a matrix where the users are the rows, the movies are the columns, and the ratings exist in each cell, or a NaN exists in cells where a user hasn't rated a particular movie. If you get a memory error (like I did), [this link](https://stackoverflow.com/questions/75783694/trying-to-pivot-a-large-dataframe-but-get-indexerror-index-875914235-is-out-o) and [this link](https://stackoverflow.com/questions/31661604/efficiently-create-sparse-pivot-tables-in-pandas) here might help you!

In [13]:
rcLabel, vLabel = ('userId', 'title'), 'rating'
rcCat = [CategoricalDtype(sorted(final_rating[col].unique()), ordered=True) for col in rcLabel]
rc = [final_rating[column].astype(aType).cat.codes for column, aType in zip(rcLabel, rcCat)]
mat = csr_matrix((final_rating[vLabel], rc), shape=tuple(cat.categories.size for cat in rcCat))
movie_pivot = ( pd.DataFrame.sparse.from_spmatrix(
    mat, index=rcCat[0].categories, columns=rcCat[1].categories) )

In [14]:
movie_pivot.index

Index([     4,      5,      6,      8,     10,     14,     15,     16,     18,
           19,
       ...
       283210, 283213, 283214, 283215, 283218, 283219, 283221, 283222, 283224,
       283228],
      dtype='int64', length=169552)

2. Now that you have a matrix of users by movies, use this matrix to create a dictionary where the key is each user and the value is an array of the movies each user has rated.

In [16]:
# Create a dictionary with users and corresponding movies seen

def movies_watched(user_id):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    OUTPUT:
    movies - an array of movies the user has watched
    '''
    movies = movie_pivot.loc[user_id][movie_pivot.loc[user_id].isnull() == False].index.values

    return movies


def create_user_movie_dict():
    '''
    INPUT: None
    OUTPUT: movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    Creates the movies_seen dictionary
    '''
    users = movie_pivot.index
    movies_seen = dict()

    for user1 in users:
        
        # assign list of movies to each user key
        movies_seen[user1] = movies_watched(user1)
    
    return movies_seen

# movies_seen = create_user_movie_dict()

In [127]:
from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors(algorithm = 'brute')

In [128]:
model.fit(mat.T)

In [112]:
# movie_pivot = ( pd.DataFrame.sparse.from_spmatrix(
#     mat.T, index=rcCat[1].categories, columns=rcCat[0].categories) )

In [129]:
movie_pivot = ( pd.DataFrame.sparse.from_spmatrix(
    mat, index=rcCat[0].categories, columns=rcCat[1].categories) )

In [137]:
distance, suggestion = model.kneighbors(movie_pivot.iloc[:,8].to_numpy().reshape(1,-1),n_neighbors=11)

In [138]:
distance,suggestion

(array([[ 0.        , 70.17300051, 70.23709846, 70.25667228, 70.27268317,
         70.37044834, 70.42904231, 70.50886469, 70.54076835, 70.54076835,
         70.55317144]]),
 array([[    8,  1067,  4316, 12151,    13, 12955,  3162, 12268, 10405,
          7160,  2045]], dtype=int64))

In [139]:
# find out the book name
for i in range(len(suggestion)):
    print(movie_pivot.columns[suggestion[i]])

Index([''night Mother (1986)',
       'Barbie & Her Sisters in the Great Puppy Adventure (2015)',
       'Fresh Horses (1988)', 'Up the Academy (1980)',
       '...All the Marbles (1981)',
       'Zombie Lake (Lac des morts vivants, Le) (Zombies Lake) (Lake of the Living Dead, The) (1981)',
       'Devil Doll (1964)', 'Violets Are Blue... (1986)', 'So Fine (1981)',
       'Making Love (1982)', 'Canyons, The (2013)'],
      dtype='object')


In [75]:
movie_pivot.shape

(169552, 12978)

In [80]:
movie_pivot.iloc[:,4].to_numpy().shape

(169552,)

In [78]:
movie_pivot.iloc[237,:]

"Great Performances" Cats (1998)             0.0
$9.99 (2008)                                 0.0
'71 (2014)                                   0.0
'Hellboy': The Seeds of Creation (2004)      0.0
'Round Midnight (1986)                       0.0
                                            ... 
xXx (2002)                                   0.0
xXx: Return of Xander Cage (2017)            0.0
xXx: State of the Union (2005)               0.0
¡Three Amigos! (1986)                        0.0
À nous la liberté (Freedom for Us) (1931)    0.0
Name: 411, Length: 12978, dtype: Sparse[float64, 0]

Collecting progressbar
  Downloading progressbar-2.5.tar.gz (10 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: progressbar
  Building wheel for progressbar (setup.py): started
  Building wheel for progressbar (setup.py): finished with status 'done'
  Created wheel for progressbar: filename=progressbar-2.5-py3-none-any.whl size=12084 sha256=504ef4a1f9151ea1cedd9ff837427f849cd5cd52bd73c6d19518bf8eb3d0bb68
  Stored in directory: c:\users\abhis\appdata\local\pip\cache\wheels\cd\17\e5\765d1a3112ff3978f70223502f6047e06c43a24d7c5f8ff95b
Successfully built progressbar
Installing collected packages: progressbar
Successfully installed progressbar-2.5
