### User-Based Collaborative Filtering (Recommender Systems)

#### Recommend movies to users

This dataset describes 5-star rating from MovieLens, a movie recommendation service. It contains 100,836 ratings across 9,742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018. Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided. The data are contained in the files 'movies.csv' and 'ratings.csv'.

In the collaborative filtering, the only information that we are using is ratings.

In [1]:
# Import useful libararies used for data management

import pandas as pd
import numpy as np

In [2]:
# load the dataset 'movies.csv'

movies = pd.read_csv('movies.csv')

In [3]:
# display the info on attributes

movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [4]:
# display the first 5 records

movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
# load the dataset 'ratings.csv'
ratings = pd.read_csv('ratings.csv')

ratings.info()

# display the first 5 records
ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
# map movie Id for movie name by join the table on 'movieId'
user_movie_rating=pd.merge(ratings,movies,on='movieId')

In [7]:
user_movie_rating.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [8]:
#create a full table, rows: userIDs, and columns: movieIDs

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html
M = user_movie_rating.pivot_table(index=['userId'],columns = ['title'],values = 'rating')

In [9]:
# get to know how many users and how many movies
M.shape

(610, 9719)

**Note this 9,719 is smaller than 9,742. Why?**

In [None]:
M

**Python does not have built-in libarary for collaborative filtering. We need to do the calculation ourselves. We first define a function to calculate pearson correlation.**

In [10]:
# define a function to calculate the Pearson correlation (similarity) between u1 and u2

def pearson(u1, u2):
    u1_dif = u1 - u1.mean()
    u2_dif = u2 - u2.mean()
    return np.sum(u1_dif*u2_dif)/np.sqrt(np.sum(u1_dif*u1_dif)*np.sum(u2_dif*u2_dif))

In [11]:
# show the pearson correlation between the first user and the second user in the data
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
# M.iloc[0,:] gives the rating info by first user, and M.iloc[1,:] gives the rating info by second user

pearson(M.iloc[0,:], M.iloc[1,:])

0.0012645157377626514

**Next, we define another function to find the k-nearest neighbors of a user, based on the similarity given by pearson correlation.**

In [12]:
# choose the k-nearest neighbors
def k_nearest_neighbors(user, k, M):
    all_others = []
    neighbors = []
    
    # M.shape[0] gives the total number of users/rows
    for i in range(M.shape[0]):
        # a user cannot be the neighbor of himself/herself, skip if so
        if i == user:
            continue
        sim = pearson(M.iloc[user,:], M.iloc[i,:])
        # skip if the similarity values is NaN
        if np.isnan(sim):
            continue
        else:
            # append the id and similarity score
            all_others.append([i,sim])
    
    # reverse sort all the records based on the similarity score: highest value being the first
    all_others.sort(key=lambda tup: tup[1], reverse = True) 
    
    # select the top k neighbors 
    for i in range(k):
        if i >= len(all_others):
            break
        neighbors.append(all_others[i])
    return neighbors

**Now, we need to define a function to perform the prediction based on the ratings given by the k-nearest neighbors.**

In [13]:
# perform prediction/recommendation

def predict(user, neighbors, M):
    predictions = []

    # for all the movies in the data, do the prediction
    for i in range(M.shape[1]):
        # if the rating given by the user is not missing, it means that the user has watched the movie. skip (no need to predict)
        if ~np.isnan(M.iloc[user,i]):
            continue
        numerator = 0.0
        denominator = 0.0
        
        # do the weighted average of the ratings given by the k-nearest neighbors, adjusting their rating bias.
        for neighbor in neighbors:
            neighbor_id = neighbor[0]
            neighbor_sim = neighbor[1]
            if np.isnan(M.iloc[neighbor_id,i]):
                continue
            numerator += neighbor_sim * (M.iloc[neighbor_id,i]-M.iloc[neighbor_id,:].mean())
            denominator += np.abs(neighbor_sim)
        if denominator == 0.0:
               continue
        pred_rating = numerator/denominator + M.iloc[user,:].mean() 
        predictions.append([i,pred_rating]) 
    return predictions


**Next, we define a function to print the top-n recommendations to a user (those with the highest predicted ratings).**

In [14]:
# print the top n recommendations

def top_n_recs(user, predictions, M, top_n):
    
    # sort the movies by predicted ratings, from the highest one to the lowest one
    predictions.sort(key=lambda tup: tup[1], reverse = True) 
    
    recommendations = []
    for i in range(top_n):
        if i >= len(predictions):
            break
        recommendations.append(predictions[i])
    print("----------------------------------------------------------------")
    print("The top %d movies recommended to user %d are as follows:" % (top_n, user+1))
    j = 0
    for rec in recommendations:
        if j >= top_n:
            break
        print("Moive: %s, Predicted Rating:%.3f" % (M.columns[rec[0]], rec[1])) 
        j = j+1

**Finally, we define a function to combine the steps stated above. This gives the user-based collaborative filtering algorithm.**

In [15]:
# define the user-based collaborative filtering algorithm
def user_based_cf(user, M, k,top_n):
    # first k-nearest neighbors
    k_neighbors = k_nearest_neighbors(user, k, M)
    # perform predictions for each movie not watched by the user
    predictions = predict(user,k_neighbors, M)
    # recommend the top n with highest predicted ratings
    top_n_recs(user, predictions, M, top_n)

**For demonstration purpose, we also define a function to print the top n movies rated by the user. Here, n can be any number.**

In [16]:
def print_rating_records(user, M, highest_n):
    records = []
    for i in range(M.shape[1]):
        if np.isnan(M.iloc[user,i]):
            continue
        records.append([M.columns[i],M.iloc[user,i]]) 
    records.sort(key=lambda tup: tup[1], reverse = True) 
    
    print("----------------------------------------------------------------")
    print("The top %d movies rated by user %d are as follows:" % (highest_n, user+1))
    j = 0
    for record in records:
        if j >= highest_n:
            break
        print("Moive: %s, Rating:%.3f" % (record[0], record[1])) 
        j = j+1

In [17]:
# ignore any warnings in calculation
import warnings
warnings.filterwarnings("ignore")

**Now, we are ready to make recommendations to any user. Let's do it for user 11.**

In [18]:
# print the top 20 movies rated by user 11 (note: user 1 has the index 0)
print_rating_records(10,M,20)
# give 10 recommendations to user 11, with ratings predicted by the 10-nearest neighbors of user 11.
user_based_cf(10, M, 10, 10)

----------------------------------------------------------------
The top 20 movies rated by user 11 are as follows:
Moive: Amistad (1997), Rating:5.000
Moive: Apollo 13 (1995), Rating:5.000
Moive: As Good as It Gets (1997), Rating:5.000
Moive: Braveheart (1995), Rating:5.000
Moive: Clear and Present Danger (1994), Rating:5.000
Moive: Contact (1997), Rating:5.000
Moive: Forrest Gump (1994), Rating:5.000
Moive: Fugitive, The (1993), Rating:5.000
Moive: Heat (1995), Rating:5.000
Moive: Last of the Mohicans, The (1992), Rating:5.000
Moive: Saving Private Ryan (1998), Rating:5.000
Moive: Searching for Bobby Fischer (1993), Rating:5.000
Moive: Silence of the Lambs, The (1991), Rating:5.000
Moive: Titanic (1997), Rating:5.000
Moive: Top Gun (1986), Rating:5.000
Moive: Air Force One (1997), Rating:4.000
Moive: Armageddon (1998), Rating:4.000
Moive: Breakdown (1997), Rating:4.000
Moive: Con Air (1997), Rating:4.000
Moive: Conspiracy Theory (1997), Rating:4.000
----------------------------------

**TODO: Please feel free to change the user id, k (number of nearest neighbors used) above to see what results you can get.**

### Use Surprice package to do recommendation
#### Surprise is an easy-to-use Python scikit for recommender systems.
https://surprise.readthedocs.io/en/stable/index.html
you will have to pip install 'scikit-surprise' in anaconda base terminal and Visual Studuio Code C++ build tools is needed
https://code.visualstudio.com/docs/cpp/config-msvc

In [20]:
# import libs from surprise
from surprise import KNNBasic
from surprise import Dataset
from surprise.model_selection import cross_validate

In [21]:
# load dataset for movie ratings
data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n]Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to C:\Users\geral/.surprise_data/ml-100k


In [22]:
# define similarity function
sim_options = {'name': 'pearson_baseline',
               'shrinkage': 0  # no shrinkage
               }

In [23]:
# create a CF using KNNBasic
CFalgo = KNNBasic(k=20,sim_options=sim_options)

In [24]:
# cross validate the KNNBasic CF
cross_validate(CFalgo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0272  1.0132  1.0218  1.0079  1.0121  1.0164  0.0070  
MAE (testset)     0.8118  0.8022  0.8065  0.7969  0.8015  0.8038  0.0050  
Fit time          3.05    2.93    2.77    2.87    2.72    2.87    0.12    
Test time         3.98    3.96    3.78    4.07    3.65  

{'test_rmse': array([1.02715128, 1.01318862, 1.02175359, 1.00787155, 1.01207923]),
 'test_mae': array([0.81181125, 0.80224934, 0.80652996, 0.79692349, 0.80145297]),
 'fit_time': (3.0542147159576416,
  2.93269944190979,
  2.7723522186279297,
  2.8694632053375244,
  2.719768762588501),
 'test_time': (3.9833762645721436,
  3.955451726913452,
  3.7760236263275146,
  4.073114395141602,
  3.6541953086853027)}

In [25]:
# create a training set
trainset = data.build_full_trainset()

In [26]:
# fit the CFalgo with training set
CFalgo.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x21ff8b6e908>

In [27]:
userid = str(195)  # raw user id (as in the ratings file). They are **strings**!
movieid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = CFalgo.predict(userid, movieid, verbose=True)

user: 195        item: 302        r_ui = None   est = 4.17   {'actual_k': 20, 'was_impossible': False}
