% DES431 Project 2: Recommendation System

# Background

**MovieLens** is a movie recommendation system operated by GroupLens, a research group at the University of Minnesota. 

# Task

1. Propose and implement your own recommendation system based on the MovieLens dataset. Use `ratings_train.csv` as the training set, `ratings_valid.csv` as the validation set. Your system may use information from `movies.csv` and `tags.csv` to conduct recommendations. The undisclosed test set will be used to evaluate your system.
   - The data file structure is available at https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html. 
   - The main goal of the recommendation system is to minimize the root-mean-square error.
   - The implementation should include a function named `predict_rating`. This function accepts a DataFrame with two columns `userId` and `movieId`. Then, the function adds a column named `rating` storing a predicted rating of a `movieId` by a `userId`.
   - Your program must return a root-mean-square error value when the validation set is changed to another file. Otherwise, your score will be deducted by 50%.
   - You must modify the given program to make better recommendations. Submitting the original program without modification is considered plagiarism.
2. Prepare slides for a 7-minute presentation to explain your proposed technique and algorithm to conduct recommendation, and show your RMSE results on the validation set.
3. Submit all required documents by April 30, 2023; 23:59. Late submission will not be accepted and will be marked 0. Do not wait until the last minute. Plagiarism and code duplication will be checked. 
4. Present your work on May 1, 2023 within 7 minutes. Exceeding 7 minutes will be subject to point deduction.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
import imdb
import requests
import tmdb
from surprise import Dataset, Reader, SVD, accuracy

# Loading data

In [4]:
ratings_train = pd.read_csv('ratings_train.csv')
ratings_valid = pd.read_csv('ratings_valid.csv')
movies = pd.read_csv('movies.csv')
tags = pd.read_csv('tags.csv',usecols=["userId","movieId","tag"])
links = pd.read_csv('links.csv')

In [5]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [6]:
tags

Unnamed: 0,userId,movieId,tag
0,2,60756,funny
1,2,60756,Highly quotable
2,2,60756,will ferrell
3,2,89774,Boxing story
4,2,89774,MMA
...,...,...,...
3678,606,7382,for katie
3679,606,7936,austere
3680,610,3265,gun fu
3681,610,3265,heroic bloodshed


In [7]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9737,193581,5476944,432131.0
9738,193583,5914996,445030.0
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0


In [44]:
ratings_train

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
96459,610,166534,4.0,1493848402
96460,610,168248,5.0,1493850091
96461,610,168250,5.0,1494273047
96462,610,168252,5.0,1493846352


In [37]:
print("Number of users = "+str(ratings_train["userId"].nunique()))
print("Number of movies = "+str(ratings_train["movieId"].nunique()))

Number of users = 610
Number of movies = 9690


## Content Based

### creating user profile

### creating feature space

In [9]:
m_df=ratings_train[["userId","movieId","rating"]].merge(movies,on="movieId")
m_df

Unnamed: 0,userId,movieId,rating,title,genres
0,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...
96459,610,160341,2.5,Bloodmoon (1997),Action|Thriller
96460,610,160527,4.5,Sympathy for the Underdog (1971),Action|Crime|Drama
96461,610,160836,3.0,Hazard (2005),Action|Drama|Thriller
96462,610,163937,3.5,Blair Witch (2016),Horror|Thriller


In [19]:
ratings = ratings_train.copy()

# convert UserID, MovieID to categorical type
ratings['userId'] = ratings['userId'].astype('category')
ratings['movieId'] = ratings['movieId'].astype('category')
ratings

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0
...,...,...,...
96459,610,166534,4.0
96460,610,168248,5.0
96461,610,168250,5.0
96462,610,168252,5.0


In [10]:
# convert genres to Array
m_genre=m_df.copy()
m_genre["genres"]=m_genre["genres"].str.split("|")
# convert array elements to string
m_genre["genres"]=m_genre["genres"].fillna("").astype("str")
m_genre

Unnamed: 0,userId,movieId,rating,title,genres
0,1,1,4.0,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy..."
1,5,1,4.0,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy..."
2,7,1,4.5,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy..."
3,15,1,2.5,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy..."
4,17,1,4.5,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy..."
...,...,...,...,...,...
96459,610,160341,2.5,Bloodmoon (1997),"['Action', 'Thriller']"
96460,610,160527,4.5,Sympathy for the Underdog (1971),"['Action', 'Crime', 'Drama']"
96461,610,160836,3.0,Hazard (2005),"['Action', 'Drama', 'Thriller']"
96462,610,163937,3.5,Blair Witch (2016),"['Horror', 'Thriller']"


In [11]:
# get movie title Series
titles = movies['title']
title_S = pd.Series(movies.index, index=movies['title'])
title_S

title
Toy Story (1995)                                0
Jumanji (1995)                                  1
Grumpier Old Men (1995)                         2
Waiting to Exhale (1995)                        3
Father of the Bride Part II (1995)              4
                                             ... 
Black Butler: Book of the Atlantic (2017)    9737
No Game No Life: Zero (2017)                 9738
Flint (2017)                                 9739
Bungo Stray Dogs: Dead Apple (2018)          9740
Andrew Dice Clay: Dice Rules (1991)          9741
Length: 9742, dtype: int64

In [12]:
# get genre name list
genre_name = set()
for genre in movies['genres'].str.split('|').values:
    genre_name = genre_name.union(set(genre))
genre_name

{'(no genres listed)',
 'Action',
 'Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'IMAX',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western'}

In [13]:
# find tf-idf score of genre for  each movie
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 1),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(m_genre['genres'])
print(tfidf_matrix[:10, :10].todense())

[[0.         0.36356033 0.549949   0.50819072 0.29282185 0.
  0.         0.         0.47050776 0.        ]
 [0.         0.36356033 0.549949   0.50819072 0.29282185 0.
  0.         0.         0.47050776 0.        ]
 [0.         0.36356033 0.549949   0.50819072 0.29282185 0.
  0.         0.         0.47050776 0.        ]
 [0.         0.36356033 0.549949   0.50819072 0.29282185 0.
  0.         0.         0.47050776 0.        ]
 [0.         0.36356033 0.549949   0.50819072 0.29282185 0.
  0.         0.         0.47050776 0.        ]
 [0.         0.36356033 0.549949   0.50819072 0.29282185 0.
  0.         0.         0.47050776 0.        ]
 [0.         0.36356033 0.549949   0.50819072 0.29282185 0.
  0.         0.         0.47050776 0.        ]
 [0.         0.36356033 0.549949   0.50819072 0.29282185 0.
  0.         0.         0.47050776 0.        ]
 [0.         0.36356033 0.549949   0.50819072 0.29282185 0.
  0.         0.         0.47050776 0.        ]
 [0.         0.36356033 0.549949   0.

In [None]:
# get movie title and rating from IMDB
linksIMDBID=links.copy()
linksIMDBID['imdbId'] = links['imdbId'].astype(str)

def get_movie_title_and_rating(links):
    def retrieve_info(imdbId):
        try:
            movie = ia.get_movie(imdbId)
            title = movie['title']
            rating = movie['rating']
            return title, rating
        except:
            return None, None
        
    # Apply the retrieve_info function to the imdbId column of the link_df
    movie_info = links['imdbId'].apply(retrieve_info)
    
    # Unpack the tuples into separate Series for title and rating
    movie_titles, movie_ratings = zip(*movie_info)
    
    # Return a DataFrame with the movie titles and ratings
    return pd.DataFrame({'movieId': links['movieId'], 'title': movie_titles, 'rating': movie_ratings})

# Apply the get_movie_title_and_rating function to the link DataFrame
movie_info_df = get_movie_title_and_rating(links)

In [79]:
tfidf_array = tfidf_matrix.toarray()
mov_tf=movies.copy()
# Create a new dataframe column from the numpy array
mov_tf['Genre_tfidf'] = pd.Series(tfidf_array.tolist(),index=mov_tf.index)
mov_tf

Unnamed: 0,movieId,title,genres,tfidf
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[0.0, 0.41684567364693936, 0.5162254711770092,..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[0.0, 0.5123612074824268, 0.0, 0.6205251727456..."
2,3,Grumpier Old Men (1995),Comedy|Romance,"[0.0, 0.0, 0.0, 0.0, 0.5709154064399099, 0.0, ..."
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[0.0, 0.0, 0.0, 0.0, 0.5050154397005037, 0.0, ..."
4,5,Father of the Bride Part II (1995),Comedy,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,"[0.4360100100175108, 0.0, 0.614603265870322, 0..."
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,"[0.0, 0.0, 0.682936669160916, 0.0, 0.354001550..."
9739,193585,Flint (2017),Drama,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ..."
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,"[0.5786057446426058, 0.0, 0.8156073762948541, ..."


In [18]:
# get movie title and rating from IMDB
import imdb
linksIMDBID=links.copy()
linksIMDBID['imdbId'] = links['imdbId'].astype(str)

def get_movie_title_and_rating(links):
    def retrieve_info(imdbId):
        try:
            movie = ia.get_movie(imdbId)
            title = movie['title']
            rating = movie['rating']
            return title, rating
        except:
            return None, None
        
    # Apply the retrieve_info function to the imdbId column of the link_df
    movie_info = links['imdbId'].apply(retrieve_info)
    
    # Unpack the tuples into separate Series for title and rating
    movie_titles, movie_ratings = zip(*movie_info)
    
    # Return a DataFrame with the movie titles and ratings
    return pd.DataFrame({'movieId': links['movieId'], 'title': movie_titles, 'rating': movie_ratings})

# Apply the get_movie_title_and_rating function to the link DataFrame
movie_info_df = get_movie_title_and_rating(links)

## collaborative filtering

In [14]:
# create user-item matrix for collaborative filtering-item based
user_item_matrix = pd.pivot_table(ratings_train, values='rating', index='userId', columns='movieId').fillna(0)
user_item_matrix
# 610 unique user - 9690 unique movie(id max at 193609)

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
from sklearn.metrics.pairwise import cosine_similarity

### user-based

In [41]:
# find user-item cos-sim in order to find high similarity user
user_csim = pd.DataFrame(cosine_similarity(user_item_matrix,user_item_matrix), user_item_matrix.index, user_item_matrix.index)
user_csim

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.027283,0.059720,0.210282,0.129080,0.128152,0.158744,0.136968,0.064263,0.018329,...,0.081481,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2,0.027283,1.000000,0.000000,0.004167,0.016614,0.025333,0.027585,0.027257,0.000000,0.073255,...,0.205005,0.016866,0.011997,0.000000,0.000000,0.028429,0.012948,0.046211,0.027565,0.102427
3,0.059720,0.000000,1.000000,0.002518,0.005020,0.003936,0.000000,0.004941,0.000000,0.000000,...,0.005106,0.004892,0.024992,0.000000,0.010694,0.012993,0.019247,0.021128,0.000000,0.032119
4,0.210282,0.004167,0.002518,1.000000,0.107718,0.085415,0.117554,0.070424,0.012706,0.031502,...,0.097220,0.129750,0.272360,0.047971,0.094598,0.184199,0.136068,0.163608,0.024007,0.103752
5,0.129080,0.016614,0.005020,0.107718,1.000000,0.300349,0.108342,0.429075,0.000000,0.033248,...,0.068831,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.164191,0.028429,0.012993,0.184199,0.106435,0.102123,0.200035,0.099388,0.075898,0.084767,...,0.180134,0.116534,0.300669,0.066032,0.148141,1.000000,0.153063,0.262558,0.069622,0.201104
607,0.269389,0.012948,0.019247,0.136068,0.152866,0.162182,0.186114,0.185142,0.011844,0.011351,...,0.093590,0.199910,0.203540,0.137834,0.118780,0.153063,1.000000,0.283081,0.149190,0.139114
608,0.291097,0.046211,0.021128,0.163608,0.135535,0.178809,0.323541,0.187233,0.100435,0.084093,...,0.160178,0.197514,0.232771,0.155306,0.178142,0.262558,0.283081,1.000000,0.121993,0.322055
609,0.093572,0.027565,0.000000,0.024007,0.261232,0.214234,0.090840,0.423993,0.000000,0.023641,...,0.036063,0.335231,0.061941,0.236601,0.097610,0.069622,0.149190,0.121993,1.000000,0.053225


### item-based

In [42]:
# find user-item cos-sim in order to find high similarity movie
movie_csim = pd.DataFrame(cosine_similarity(user_item_matrix.transpose(),user_item_matrix.transpose()), user_item_matrix.columns, user_item_matrix.columns)
movie_csim

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.410562,0.296917,0.035573,0.295509,0.376316,0.277491,0.115186,0.232586,0.395573,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.410562,1.000000,0.282438,0.106415,0.252313,0.297009,0.228576,0.149095,0.044835,0.417693,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.296917,0.282438,1.000000,0.092406,0.405341,0.284257,0.402831,0.334122,0.304840,0.242954,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.035573,0.106415,0.092406,1.000000,0.197276,0.089685,0.275035,0.168453,0.000000,0.095598,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.295509,0.252313,0.405341,0.197276,1.000000,0.292412,0.456264,0.316516,0.350888,0.198970,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
193583,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
193585,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
193587,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


## Singular Value Decomposition

In [5]:
from surprise.model_selection import train_test_split
# rating reader
reader = Reader(rating_scale=(1,5))
# Load Train dataset
ds_user_item=Dataset.load_from_df(ratings_train[["userId","movieId","rating"]], reader)
# Load Valid dataset (must be changeable)
ds_user_item_valid=Dataset.load_from_df(ratings_valid[["userId","movieId","rating"]], reader)

trainset = ds_user_item.build_full_trainset()
testset= ds_user_item_valid.build_full_trainset()
svd=SVD(n_factors=100, n_epochs=100, lr_all=0.005, reg_all=0.02)
svd.fit(trainset)
pred=svd.test(testset.build_testset())
for pre in pred:
    print(pre.est)

5
2.9443887164903
3.895629357844909
4.139178850985669
4.121778615877596
2.714181479187915
2.9497182862901496
2.8907897530543414
3.0781480307932405
3.635162803924487
4.085889584358417
3.2017366122402375
3.754780399849032
3.5474275250953267
3.789414894407361
4.008984912302239
4.1555242718803855
3.7595768026323335
3.8729873387622575
4.331559623898828
3.970967759277489
3.797845270945855
2.8362483715310853
3.187649666985946
3.5519783725226306
3.2333808605511716
3.330216886902143
3.8027428476485046
2.9966643185584636
3.6768424052843582
3.7818240669805165
3.6377912505927306
3.3677532667152965
3.7059425197254607
3.9816594832711685
3.5126420526241007
3.834292669295662
3.7536475994808796
4.428445988418387
3.4264202411348124
3.120872994228131
2.8417496585899378
3.416865502328317
3.240853127977478
3.646777801562149
4.0540867455664
4.13575708595135
4.350805447817039
3.8979588428074226
3.3852033131227497
3.653795538639686
2.752586710416039
3.529125931374691
2.853649187305984
2.8951082644530666
2.915

In [6]:
uid = [p.uid for p in pred]
iid = [p.iid for p in pred]
est = [p.est for p in pred]
pred_df = pd.DataFrame({'userId': uid, 'movieId': iid, 'rating': est})
pred_df

Unnamed: 0,userId,movieId,rating
0,4,45,5.000000
1,4,52,2.944389
2,4,58,3.895629
3,4,222,4.139179
4,4,247,4.121779
...,...,...,...
2349,561,139385,4.159082
2350,561,146656,3.594220
2351,561,149406,3.401180
2352,561,160438,3.051828


# Constructing model and predicting ratings

In [10]:
# Model construction
# avg_rating = ratings_train[['movieId', 'rating']].groupby(by='movieId').mean()
	    
# Prediction
def predict_rating(df):
    # Input: 
	# 	df = a dataframe with two columns: userId, movieId
	# Output:
	#   a dataframe with three columns: userId, movieId, rating
    new_df=pd.merge(df,pred_df,on=["userId","movieId"],how="left")
    return new_df  #df.join(avg_rating, on='movieId')


In [11]:
# Prepare df for prediction
r = ratings_valid[['userId', 'movieId']]

# Predict ratings
ratings_pred = predict_rating(r)
ratings_pred

Unnamed: 0,userId,movieId,rating
0,4,45,5.000000
1,4,52,2.944389
2,4,58,3.895629
3,4,222,4.139179
4,4,247,4.121779
...,...,...,...
2349,561,139385,4.159082
2350,561,146656,3.594220
2351,561,149406,3.401180
2352,561,160438,3.051828


In [13]:
from sklearn.metrics import mean_squared_error

r_true = ratings_valid['rating'].to_numpy()
r_pred = ratings_pred['rating'].to_numpy()

rmse = mean_squared_error(r_true, r_pred, squared=False)
print(f"RMSE = {rmse:.4f}")

RMSE = 0.8496
