# Cold Start For Movie Recommender System

## Team
Christian Mourad `20682730`

## Problem Statement
This is my custom project for MSCI-641, Spring 22. The goal of the project is to build a movie recommender system to help cold start users who do not have many ratings in the system.  

## Data
The recommender will use the IMDB ratings dataset. This dataset has :
* 45,000 movies
* 270,000 users
* 26 million ratings

The recommender system will take the following input:
* the features of the movie, like:
    * director
    * top 5 actors
    * top 3 genres
    * release date
    * top 2 production companies
* the plot summary (a.k.a. overview) of the movie
    * baseline: bag-of-words
    * advanced model: embedding (potentially using `BERT`)

Due to the very large number of users and movies, the data is very sparse. To keep computation time and resources at a resonable size, the movies (and associated ratings that will be used for testing) will be trucated to only movies that meet a certain popularity criterion. Moreover, this will keep the testing metrics at bay (this will become more evident in the [Testing](#testing) section)

## Testing

The primary metric that we will use is [MAP@k](https://machinelearninginterview.com/topics/machine-learning/mapatk_evaluation_metric_for_ranking/). 
Due to the very large number of users and movies, the data is very sparse. This can hurt testing significantly, since the recommended movies can be movies that the user never stumbled upon. Thus, we will test our algorithms on the 100 users who watched the biggest number of movies.

`x_in_100` for 10k movies: mean=0.461700, std=0.920444


## Future work
This system primairly uses content-based fitlerting for recommendation. If time permits, or as a next step, it would be intersting to augment the recommendation logic with collaborative-based filtering.


## 1. Loading the data and keeping the desired features

In [128]:
import pandas as pd, numpy as np
df_movies = pd.read_csv('data/IMDB_Ratings/movies_metadata.csv')
df_movies.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [132]:
df_credits = pd.read_csv('data/IMDB_Ratings/credits.csv')
df_credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [133]:
corrupt_ids = []
for i in df_movies['id']:
    try:
        int(i)
    except:
        corrupt_ids.append(i)

# remove the rows with the corrupt ids from the data frame
df_movies.drop(df_movies[df_movies.id.isin(corrupt_ids)].index, inplace=True)

In [134]:
# cast movies.id to int for the join to succeed
df_movies.id = df_movies.id.astype(int)

# merge/join the 2 dataframes together on the movie id
df_movies_full_data = df_credits.merge(df_movies,on='id')

df_movies_full_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45538 entries, 0 to 45537
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   cast                   45538 non-null  object 
 1   crew                   45538 non-null  object 
 2   id                     45538 non-null  int64  
 3   adult                  45538 non-null  object 
 4   belongs_to_collection  4500 non-null   object 
 5   budget                 45538 non-null  object 
 6   genres                 45538 non-null  object 
 7   homepage               7792 non-null   object 
 8   imdb_id                45521 non-null  object 
 9   original_language      45527 non-null  object 
 10  original_title         45538 non-null  object 
 11  overview               44584 non-null  object 
 12  popularity             45535 non-null  object 
 13  poster_path            45152 non-null  object 
 14  production_companies   45535 non-null  object 
 15  pr

In [135]:
# remove movies that have been rated under a certain percentile
MIN_MOVIE_RATED_PERCENTILE = 50

popular_movies = df_movies_full_data.copy().loc[df_movies_full_data.vote_count >= MIN_MOVIE_RATED_PERCENTILE]
popular_movies = popular_movies[['id', 'cast', 'title', 'crew', 'genres', 'overview', 'production_companies']]
popular_movies.reset_index(drop=True, inplace=True)
popular_movies.head()


Unnamed: 0,id,cast,title,crew,genres,overview,production_companies
0,862,"[{'cast_id': 14, 'character': 'Woody (voice)',...",Toy Story,"[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","Led by Woody, Andy's toys live happily in his ...","[{'name': 'Pixar Animation Studios', 'id': 3}]"
1,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...",Jumanji,"[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",When siblings Judy and Peter discover an encha...,"[{'name': 'TriStar Pictures', 'id': 559}, {'na..."
2,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...",Grumpier Old Men,"[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",A family wedding reignites the ancient feud be...,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'..."
3,11862,"[{'cast_id': 1, 'character': 'George Banks', '...",Father of the Bride Part II,"[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 35, 'name': 'Comedy'}]",Just when George Banks has recovered from his ...,"[{'name': 'Sandollar Productions', 'id': 5842}..."
4,949,"[{'cast_id': 25, 'character': 'Lt. Vincent Han...",Heat,"[{'credit_id': '52fe4292c3a36847f802916d', 'de...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...","Obsessive master thief, Neil McCauley leads a ...","[{'name': 'Regency Enterprises', 'id': 508}, {..."


In [114]:
popular_movies[popular_movies.title == 'The Avengers']

Unnamed: 0,id,cast,title,crew,genres,overview
1078,9320,"[{'cast_id': 1, 'character': 'John Steed', 'cr...",The Avengers,"[{'credit_id': '52fe44e7c3a36847f80b0e7f', 'de...","[{'id': 53, 'name': 'Thriller'}]","British Ministry agent John Steed, under direc..."
5945,24428,"[{'cast_id': 46, 'character': 'Tony Stark / Ir...",The Avengers,"[{'credit_id': '52fe4495c3a368484e02b1cf', 'de...","[{'id': 878, 'name': 'Science Fiction'}, {'id'...",When an unexpected enemy emerges and threatens...


In [136]:
popular_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9171 entries, 0 to 9170
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    9171 non-null   int64 
 1   cast                  9171 non-null   object
 2   title                 9171 non-null   object
 3   crew                  9171 non-null   object
 4   genres                9171 non-null   object
 5   overview              9135 non-null   object
 6   production_companies  9171 non-null   object
dtypes: int64(1), object(6)
memory usage: 501.7+ KB


In [138]:
# taken from https://gist.github.com/4OH4/f727af7dfc0e6bb0f26d2ea41d89ee55
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords

# Download stopwords list
nltk.download('punkt')
stop_words = set(stopwords.words('english')) 

# Interface lemma tokenizer from nltk with sklearn
class LemmaTokenizer:
    ignore_tokens = [',', '.', ';', ':', '"', '``', "''", '`']
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if t not in self.ignore_tokens]

# Lemmatize the stop words
tokenizer=LemmaTokenizer()
token_stop = tokenizer(' '.join(stop_words))

# Create TF-idf model
vectorizer = TfidfVectorizer(
    stop_words=token_stop,
    tokenizer=tokenizer,
    lowercase=True,
    max_df=0.95,
    min_df=10)

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix_lemma = vectorizer.fit_transform(popular_movies['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix_lemma.shape

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(9171, 4140)

In [137]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(
    stop_words='english', 
    lowercase=True,
    max_df=0.95,
    min_df=10)

#Replace NaN with an empty string
popular_movies['overview'] = popular_movies['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(popular_movies['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(9171, 4330)

In [117]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix_lemma, tfidf_matrix_lemma)

In [118]:
cosine_sim.shape

(9171, 9171)

In [120]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(popular_movies.index, index=popular_movies.id).drop_duplicates()

In [244]:
# Function that takes in movie id as input and outputs most similar movies
def get_recommendations(movie_id, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[movie_id]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return popular_movies[['id']].iloc[movie_indices].id.to_list()

In [245]:
get_recommendations(24428)

[99861, 76122, 271110, 100402, 1771, 10138, 29845, 413279, 11324, 13505]

In [139]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

for feature in ['cast', 'crew', 'genres', 'production_companies']:
    popular_movies[feature] = popular_movies[feature].apply(literal_eval)

In [125]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [140]:
# Returns the list top n elements or entire list; whichever is more.
def get_list(x, top_n):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > top_n:
            names = names[:top_n]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [142]:
# Define new director, cast, genres and keywords features that are in a suitable form.
popular_movies['director'] = popular_movies['crew'].apply(get_director)

for feature, top_n in [('cast', 5), ('genres', 3), ('production_companies', 2)]:
    popular_movies[feature] = popular_movies[feature].apply(get_list, args=(top_n,))

In [143]:
popular_movies.head()

Unnamed: 0,id,cast,title,crew,genres,overview,production_companies,director
0,862,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...",Toy Story,"[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[Animation, Comedy, Family]","Led by Woody, Andy's toys live happily in his ...",[Pixar Animation Studios],John Lasseter
1,8844,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...",Jumanji,"[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[Adventure, Fantasy, Family]",When siblings Judy and Peter discover an encha...,"[TriStar Pictures, Teitler Film]",Joe Johnston
2,15602,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...",Grumpier Old Men,"[{'credit_id': '52fe466a9251416c75077a89', 'de...","[Romance, Comedy]",A family wedding reignites the ancient feud be...,"[Warner Bros., Lancaster Gate]",Howard Deutch
3,11862,"[Steve Martin, Diane Keaton, Martin Short, Kim...",Father of the Bride Part II,"[{'credit_id': '52fe44959251416c75039ed7', 'de...",[Comedy],Just when George Banks has recovered from his ...,"[Sandollar Productions, Touchstone Pictures]",Charles Shyer
4,949,"[Al Pacino, Robert De Niro, Val Kilmer, Jon Vo...",Heat,"[{'credit_id': '52fe4292c3a36847f802916d', 'de...","[Action, Crime, Drama]","Obsessive master thief, Neil McCauley leads a ...","[Regency Enterprises, Forward Pass]",Michael Mann


In [144]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [145]:
# turn all feature names to lower case and strip them from spaces (to not get confused with the bag-of-words)
for feature in ['cast', 'director', 'director', 'production_companies']:
    popular_movies[feature] = popular_movies[feature].apply(clean_data)

In [148]:
popular_movies.head()

Unnamed: 0,id,cast,title,crew,genres,overview,production_companies,director,features
0,862,"[tomhanks, timallen, donrickles, jimvarney, wa...",Toy Story,"[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[Animation, Comedy, Family]","Led by Woody, Andy's toys live happily in his ...",[pixaranimationstudios],johnlasseter,tomhanks timallen donrickles jimvarney wallace...
1,8844,"[robinwilliams, jonathanhyde, kirstendunst, br...",Jumanji,"[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[Adventure, Fantasy, Family]",When siblings Judy and Peter discover an encha...,"[tristarpictures, teitlerfilm]",joejohnston,robinwilliams jonathanhyde kirstendunst bradle...
2,15602,"[waltermatthau, jacklemmon, ann-margret, sophi...",Grumpier Old Men,"[{'credit_id': '52fe466a9251416c75077a89', 'de...","[Romance, Comedy]",A family wedding reignites the ancient feud be...,"[warnerbros., lancastergate]",howarddeutch,waltermatthau jacklemmon ann-margret sophialor...
3,11862,"[stevemartin, dianekeaton, martinshort, kimber...",Father of the Bride Part II,"[{'credit_id': '52fe44959251416c75039ed7', 'de...",[Comedy],Just when George Banks has recovered from his ...,"[sandollarproductions, touchstonepictures]",charlesshyer,stevemartin dianekeaton martinshort kimberlywi...
4,949,"[alpacino, robertdeniro, valkilmer, jonvoight,...",Heat,"[{'credit_id': '52fe4292c3a36847f802916d', 'de...","[Action, Crime, Drama]","Obsessive master thief, Neil McCauley leads a ...","[regencyenterprises, forwardpass]",michaelmann,alpacino robertdeniro valkilmer jonvoight toms...


In [147]:
# join all the features together into one string thacan be processed by a count-vectorizer
def join_features(x, features = ['cast', 'director', 'director', 'production_companies']):
    joined_features = ''
    for feature in features:
        joined_features += ' '.join(x[feature]) + ' '
    return joined_features

popular_movies['features'] = popular_movies.apply(join_features, axis=1)

In [149]:
popular_movies.head()

Unnamed: 0,id,cast,title,crew,genres,overview,production_companies,director,features
0,862,"[tomhanks, timallen, donrickles, jimvarney, wa...",Toy Story,"[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[Animation, Comedy, Family]","Led by Woody, Andy's toys live happily in his ...",[pixaranimationstudios],johnlasseter,tomhanks timallen donrickles jimvarney wallace...
1,8844,"[robinwilliams, jonathanhyde, kirstendunst, br...",Jumanji,"[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[Adventure, Fantasy, Family]",When siblings Judy and Peter discover an encha...,"[tristarpictures, teitlerfilm]",joejohnston,robinwilliams jonathanhyde kirstendunst bradle...
2,15602,"[waltermatthau, jacklemmon, ann-margret, sophi...",Grumpier Old Men,"[{'credit_id': '52fe466a9251416c75077a89', 'de...","[Romance, Comedy]",A family wedding reignites the ancient feud be...,"[warnerbros., lancastergate]",howarddeutch,waltermatthau jacklemmon ann-margret sophialor...
3,11862,"[stevemartin, dianekeaton, martinshort, kimber...",Father of the Bride Part II,"[{'credit_id': '52fe44959251416c75039ed7', 'de...",[Comedy],Just when George Banks has recovered from his ...,"[sandollarproductions, touchstonepictures]",charlesshyer,stevemartin dianekeaton martinshort kimberlywi...
4,949,"[alpacino, robertdeniro, valkilmer, jonvoight,...",Heat,"[{'credit_id': '52fe4292c3a36847f802916d', 'de...","[Action, Crime, Drama]","Obsessive master thief, Neil McCauley leads a ...","[regencyenterprises, forwardpass]",michaelmann,alpacino robertdeniro valkilmer jonvoight toms...


In [164]:
# use CountVectorizer on the features
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
feature_count_matrix = cv.fit_transform(popular_movies['features'])

feature_count_matrix.shape

(9171, 22469)

In [185]:
# computer the cosine similarity on the bag of words tfidf and the features count vectors
from scipy.sparse import hstack
from sklearn.metrics.pairwise import cosine_similarity

FEATURE_WEIGHT = 0.5
BOW_WEIGHT = 2

full_input_matrix = hstack((BOW_WEIGHT*tfidf_matrix_lemma, FEATURE_WEIGHT*feature_count_matrix))

cosine_sim = cosine_similarity(full_input_matrix, full_input_matrix)

In [181]:
cosine_sim.shape

(9171, 9171)

In [186]:
get_recommendations(24428, cosine_sim)

5945
sim_scores.shape: 9171
sim_scores: [(7599, 0.346987950606055), (8861, 0.2073901474897821), (7606, 0.1825455940070865), (7049, 0.17287505293530955), (5873, 0.17038918718806417), (5488, 0.15893016440259836), (3153, 0.14411485572283111), (8852, 0.13987572123604708), (5409, 0.13341119154061992), (3219, 0.1333662673423161)]


Unnamed: 0,id,title
7599,99861,Avengers: Age of Ultron
8861,76122,Marvel One-Shot: The Consultant
7606,271110,Captain America: Civil War
7049,100402,Captain America: The Winter Soldier
5873,1771,Captain America: The First Avenger
5488,10138,Iron Man 2
3153,29845,A Woman Under the Influence
8852,413279,Team Thor
5409,11324,Shutter Island
3219,13505,The Perfect Score


## Testing

For every user who has 20 or more positive ratings (defined as a rating that is higher than 2.5):
* Extract the first 10 movies they rated, and use them to generate 100 recommendations
* Then, use two metrics to evaluate the model:
    1. x in 100
        * Check how many of these recommendations were actually watched
    2. MAP@k
        * SOME EXPLANATION

In [192]:
# load the movie ratings into a data frame
df_ratings = pd.read_csv('data/IMDB_Ratings/ratings.csv')

df_ratings.shape

(26024289, 4)

In [193]:
# drop all ratings that pertain to movies that were not considered in the similarity matrix
df_ratings = df_ratings[df_ratings.movieId.isin(popular_movies.id.to_list())]

df_ratings.shape

(7654481, 4)

In [191]:
##### SKIP this since we do not really care about the value of the rating, we will discard that column eventually anyway 

# turn all ratings to:
# - -1:   0 < rating < 2.5
# -  0: rating = 2.5
# -  1: 2.5 < rating <= 5
def simplify_rating(rating):
    if rating < 2.5:
        return -1
    if rating == 2.5:
        return 0
    return 1
    
df_ratings.rating = df_ratings.rating.apply(simplify_rating)

df_ratings.shape

(7654481, 4)

In [194]:
# discard all neutral or negative ratings
positive_ratings = df_ratings[df_ratings.rating > 2.5]

positive_ratings.shape

(6384265, 4)

In [231]:
# find the user who have 10 or more positive ratings
ratings_per_user = positive_ratings.groupby(['userId'])['userId'].count().reset_index(name='ratingsGiven')

ratings_per_user.shape

(256678, 2)

In [233]:
MIN_POSITIVE_RATINGS_PER_USER = 20

avid_users = ratings_per_user[ratings_per_user.ratingsGiven >= MIN_POSITIVE_RATINGS_PER_USER]

avid_users.shape

(87592, 2)

In [234]:
avid_users.head()

Unnamed: 0,userId,ratingsGiven
7,8,23
10,11,28
11,12,91
14,15,37
15,16,29


In [235]:
test_users = positive_ratings[
    positive_ratings.userId.isin(avid_users.userId.to_list())
    ]
    
test_users.shape

(5218303, 4)

In [236]:
test_users = test_users.groupby('userId')['movieId'].apply(list).reset_index(name="moviesWatched")
test_users.shape

(87592, 2)

In [237]:
test_users.head()

Unnamed: 0,userId,moviesWatched
0,8,"[170, 318, 553, 647, 653, 912, 968, 1265, 1266..."
1,11,"[110, 165, 260, 296, 318, 344, 364, 457, 480, ..."
2,12,"[16, 17, 82, 97, 123, 150, 162, 175, 176, 194,..."
3,15,"[6, 107, 293, 296, 441, 541, 599, 608, 745, 77..."
4,16,"[111, 198, 260, 296, 318, 380, 480, 541, 593, ..."


In [264]:
def test_recommender(movies_watched):
    train_movies = movies_watched[:10]
    test_movies = movies_watched[10:]

    recommendations = set()
    for movieId in train_movies:
        for recId in get_recommendations(movieId):
            recommendations.add(recId)
    
    # calculate x in 100
    overlaps = recommendations.intersection(set(test_movies))
    x_in_100 = len(overlaps)

    # TODO should I shuffle the movies to benifit MAP@k?
    # TODO implement MAP@k

    return x_in_100


In [265]:
test_users['x_in_100'] = test_users.moviesWatched.apply(test_recommender)

KeyboardInterrupt: 

In [270]:
dummy = test_users.head(10000)
dummy['x_in_100'] = dummy.moviesWatched.apply(test_recommender)
dummy.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dummy['x_in_100'] = dummy.moviesWatched.apply(test_recommender)


Unnamed: 0,userId,x_in_100
count,10000.0,10000.0
mean,15225.3917,0.4617
std,8918.762341,0.920444
min,8.0,0.0
25%,7407.5,0.0
50%,15207.0,0.0
75%,22925.25,1.0
max,30774.0,15.0
