In this notebook, we build a content-based recommendation system based on various movie categories and link metadata that we scraped off of Wikipedia. Wikipedia gave us access to details like directors, writers, actors, awards, themes, etc.

We know that the three general steps of building a content-based model are as follows:
1. preprocessing and feature extraction (content analyzer)
    - build a vector-space representation of each item
2. building user profiles
    - learn a model of user preferences over the vector-space representation
3. filtering and recommendation
    - combine the item content and user profiles to deliver recommendations

In [49]:
import pandas as pd
import numpy as np
import pickle
!pip install annoy
import annoy

You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [59]:
train_ratings = pd.read_csv('inputs/train_sample_ratings.csv', index_col=0)
val_ratings = pd.read_csv('inputs/val_sample_ratings.csv', index_col=0)
test_ratings = pd.read_csv('inputs/test_sample_ratings.csv', index_col=0)
train_val_ratings = pd.read_csv('inputs/train_val_sample_ratings.csv', index_col=0)
movies = pd.read_csv("inputs/movies.csv")

### Preprocessing and feature extraction about movies
This involved running Wikipedia scraping algorithm for each movie in our training set. We then did data exploration and cleaning on this extracted information. The Wikipedia.ipynb notebook contains the work done to scrape Wikipedia and clean its output, so this step will be skipped here, and we simply read in the data table we generated from that Wikipedia notebook. The row indices are the movieIds and the rows themselves are the movie vectoes. Please see that ipynb file if you are interested in the movie vector generation process.

In [52]:
movie_vecs = pd.read_csv('inputs/movie_tag_wiki_vecs.csv', index_col=0)
movie_vecs.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2547,2548,2549,2550,2551,2552,2553,2554,2555,2556
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
36,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45,0,2,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [53]:
def generate_movie_vector(movieID):
    return list(movie_vecs.loc[movieID])

### Building user profiles using the MovieLens rating data


In [54]:
# given all movieIDs that user has watched, generate the user's weighted vector of movies they like (weighted by rating?)
def user_movie_vec(movieID_rating_dict):
    all_vecs = []
    for movieID, rating in movieID_rating_dict.items():
        movie_vec = generate_movie_vector(movieID)
        movie_vec = [v*rating for v in movie_vec]
        all_vecs.append(movie_vec)
    all_vecs = np.array(all_vecs)
    return np.mean(all_vecs, axis=0)

In [55]:
def create_user_profiles(ratings):
    grouped_ratings = ratings.groupby(['userId'], as_index=False)['movieId','rating'].agg(lambda x: list(x))
    users = {}
    for user_row in grouped_ratings.iterrows():
        movies = user_row[1]['movieId']
        ratings = user_row[1]['rating']
        movie_rating_dict = {k:v for k,v in zip(movies, ratings)}
        users[user_row[1]['userId']] = movie_rating_dict
    final_user_profiles = {}
    for u, dic in users.items():
        user_vec = user_movie_vec(dic)
        final_user_profiles[u] = user_vec
    return final_user_profiles

In [43]:
print(len(train_val_ratings['userId'].unique()))
user_profiles = create_user_profiles(train_val_ratings) # ONLY BUILD USER PROFILES ON TRAINING DATA
print(len(user_profiles)) # ensure we have 20k user profiles

20000
20000


In [44]:
stacked_vecs = list(user_profiles.values())
df = pd.DataFrame(stacked_vecs, index = list(user_profiles.keys())) # index = userId, row = their profile vector
df.to_csv('user_profiles.csv', header=False)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2547,2548,2549,2550,2551,2552,2553,2554,2555,2556
4,0.175258,0.082474,0.0,0.252577,0.097938,0.123711,0.391753,0.0,0.139175,0.231959,...,0.273196,0.0,0.108247,0.28866,0.0,0.257732,0.0,0.412371,0.618557,0.103093
46,0.205882,0.382353,0.0,1.088235,0.294118,0.911765,1.147059,1.088235,0.941176,0.647059,...,0.470588,0.0,0.235294,1.058824,0.0,1.0,0.382353,0.0,0.0,0.558824
47,0.157895,0.947368,0.0,0.578947,0.894737,0.842105,1.263158,0.263158,0.421053,0.736842,...,0.0,0.0,0.0,0.210526,0.0,0.263158,0.210526,0.0,0.0,0.0
51,0.553571,0.178571,0.125,0.553571,0.142857,0.642857,1.214286,0.160714,0.625,0.607143,...,0.321429,0.0,0.196429,0.482143,0.0,0.660714,0.142857,0.0,0.0,0.125
72,0.0,0.0,0.0,0.470588,0.823529,1.058824,0.294118,0.588235,0.764706,0.0,...,1.882353,0.0,0.0,1.382353,0.0,1.558824,0.294118,0.794118,19.852941,0.294118


### Run Approximate Nearest Neighbors on each user vector to get the top 10 recommended movies for that user

In [45]:
def unwatched_movies(userId, movies_per_user_dict, recommended_movieIds):
    watched_movies = movies_per_user_dict[userId]
    final_recs = []
    for movId in recommended_movieIds:
        if movId not in watched_movies:
            final_recs.append(movId)
            if len(final_recs) == 10:
                return final_recs
    return final_recs

First, let's do some hyperparameter searching for the best value of k for number of trees to search. Because this process is very lengthy for all users, we'll simply explore k for a single user (userId 4) right now.

In [75]:
all_movie_vecs = movie_vecs.to_numpy(dtype=float)
user_vec = list(df.loc[4])
curr_user_movie_vecs = np.insert(all_movie_vecs, 0, np.array(user_vec), 0)  
names = [0] + list(movie_vecs.index) # user here is named 0, every other movie is recognized by its id
index = annoy.AnnoyIndex(2557)
for name, vec in zip(names, curr_user_movie_vecs):
    index.add_item(name, vec)

index.build(50)

for k in [20, 40, 50, 80, 100, 120, 200]:
    nearest_movie_ids, nearest_distances = index.get_nns_by_vector(user_vec, 200, search_k=k, include_distances=True)
    print(unwatched_movies(user, movies_per_user_trainval, nearest_movie_ids[1:]))

  """


[6333, 3793, 1210, 112852, 1918, 4993, 65982, 1587, 8961, 316]
[6333, 3793, 1210, 112852, 1918, 4993, 65982, 1587, 8961, 316]
[6333, 3793, 1210, 112852, 1918, 4993, 65982, 1587, 8961, 316]
[6333, 3793, 1210, 112852, 1918, 4993, 65982, 1587, 8961, 316]
[6333, 3793, 1210, 112852, 1918, 4993, 65982, 1587, 8961, 316]
[6333, 3793, 1210, 112852, 1918, 4993, 65982, 1587, 8961, 316]
[6333, 3793, 1210, 112852, 1918, 4993, 65982, 1587, 8961, 316]


Above, we see that for all these various hyperparamter values, we actually end up getting the exact same recommended movies. Thus, we will simply choose k=50 in the middle of the range for simple, stable results.

In [47]:
all_movie_vecs = movie_vecs.to_numpy(dtype=float)
user_recs = {}
for user in df.index:
    user_vec = list(df.loc[user])
    curr_user_movie_vecs = np.insert(all_movie_vecs, 0, np.array(user_vec), 0)  
    names = [0] + list(movie_vecs.index) # user here is named 0, every other movie is recognized by its id
    index = annoy.AnnoyIndex(2557)
    for name, vec in zip(names, curr_user_movie_vecs):
        index.add_item(name, vec)
    index.build(50)
    nearest_movie_ids, nearest_distances = index.get_nns_by_vector(user_vec, 200, search_k=50, include_distances=True)
    user_recs[user] = unwatched_movies(user, movies_per_user_trainval, nearest_movie_ids[1:])

    print(len(user_recs))
print(user_recs[4])

  import sys


20000
[1210, 112852, 1918, 65982, 1587, 95510, 6534, 43928, 122918, 589]


In [48]:
file_to_write = open("outputs/all_content_recommendations.pickle", "wb")
pickle.dump(user_recs, file_to_write)

To avoid running the whole content based model repeatedly, we save the recommendations dictionary into a pickle file and load that file in to do our evaluation metrics on. Now, let's evaluate our predictions visually by looking at user 4's past rated movies as well as user 4's recommended movies.

In [60]:
train_ratings[train_ratings['userId']==4].join(movies.set_index('movieId'), on='movieId').sort_values(by='rating', ascending=False).head(20)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
230,4,1242,5.0,1115782326,Glory (1989),Drama|War
548,4,4262,5.0,1127947327,Scarface (1983),Action|Crime|Drama
592,4,5010,5.0,1113795780,Black Hawk Down (2001),Action|Drama|War
119,4,349,5.0,1113765966,Clear and Present Danger (1994),Action|Crime|Drama|Thriller
351,4,2278,5.0,1113795873,Ronin (1998),Action|Crime|Thriller
724,4,8961,5.0,1113796207,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy
550,4,4306,5.0,1113765676,Shrek (2001),Adventure|Animation|Children|Comedy|Fantasy|Ro...
109,4,316,5.0,1113767120,Stargate (1994),Action|Adventure|Sci-Fi
235,4,1275,5.0,1113795785,Highlander (1986),Action|Adventure|Fantasy
737,4,31878,5.0,1115781996,Kung Fu Hustle (Gong fu) (2004),Action|Comedy


Here, we see that user 4 is interested in a lot of action, crime, drama, and thriller movies. Let's now take a look at user 4's recommended movies from our content-based model.

In [70]:
movie_ids = user_recs[4]
for i in movie_ids:
    print(movies[movies['movieId']==i].title.iloc[0][:-6])

Star Wars: Episode VI - Return of the Jedi 
Guardians of the Galaxy 
Lethal Weapon 4 
Outlander 
Conan the Barbarian 
Amazing Spider-Man, The 
Hulk 
Ultraviolet 
Guardians of the Galaxy 2 
Terminator 2: Judgment Day 


This appears to be a decently recommended list of movies. We see a lot of action, thrillers, Marvel, etc. We see some pretty popular movies like Star Wars and Guardians of the Galaxy, but we also see some more "novel" recommendations like Conan the Barbarian.