Assuming we have the "results.txt" which contains the results of bruteforce finding the "best" hyperparameters (this was done through running create_model in parallel), then

In [3]:
import vowpalwabbit
import numpy as np
import pandas as pd 
import model_selection
from sklearn.metrics import mean_squared_error

In [4]:
# import ast 
# best_rmse = np.inf
# best_hyperparams = None
# with open("results.txt", 'r') as file:
#     lines = file.readlines()[:107550]
#     for line in lines:
#         parsed_data = ast.literal_eval(line.strip())
#         rmse = parsed_data[0]
#         if rmse < best_rmse:
#             best_rmse = rmse
#             best_hyperparams = parsed_data[1]
# print(best_rmse, best_hyperparams)

For lines 0-20,000, we have that 0.9793287079032921 {'l2': 0.0001, 'lrate': 0.01, 'passes': 1, 'rank': 39} is our best values.  
For lines 20k-40k, we have 0.9804566075675284 {'l2': 0.0001, 'lrate': 0.01, 'passes': 6, 'rank': 39}.  
For lines 40k-60k, we have 0.9918920165609537 {'l2': 0.03727593720314938, 'lrate': 0.01, 'passes': 6, 'rank': 15}.  
For lines 60k-80k, we have 1.468098975181335 {'l2': 1.0, 'lrate': 0.005005, 'passes': 18, 'rank': 39}.  
For lines 80k-100k, we have 1.0849137448164488 {'l2': 1.0, 'lrate': 0.1, 'passes': 1, 'rank': 39}.  
For lines 100k+, 1.973363177731362 {'l2': 5.17947467923121, 'lrate': 0.1, 'passes': 18, 'rank': 39}.  

So, even though these RMSE aren't the best obtainable, they were obtained using a much smaller dataset so that computation doesn't take too long (25k training, 25k validation). Even so, brute forcing through this took considerable computational time. We can see that our minimized RMSE is obtained through having l2 = 0.0001, learning rate = 0.01, passes = 1, and rank = 39, and it's noteworthy that even though we were varying our rank between 15-39, 39 usually gave the best performance, which supports the idea that allowing more space for latent features increases our performance. More testing would need to be done to determine at which point performance would begin to fall off for this particular dataset. Additionally, as the rank increases, the space the model takes up increases exponentially, so with the current computation further testing isn't feasible because of time and computational constraints. The variable "passes" is the number of times each training example was used during training. Now, we can train our main model:

In [7]:
#Credits to Cody for this part:
# Program uses {percentageDatasetUsed} * total movies.
percentageDatasetUsed = 0.01

# Load the datasets
folder = "ml-32m/"
ratings = pd.read_csv(folder + 'ratings.csv')
movies = pd.read_csv(folder + 'movies.csv')

# Get a percentage of movie ids. Favor movies with more reviews.
movie_rating_counts = ratings['movieId'].value_counts()
sorted_movies = movie_rating_counts.sort_values(ascending=False)
top_movies = sorted_movies.head(int(len(sorted_movies) * percentageDatasetUsed)).index
top_movies_data = movies[movies["movieId"].isin(top_movies)]
print("Number of movies:", len(top_movies))

# Ratings of top movies, and removing timestamp feature
filtered_ratings = ratings[ratings['movieId'].isin(top_movies)].drop("timestamp", axis=1)

Number of movies: 844


In [8]:
from preprocessing import convert, split
from model_selection import create_model, pred

training_df, testing_df = split(filtered_ratings, training_size=0.8, randomstate=1) 
testing = convert(testing_df)
training = convert(training_df)

Note that running the code below creates a cache file on disk "model.cache" and that the "create_model" function takes approx. 5 minutes to train.

In [10]:
hyperparams = {"rank": 39, "l2": 0.0001, "lrate": 0.01, "passes": 1} 
model, test_rmse, _ = create_model(hyperparams=hyperparams, train=training, validation_df=testing_df, validation=testing, r_model=True)
model.save("model.vw")
print(test_rmse)

0.8811452430303447


Assuming that we have the model saved as "model.vw". We can begin our ranking; in practice, we'd want to compute every user against every movie and save those computations in a file, however for our purposes that isn't necessary. So, for simplicity, we'll define the ranking function just to return the top movies given some userId. 

In [12]:
model = vowpalwabbit.Workspace("-i model.vw")
top_n_ratings = 10
def ranker(userId, top_n_ratings=top_n_ratings):
    to_predict = [f"|user {userId} |movie {movieId}" for movieId in top_movies]
    predictions = pred(model, to_predict)
    top_n_movieIds, _ = zip(*sorted(list(zip(top_movies, predictions)), key=lambda x: x[1], reverse=True))
    top_n_movies = top_movies_data[top_movies_data["movieId"].isin(top_n_movieIds[:top_n_ratings])]
    return top_n_movies["title"]

We just basically predict what a user would rate any given movie and return the top one; an example:

In [14]:
userId = 1
top_movies = ranker(userId, top_n_ratings=top_n_ratings)
print(top_movies)

257             Star Wars: Episode IV - A New Hope (1977)
292                                   Pulp Fiction (1994)
314                      Shawshank Redemption, The (1994)
351                                   Forrest Gump (1994)
475                                  Jurassic Park (1993)
522                               Schindler's List (1993)
585                      Silence of the Lambs, The (1991)
2480                                   Matrix, The (1999)
2867                                    Fight Club (1999)
4888    Lord of the Rings: The Fellowship of the Ring,...
Name: title, dtype: object
