Assuming we have the "results.txt" which contains the results of bruteforce finding the "best" hyperparameters (this was done through running create_model in parallel), then

In [10]:
import vowpalwabbit
import numpy as np
import pandas as pd 
import model_selection
from sklearn.metrics import mean_squared_error

In [3]:
# import ast 
# best_rmse = np.inf
# best_hyperparams = None
# with open("results.txt", 'r') as file:
#     lines = file.readlines()[:107550]
#     for line in lines:
#         parsed_data = ast.literal_eval(line.strip())
#         rmse = parsed_data[0]
#         if rmse < best_rmse:
#             best_rmse = rmse
#             best_hyperparams = parsed_data[1]
# print(best_rmse, best_hyperparams)

For lines 0-20,000, we have that 0.9793287079032921 {'l2': 0.0001, 'lrate': 0.01, 'passes': 1, 'rank': 39} is our best values.  
For lines 20k-40k, we have 0.9804566075675284 {'l2': 0.0001, 'lrate': 0.01, 'passes': 6, 'rank': 39}.  
For lines 40k-60k, we have 0.9918920165609537 {'l2': 0.03727593720314938, 'lrate': 0.01, 'passes': 6, 'rank': 15}.  
For lines 60k-80k, we have 1.468098975181335 {'l2': 1.0, 'lrate': 0.005005, 'passes': 18, 'rank': 39}.  
For lines 80k-100k, we have 1.0849137448164488 {'l2': 1.0, 'lrate': 0.1, 'passes': 1, 'rank': 39}.  
For lines 100k+, 1.973363177731362 {'l2': 5.17947467923121, 'lrate': 0.1, 'passes': 18, 'rank': 39}.  

So, even though these RMSE aren't the best obtainable, they were obtained using a much smaller dataset so that computation doesn't take too long (25k training, 25k validation). Even so, brute forcing through this took considerable computational time. We can see that our minimized RMSE is obtained through having l2 = 0.0001, learning rate = 0.01, passes = 1, and rank = 39, and it's noteworthy that even though we were varying our rank between 15-39, 39 usually gave the best performance, which supports the idea that allowing more space for latent features increases our performance. More testing would need to be done to determine at which point performance would begin to fall off for this particular dataset. Additionally, as the rank increases, the space the model takes up increases exponentially, so with the current computation further testing isn't feasible because of time and computational constraints. The variable "passes" is the number of times each training example was used during training. Now, we can train our main model:

In [24]:
from preprocessing import convert, split
from model_selection import create_model

folder = "ml-32m/"
#Dropping the timestamp feature as we're aren't making use of it
ratings = pd.read_csv(folder+"ratings.csv").drop("timestamp", axis=1)
df1, testing_df = split(ratings, training_size=0.75, randomstate=1) 
testing = convert(testing_df)

Note that running the code below creates a cache file on disk "model.cache" and that the "create_model" function takes approx. 30 minutes to train as there are 16m examples in our training set.

In [26]:
hyperparams = {"rank": 39, "l2": 0.0001, "lrate": 0.01, "passes": 1}
training_df, validation_df = split(df1, training_size=0.75, randomstate=1)
training = convert(training_df)
validation = convert(validation_df) 
model, rmse, _ = create_model(hyperparams=hyperparams, train=training, validation_df=validation_df, validation=validation, r_model=True)
model.save("model.vw")
print(rmse)

0.9159200690969022


Assuming that we have the model saved as "model.vw" we can call it and finally, check our predictions for the training set.

In [28]:
model = vowpalwabbit.Workspace("-i model.vw")
predictions = model_selection.pred(model, testing)
test_rmse = np.sqrt(mean_squared_error(testing_df["rating"], predictions))
print(test_rmse)

0.9159833576032641
