It can be difficult to quantify the impact of changing a hyperparameter on an embedding model. In this notebook we use matrix factorization to train a couple of movie embedding models based on the MovieLens dataset, and then use embedding comparison to see how changing the value of hyperparameters affects the learned embedding spaces.

We also show how certain hyperparameter settings for WALS create embedding spaces very similar to those that we would learn with SVD 

In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import logging
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from tqdm import tqdm
import subprocess
from urllib.request import urlretrieve

from experiment_helpers import (
    get_embeddings_from_ratings_df,
    compare_embedding_maps
)
from embeddingcomp.comparison import CCAComparison, UnitMatchComparison, NeighborsComparison

tf.logging.set_verbosity(tf.logging.ERROR)
logging.getLogger().setLevel("ERROR")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Load the data (MovieLens)

In [31]:
data_path = "../../../data"

# dataset = "ml-1m"
dataset = "ml-20m"

clear_command = "rm -rf {}/{}".format(data_path, dataset)
os.system(clear_command)

urlretrieve("http://files.grouplens.org/datasets/movielens/{}.zip".format(dataset),
            "{}/{}.zip".format(data_path, dataset))

unzip_command = "unzip {}/{}.zip  -d {}".format(data_path, dataset, data_path)
subprocess.check_output(unzip_command, shell=True)

headers = ['user_id', 'item_id', 'rating', 'timestamp']
if dataset == "ml-1m":  
    ratings_df  = pd.read_csv("{}/{}/ratings.dat".format(data_path, dataset),
                  delimiter="::", header=None, names=headers)
    
    # Load the movie titles
    movie_df = pd.read_csv("{}/{}/movies.dat".format(data_path, dataset), delimiter="::", header=None)
    id_to_title = dict(zip(movie_df[0].values, zip(movie_df[1].values, movie_df[2].values)))
elif dataset == "ml-20m":
    ratings_df  = pd.read_csv("{}/{}/ratings.csv".format(data_path, dataset),
                  delimiter=",", header=0, names=headers)

    # Load the movie titles
    movie_df = pd.read_csv("{}/{}/movies.csv".format(data_path, dataset))
    id_to_title = dict(zip(movie_df['movieId'].values, zip(movie_df['title'].values, movie_df['genres'].values)))
else:
    assert False

ratings_df['item_id'] = [id_to_title[item_id] for item_id in ratings_df['item_id'].values]

# Train example WALS


In [32]:
from embeddingcomp.neighbors import get_neighbors_table

# Train the matrix factorization model
movie_to_embedding = get_embeddings_from_ratings_df(
    ratings_df, algorithm="wals", latent_factors=10, unobs_weight=0.01)[1]

# Look at the nearest neighbors by movie
movie_names = list(movie_to_embedding.keys())
movie_embeddings = np.vstack(movie_to_embedding.values())
table =  get_neighbors_table(movie_embeddings, "brute")
for index in np.random.permutation(range(len(movie_to_embedding)))[:3]:
    neighbor_indices = table.get_neighbors(index, 3)
    print("Query: {} \nResults: \n{}\n ----- \n".format(
        movie_names[index],
        "\n".join([str(movie_names[neighbor_index]) for neighbor_index in neighbor_indices])))


Query: ('Accattone (1961)', 'Drama') 
Results: 
('Young and the Damned, The (Olvidados, Los) (1950)', 'Crime|Drama')
('Investigation of a Citizen Above Suspicion (Indagine su un cittadino al di sopra di ogni sospetto) (1970)', 'Crime|Drama|Thriller')
('Cul-de-sac (1966)', 'Comedy|Crime|Drama|Thriller')
 ----- 

Query: ('They Live (1988)', 'Action|Sci-Fi|Thriller') 
Results: 
('Scanners (1981)', 'Horror|Sci-Fi|Thriller')
('Shogun Assassin (1980)', 'Action|Adventure')
('Big Boss, The (Fists of Fury) (Tang shan da xiong) (1971)', 'Action|Thriller')
 ----- 

Query: ('Dragon Ball Z: Broly Second Coming (Doragon bôru Z 10: Kiken na futari! Sûpâ senshi wa nemurenai) (1994)', 'Action|Adventure|Animation') 
Results: 
('Ironclad (2011)', 'Action|Adventure')
('Secuestrados (Kidnapped) (2010)', 'Horror|Thriller')
('Sharknado (2013)', 'Sci-Fi')
 ----- 



# Look at the effect of a hyperparameters on the learned WALS embedding space
One of the important hyperparameters of a WALS model is the amount of weight to assign the unobserved elements. In this experiment we look at how increasing the unobserved weights that we use to train the WALS models yields distinct movie embedding spaces.

In [33]:
alg = "wals"
factors = 10
uweights = [0.0, 0.01, 0.1, 0.5, 1.0, 2.0]

baseline_movie_embedding = get_embeddings_from_ratings_df(ratings_df, alg, factors, 0.0)[1]
retrained_movie_embeddings = [get_embeddings_from_ratings_df(ratings_df, alg, factors, uweight)[1]
                              for uweight in uweights]
similarities = [compare_embedding_maps(baseline_movie_embedding, ret) for ret in retrained_movie_embeddings]

plt.figure(figsize=(12, 4))
plt.subplot(1,2,1)
plt.title("Neighbor Method")
plt.xlabel("Unobserved Element Weight")
plt.ylabel("Similarity to Unobserved Weight = 0")
plt.plot(uweights, [s["neighbor"] for s in similarities])

plt.subplot(1,2,2)
plt.title("CCA Method")
plt.xlabel("Unobserved Element Weight")
plt.ylabel("Similarity to Unobserved Weight = 0")
plt.plot(uweights, [s["cca"] for s in similarities])

KeyboardInterrupt: 

Next, let's train an SVD model compares to these learned WALS models. Since the loss that SVD minimizes does't bias towards all of the elements in the matrix we see that the SVD embeddings are most similar to the WALS embeddings trained with larger unobserved weight

In [None]:
svd_movie_embedding = get_embeddings_from_ratings_df(ratings_df, "svd", factors, None)[1]
svd_similarities = [compare_embedding_maps(svd_movie_embedding, ret) for ret in retrained_movie_embeddings]

plt.figure(figsize=(12, 4))
plt.subplot(1,2,1)
plt.title("Neighbor Method")
plt.xlabel("Unobserved Element Weight")
plt.ylabel("Similarity to SVD Embeddings")
plt.plot(uweights, [s["neighbor"] for s in svd_similarities])

plt.subplot(1,2,2)
plt.title("CCA Method")
plt.xlabel("Unobserved Element Weight")
plt.ylabel("Similarity to SVD Embeddings")
plt.plot(uweights, [s["cca"] for s in svd_similarities])