## Introduction

In this notebook we will use the Tensorflow Recommenders (TFRS) library to build a recommendation system. TFRS is a library for building deep learning recommendation systems. It helps with the full workflow of building a recommendation system: data preparation, model formulation, training, evaluation, and deployment. TFRS is built on top of TensorFlow 2 and Keras, and is designed to be scalable and easy to use.

We are first focusing on a retrieval system, which is a model that predicts a set of movies from the catalogue that the user is likely to watch. We're going to treat the dataset as an implicit system. This means that we are not trying to predict the rating that a user will give to a movie. Instead, we are trying to rank movies by their relevance to the user. This is a common scenario in many recommendation systems, where we are trying to predict the items that a user is most likely to interact with.
Treating Movielens as an implicit system means that we're interpreting users' actions (watching movies) as indicators of their preferences. Specifically:

1. Every movie a user has watched is considered a positive example, indicating that they like or are interested in that movie.
2. Every movie a user hasn't watched is treated as an implicit negative example, implying that they haven't shown interest in it or haven't been exposed to it yet.

This approach helps us make predictions about which movies users might enjoy based on their past behavior without requiring explicit feedback or ratings for each movie.

In a second step we will build a ranking model that predicts the top 10 movies that a user is likely to watch. This model will be trained on the same dataset as the retrieval model, but will be optimized for ranking accuracy rather than relevance.

## Imports

In [1]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
!pip install -q scann

[31mERROR: Could not find a version that satisfies the requirement scann (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for scann[0m[31m
[0m

In [2]:
import os
import pprint
import tempfile
import pandas as pd

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

## Preprocessing

For the TFRS we will use the movielens dataset from [Tensorflow](https://www.tensorflow.org/datasets/catalog/movielens) which is specifically designed to work with TFRS. It is equal to the dataset that was used for the Preprocessing and EDA. However, we will use a subset dataset that is able to run.

- 100k-ratings: This dataset contains 100,000 ratings from 943 users on 1,682 movies. This dataset is the oldest version of the MovieLens dataset.
- 100k-movies: This dataset contains data of 1,682 movies rated in the 100k dataset.

In [3]:
# Ratings data.
ratings = tfds.load("movielens/100k-ratings", split="train")

#Movie Data
movies = tfds.load("movielens/100k-movies", split="train")

2024-04-29 17:52:49.471188: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


[1mDownloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 32.41 MiB, total: 37.10 MiB) to /Users/maltehaupt/tensorflow_datasets/movielens/100k-ratings/0.1.1...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/100000 [00:00<?, ? examples/s]

Shuffling /Users/maltehaupt/tensorflow_datasets/movielens/100k-ratings/0.1.1.incompleteUYI41Y/movielens-train.…

[1mDataset movielens downloaded and prepared to /Users/maltehaupt/tensorflow_datasets/movielens/100k-ratings/0.1.1. Subsequent calls will reuse this data.[0m
[1mDownloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 150.35 KiB, total: 4.84 MiB) to /Users/maltehaupt/tensorflow_datasets/movielens/100k-movies/0.1.1...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/1682 [00:00<?, ? examples/s]

Shuffling /Users/maltehaupt/tensorflow_datasets/movielens/100k-movies/0.1.1.incompleteJ24QH0/movielens-train.t…

[1mDataset movielens downloaded and prepared to /Users/maltehaupt/tensorflow_datasets/movielens/100k-movies/0.1.1. Subsequent calls will reuse this data.[0m


In [4]:
for x in ratings.take(1).as_numpy_iterator():
  pprint.pprint(x)

{'bucketized_user_age': 45.0,
 'movie_genres': array([7]),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}


2024-04-29 17:53:06.694864: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2024-04-29 17:53:06.696368: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [5]:
for x in movies.take(1).as_numpy_iterator():
  pprint.pprint(x)

{'movie_genres': array([4]),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}


2024-04-29 17:53:06.715090: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2024-04-29 17:53:06.715368: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In this Notebook we only going to focus on the ratings dataset. We keep only the user_id, and movie_title fields in the dataset.

In [6]:
ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
})
movies = movies.map(lambda x: x["movie_title"])

In [7]:
# Set the random seed for reproducibility
tf.random.set_seed(42)

# Shuffle the dataset with a specified buffer size and seed
shuffled = ratings.shuffle(buffer_size=100_000, seed=42, reshuffle_each_iteration=False)

# Take the first 80,000 examples for the training set (80% of the data)
train_dataset = shuffled.take(80_000)

# Skip the first 80,000 examples and take the next 20,000 for the test set (20% of the data)
test_dataset = shuffled.skip(80_000).take(20_000)

Lets investigate the unique user ids and movie titles in the dataset.

In [8]:
# Unique user ids and movie titles
user_ids_vocabulary = ratings.batch(1_000_000).map(lambda x: x["user_id"])
movie_titles_vocabulary = ratings.batch(1_000_000).map(lambda x: x["movie_title"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles_vocabulary)))
unique_user_ids = np.unique(np.concatenate(list(user_ids_vocabulary)))

unique_movie_titles[:10]

2024-04-29 18:04:10.996154: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2024-04-29 18:04:12.294915: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


array([b"'Til There Was You (1997)", b'1-900 (1994)',
       b'101 Dalmatians (1996)', b'12 Angry Men (1957)', b'187 (1997)',
       b'2 Days in the Valley (1996)',
       b'20,000 Leagues Under the Sea (1954)',
       b'2001: A Space Odyssey (1968)',
       b'3 Ninjas: High Noon At Mega Mountain (1998)',
       b'39 Steps, The (1935)'], dtype=object)


The extraction and identification of unique user IDs and movie titles from a dataset are crucial for efficiently managing and processing data in machine learning applications, especially in recommender systems. By converting these categorical labels into unique numerical identifiers, we can optimize memory usage and enhance computational efficiency, which is fundamental for training accurate and scalable models. This preprocessing step also ensures that each entity (user or movie) is uniquely represented, aiding in the precise construction of user-item interaction matrices that are essential for personalization and prediction accuracy in recommendation algorithms.

## Model

We will build a two-tower retrieval model using the TFRS library. This model consists of two separate components or "towers".

A two-tower retrieval model is a type of architecture commonly used in recommendation systems, particularly in the context of information retrieval tasks such as recommending items to users. The "two towers" refer to two separate components or "towers" within the model:

Query Tower: This tower represents the user's query or input. It typically takes user-specific information or features, such as user demographics, past interactions, or explicit preferences, and encodes them into a fixed-length representation.
Candidate Tower: This tower represents the items in the catalog or database that are being considered for recommendation. It takes item-specific information or features, such as item attributes, metadata, or embeddings, and encodes them into a fixed-length representation.
After encoding both the query and candidate items, the model computes a similarity score between them.