<a href="https://colab.research.google.com/github/ashmibanerjee/tf-recommenders/blob/main/notebook/TF%20Recommenders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 0: Defining the `imports` and `utilities`

In [1]:
!pip install tensorflow_recommenders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow_recommenders
  Downloading tensorflow_recommenders-0.7.3-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tensorflow_recommenders
Successfully installed tensorflow_recommenders-0.7.3


In [2]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs

In [9]:
# utility function to pretty print the data
def pprint_data(data: tf.data.Dataset, n: int=1):
   for x in data.take(n).as_numpy_iterator():
    pprint.pprint(x) 

# Step 1: Preparing the Data


## Step 1.1: Download the data from `tensorflow_datasets`

In [4]:
# Ratings data.
ratings = tfds.load("movielens/100k-ratings", split="train")
# Features of all the available movies.
movies = tfds.load("movielens/100k-movies", split="train")

Downloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 32.41 MiB, total: 37.10 MiB) to /root/tensorflow_datasets/movielens/100k-ratings/0.1.1...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/100000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/movielens/100k-ratings/0.1.1.incompleteVFOCTW/movielens-train.tfrecord*...…

Dataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-ratings/0.1.1. Subsequent calls will reuse this data.
Downloading and preparing dataset 4.70 MiB (download: 4.70 MiB, generated: 150.35 KiB, total: 4.84 MiB) to /root/tensorflow_datasets/movielens/100k-movies/0.1.1...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/1682 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/movielens/100k-movies/0.1.1.incompleteOHQL7G/movielens-train.tfrecord*...:…

Dataset movielens downloaded and prepared to /root/tensorflow_datasets/movielens/100k-movies/0.1.1. Subsequent calls will reuse this data.


In [11]:
print("Ratings:")
pprint_data(ratings)
print("Movies")
pprint_data(movies)

Ratings:
{'bucketized_user_age': 45.0,
 'movie_genres': array([7]),
 'movie_id': b'357',
 'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
 'raw_user_age': 46.0,
 'timestamp': 879024327,
 'user_gender': True,
 'user_id': b'138',
 'user_occupation_label': 4,
 'user_occupation_text': b'doctor',
 'user_rating': 4.0,
 'user_zip_code': b'53211'}
Movies
{'movie_genres': array([4]),
 'movie_id': b'1681',
 'movie_title': b'You So Crazy (1994)'}


## Step 1.2: Clean up the data

In this example, we're going to focus on the ratings data.

Hence, we keep only the `user_id`, and movie_title fields in the dataset.

In [12]:
ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
})
movies = movies.map(lambda x: x["movie_title"])

pprint_data(ratings)
pprint_data(movies)

## Step 1.3: Train-Test Split

To fit and evaluate the model, we need to split it into a training and evaluation set. In an industrial recommender system, this would most likely be done by time: the data up to time  𝑇  would be used to predict interactions after  𝑇 .

In this simple example, however, let's use a random split, putting 80% of the ratings in the train set, and 20% in the test set.

In [13]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

pprint_data(train)

{'movie_title': b'Postman, The (1997)', 'user_id': b'681'}


## Step 1.4: Unique` user_ids `and `movie_titles`

* We figure out unique user ids and movie titles present in the data.

* We batch the movies and ratings dataset in batches of 1000 and 1000000 movie titles and ratings respectively.

* Batching allows us to process multiple movie titles simultaneously, which can improve computational efficiency during training.

* This is important because we need to be able to map the raw values of our categorical features to embedding vectors in our models. 
* To do that, we need a vocabulary that maps a raw feature value to an integer in a contiguous range: this allows us to look up the corresponding embeddings in our embedding tables.

In [None]:
movie_titles = movies.batch(1_000)
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

unique_movie_titles[:10]

# Step 2: Model Implementation

Since we are building a two-tower retrieval model, we can build each tower (the `Query Tower`/`User Embeddings` and the `Candidate Tower`/`Item Embeddings`) separately and then combine them in the final model.

## Step 2.1: The Query Tower

* Define the `dimensionality`
 
 The first step is to decide on the dimensionality of the query and candidate representations. 

 *Higher values will correspond to models that may be more accurate, but will also be slower to fit and more prone to overfitting.*


In [None]:
embedding_dimension = 32

* Defining the `user_model`

 Here, we're going to use Keras preprocessing layers to first convert user ids to integers, and then convert those to user embeddings via an Embedding layer. 
 
 * Note that we use the list of unique user ids we computed earlier as a vocabulary.

This model corresponds to the **classic matrix factorization** approach.

In [None]:
user_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_user_ids, mask_token=None),
  # We add an additional embedding to account for unknown tokens.
  tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
])

## Step 2.2: The Candidate Tower

Similarly, we build the `candidate tower`/`item embeddings`.

In [None]:
movie_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_movie_titles, mask_token=None),
  tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)
])

## Step 2.3: Defining the `Metrics`

In our training data, we have positive pairs of (user, movie) instances. To evaluate the performance of our model, we need to compare the affinity score generated by the model for each positive pair with the scores of all other potential candidates. If the score for the positive pair is higher than the scores of all other candidates, it indicates that our model is highly accurate.

To accomplish this evaluation, we can utilize the tfrs.metrics.FactorizedTopK metric. This metric requires one essential argument: the dataset of candidates that will be considered as implicit negatives for evaluation.

In our specific scenario, the dataset of candidates corresponds to the movies dataset. To facilitate the evaluation process, we convert the movie dataset into embeddings using our movie model. These embeddings capture the latent representations of the movies and enable us to compare them with the positive pairs during evaluation.

In [None]:
metrics = tfrs.metrics.FactorizedTopK(
  candidates=movies.batch(128).map(movie_model)
)

## Step 2.4: Defining the `Loss` function

TFRS has several loss layers and tasks to make this easy.

In this instance, we'll make use of the Retrieval task object: a convenience wrapper that bundles together the loss function and metric computation.

In [None]:
task = tfrs.tasks.Retrieval(
  metrics=metrics
)

## Step 2.5: Combining the Models: The Full Model

* Now we can combine all the components into a complete model. 
* TensorFlow Recommenders (TFRS) provides a base model class, `tfrs.models.Model`, that simplifies the process of building models. 
* All we need to do is set up the necessary components within the `__init__` method and implement the `compute_loss` method, which takes the raw features as input and returns the corresponding loss value.

By utilizing the base model class, the framework handles the creation of an appropriate training loop to train our model efficiently and effectively. This streamlines the model-building process and allows us to focus on defining the components and loss computation specific to our recommendation task.

In [None]:
class MovielensModel(tfrs.Model):

    def __init__(self, user_model, movie_model):
        super().__init__()
        self.movie_model: tf.keras.Model = movie_model
        self.user_model: tf.keras.Model = user_model
        self.task: tf.keras.layers.Layer = task

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        # We pick out the user features and pass them into the user model.
        user_embeddings = self.user_model(features["user_id"])
        # And pick out the movie features and pass them into the movie model,
        # getting embeddings back.
        positive_movie_embeddings = self.movie_model(features["movie_title"])

        # The task computes the loss and the metrics.
        return self.task(user_embeddings, positive_movie_embeddings)

The `tfrs.Model` base class is a simply convenience class: it allows us to compute both training and test losses using the same method.

Under the hood, it's still a plain `Keras` model. 

# Step 3: Model Training, Fitting & Evaluation


In [None]:
# model instantiation
model = MovielensModel(user_model, movie_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

In [None]:
# shuffle, batch, and cache the training and evaluation data
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

In [None]:
# train the model
model.fit(cached_train, epochs=3)

In [None]:
# evaluate on the test set
model.evaluate(cached_test, return_dict=True)

# Step 4: Making Predictions

In [None]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)

# recommends movies out of the entire movies dataset.
index.index(movies.batch(100).map(model.movie_model), movies)

# Get recommendations.
_, titles = index(tf.constant(["42"]))
print(f"Recommendations for user 42: {titles[0, :5]}")