# ML Model Test: Retrieval

This is a follow up of the TensorFlow Recommenders tutorials. On this notebook, we will be focusing on the "Retrieval" stage of a Recommender System. All the information is in the following [link](https://www.tensorflow.org/recommenders/examples/basic_retrieval).<br>
We strongly recommend creating a **virtual environment** before running the following code. Let's start by getting our dependencies.

In [1]:
!pip install tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets





## Imports

Next, let's invoke the necessary packages.

In [2]:
import os
import pprint

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs

## Dataset

This is included in the TensorFlow library. We intend to use the MovieLens ratings and movies dataset. All the data will be considered for the `train` split.

In [9]:
# Ratings data.
ratings = tfds.load("ccp2_capstone_ratings", split="train")
# Features of all the available movies.
movies = tfds.load("ccp2_capstone_media_items", split="train")

Let's take a look at the data structure:

In [10]:
for x in ratings.take(1).as_numpy_iterator():
    print("Rating: ")
    pprint.pprint(x)

for x in movies.take(1).as_numpy_iterator():
    print("Movie: ")
    pprint.pprint(x)

Rating: 
{'media_id': b'357',
 'media_title': b'Sifan',
 'user_id': b'138',
 'user_rating': 4.0}
Movie: 
{'media_id': b'1198', 'media_title': b'Classroom\\xe2\\x98\\x86Crisis'}


You can modify the limits of the previous for-loops if you would like to see more examples. The next thing to do is to process the data. We only need `user_id` and `movie_title` for training our model.

In [11]:
ratings = ratings.map(lambda x: {
    "movie_title": x["media_title"],
    "user_id": x["user_id"],
})
movies = movies.map(lambda x: x["media_title"])

Let's now split the set into `train` and `test` sets. This is for having ways of validation after training the model.

In [12]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

Next, we will identify unique `user_id`s and `movie_title`s. This is for having the vocabulary necessary for embedding vectors mapping.

In [13]:
movie_titles = movies.batch(1_000)
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

unique_movie_titles[:10]

array([b'', b"'Til There Was You", b'.hack//Tasogare no Udewa Densetsu',
       b'100-nichikan Ikita Wani X Bourbon', b'101 Dalmatians',
       b'12-sai.: Chicchana Mune no Tokimeki', b'187', b'1year',
       b'2 Days in the Valley', b'2020 Nyeon Ujuui Wonder Kiddy'],
      dtype=object)

## Implementing

This is a two-tower Retrieval model, so we will build them separately and put them back together at the end. 

### Query Tower

The first thing to do is define the dimensionality of the query. In other words, decide how many candidates we want to fetch in this stage. The higher the value, the slower and prone to overfitting it gets. 

In [14]:
embedding_dimension = 32

Let's now define our model using the `Keras` library. It defines the layers of your Neural Network. Our objective here is to convert words from IDs and movie titles into integers we can use for our model.

In [15]:
user_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_user_ids, mask_token=None),
  # We add an additional embedding to account for unknown tokens.
  tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
])

### Candidate Tower

Similar to Query Tower:

In [16]:
movie_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_movie_titles, mask_token=None),
  tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)
])

### Metrics

This is the method we will use to measure the "accuracy" of our model, using the implicit negatives for evaluation. 

In [17]:
metrics = tfrs.metrics.FactorizedTopK(
  candidates=movies.batch(128).map(movie_model)
)

### Loss

We will use the `Retrieval` task object for wrapping together loss function and metric computation.

In [18]:
task = tfrs.tasks.Retrieval(
  metrics=metrics
)

### Full model

Here we put all the pieces together for creating our model. There is a high level of abstraction in the following code for selecting the appropriate training loop that matches our model.

In [19]:
class MovielensModel(tfrs.Model):

  def __init__(self, user_model, movie_model):
    super().__init__()
    self.movie_model: tf.keras.Model = movie_model
    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(features["user_id"])
    # And pick out the movie features and pass them into the movie model,
    # getting embeddings back.
    positive_movie_embeddings = self.movie_model(features["movie_title"])

    # The task computes the loss and the metrics.
    return self.task(user_embeddings, positive_movie_embeddings)

## Fitting and Evaluating

This makes use of `Keras` functionalities. Let's start by  instantiating the model.

In [20]:
model = MovielensModel(user_model, movie_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

Shuffle, batch, and cache the training and evaluation data.

In [21]:
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

Finally, train the model

In [22]:
model.fit(cached_train, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x19d6b22a0d0>

>As the model trains, the loss is falling and a set of top-k retrieval metrics is updated. These tell us whether the true positive is in the top-k retrieved items from the entire candidate set. For example, a top-5 categorical accuracy metric of 0.2 would tell us that, on average, the true positive is in the top 5 retrieved items 20% of the time.

> -- <cite>TensorFlow Recommenders - Retrieval Tutorial</cite>

Having that in mind, we can now evaluate our model.

In [23]:
model.evaluate(cached_test, return_dict=True)



{'factorized_top_k/top_1_categorical_accuracy': 0.0011500000255182385,
 'factorized_top_k/top_5_categorical_accuracy': 0.009999999776482582,
 'factorized_top_k/top_10_categorical_accuracy': 0.023749999701976776,
 'factorized_top_k/top_50_categorical_accuracy': 0.1256999969482422,
 'factorized_top_k/top_100_categorical_accuracy': 0.23849999904632568,
 'loss': 28304.404296875,
 'regularization_loss': 0,
 'total_loss': 28304.404296875}

As expected, the performance on unknown data is not as good as in the training set, meaning our model is not overfitting. We also need to take into account that the model is re-recommending movies already watched by the user.

## Predictions

By using the `tfrs.layers.factorized_top_k.BruteForce` layer. Since it is a brute-force approach, it is slow.

In [24]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# recommends movies out of the entire movies dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(model.movie_model)))
)

# Get recommendations.
_, titles = index(tf.constant(["42"]))
print(f"Recommendations for user 42: {titles[0, :3]}")

Recommendations for user 42: [b'Gundam vs Hello Kitty' b'Jumanji' b'Sleepless in Seattle']


## Model Serving

We will pack our two-tower retrieval model into a single exportable as a `SavedModel` so we can deploy it with `TensorFlow Serving`. We just need the `BruteForce` layer from before.

In [25]:
# Export the query model.
path = os.path.join(os.getcwd(), "../models/retrieval/00000123/")

# Save the index.
tf.saved_model.save(index, path)



INFO:tensorflow:Assets written to: D:\Documentos\Code Chrysalis\ccp2\ccp2-capstone-recommender-retrieval\src\../models/retrieval/00000123/assets


INFO:tensorflow:Assets written to: D:\Documentos\Code Chrysalis\ccp2\ccp2-capstone-recommender-retrieval\src\../models/retrieval/00000123/assets


The next step is to deploy our model in Docker.