# Retrieval

This notebook shows an example of how to train a deep model for recommendation retrieval in TensorFlow. More specifically, this notebook shows how to train both an exact and approximated retrieval model for user-to-item and item-to-item retrieval.

The dataset used is MovieLens 100K dataset, however any costum dataset can be used.




The retrieval task consists of two steps:

  1- Build a two-towers model. A two-tower model for retrieval is a neural network architecture that consists of two sub-models: a query model and a candidate model. The query model computes a vector representation (or embedding) for a user or a query using features such as user history, preferences, or context. The candidate model computes a vector representation for an item or a candidate using features such as item attributes, ratings, or popularity. 

  2- The similarity between the query and candidate embeddings is then used to score and rank the candidates for retrieval.

In [None]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
!pip install -q tensorflow-ranking
!pip install -q scann

In [None]:
import pprint

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

from utils.feature_extraction import FeatureExtractionTower
from utils.models import RetrievalModel
from utils.preprocessing import *

import logging
tf.get_logger().setLevel(logging.ERROR)

## Data Loading

In [None]:
ratings = tfds.load("movielens/100k-ratings", split="train")
movies = tfds.load("movielens/100k-movies", split="train")

ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "movie_id": x["movie_id"],
    "user_occupation_text": x["user_occupation_text"]
})
movies = movies.map(lambda x: {"movie_title": x["movie_title"], "movie_id": x["movie_id"]})

In [None]:
for x in ratings.take(1):
    pprint.pprint(x)

## Data preprocessing

In [None]:
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

## Model definition

We build a user tower and item tower and feed them to a retrival task.

### The query/user tower


In [None]:
user_tower = FeatureExtractionTower(ratings, cats_to_hash_embedding=["user_id"],text_to_embedding=["user_occupation_text"])

### The candidate/movie tower


In [None]:
movie_tower = FeatureExtractionTower(ratings, text_to_embedding=["movie_title", "movie_title"])


### Metrics

We want to know how well our model can predict the user’s preference for a movie. We have some data that tells us which movies the user liked. We can use these as positive examples and compare them with all the other movies that the user did not rate. If the model gives a higher score to the positive examples than to the negative ones, it means it is very accurate.

To measure this, we can use the tfrs.metrics.FactorizedTopK metric. It needs one input: a dataset of all the movies that we use as negative examples for testing.


In [None]:
metrics = tfrs.metrics.FactorizedTopK(
  candidates=movies.batch(128).map(movie_tower.call)
)

## Fitting and evaluating


In [None]:
model = RetrievalModel(user_tower, movie_tower, metrics)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

In [None]:
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

In [None]:
model.fit(cached_train, epochs=3)

In [None]:
model.evaluate(cached_test, return_dict=True)

## Making predictions

Now that we have a model, we would like to be able to make predictions. We can use the `tfrs.layers.factorized_top_k.BruteForce` layer to do this.

In [None]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_tower)
# recommends movies out of the entire movies dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((movies.map(lambda x: x["movie_title"]).batch(100), movies.batch(100).map(model.movie_tower)))
)

In [None]:
_, titles = index({"user_id": np.array(["42"]), "user_occupation_text":tf.constant(["doctor"]), "movie_title": ["Speed (1994)"]})
print(f"Recommendations for user 42: {titles[0, :3]}")

Of course, the `BruteForce` layer is going to be too slow to serve a model with many possible candidates. We can also export an approximate retrieval index to speed up predictions. This will make it possible to efficiently surface recommendations from sets of tens of millions of candidates.

To do so, we can use the `scann` package; we can use the TFRS `ScaNN` layer

In [None]:
scann_index = tfrs.layers.factorized_top_k.ScaNN(model.user_tower)
scann_index.index_from_dataset(
  tf.data.Dataset.zip((movies.map(lambda x: x["movie_title"]).batch(100), movies.batch(100).map(model.movie_tower)))
)

This layer will perform _approximate_ lookups: this makes retrieval slightly less accurate, but orders of magnitude faster on large candidate sets.

In [None]:
# Get recommendations.
_, titles = scann_index({"user_id": np.array(["148"]), "user_occupation_text":tf.constant(["doctor"])})
print(f"Recommendations for user 42: {titles[0, :3]}")

## Item-to-item recommendation

We can use the learned models to perform item-to-item or user-to-user recommendations.

Another approache would build a two item/users towers (for the query and candidate item), and train the model using (query item, candidate item) pairs. These could be constructed from movies that was seen by same user.

In [None]:
scann_index = tfrs.layers.factorized_top_k.ScaNN(model.movie_tower)
scann_index.index_from_dataset(
  tf.data.Dataset.zip((movies.map(lambda x: x["movie_title"]).batch(100), movies.batch(100).map(model.movie_tower)))
)

In [None]:
# Get recommendations.
_, titles = scann_index({"movie_title":tf.constant(["Beautiful Thing (1996)"])})
print(f"Recommendations for movie Beautiful Thing (1996): {titles[0, :3]}")