# Ranking Model

Ranking model is able to assist retrieval by ranking all the items from highest to lowest, predcting a probablity that a user may or may not like it. Ranking model is useful to filter out items that are not relevant for the user before retrieval task, making retrieval task much more accurate and efficient.

In this example,. we will look at a very simple ranking model, and after that, we shall see how we can add more features and combine ranking and retrieval model into a multitask model.

In [12]:
### Import necessary libraries

from typing import Dict, Text

import numpy as np
import tensorflow as tf

import tensorflow_recommenders as tfrs

import os
import pprint
import tempfile

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [13]:
masterdf = pd.read_csv('../data_cleaned/brazildata_mod.csv')
masterdf.head(3)

Unnamed: 0.1,Unnamed: 0,order_id,order_purchase_timestamp,user_id,customer_city,customer_state,product_category,quantity,price,review_score,timestamp,product_code,product_id
0,0,e481f51cbdc54678b7cc49136f2d6af7,2017-10-02 10:56:33,9ef432eb6251297304e76186b10a928d,sao paulo,SP,housewares,1.0,29.99,4,1506942000.0,87285b34884572647811a353c7ac498a,housewares SKU 0
1,1,53cdb2fc8bc7dce0b6741e2150273451,2018-07-24 20:41:37,b0830fb4747a6c6d20dea0b8c802d7ef,barreiras,BA,perfumery,1.0,118.7,4,1532465000.0,595fac2a385ac33a80bd5114aec74eb8,perfumery SKU 0
2,2,47770eb9100c2d0c44946d9cf07ec65d,2018-08-08 08:38:49,41ce2a54c0b03bf3443c3d931a367089,vianopolis,GO,auto,1.0,159.9,5,1533718000.0,aa4383b373c6aca5d8797843e5594415,auto SKU 0


# Idenfity features that we need, and prepare the dataset

In [14]:
len(masterdf)

82597

In [15]:
### standardize item data types, especially string, float, and integer

masterdf[['user_id',      
          'product_id',  
         ]] = masterdf[['user_id','product_id']].astype(str)

# we will play around with the data type of the quantity, 
# which you shall see later it affects the accuracy of the prediction.

masterdf['quantity'] = masterdf['quantity'].astype(float)

In [16]:
### define interactions data and user data

### interactions 
### here we create a reference table of the user , item, and quantity purchased
interactions_dict = masterdf.groupby(['user_id', 'product_id'])[ 'quantity'].sum().reset_index()

## we tansform the table inta a dictionary , which then we feed into tensor slices
# this step is crucial as this will be the type of data fed into the embedding layers
interactions_dict = {name: np.array(value) for name, value in interactions_dict.items()}
interactions = tf.data.Dataset.from_tensor_slices(interactions_dict)

## we do similar step for item, where this is the reference table for items to be recommended
items_dict = masterdf[['product_id']].drop_duplicates()
items_dict = {name: np.array(value) for name, value in items_dict.items()}
items = tf.data.Dataset.from_tensor_slices(items_dict)

## map the features in interactions and items to an identifier that we will use throught the embedding layers
## do it for all the items in interaction and item table
## you may often get itemtype error, so that is why here i am casting the quantity type as float to ensure consistency
interactions = interactions.map(lambda x: {
    'user_id' : x['user_id'], 
    'product_id' : x['product_id'], 
    'quantity' : float(x['quantity']),

})

items = items.map(lambda x: x['product_id'])

In [17]:
unique_item_titles = np.unique(np.concatenate(list(items.batch(1000))))
unique_user_ids = np.unique(np.concatenate(list(interactions.batch(1_000).map(lambda x: x["user_id"]))))

In [24]:
# Randomly shuffle data and split between train and test.
tf.random.set_seed(42)
shuffled = interactions.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(60_000)
test = shuffled.skip(60_000).take(20_000)

Here, many embedding layers works similarly with retrieval model, with addition of multiple hidden layers under Sequential latyers, where we can stack multiple dense layers. 

We split the query and candidate tower separately, and call them later into the model

In [19]:
class RankingModel(tf.keras.Model):

    def __init__(self):
        super().__init__()
        embedding_dimension = 32

        # Compute embeddings for users.
        self.user_embeddings = tf.keras.Sequential([
          tf.keras.layers.experimental.preprocessing.StringLookup(
            vocabulary=unique_user_ids, mask_token=None),
          tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
        ])

        # Compute embeddings for movies.
        self.movie_embeddings = tf.keras.Sequential([
          tf.keras.layers.experimental.preprocessing.StringLookup(
            vocabulary=unique_item_titles, mask_token=None),
          tf.keras.layers.Embedding(len(unique_item_titles) + 1, embedding_dimension)
        ])

        # Compute predictions.
        self.ratings = tf.keras.Sequential([
          # Learn multiple dense layers.
          tf.keras.layers.Dense(256, activation="relu"),
          tf.keras.layers.Dense(64, activation="relu"),
          # Make rating predictions in the final layer.
          tf.keras.layers.Dense(1)
  ])

    def call(self, inputs):

        user_id, movie_title = inputs

        user_embedding = self.user_embeddings(user_id)
        movie_embedding = self.movie_embeddings(movie_title)

        return self.ratings(tf.concat([user_embedding, movie_embedding], axis=1))

This model takes user ids and movie titles, and outputs a predicted rating, for example:

In [20]:
RankingModel()((["9ef432eb6251297304e76186b10a928d"], ["perfumery SKU 0"]))

Consider rewriting this model with the Functional API.
Consider rewriting this model with the Functional API.


<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.01101889]], dtype=float32)>

Here we insert the query and candidate model built above into the model. The only difference from retrieval task is that we use ranking task, and we calculate the accuracy metrics using RMSE.

In [None]:
class RetailModel(tfrs.models.Model):

    def __init__(self):
        super().__init__()
        self.ranking_model: tf.keras.Model = RankingModel()
        self.task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
          loss = tf.keras.losses.MeanSquaredError(),
          metrics=[tf.keras.metrics.RootMeanSquaredError()]
        )

    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        rating_predictions = self.ranking_model(
            (features["user_id"], features["product_id"]))

        # The task computes the loss and the metrics.
        return self.task(labels=features["quantity"], predictions=rating_predictions)

As usual, we call fit on train and test data, and the evaluate.

In [2]:
model = RetailModel()

model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.5))

cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

model.fit(cached_train, epochs=100)

model.evaluate(cached_test, return_dict=True)



{'root_mean_squared_error': 2.0219151973724365,
 'loss': 4.294068336486816,
 'regularization_loss': 0,
 'total_loss': 4.294068336486816}

The RMSE is not very good, which we shall see later how we can improve it by adding more features and combining ranking and retrieval model together.