# Book Recommendation Inference
---

1. [Introduction](#Introduction)
    * [Background](#Introduction-Background)
    * [Definitions](#Introduction-Definitions)
    * [Prerequisites](#Introduction-Prerequisites)
2. [Notebook Setup](#Notebook-Setup)
3. [Code](#Code)
    * [Imports](#Code-Imports)
    * [File Paths](#Code-File-Paths)
    * [Load Input Files](#Code-Load-Input-Files)
    * [Deploy the Endpoint](#Code-Deploy-the-Endpoint)
    * [Create Predictor from Endpoint Name](#Code-Create-Predictor-from-Endpoint-Name)
    * [Inference Functions](#Code-Inference-Functions)
    * [Display Functions](#Code-Display-Functions)
    * [Example Recommendations](#Code-Example-Recommendations)
    * [Cleanup](#Code-Cleanup)

<a id="Introduction"></a>
## Introduction
---
<a id="Introduction-Background"></a>
### Background
In our application we'll have anonymous users provide a few books that they liked and we need to give them book recommendations based on those. Our factorization machine model provides a way to score books given the user context (the books that the user said they liked). However, there are almost 200k to search from, which would be computationally expensive to score them all and may introduce noise from wrong predictions. We reduce the search space using the proximal books obtained from the "BooksRecommenderProximity" notebook. This notebook provides the inference method to score the proximal books and provide anonymous users with book recommendations.

<a id="Introduction-Definitions"></a>
### Definitions
| Term | Definition |
|:--- |:--- | 
| Context Book | A book that is provided by the user, with the assumption that they liked it (explicit feedback), that will be used to rank recommendations using the factorization machine model. The context books are only a subset of the book dataset since some books did not have enough data during training. **Users may only choose liked books from this subset.** |
| Target Book | A book that can be scored by the factorization machine model. The target books are only a subset of the book dataset since some books did not have enough data during training. **Users will receive recommendations only from this subset.** |
| Encoder | The encoders we refer to in this notebook are either **sklearn.preprocessing.OneHotEncoder** for the target books (since we only have one per prediction), or **sklearn.preprocessing.MultiLabelBinarizer** for the context books (since we have multiple ones for each prediction). **We load these encoders to retrieve the context and target book subsets.** |
| Proximal Book | A context book has multiple proximal books that are "close" to it. Here, "close" means that users that liked the context book also liked the proximal books. Books from the same authors are also considered proximal. |
| ISBN | International Standard Book Numbers (or ISBN) is a unique identified for each book. In particular, it is an identified for each specific version/revision of a given book, which is why we need to perform ISBN deduplication to map old versions to the latest one. We do this to treat all versions of the same book in as one single book. |

<a id="Introduction-Prerequisites"></a>
### Prerequisites
The following files are required, all of which are the result of training the model using the training notebook:
- target_encoder.pkl
    * Obtained from "BooksRecommenderTraining" notebook.
    * Contains a python pickled object: The target ISBN one-hot encoder.
- context_encoder.pkl
    * Obtained from "BooksRecommenderTraining" notebook.
    * Contains a python pickled object: The context ISBN multi-hot encoder.
- same_book_isbn_map.json
    * Obtained from "BooksRecommenderTraining" notebook.
    * Contains a map from duplicate ISBNs to the latest ISBN that removes some duplicate books (different editions of the same book).
- model.tar.gz
    * Obtained from "BooksRecommenderTraining" notebook.
    * Is an object stored on an s3 bucket (you need to know the path to the object) containing the factorization machine model parameters that are used to score proximal books so they can be ranked.
- isbn_to_proximal_isbns.json
    * Obtained from "BooksRecommenderProximity" notebook.
    * Contains a map from context ISBNs to lists of proximal books (books that are similar).
- books.csv
    * Download [here](https://cosminc98-public-datasets.s3.eu-central-1.amazonaws.com/books-recommender/books.csv). Originally from [this](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset/) kaggle competition. If no longer available, download the original file [here](https://cosminc98-public-datasets.s3.eu-central-1.amazonaws.com/books-recommender/original/Books.csv), although this is not the one we use.
    * Contains all metadata of the books in the dataset.

<a id='Notebook-Setup'></a>
## Notebook Setup
---
This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.

<a id='Code'></a>
## Code
---

<a id='Code-Imports'></a>
### Imports

In [2]:
import pandas as pd
import json
import numpy as np
import itertools
import pickle
import os
from typing import Dict, List, Tuple, Set, Optional, Iterable
from scipy.sparse import diags, hstack, csr_matrix
import sagemaker
from sagemaker import get_execution_role
from sagemaker.deserializers import JSONDeserializer

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


<a id="Code-File-Paths"></a>
### File Paths

In [3]:
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = "books-recommender/anonymous-regressor"
training_output_prefix = 's3://{}/{}/output'.format(bucket, prefix)

# WARNING: You need to update this; look into your sagemaker s3 bucket to find 
# the model you trained
model_path = f"{training_output_prefix}/factorization-machines-2023-12-09-10-08-08-204/output/model.tar.gz" 

model_dir = "./"
same_book_isbn_fpath = os.path.join(model_dir, "same_book_isbn_map.json")
target_encoder_fpath = os.path.join(model_dir, "target_encoder.pkl")
context_encoder_fpath = os.path.join(model_dir, "context_encoder.pkl")
proximity_map_fpath = os.path.join(model_dir, "isbn_to_proximal_isbns.json")

data_dir = "../../../data"
books_fpath = os.path.join(data_dir, "books", "books.csv")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


<a id="Code-Load-Input-Files"></a>
### Load Input Files

In [4]:
with open(target_encoder_fpath, "rb") as f:
    target_encoder = pickle.load(f)
target_isbns = set(target_encoder.categories_[0])
print(f"There are {len(target_isbns)} target books")

There are 149623 target books


In [5]:
with open(context_encoder_fpath, "rb") as f:
    context_encoder = pickle.load(f)
context_isbns = set(context_encoder.classes_)
print(f"There are {len(context_isbns)} context books")

There are 170978 context books


In [6]:
with open(same_book_isbn_fpath, "r") as f:
    same_book_isbn_map = json.load(f)
print(f"There are {len(same_book_isbn_map)} duplicate ISBNs (different versions of the same book) that need to be mapped to the latest version of that book (ISBN)")

There are 24636 duplicate ISBNs (different versions of the same book) that need to be mapped to the latest version of that book (ISBN)


In [7]:
with open(proximity_map_fpath, "r") as f:
    isbn_to_proximal_isbns = json.load(f)
print(f"There are {len(isbn_to_proximal_isbns)} books for which we have 100 similar books each (should be the same as the number of context books)")

There are 170978 books for which we have 100 similar books each (should be the same as the number of context books)


In [8]:
proximal_set: Set[str] = set()
for proximal_isbns in isbn_to_proximal_isbns.values():
    proximal_set.update(proximal_isbns)
print(f"There are {len(proximal_set)} unique proximal books (ideally should be close or equal to the number of target books, but may be lower because some books are just too isolated, i.e. rated by only one person)")

There are 131925 unique proximal books (ideally should be close or equal to the number of target books, but may be lower because some books are just too isolated, i.e. rated by only one person)


In [9]:
books_df = pd.read_csv(books_fpath, dtype={
    "ISBN": str, 
    "BookTitle": str, 
    "BookAuthor": str, 
    "YearOfPublication": int, 
    "Publisher": str, 
    "ImageURLSmall": str, 
    "ImageURLMedium": str, 
    "ImageURLLarge": str
})
isbns = books_df.ISBN.tolist()
authors = [auth.lower() for auth in books_df.BookAuthor]
book_to_author: Dict[str, str] = dict(zip(isbns, authors))
book_to_title: Dict[str, str] = dict(zip(isbns, books_df.BookTitle))

<a id="Code-Deploy-the-Endpoint"></a>
### Deploy the Endpoint

Do not forget to call this function if you're done with the endpoint as it will cost you a lot of money:
```python
fm_predictor.delete_endpoint()
```

Optionally, you may go to your AWS Console (in the SageMaker service) and delete the endpoint from there.

In [10]:
%%time
fm = sagemaker.FactorizationMachinesModel(
    model_data=model_path,
    role=get_execution_role(),
    sagemaker_session=sess,
)
fm_predictor = fm.deploy(
    initial_instance_count=1,
    instance_type="ml.t2.medium", # "ml.c7g.xlarge", # "ml.c4.xlarge",
    deserializer=JSONDeserializer()
)
endpoint_name = fm_predictor.endpoint_name
display(f"Endpoint's name: {endpoint_name}")

# you can't use the fm_predictor from deployment in a lambda function because
# you only deploy once; you have to use the endpoint name to initialize a
# Predictor object; we therefore make this object "None" to show that you
# cannot use it in practice
fm_predictor = None

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
--------------------!

"Endpoint's name: factorization-machines-2023-12-09-12-10-55-722"

CPU times: user 236 ms, sys: 20.6 ms, total: 256 ms
Wall time: 10min 33s


<a id="Code-Create-Predictor-from-Endpoint-Name"></a>
### Create Predictor from Endpoint Name

In [11]:
# this is how a lambda function would initialize the predictor (by using its
# name, not by deploying it again)
fm_predictor = sagemaker.predictor.Predictor(
    endpoint_name, 
    sagemaker_session=sagemaker.Session(),
    deserializer=JSONDeserializer()
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


<a id="Code-Inference-Functions"></a>
### Inference Functions

In [12]:
def map_if_duplicate(isbn: str, same_book_isbn_map: Dict[str, str]) -> str:
    """
    Map current ISBN to the latest version if it is a duplicate.
    
    Args:
        isbn: The ISBN to be mapped if it is a duplicate.
        same_book_isbn_map: A map from duplicate ISBN's (old version of a book)
            to the latest ISBN (the latest version of that book).
            
    Returns:
        The ISBN after being mapped or not.
    """
    if isbn in same_book_isbn_map:
        return same_book_isbn_map[isbn]
    return isbn


def serialize(inputs_encoded: csr_matrix, float_decimals: Optional[int] = 5) -> str:
    """
    Serialize the input data in json format according to
    https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#ir-serialization
    before sending it to the factorization machine endpoint for inference.
    
    Args:
        inputs_encoded (scipy.sparse.csr_matrix): A sparse matrix containing the 
            inference samples. Each sample (row) contains the user index and
            the movie index so that the model may compute if the movie is good
            for that specific user.
        float_decimals: The floating point precision (number of decimals)
            in the serialized json string. Used to reduce the size of the 
            output string. If None, rounding is not performed.
            
    Returns:
        The serialized input in the json format.
    """
    instances = []
    shape = inputs_encoded.shape[1]
    
    for row in inputs_encoded:
        values = [float(x) for x in row.data]
        if float_decimals is not None:
            values = [round(x, float_decimals) for x in values]
            
        instances.append({
            "data": {
                "features": {
                    "keys": row.indices.tolist(), 
                    "shape": [shape], 
                    "values": values
                }
            }
        })
            
    return json.dumps({
        "instances": instances
    })


def predict_scores(
    target_ids: List[str],
    context_ids: List[List[str]],
    fm_predictor: sagemaker.predictor.Predictor,
) -> List[float]:
    """
    Use the factorization machine endpoint to predict scores for the pairs
    of target book and context books.
    
    Args:
        target_ids: A list of target books for which we are predicting a
            suitability score.
        context_ids: A list of lists of context books, which are used
            to predict the score of the target.
        fm_predictor: A predictor object that connects to the sagemaker
            factorization machine endpoint.
            
    Returns:
        A list of suitability scores that will be used for ranking books.
    """
    # one-hot encode the target id
    target_ids_encoded = target_encoder.transform(np.array(target_ids).reshape(-1, 1)).astype("float32")
    
    # multi-hot encode the context ids
    context_ids_encoded = context_encoder.transform(context_ids)
    # normalize the multi-hot encoded vectors so that each sums up to 1; can 
    # handle context_ids of different lengths per target_id
    context_ids_encoded = (diags([1 / len(x) for x in context_ids]) * context_ids_encoded).astype("float32")
    
    # concatenate the target ids and the context ids along the column dimension
    inputs_encoded = hstack([target_ids_encoded, context_ids_encoded], format="csr")
    
    # serialize the sparse input matrix to json
    inputs_json = serialize(inputs_encoded)
    
    # send the serialized inputs to the factorization machine endpoint
    result = fm_predictor.predict(inputs_json, initial_args={"ContentType": "application/json"})
    scores = [x["score"] for x in result["predictions"]]
    
    return scores


def recommend(
    context_books: List[str], 
    same_book_isbn_map: Dict[str, str],
    isbn_to_proximal_isbns: Dict[str, List[str]],
    use_positional_bias: bool = True,
    bias_strength: float = 0.5,
) -> List[Tuple[str, float]]:
    """
    Recommends books based on a few (1-10 recommended) context books (which the
    anonymous user has already read and liked). It does this by scoring the 
    proximal books of each context book and ranking them based on the score
    and the positional bias (how close to the beginning of the proximal book
    list a given book was; the closer it is to the front, the more likely it
    is to be relevant).
    
    Args:
        context_books: List of books that the user liked.
        same_book_isbn_map: A map from duplicate ISBN's (old version of a book)
            to the latest ISBN (the latest version of that book).
        isbn_to_proximal_isbns: A map from context ISBNs to lists of proximal 
            books.
        use_positional_bias: If True, will bias the scores towards proximal 
            books that are close to the beginning of their respective list.
            The effect of this is reducing noise coming from irrelevant books
            at the end of the list for which the factorization machine model
            accidentally provides a large score.
        bias_strength: A coefficient describing the positional bias influence.
        
    Returns:
        A list of book recommendations (tuple of ISBNs and scores) sorted by
        the suitability score in decreasing order.
    """
    context_books = [
        map_if_duplicate(isbn, same_book_isbn_map) for isbn in context_books
    ]
    
    for isbn in context_books:
        if isbn not in context_isbns:
            raise ValueError(f'Book with ISBN "{isbn}" not in the context subset.')
            
    position_bias = None
    if use_positional_bias:
        position_bias: List[float] = []
        
    target_ids: List[str] = []
    for context_id in context_books:
        # get proximal books of context books; we score only those books (at 
        # most 100 per context book) as a way of reducing the search space
        # since there are hundreds of thousands of books in the dataset
        proximal_isbns = isbn_to_proximal_isbns[context_id]
        
        # filter out proximal books that are given as context
        proximal_isbns = list(
            filter(lambda isbn: isbn not in set(context_books), proximal_isbns)
        )
        # filter out proximal books that were added by another context book
        proximal_isbns = list(
            filter(lambda isbn: isbn not in set(target_ids), proximal_isbns)
        )
        
        if len(proximal_isbns) > 0:
            target_ids.extend(proximal_isbns)
            if position_bias is not None:
                position_bias.extend([
                    1 - idx/len(proximal_isbns) 
                    for idx in range(len(proximal_isbns))
                ])
    
    # for each target isbn (book that we want to score) we will provide the
    # same context (what books the user liked, which will be used to score
    # the targets)
    context_ids: List[List[str]] = list(
        itertools.repeat(context_books, times=len(target_ids))
    )
    
    scores = predict_scores(target_ids, context_ids, fm_predictor)
    if position_bias is not None:
        scores = [score + bias_strength * bias for score, bias in zip(scores, position_bias)]
        
    scored_books: List[Tuple[str, float]] = list(
        sorted(
            zip(target_ids, scores), 
            key=lambda x: x[1],
            reverse=True, 
        )
    )
    
    return scored_books

<a id="Code-Display-Functions"></a>
### Display Functions

In [13]:
def print_book(title: str, author: str, score: Optional[float] = None) -> None:
    if len(title) > 40:
        title = title[:37] + "..."
    if len(author) >= 40:
        author = author[:37] + "..."
    output = f'"{title}" by "{author}"'
    if score is not None:
        output = f"[{score:.3f}] " + output
    print(output)


def display_recommendations(
    scored_books: List[Tuple[str, float]],
    context_books: List[str],
    book_to_author: Dict[str, str], 
    book_to_title: Dict[str, str],
    best_n_only: Optional[int] = 20,
) -> None:
    n = len(scored_books)
    if best_n_only:
        n = min(n, 20)
        
    print("The books that the recommendations were based on:")
    for isbn in context_books:
        title = book_to_title[isbn]
        author = book_to_author[isbn]
        print_book(title, author)
          
    print("\nThe recommended books:")
    for isbn, score in scored_books[:n]:
        title = book_to_title[isbn]
        author = book_to_author[isbn]
        print_book(title, author, score)

<a id="Code-Example-Recommendations"></a>
### Example Recommendations

In [14]:
context_books = ["0192177737", "0393316823", "0553381695"]
scored_books = recommend(context_books, same_book_isbn_map, isbn_to_proximal_isbns, use_positional_bias=True)
display_recommendations(scored_books, context_books, book_to_author, book_to_title, best_n_only=20)

The books that the recommendations were based on:
"The Selfish Gene" by "richard dawkins"
"Climbing Mount Improbable" by "richard dawkins"
"A Clash of Kings (A Song of Ice and F..." by "george r.r. martin"

The recommended books:
[1.189] "A Clash of Kings (A Song of Fire and ..." by "george r. r. martin"
[1.162] "Windhaven" by "george r. r. martin"
[1.150] "Warchild" by "karin lowachee"
[1.143] "A Storm of Swords (A Song of Ice and ..." by "george r.r. martin"
[1.118] "The Blind Watchmaker: Why the Evidenc..." by "richard dawkins"
[1.110] "The Biotech Century: Harnessing the G..." by "jeremy rifkin"
[1.106] "Shock" by "robin cook"
[1.097] "Battlefield Earth: A Saga of the Year..." by "l. ron hubbard"
[1.095] "Sarajevo Daily: A City and Its Newspa..." by "tom gjelten"
[1.094] "My Century: A Novel" by "gunter grass"
[1.082] "Fevre Dream" by "george r.r. martin"
[1.079] "Tuf Voyaging" by "george r. r. martin"
[1.075] "The Extended Phenotype: The Long Reac..." by "richard dawkins"
[1.074] 

In [15]:
context_books = ["2070360318", "1566194334", "0451517938", "0679643087"]
scored_books = recommend(context_books, same_book_isbn_map, isbn_to_proximal_isbns, use_positional_bias=True)
display_recommendations(scored_books, context_books, book_to_author, book_to_title, best_n_only=20)

The books that the recommendations were based on:
"Eugenie Grandet" by "honore de balzac"
"Crime &amp; Punishment" by "fyodor m. dostoevsky"
"Red and the Black" by "stendhal"
"The Sorrows of Young Werther (Modern ..." by "goethe"

The recommended books:
[1.155] "The Return of the King (The Lord of t..." by "j. r. r. tolkien"
[1.082] "Animal Liberation" by "peter singer"
[1.076] "The Pleasure of My Company: A Novel" by "steve martin"
[1.071] "About the Author : A Novel" by "john colapinto"
[1.065] "Cascades - \Fahrenheit 451\" (Collins..." by "ray bradbury"
[1.055] "Lonely Planet Spain (Serial)" by "john noble"
[1.054] "Routledge Philosophy GuideBook to Pla..." by "nickolas pappas"
[1.052] "Pere Goriot (Oxford World's Classics ..." by "honore de balzac"
[1.051] "The Symposium (Penguin Classics)" by "plato"
[1.046] "Great Dialogues of Plato (Signet Clas..." by "plato"
[1.045] "The Red and the Black" by "stendhal"
[1.044] "Scenes from a Courtesan's Life" by "honore de balzac"
[1.042] "Le 

<a id='Code-Cleanup'></a>
### Cleanup


When we're done with the endpoint, we can just delete it, which will terminate any instances we deployed to not be charged a lot of money. Optionally, you may go to your AWS Console (in the SageMaker service) and delete the endpoint from there.

In [16]:
fm_predictor.delete_endpoint()