In [2]:
%%html
<style>
table {align:left; display:block}
</style>

# Proximity Book Recommendations
---

1. [Introduction](#Introduction)
    * [Background](#Introduction-Background)
    * [Definitions](#Introduction-Definitions)
    * [Prerequisites](#Introduction-Prerequisites)
2. [Notebook Setup](#Notebook-Setup)
3. [Code](#Code)
    * [Imports](#Code-Imports)
    * [File Paths](#Code-File-Paths)
    * [Load Input Files](#Code-Load-Input-Files)
    * [Proximity Search Features](#Code-Proximity-Search-Features)
    * [Proximity Search Algorithm](#Code-Proximity-Search-Algorithm)
    * [Parallel Proximity Search](#Code-Parallel-Proximity-Search)

<a id="Introduction"></a>
## Introduction
---
<a id="Introduction-Background"></a>
### Background
In our application we'll have anonymous users provide a few books that they liked and we need to give them book recommendations based on those. Our factorization machine model provides a way to score books given the user context (the books that the user said they liked), but there are almost 200k to search from, which would be computationally expensive to score them all and may introduce noise from wrong predictions. This notebook creates a way to vastly reduce the search space for book recommendation. We will find around a fixed number of proximal books for each book from the context subset (see [definitions](#Introduction-Definitions)). A dictionary of the mapping is saved in the following json format:
```json
{
    "ISBN_0": ["ISBN_4", "ISBN_5", ...],
    "ISBN_1": ["ISBN_6", "ISBN_7", ...],
    "ISBN_2": ["ISBN_8", "ISBN_9", ...],
    ... 
}
```
The file above will be named "isbn_to_proximal_isbns.json". In the example, the keys (ISBN 0 through 2) must be from the context subset, and the values (ISBN 4 through 9) must be from the target subset (see [definitions](#Introduction-Definitions) to understand why).
Using this dictionary we can quickly find a few books that are closely related to those that the user liked and we can put them in order using the factorization machine model. If we want to generate even more recommendations, all we need to do is iteratively score the first proximal recommendations and get new proximal recommandations from the books with the largest score.

<a id="Introduction-Definitions"></a>
### Definitions
| Term | Definition |
|:--- |:--- | 
| Context Book | A book that is provided by the user, with the assumption that they liked it (explicit feedback), that will be used to rank recommendations using the factorization machine model. The context books are only a subset of the book dataset since some books did not have enough data during training. **Users may only choose liked books from this subset.** |
| Target Book | A book that can be scored by the factorization machine model. The target books are only a subset of the book dataset since some books did not have enough data during training. **Users will receive recommendations only from this subset.** |
| Encoder | The encoders we refer to in this notebook are either **sklearn.preprocessing.OneHotEncoder** for the target books (since we only have one per prediction), or **sklearn.preprocessing.MultiLabelBinarizer** for the context books (since we have multiple ones for each prediction). **We load these encoders to retrieve the context and target book subsets.** |
| Proximal Book | A context book has multiple proximal books that are "close" to it. Here, "close" means that users that liked the context book also liked the proximal books. Books from the same authors are also considered proximal. |
| ISBN | International Standard Book Numbers (or ISBN) is a unique identified for each book. In particular, it is an identified for each specific version/revision of a given book, which is why we need to perform ISBN deduplication to map old versions to the latest one. We do this to treat all versions of the same book in as one single book. |

<a id="Introduction-Prerequisites"></a>
### Prerequisites
The following files are required, all of which are the result of training the model using the training notebook:
- target_encoder.pkl
- context_encoder.pkl
- same_book_isbn_map.json
- valid_isbn.json

<a id="Notebook-Setup"></a>
## Notebook Setup
---
This notebook was tested in Amazon SageMaker Studio on a ml.c5.2xlarge (8 vCPUs, 16GB RAM) instance with Python 3 (Data Science) kernel. 
It may also work with smaller instances, but at least 8GB of RAM is recommended. With ml.c5.2xlarge, the processing time is around 35 minutes.
You may use instances with even more CPUs, but make sure to close them immediately after the processing is done and the dictionary is saved as a .json file to not be charged a large sum.

<a id="Code"></a>
## Code
---

<a id="Code-Imports"></a>
### Imports

In [2]:
import pickle
import os
import re
import json
from typing import Dict, List, Tuple, Set, Callable
from functools import partial
import random

import pandas as pd
from tqdm.notebook import tqdm
from multiprocessing import Pool

<a id="Code-File-Paths"></a>
### File Paths

In [3]:
# input paths
model_dir = "./"
valid_isbns_fpath = os.path.join(model_dir, "valid_isbn.json")
same_book_isbn_fpath = os.path.join(model_dir, "same_book_isbn_map.json")
target_encoder_fpath = os.path.join(model_dir, "target_encoder.pkl")
context_encoder_fpath = os.path.join(model_dir, "context_encoder.pkl")

data_dir = "../../../data"
ratings_fpath = os.path.join(data_dir, "books", "ratings-train.csv")
books_fpath = os.path.join(data_dir, "books", "books.csv")

# output paths
output_dir = "./"
os.makedirs(output_dir, exist_ok=True)
output_fpath = os.path.join(output_dir, "isbn_to_proximal_isbns.json")

<a id="Code-Load-Input-Files"></a>
### Load Input Files

In [4]:
with open(target_encoder_fpath, "rb") as f:
    target_encoder = pickle.load(f)
target_isbns = set(target_encoder.categories_[0])
print(f"There are {len(target_isbns)} target books")

There are 149623 target books


In [5]:
with open(context_encoder_fpath, "rb") as f:
    context_encoder = pickle.load(f)
context_isbns = set(context_encoder.classes_)
print(f"There are {len(context_isbns)} context books")

There are 170978 context books


In [6]:
with open(valid_isbns_fpath, "r") as f:
    valid_isbns = set(json.load(f))
print(f"There are {len(valid_isbns)} total valid ISBNs (books)")

There are 263516 total valid ISBNs (books)


In [7]:
with open(same_book_isbn_fpath, "r") as f:
    same_book_isbn_map = json.load(f)
print(f"There are {len(same_book_isbn_map)} duplicate ISBNs (different versions of the same book) that need to be mapped to the latest version of that book (ISBN)")

There are 24636 duplicate ISBNs (different versions of the same book) that need to be mapped to the latest version of that book (ISBN)


In [8]:
def map_if_duplicate(isbn: str, same_book_isbn_map: Dict[str, str]) -> str:
    """
    Map current ISBN to the latest version if it is a duplicate.
    
    Args:
        isbn: The ISBN to be mapped if it is a duplicate.
        same_book_isbn_map: A map from duplicate ISBN's (old version of a book)
            to the latest ISBN (the latest version of that book).
            
    Returns:
        The ISBN after being mapped or not.
    """
    if isbn in same_book_isbn_map:
        return same_book_isbn_map[isbn]
    return isbn


def read_user_ratings(
    file_path: str, 
    same_book_isbn_map: Dict[str, str], 
    valid_isbns: Set[str],
) -> pd.DataFrame:
    """
    Read the user ratings, remove ratings with invalid ISBNs, and map 
    duplicates to the latest version.
    
    Args:
        file_path: File path of the user ratings file.
        same_book_isbn_map: A map from duplicate ISBN's (old version of a book)
            to the latest ISBN (the latest version of that book).
        valid_isbns: Set of valid ISBNs.
        
    Returns:
        The user ratings dataframe.
    """
    df = pd.read_csv(file_path, dtype={"UserID": str, "ISBN": str, "BookRating": int})
    df = df[df.ISBN.isin(valid_isbns)]
    df["ISBN"] = df["ISBN"].apply(partial(map_if_duplicate, same_book_isbn_map=same_book_isbn_map))
    df.drop_duplicates(subset=["UserID", "ISBN"], keep="last", inplace=True)
    return df


ratings_df = read_user_ratings(ratings_fpath, same_book_isbn_map, valid_isbns)
ratings_df

Unnamed: 0,UserID,ISBN,BookRating
0,25409,081296666X,10
1,25533,0440910846,0
2,26182,1570625190,0
3,26624,0698119517,0
4,26731,0515087947,0
...,...,...,...
1034894,274308,0671543032,0
1034895,275154,0375703063,8
1034897,275970,0553348973,10
1034898,275970,0850253101,0


In [9]:
books_df = pd.read_csv(books_fpath, dtype={
    "ISBN": str, 
    "BookTitle": str, 
    "BookAuthor": str, 
    "YearOfPublication": int, 
    "Publisher": str, 
    "ImageURLSmall": str, 
    "ImageURLMedium": str, 
    "ImageURLLarge": str
})
books_df = books_df.drop(["Publisher", "ImageURLSmall", "ImageURLMedium", "ImageURLLarge", "YearOfPublication"], axis=1)
books_df

Unnamed: 0,ISBN,BookTitle,BookAuthor
0,1565920317,!%@ (A Nutshell handbook),Donnalyn Frey
1,1565920465,!%@ (A Nutshell handbook),Donnalyn Frey
2,0133989429,!Arriba! Comunicacion y cultura,Eduardo Zayas-Bazan
3,013327974X,"!Trato hecho!: Spanish for Real Life, Combined...",John T. McMinn
4,0452279186,!Yo!,Julia Alvarez
...,...,...,...
263969,3499232499,Ã?Â?lpiraten.,Janwillem van de Wetering
263970,325721538X,Ã?Â?rger mit Produkt X. Roman.,Joan Aiken
263971,3451274973,Ã?Â?sterlich leben.,Anselm GrÃ?Â¼n
263972,3442725739,Ã?Â?stlich der Berge.,David Guterson


<a id="Code-Proximity-Search-Features"></a>
### Proximity Search Features
Now we create the features that will be used to discover relationships between books. For example, we will have a map from ISBN (book ID) to the users that rated that book and another map from users to the books they rated. This way we don't have to explicitly build a graph.

In [10]:
user_to_books: Dict[str, pd.DataFrame] = {}
for user_id, user_df in ratings_df.groupby("UserID"):
    user_to_books[user_id] = user_df
print(f"We created a map of {len(user_to_books)} users to the books they rated. The number of ratings per user:")

pd.Series([len(x) for x in user_to_books.values()]).describe(percentiles=[0.6, 0.75, 0.9, 0.95, 0.99, 0.995])

We created a map of 80043 users to the books they rated. The number of ratings per user:


count    80043.000000
mean        11.300913
std         90.128380
min          1.000000
50%          1.000000
60%          2.000000
75%          4.000000
90%         13.000000
95%         31.000000
99%        180.000000
99.5%      320.000000
max      10267.000000
dtype: float64

As we can see, the top 1% of users have more than 200 rated books. The user with the most ratings has 10k. If we keep those, the network of books will connect to them too often and be biased towards books they liked. Because of this we sample 200 books from each user that has more.

In [11]:
random.seed(42)
user_to_books: Dict[str, pd.DataFrame] = {
    uid: (isbns if len(isbns) <= 200 else isbns.sample(n=200)) 
    for uid, isbns in user_to_books.items()
}
print(f"The number of ratings per user after sampling:")
pd.Series([len(x) for x in user_to_books.values()]).describe(percentiles=[0.6, 0.75, 0.9, 0.95, 0.99])

The number of ratings per user after sampling:


count    80043.000000
mean         7.816599
std         24.756617
min          1.000000
50%          1.000000
60%          2.000000
75%          4.000000
90%         13.000000
95%         31.000000
99%        180.000000
max        200.000000
dtype: float64

In [12]:
book_to_users: Dict[str, pd.DataFrame] = {}
for isbn, isbn_df in ratings_df.groupby("ISBN"):
    book_to_users[isbn] = isbn_df
print(f"We created a map of {len(book_to_users)} books to the users that have rated them. The number of users per book:")

pd.Series([len(x) for x in book_to_users.values()]).describe(percentiles=[0.6, 0.75, 0.9, 0.95, 0.99, 0.995])

We created a map of 225388 books to the users that have rated them. The number of users per book:


count    225388.000000
mean          4.013341
std          15.259392
min           1.000000
50%           1.000000
60%           2.000000
75%           3.000000
90%           7.000000
95%          12.000000
99%          46.000000
99.5%        75.000000
max        2212.000000
dtype: float64

In [13]:
def remove_punctuation_and_spaces(text: str) -> str:
    return re.sub('\W+','', text).lower().strip()
books_df["FixedAuthor"] = books_df["BookAuthor"].apply(remove_punctuation_and_spaces)

isbns = books_df.ISBN.tolist()
authors = [auth.lower() for auth in books_df.FixedAuthor]
book_to_author: Dict[str, str] = dict(zip(isbns, authors))
book_to_title: Dict[str, str] = dict(zip(isbns, books_df.BookTitle))
author_to_books: Dict[str, List[str]] = {}

for author, author_df in books_df.groupby("FixedAuthor"):
    author = author.lower()
    author_to_books[author] = author_df.ISBN.tolist()
    
print(f"We have {len(book_to_author)} books and {len(author_to_books)} authors. We created two maps: one from books to authors, and one from authors to books.")

We have 263974 books and 95053 authors. We created two maps: one from books to authors, and one from authors to books.


<a id="Code-Proximity-Search-Algorithm"></a>
### Proximity Search Algorithm

In order to perform proximity search we need two main functions, which are written in mutual recursion:
1. **_get_proximal_isbns_user**
    * Returns books that are proximal to a given user. 
    * Will recursively call "_get_proximal_isbns_book" to get proximal books from books that this user liked.
2. **_get_proximal_isbns_book**
    * Returns books that are proximal to a given book. 
    * Will recursively call "_get_proximal_isbns_user" to get proximal books from users that have liked this book.

In [14]:
def _select_books(books: List[str], max_books: int, already_selected_isbns: Set[str]) -> List[str]:
    """
    Select a maximum number of books from a list. Chooses the first "max_books"
    books that haven't previously been selected and are in the target subset
    so they can be scored using the factorization machine model.
    
    Args:
        books: A list of books IDs (ISBNs) from which to select.
        max_books: The maximum number of books to select (may not be the first
            ones if they don't pass the criteria).
        already_selected_isbns: A set of books that have already been selected 
            and which should not be selected again to prevent duplication.        
        
    Returns:
        A list of the selected ISBNs (book IDs).
    """
    selected_books_count = 0
    selected_books: List[str] = []
    
    for isbn in books:
        if isbn in already_selected_isbns:
            continue
            
        if selected_books_count >= max_books:
            break
            
        already_selected_isbns.add(isbn)
        
        # only add ISBNs for which we can predict the rating using the 
        # factorization machine model (those that are in the target subset)
        if isbn in target_isbns:
            selected_books.append(isbn)
            selected_books_count += 1
            
    return selected_books


def _get_proximal_isbns_user(
    user_id: str, 
    n_results: int, 
    book_to_users: Dict[str, pd.DataFrame], 
    user_to_books: Dict[str, pd.DataFrame],
    book_to_author: Dict[str, str],
    author_to_books: Dict[str, List[str]],
    visited_isbns: Set[str],
    visited_users: Set[str],
    already_selected_isbns: Set[str],
    min_like_rating: int = 7,
    max_books_same_user: int = 8,
) -> List[str]:
    """
    Get proximal books for a given user, which will contain books that they
    liked directly, or "liked" books of users that liked the same books as the
    current user.
    
    Args:
        user_id: The unique identifier of the user for which we are searching
            proximal books.
        n_results: The number of books to be returned.
        book_to_users: A map from books to the users that rated them.
        user_to_books: A map from users to the books they rated.
        book_to_author: A map from books to their author.
        author_to_books: A map from authors to the books they wrote.
        visited_isbns: A set of books (ISBNs) visited during the search.
        visited_users: A set of users (user IDs) visited during the search.
        already_selected_isbns: A set of books that were already selected to
            be returned (to prevent them from being added twice).
        min_like_rating: The minimum rating (out of 10) to consider a book as
            "liked" by the user. The search will only continue with books that
            a user has liked.
        max_books_same_user: The maximum number of books that will come 
            directly from this user's liked books. The rest will come from
            what similar users liked.
    
    Returns:
        A list of ISBNs of books that are considered proximal to the user.
    """
    visited_users.add(user_id)
    
    if n_results <= 0:
        return []
    
    proximal_isbns: List[str] = []

    # dataframe of books that this user has rated, sorted by the rating descending
    books_df = user_to_books[user_id].sort_values(by="BookRating", ascending=False)
    # eliminate books that have already been visited
    books_df = books_df[~books_df.ISBN.isin(visited_isbns)]
    
    # books that the person has "liked" given the rating threshold
    liked_books = books_df[books_df.BookRating > min_like_rating]
    # sort by rating descending to iterate through the most liked books first
    liked_books = liked_books.sort_values(by="BookRating", ascending=False)
    
    # n_remaining represents the number of books that still need to be found
    # out of the maximum n_results that this function needs to return
    n_remaining = n_results
    
    # select at most min(n_remaining, max_books_same_user) of the liked books 
    # of this user, starting with the highest rated ones
    liked_books = liked_books.ISBN.tolist()
    selected_books = _select_books(
        books=liked_books,
        max_books=min(n_remaining, max_books_same_user),
        already_selected_isbns=already_selected_isbns,
    )
    proximal_isbns.extend(selected_books)
    n_remaining -= len(selected_books)
    
    # weights assigned to each book; based on this number we will extract a
    # proportional number of proximal books from each of them; high ratings
    # get a much larger proportion since the weights are the ratings squared
    weights = [1 + rating * rating for rating in books_df.BookRating]
    for book_index, book_id in enumerate(books_df.ISBN):
        if n_remaining <= 0:
            break
        
        # extract a proportional number of proximal books from this book
        book_n_results = int(1 + weights[book_index] / sum(weights[book_index:]) * n_remaining)
        book_results = _get_proximal_isbns_book(
            isbn=book_id,
            n_results=book_n_results,
            book_to_users=book_to_users,
            user_to_books=user_to_books,
            book_to_author=book_to_author,
            author_to_books=author_to_books,
            visited_isbns=visited_isbns,
            visited_users=visited_users,
            already_selected_isbns=already_selected_isbns,
        )
            
        proximal_isbns.extend(book_results[:n_remaining])
        n_remaining = n_results - len(proximal_isbns)
        
    return proximal_isbns    
    
    
def _get_proximal_isbns_book(
    isbn: str, 
    n_results: int, 
    book_to_users: Dict[str, pd.DataFrame], 
    user_to_books: Dict[str, pd.DataFrame],
    book_to_author: Dict[str, str],
    author_to_books: Dict[str, List[str]],
    visited_isbns: Set[str],
    visited_users: Set[str],
    already_selected_isbns: Set[str],
    max_books_same_author: int = 10,
) -> List[str]:
    """
    Get proximal books for a given book, which will contain books from the same
    author, or books from users that liked this book.
    
    Args:
        isbn: The unique identifier of the book for which we are searching
            proximal books.
        n_results: The number of books to be returned.
        book_to_users: A map from books to the users that rated them.
        user_to_books: A map from users to the books they rated.
        book_to_author: A map from books to their author.
        author_to_books: A map from authors to the books they wrote.
        visited_isbns: A set of books (ISBNs) visited during the search.
        visited_users: A set of users (user IDs) visited during the search.
        already_selected_isbns: A set of books that were already selected to
            be returned (to prevent them from being added twice).
        min_like_rating: The minimum rating (out of 10) to consider a book as
            "liked" by the user. The search will only continue with books that
            a user has liked.
        max_books_same_author: The maximum number of books from the same author
            that will be included directly in the first iteration. More books 
            from this author can be included, but they would have to be 
            discovered from similar users in the recursive call of 
            "_get_proximal_isbns_user".
    
    Returns:
        A list of ISBNs of books that are considered proximal to the given book.
    """
    visited_isbns.add(isbn)
    
    if n_results <= 0:
        return []
    
    proximal_isbns: List[str] = []
    
    # n_remaining represents the number of books that still need to be found
    # out of the maximum n_results that this function needs to return
    n_remaining = n_results
    
    # select at random at most min(n_remaining, max_books_same_author) other
    # books of this author
    same_author_books = author_to_books[book_to_author[isbn]]
    same_author_books = random.sample(same_author_books, len(same_author_books))
    selected_books = _select_books(
        books=same_author_books,
        max_books=min(n_remaining, max_books_same_author),
        already_selected_isbns=already_selected_isbns,
    )
    proximal_isbns.extend(selected_books)
    n_remaining -= len(selected_books)
    
    # dataframe of users that have rated this book, sorted by the rating descending
    users_df = book_to_users[isbn].sort_values(by="BookRating", ascending=False, ignore_index=True)
    # removing the users that have already been visited
    users_df = users_df[~users_df.UserID.isin(visited_users)]
    
    
    # weights assigned to each user; based on this number we will extract a
    # proportional number of proximal books from each of them; high ratings
    # get a much larger proportion since the weights are the ratings squared
    weights = [1 + rating * rating for rating in users_df.BookRating]
    
    for user_index, user_id in enumerate(users_df.UserID):
        if n_remaining <= 0:
            break
        
        # extract a proportional number of proximal books from this user
        user_n_results = int(1 + weights[user_index] / sum(weights[user_index:]) * n_remaining)
        user_results = _get_proximal_isbns_user(
            user_id=user_id,
            n_results=user_n_results,
            book_to_users=book_to_users, 
            user_to_books=user_to_books,
            book_to_author=book_to_author,
            author_to_books=author_to_books,
            visited_isbns=visited_isbns,
            visited_users=visited_users,
            already_selected_isbns=already_selected_isbns,
        )
        
        proximal_isbns.extend(user_results[:n_remaining])
        n_remaining = n_results - len(proximal_isbns)
        
    return proximal_isbns    
    

def get_proximal_isbns(
    isbn: str, 
    n_results: int, 
    book_to_users: Dict[str, pd.DataFrame], 
    user_to_books: Dict[str, pd.DataFrame],
    book_to_author: Dict[str, str],
    author_to_books: Dict[str, List[str]],
) -> List[str]:
    """
    Get proximal books for a given book, which will contain books from the same
    author, or books from users that liked this book.
    
    Args:
        isbn: The unique identifier of the book for which we are searching
            proximal books.
        n_results: The number of books to be returned.
        book_to_users: A map from books to the users that rated them.
        user_to_books: A map from users to the books they rated.
        book_to_author: A map from books to their author.
        author_to_books: A map from authors to the books they wrote.
    
    Returns:
        A list of ISBNs of books that are considered proximal to the given book.
    """
        
    random.seed(42)
    
    already_selected_isbns: Set[str] = set([isbn])
    visited_isbns: Set[str] = set()
    visited_users: Set[str] = set()
    
    return _get_proximal_isbns_book(
        isbn=isbn, 
        n_results=n_results, 
        book_to_users=book_to_users, 
        user_to_books=user_to_books,
        book_to_author=book_to_author,
        author_to_books=author_to_books,
        visited_isbns=visited_isbns,
        visited_users=visited_users,
        already_selected_isbns=already_selected_isbns,
    )

def map_if_duplicate(isbn: str, same_book_isbn_map: Dict[str, str]) -> str:
    """
    Map current ISBN to the latest version if it is a duplicate.
    
    Args:
        isbn: The ISBN to be mapped if it is a duplicate.
        same_book_isbn_map: A map from duplicate ISBN's (old version of a book)
            to the latest ISBN (the latest version of that book).
            
    Returns:
        The ISBN after being mapped or not.
    """
    if isbn in same_book_isbn_map:
        return same_book_isbn_map[isbn]
    return isbn

# book_isbn = "0553381695"
book_isbn = "0192177737"
book_isbn = map_if_duplicate(book_isbn, same_book_isbn_map)

proximal_isbns = get_proximal_isbns(book_isbn, 100, book_to_users, user_to_books, book_to_author, author_to_books)
print(f'Getting proximal books for "{book_to_title[book_isbn]}" by "{book_to_author[book_isbn]}"')

Getting proximal books for "The Selfish Gene" by "richarddawkins"


In [15]:
book_isbn in proximal_isbns

False

In [16]:
[book_to_author[x] for x in proximal_isbns][:20]

['richarddawkins',
 'richarddawkins',
 'richarddawkins',
 'richarddawkins',
 'richarddawkins',
 'richarddawkins',
 'richarddawkins',
 'richarddawkins',
 'richarddawkins',
 'richarddawkins',
 'byronpreiss',
 'byrdbaggett',
 'billphillips',
 'howardmshapiro',
 'panjajurgens',
 'gregpahl',
 'roberthendrickson',
 'byronpreiss',
 'byronpreiss',
 'byrdbaggett']

In [17]:
[book_to_title[x] for x in proximal_isbns][:20]

['Der entzauberte Regenbogen. Wissenschaft, Aberglaube und die Kraft der Phantasie.',
 "A Devil's Chaplain: Reflections on Hope, Lies, Science and Love",
 'The Extended Phenotype: The Long Reach of the Gene (Popular Science)',
 "God's Utility Function (Phoenix 60p Paperbacks)",
 'Unweaving the Rainbow: Science, Delusion and the Appetite for Wonder',
 'Climbing Mount Improbable',
 'The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe Without Design',
 'The Extended Phenotype: The Long Reach of the Gene',
 'River Out of Eden: A Darwinian View of Life (Science Masters)',
 'River Out of Eden: A Darwinian View of Life (Science Masters Series)',
 'The Microverse',
 'The Pocket Power Book Of Leadership',
 'Body for Life: 12 Weeks to Mental and Physical Strength',
 "Dr. Shapiro's Picture Perfect Weight Loss : The Visual Program for Permanent Weight Loss",
 'They Call Themselves Queens: The Transformation Series',
 "Complete Idiot's Guide to Saving the Environment",
 'The Fact

<a id="Code-Parallel-Proximity-Search"></a>
### Parallel Proximity Search
All that is left to do is find the proximal ISBNs for all 170k books in the context subset. The workers require the presence of some very large objects (for example, a dictionary that maps an ISBN to all users that rated that book; another one is a dictionary that maps users to all books they rated, etc.). If we used a functools.partial function to send the parameters to the worker then the large objects would be sent to the workers for each task. Instead we need to use multiprocessing.Pool's initializer to attach the objects to the function object itself, which will send the large dictionaries only once to each worker.

In [18]:
def wrapper_get_proximal_isbns(isbn: str) -> List[str]:
    return get_proximal_isbns(
        isbn=isbn,
        n_results=wrapper_get_proximal_isbns.n_results,
        book_to_users=wrapper_get_proximal_isbns.book_to_users, 
        user_to_books=wrapper_get_proximal_isbns.user_to_books, 
        book_to_author=wrapper_get_proximal_isbns.book_to_author, 
        author_to_books=wrapper_get_proximal_isbns.author_to_books,
    )

def init_worker(function: Callable) -> None:
    """
    Adds the global variables that are needed for the "get_proximal_isbns"
    function to the wrapper function object that will be pickled and sent to
    the worker.
    """
    function.n_results = 100
    function.book_to_users = book_to_users
    function.user_to_books = user_to_books
    function.book_to_author = book_to_author
    function.author_to_books = author_to_books

isbns: List[str] = list(context_isbns)
with Pool(initializer=init_worker, initargs=(wrapper_get_proximal_isbns,)) as p:
    proximal_isbns: List[List[str]] = list(
        tqdm(
            p.imap(wrapper_get_proximal_isbns, isbns), 
            total=len(isbns)
        )
    )

isbn_to_proximal_isbns: Dict[str, List[str]] = dict(zip(isbns, proximal_isbns))

  0%|          | 0/170978 [00:00<?, ?it/s]

In [19]:
with open(output_fpath, "w") as f:
    f.write(json.dumps(isbn_to_proximal_isbns))