[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PIMtkozBnfeEvCuAQcHyTCQr7mYF3dLg#scrollTo=7kdhvf0aN1x-)

This notebook shows the process that was used to obtain embeddings for the book descriptions from 2,384,197 books contained in the "Books" category of the [Amazon Dataset Review dataset](http://deepyeti.ucsd.edu/jianmo/amazon/index.html) (Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019).

In [3]:
# fastai is (almost) all you need

!pip install -Uqq fastai

In [4]:
import os
import json
import gzip
import torch
import pandas as pd
import numpy as np

from urllib.request import urlopen
from tqdm.autonotebook import trange
from fastai.text.all import *

# Get and parse data

In [5]:
# Must have filled a form to use the data: https://forms.gle/A8hBfPxKkKGFCP238

!wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Books.json.gz

--2021-07-27 23:48:19--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Books.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1219104464 (1.1G) [application/octet-stream]
Saving to: ‘meta_Books.json.gz’


2021-07-27 23:48:49 (38.8 MB/s) - ‘meta_Books.json.gz’ saved [1219104464/1219104464]



In [6]:
# Get metadata for one book.

with gzip.open('/content/meta_Books.json.gz') as g:
  for l in g:
    print (json.loads(l))
    break

{'category': [], 'tech1': '', 'description': ['It is a biology book with God&apos;s perspective.'], 'fit': '', 'title': 'Biology Gods Living Creation Third Edition 10 (A Beka Book Science Series)', 'also_buy': ['0669009075', 'B000K2P5SA', 'B00MD4G2N0', 'B000ASIPTK', '0130508470', '1892427524', '0321567919', 'B000BJBH20', '0547484631', 'B000HAJTQO', 'B000AUCX7I', '0130365645', 'B000BI1Y2O', '0395976715', '052817729X', '1579246443', 'B001CK63XK', '1591669847', '0395879884', '836585161X', 'B01J2F9BH6', 'B00KYEHR4E', '158008141X', '1857928393', '0927545829', 'B015AR0RA0', 'B000TVHHRE', '0865167990', '1579246052', 'B003NXXVD4', 'B000OH6AX0', '061802087X', 'B000NU2X02', '0743252012'], 'tech2': '', 'brand': 'Keith Graham', 'feature': [], 'rank': '1,349,781 in Books (', 'also_view': ['0019777701', 'B000AUCX7I', 'B000K2P5SA', 'B001CK63XK', 'B01J2F9BH6', 'B000BI1Y2O', '1932012540', 'B0095ZCRCK'], 'main_cat': 'Books', 'similar_item': '', 'date': '', 'price': '$39.94', 'asin': '0000092878', 'image

In [7]:
def parse(path):
  """
  Open the JSON file and yield every line.
  """
  with gzip.open(path) as g:
    for l in g:
      yield json.loads(l)

def getDF(path, max_items=None):
  """
  Get `max_items` and turn them into a pandas DataFrame. 
  """
  i = 0
  df = {}
  for d in parse(path):
    if max_items and i == max_items:
      break
    df[i] = {}    
    try:
      df[i]['title'] = d['title']
    except KeyError:
      pass
    try:
      df[i]['description'] = d['description'][0] 
    except IndexError: 
      pass
    i += 1
  return pd.DataFrame.from_dict(df, orient='index', columns=['title', 'description'])

In [8]:
df = getDF('/content/meta_Books.json.gz', max_items=None)

In [9]:
df

Unnamed: 0,title,description
0,Biology Gods Living Creation Third Edition 10 (A Beka Book Science Series),It is a biology book with God&apos;s perspective.
1,Mksap 16 Audio Companion: Medical Knowledge Self-Assessment Program,
2,"Flex! Discography of North American Punk, Hardcore, and Powerpop 1975-1985 A-M","Discography of American Punk, Hardcore, and Power Pop"
3,Heavenly Highway Hymns: Shaped-Note Hymnal,This is a collection of classic gospel hymns that many churches still enjoy singing today.
4,Georgina Goodman Nelson Womens Size 8.5 Purple Regular Suede Platforms Shoes,
...,...,...
2934944,Made Men: A Thriller (Law of Retaliation Book 2) - Kindle edition,
2934945,Raptor&#39;s Desire (A Planet Desire novelette) - Kindle edition,
2934946,"LG K4 Case,LG Optimus Zone 3 Case,LG Spree Case,LG Rebel LTE Case,Veggzy[Kickstand]Slim Heavy Duty Shock Absorption Rugged Armor Dual Layer Hybird High Impact Resistant Defender Protective Hard Cover",
2934947,Magickal Incantations,


In [10]:
print(df.isna().sum())
df[df['description'].isna()]

title               0
description    550752
dtype: int64


Unnamed: 0,title,description
1,Mksap 16 Audio Companion: Medical Knowledge Self-Assessment Program,
4,Georgina Goodman Nelson Womens Size 8.5 Purple Regular Suede Platforms Shoes,
9,LJ Classique Interchangeable Ladies Gift Set Watches,
10,Classic Soul Winner's New Testament Bible,
13,"The New England Historical and Genealogical Register, Vol 155, April 2001 (618)",
...,...,...
2934944,Made Men: A Thriller (Law of Retaliation Book 2) - Kindle edition,
2934945,Raptor&#39;s Desire (A Planet Desire novelette) - Kindle edition,
2934946,"LG K4 Case,LG Optimus Zone 3 Case,LG Spree Case,LG Rebel LTE Case,Veggzy[Kickstand]Slim Heavy Duty Shock Absorption Rugged Armor Dual Layer Hybird High Impact Resistant Defender Protective Hard Cover",
2934947,Magickal Incantations,


In [11]:
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,title,description
0,Biology Gods Living Creation Third Edition 10 (A Beka Book Science Series),It is a biology book with God&apos;s perspective.
1,"Flex! Discography of North American Punk, Hardcore, and Powerpop 1975-1985 A-M","Discography of American Punk, Hardcore, and Power Pop"
2,Heavenly Highway Hymns: Shaped-Note Hymnal,This is a collection of classic gospel hymns that many churches still enjoy singing today.
3,"Principles of Analgesic Use in the Treatment of Acute Pain and Cancer Pain (APS, Principles of Analgesic Use in the Treatment of Acute Pain and Cancer Pain)",Brand new; never used.
4,MKSAP 15 Audio Companion,Flash cards used with accompany MKSAP 15 audio companion. \nExtremely useful for board exam.
...,...,...
2384192,The Walking Dead The Official Magazine Issue # 17 Newsstand Cover,The OFFICIAL THE WALKING DEAD MAGAZINE #17. 100 PAGE ISSUE!!! Feature Interviews with Cast and Crew plus TONS MORE Loaded with Full Color Photos front to back.Comes with a special insert called The Walking Dead Collector's Models.
2384193,Busoni : Konzertstuck fur Klavier mit Orchester,"Busoni, Ferruccio : Konzertstck fr Klavier mit Orchester. Op. 31a. Fr zwei Klaviere zu vier Hnden. (2. Klavier an Stelle des Orchesters). Piano Music - 2-Piano Scores This is an Eastman Scores Publishing professional reprint of the work originally published by: Breitkopf & Hartel, Leipzig, 1892, 2 scores, 38 pp. Sheet Music Eastman Scores Publishing Library Commerce ISMN : 979-0-087-00129-8"
2384194,Entertainment Weekly Magazine July 1 2016 | Star Wars Rogue One,New Entertainment Weekly Magazine July 1 2016 | Star Wars Rogue One
2384195,PLAYBOY MAGAZINE JULY/AUGUST 2016,"this product was never view, only taken out of plastic cover."


# Fine-tune Language Model

## Create DataLoaders and Learner

The random number generator is set to ensure that the same samples are drawn every time. It was necessary to sample from the complete dataset because it was not possible to avoid memory issues when working with the all the data. 

In [12]:
rand_num_gen = np.random.default_rng(42)
random_indices = rand_num_gen.integers(0, len(df), size=400_000)
sampled_df = df.iloc[list(set(random_indices))]

In [13]:
dls = TextDataLoaders.from_df(sampled_df, text_col=1, is_lm=True)
dls.show_batch(max_n=3)

  return array(a, dtype, copy=False, order=order)


Unnamed: 0,text,text_
0,"xxbos xxmaj quick xxmaj overview \n\n xxmaj originally released as a devotional , this volume has been reprinted in the new 4-color , xxup xxunk format . xxmaj xxunk : el xxmaj seor viene brings xxmaj ellen xxmaj white 's statements on last - day events together to form an inspired preview of the closing events of earth 's history xxbos xxup aj is a believer that magic is just science we","xxmaj quick xxmaj overview \n\n xxmaj originally released as a devotional , this volume has been reprinted in the new 4-color , xxup xxunk format . xxmaj xxunk : el xxmaj seor viene brings xxmaj ellen xxmaj white 's statements on last - day events together to form an inspired preview of the closing events of earth 's history xxbos xxup aj is a believer that magic is just science we have"
1,"… this is the best by far ! "" -- xxmaj pamela xxup a. xxmaj mitchell , xxmaj pilot , xxmaj international xxmaj society of xxmaj women xxmaj airline xxmaj pilots xxbos xxbos xxmaj author xxmaj mike xxmaj butler has included some 200 of these photographs from the xxmaj library of xxmaj congress collection . xxmaj butler 's other books for xxmaj arcadia 's xxmaj images of xxmaj america series include xxmaj","this is the best by far ! "" -- xxmaj pamela xxup a. xxmaj mitchell , xxmaj pilot , xxmaj international xxmaj society of xxmaj women xxmaj airline xxmaj pilots xxbos xxbos xxmaj author xxmaj mike xxmaj butler has included some 200 of these photographs from the xxmaj library of xxmaj congress collection . xxmaj butler 's other books for xxmaj arcadia 's xxmaj images of xxmaj america series include xxmaj around"
2,"loves puppies and babysitting and is very creative . xxmaj she may have a disability , but she has a special ability to bring joy to everyone around her . xxmaj tara was at a birthday party at xxmaj allison 's house and looked at the artwork on her wall and could n't believe the drawings she saw . xxmaj they were exactly what she needed for her book . xxmaj tara","puppies and babysitting and is very creative . xxmaj she may have a disability , but she has a special ability to bring joy to everyone around her . xxmaj tara was at a birthday party at xxmaj allison 's house and looked at the artwork on her wall and could n't believe the drawings she saw . xxmaj they were exactly what she needed for her book . xxmaj tara had"


In [14]:
learn = language_model_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()

In [15]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.680586,3.547758,0.378939,34.735355,32:55


In [16]:
learn.unfreeze()
learn.fit_one_cycle(1, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.263917,3.257999,0.415339,25.997469,35:11


## Test tokenizer and numericalization

We verify that the tokenizer and the numericalizer work well by testing them with one sample from the DataFrame.

In [19]:
sample_description = df.iloc[0]['description']

In [26]:
tok_sample = dls.tokenizer(sample_description)
num_sample = dls.numericalize(tok_sample)
print(f"{tok_sample}\n{num_sample}")

['xxbos', 'xxmaj', 'it', 'is', 'a', 'biology', 'book', 'with', 'xxmaj', 'god', "'s", 'perspective', '.']
TensorText([   2,    8,   39,   22,   14, 1844,   31,   25,    8,  185,   26,  766,
          11])


# Get embeddings for all the book descriptions 

After fine-tuning the language model, we can use its encoder to pass the description from every book and obtain an embedding for each one.

The first attempt to obtain the embeddings was a naive for-loop that processed 250 samples per minute. This means that it would take ~9,000 minutes (150 hours) to get all of the embeddings.

In [None]:
# Super slow
# learn.model.eval()
# embds = np.empty((len(df), 400))
# i = 0

# for description in df['description']:
#   learn.model.reset()
#   num_description = get_tokenized_description(dls, description)
#   embeddings = learn.model[0](num_description[None])[0]
#   mean_embeddings = embeddings.mean(0)
#   embds[i,:] = mean_embeddings.cpu().data.numpy()
#   i += 1
#   print(i)

We now proceed to define a series of functions that will be part of the main `get_numpy_embeddings` function, batching the samples and using the GPU to accelerate the process. Getting the 2,384,197 embeddings took ~3 hours on a JarvisCloud instance (50x faster than the naive approach).

In [27]:
def get_numericalized_description(dls, sentences):
  """
  Tokenize and numericalize every sentence in sentences, and obtain the length 
  of the longest numericalized sentence. 
  """
  num_sentences = []
  lengths = []
  for sentence in sentences:
    tok_description = dls.tokenizer(sentence)
    num_description = dls.numericalize(tok_description)
    num_sentences.append(num_description)
    lengths.append(len(num_description))
  return num_sentences, max(lengths)

In [28]:
def batch_to_device(batch, target_device):
    """
    Send a Pytorch batch to a device (CPU/GPU).
    """
    for element in batch:
        if isinstance(element, Tensor):
            element = element.to(target_device)
    return batch

In [29]:
def pad_to_max_length(batch, max_length):
  """
  Pad all of the tensors in batch, so that all of the tensors have the a length
  equal to max_length.
  """
  tensors = []
  for tensor in batch:
    tensor = torch.nn.functional.pad(tensor, (0, max_length-len(tensor)), value=1)
    tensors.append(tensor)
  return tensors

In [30]:
def get_numpy_embeddings(df, model, dls, batch_size=32, device='cuda'):
    """
    Get embeddings for each description in df by passing them through the 
    encoder of model.
    """
    learn.model.eval()
    embds = np.empty((len(df), 400))
    for start_index in trange(0, len(df), batch_size):
        if (start_index + batch_size) < len(df):
            sentences_batch = df['description'][start_index:start_index+batch_size]
        else:
            sentences_batch = df['description'][start_index:]
        features, max_length = get_numericalized_description(dls, sentences_batch)
        features = batch_to_device(features, device)
        features = pad_to_max_length(features, max_length)
        features = torch.stack(features)
        model.reset()
        embeddings = model[0](features)
        mean_embeddings = embeddings.mean(1)
        if (start_index + batch_size) < len(df):
            embds[start_index:start_index+batch_size,:] = mean_embeddings.cpu().data.numpy()
        else:
            embds[start_index:,:] = mean_embeddings.cpu().data.numpy()
    return embds

In [None]:
embds = get_numpy_embeddings(df[:2_600_000], learn.model, dls, batch_size=50, device='cuda')

HBox(children=(FloatProgress(value=0.0, max=47684.0), HTML(value='')))

# Save components for the application

The components that are necessary to convert an input string into an embedding are saved so that they can be used in the web service used for the Streamlit app.

In [31]:
torch.save(learn, 'fastai_learner.pt')

In [32]:
torch.save(learn.model[0], 'fastai_model.pt')

In [33]:
torch.save(dls.tokenizer, 'fastai_tokenizer.pt')

In [34]:
torch.save(dls.numericalize, 'fastai_numericalize.pt')

In [None]:
np.save('fastai_embeddings', embds)

# Test API endpoint

We can test the API endpoint that is used in the Streamlit app by sending a request just as a user would with the user interface. 

In [37]:
import requests
import json
import pandas as pd

input_json = json.dumps({"search_string": "book about magic", "ID": "111"})

# Call the endpoint
endpoint = 'http://50a9efd4-bbc1-4e53-b376-f44aac367846.southcentralus.azurecontainer.io/score'
headers = { 'Content-Type':'application/json'} #prod wrap

resp = requests.post(endpoint, input_json, headers=headers)
print(resp)
results = resp.text
print(results)

data = json.loads(results)
df = pd.read_json(data['response'])
print(df)

<Response [200]>
{"search_string": "book about magic", "ID": "111", "response": "{\"title\":{\"1061880\":\"The Chop Cup Book\",\"639523\":\"The Brothers' War - Artifacts Cycle - Book I\",\"2655344\":\"Lou Tannen's No. 11 Catalog of Magic\",\"2718869\":\"The Magic Menu: The First Five Years\",\"441178\":\"El libro mgico \\/ The Magic Book (El Chavo: 8x8) (Spanish Edition)\",\"2626821\":\"In a class by himself: The legacy of Don Alan\",\"901327\":\"The Structure of Magic II: A Book About Communication and Change\",\"2667398\":\"Steve Beam's Semi-Automatic Card Tricks Volume One\",\"683926\":\"Circle of Magic - books one and Two: Water &amp; Fire\",\"100622\":\"Magic Spell\"},\"description\":{\"1061880\":\"magic book\",\"639523\":\"Book based on Magic the Gathering.\",\"2655344\":\"Catalog of Magic\",\"2718869\":\"how to magic\",\"441178\":\"El libro mgico \\/ The Magic Book\",\"2626821\":\"Don Alan magic book\",\"901327\":\"This book is about magic it show's you the way of doing things.\