# Local generation of Embeddings

## Setup

In [1]:
!pwd
!pip install -U pip
!pip install -r ../../requirements.txt 

/home/ec2-user/SageMaker/MUSE-sagemaker-development/notebooks/local-notebook
Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages (20.1.1)


In [2]:
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text
import numpy as np
import pandas as pd
import dask.dataframe as dd
import os
import shutil
import seaborn as sns
import matplotlib.pyplot as plt
from typing import Optional, List
import itertools
import csv

In [3]:
print(tf.__version__)

2.2.0


In [4]:
has_gpu = any(x.device_type == 'GPU' for x in tf.config.list_physical_devices('GPU'))
print('GPU Available: ', has_gpu)

GPU Available:  True


In [6]:
MUSE_VERSION = 2
MUSE_BASE_URL = f"https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/{MUSE_VERSION}"
muse_url = f"{MUSE_BASE_URL}\?tf-hub-format=compressed"
cleaned_data_dir = '../../data/book-depository/cleaned'
inferred_data_dir  = '../../data/book-depository/local-inferred'
LOCAL_MUSE_PATH = f"../../models/MUSE/large/{MUSE_VERSION:0>6d}"
print(f'MUSE will be loaded from {LOCAL_MUSE_PATH}')
print(f'Cleaned data is at {cleaned_data_dir}')
print(f'Inference results will be saved at {inferred_data_dir}')

MUSE will be loaded from ../../models/MUSE/large/000002
Cleaned data is at ../../data/book-depository/cleaned
Inference results will be saved at ../../data/book-depository/local-inferred


In [None]:
!rm -rf {LOCAL_MUSE_PATH}
!mkdir -p {LOCAL_MUSE_PATH}
!curl -L {muse_url} | tar -zxvC {LOCAL_MUSE_PATH}

## Local transformation

Here we'll build a simple notebook that can be run to generate the embeddings for the raw input data.

Because Dask does lazy execution, the read operation returns immediately. No data has been loaded so far. We have also limited the description length to avoid 

In [45]:
dataset = dd.read_csv(
   f'{cleaned_data_dir}/dataset-*.csv', header=0
)

MUSE large consumes a lot of GPU memory. If you have loaded the model on another notebook, the next cell will probably fail. Ideally, shut down the kernel on the other notebook, and if the cell below has already failed, restart the kernel on this one to clean up any partial loading that may still be taking memory. You'll have to rerun the cells above after restarting the kernel.

In [8]:
logging.set_verbosity(logging.ERROR)
model = hub.load(LOCAL_MUSE_PATH)

def embed(input):
    return model(input)

The next cell defines a mapped operation to generate the embedding on a Pandas DataFrame on a chunk at a time. The amount of memory taken by the model limits the parallel processing that can be done by it. Therefore, chunk size has to be relatively small. The length of all the texts on a chunk is also a factor

In [46]:
def grouper(n, iterable):
    it = iter(iterable)
    while True:
        chunk = tuple(itertools.islice(it, n))
        if not chunk:
            return
        yield chunk

def embed_description(df: pd.DataFrame, chunk_size=10):
    return pd.Series([embedding.numpy().tolist() for chunk in grouper(chunk_size, df.description) for embedding in embed(chunk)['outputs']], name='embedding')

In [47]:
dataset['embeddings']=dataset.map_partitions(embed_description, meta=pd.Series(name='embedding', dtype='float32'))

In [48]:
dataset.head()

Unnamed: 0,authors,categories,description,lang,title,n_authors,n_categories,descr_len_words,detected_lang,embeddings
0,[3],"[360, 2632]",A fake marriage is the last thing he wants . ....,en,The Mercenary : Order of the Broken Blade,1,2,17,en,"[0.02523178420960903, 0.061265941709280014, -0..."
1,[4],"[1703, 2771, 2818, 3097]","Once the dust has settled, you'll need to know...",en,100 Skills You'll Need for the End of the Worl...,1,4,97,en,"[-0.02368723228573799, -0.09728001058101654, 0..."
2,[5],"[819, 3364, 1853, 2977]",The Daily Mail and the Spectator Book of the Y...,en,How to Land a Plane,1,4,11,en,"[-0.05312519147992134, 0.024528754875063896, 0..."
3,[6],"[1694, 1703, 2818]","Easy, do-able, down to earth ideas and suggest...",en,The Sustainable(ish) Living Guide : Everything...,1,3,15,en,"[0.016959479078650475, -0.047762680798769, 0.0..."
4,[7],"[1843, 2967, 2969]",Mini celebrates 60 amazing years of this iconi...,en,Mini : 60 Years,1,3,20,en,"[-0.08251623064279556, -0.05317999795079231, -..."


The next cell takes a naive approach to generating all embeddings. It runs them all almost sequentially, all in the local instance, and then saves the results to two files. The chunk size above (10) allows micro-batches, but it was tested and can't be increased unless description length is always small (one or two sentences).

In [49]:
%%time

shutil.rmtree(inferred_data_dir, ignore_errors=True)
os.makedirs(inferred_data_dir)

small_data = dataset.sample(frac=0.01)
dataset.embeddings.to_csv(f'{inferred_data_dir}/embeddings-*.csv', compute=True, index=True, header=True, quoting=csv.QUOTE_NONNUMERIC)
dataset.drop('embeddings', axis=1).to_csv(f'{inferred_data_dir}/data-*.csv', compute=True, index=True, quoting=csv.QUOTE_NONNUMERIC)

CPU times: user 4h 4min 13s, sys: 48min 52s, total: 4h 53min 6s
Wall time: 1h 43min 46s
