# NLU based item search¶
Using a pretrained BERT and Elasticsearch KNN to search textually similar items



# Contents
1. Background
2. Setup
3. Data Preparation 
4. SageMaker Model Hosting
5. Build a KNN Index in Elasticsearch
6. Evaluate Index Search Results
7. Extensions

## Background
In this notebook, we'll build the core components of a textually similar items search. Sometime people don't know what exactly they are looking in that case they just type an item descriptions and it will retrive the similar items.

One of the core components of textually similar items search is a fixed length sentence/word embedding i,e a “feature vectors” corrosponds to that text. The reference word/sentence embedding typically are generated offline and must be stored in. So they can be efficiently searched. So generating word/sengtence embedding can be achived by pretrained language model such as BRET(Bidirectional Encoder Representations from Transformers). In our use case we have used pretrained BERT model from sentence-transformers(https://github.com/UKPLab/sentence-transformers).

To enable efficient searches for textually similar items, we'll use Amazon SageMaker to generate fixed length sentence embedding i.e “feature vectors” and use KNN algorithim in Amazon Elasticsearch service. KNN for Amazon Elasticsearch Service7.7 lets you search for points in a vector space and find the "nearest neighbors" for those points by cosine similarity (Default is Euclidean distance). Use cases include recommendations (for example, an "other songs you might like" feature in a music application), image recognition, and fraud detection.

Here are the steps we'll follow to build textually similar items: After some initial setup, we'll host the pretrained BERT language model in SageMaker PyTorch model server. Then generate feature vectors for Multi-modal Corpus of Fashion Images from feidegger, a zalandoresearch dataset. Those feature vectors will be imported in Amazon Elasticsearch KNN Index. Next, we'll explore some sample text queries, and visualize the results.

## Setup

In [None]:
#Install tqdm to have progress bar
!pip install tqdm

#install necessary pkg to make connection with elasticsearch domain
!pip install elasticsearch
!pip install requests
!pip install requests-aws4auth
!pip install "sagemaker<2.0" --force-reinstall

In [None]:
import boto3
import re
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()

s3_resource = boto3.resource("s3")
s3 = boto3.client('s3')

role

In [None]:
cfn = boto3.client('cloudformation')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "ml-search-stack"

outputs = get_cfn_outputs(cloudformation_stack_name)

bucket = outputs['s3BucketTraining']
es_host = outputs['esHostName']

outputs

# Data Preparation ¶
The dataset itself consists of books with a long description and image.

Downloading Books data: Data originally from here: https://github.com/dris1995/deep-learning-semantic-search-engine

In [None]:
## Data Preparation

import os 
import shutil
import json
import tqdm
import urllib.request
from tqdm import notebook
from multiprocessing import cpu_count
from tqdm.contrib.concurrent import process_map

images_path = 'data/books/'
filename = 'metadata.json'
my_bucket = s3_resource.Bucket(bucket)
clean_data = []

if not os.path.isdir(images_path):
    os.makedirs(images_path)

def download_metadata(url):
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url, filename)
        
#download metadata.json to local notebook
download_metadata('https://raw.githubusercontent.com/dris1995/deep-learning-semantic-search-engine/main/books.json')


def generate_image_list(filename):
    metadata = open(filename,'r')
    data = json.load(metadata)
    url_lst = []
    non_desc_count = 0
    total_count = len(data)
    
    for i in range(len(data)):
        if 'longDescription' not in data[i].keys() or 'thumbnailUrl' not in data[i].keys():
            non_desc_count +=1
            continue
        else:
            url_lst.append(data[i]['thumbnailUrl'])
            clean_data.append(data[i])
    print(f'Total count data-set: {total_count}') 
    print(f'Total count not desc: {non_desc_count}')
    print(f'Total count to go in data-set:{len(url_lst)}')
    return url_lst


def download_image(url):
   urllib.request.urlretrieve(url, images_path + '/' + url.split("/")[-1])
                    
#generate image list            
url_lst = generate_image_list(filename)     

workers = 2 * cpu_count()

#downloading images to local disk
process_map(download_image, url_lst, max_workers=workers)

In [None]:

# Uploading dataset to S3
files_to_upload = []
dirName = 'data'
for path, subdirs, files in os.walk('./' + dirName):
    path = path.replace("\\","/")
    directory_name = path.replace('./',"")
    for file in files:
        files_to_upload.append({
            "filename": os.path.join(path, file),
            "key": directory_name+'/'+file
        })
        

def upload_to_s3(file):
        my_bucket.upload_file(file['filename'], file['key'])
        
#uploading images to s3
process_map(upload_to_s3, files_to_upload, max_workers=workers)

## Lauange Translation Results
This dataset has book descriptions in English. We will save to s3 bucket

In [None]:
#Define description result function
results = []
def get_descriptions(data):
    results = []
    for i in range(len(data)):
        trim_name = f's3://{bucket}/data/books/' + data[i]['thumbnailUrl'].split("/")[-1]
        data_to_add = dict(filename=trim_name,descriptions=data[i]['longDescription'],title=data[i]['title'])
        results.append(data_to_add)
    return results

# def get_descriptions(data):
#     results = {}
#     results['filename'] = f's3://{bucket}/data/books/' + data['thumbnailUrl'].split("/")[-1]
#     results['descriptions'] = []
#     for i in data['longDescription']:
#         results['descriptions'].append(data[i]['longDescription'])
#     print(results)   
#     return results

In [None]:
results = get_descriptions(clean_data)
results

In [None]:
# Saving the translated text in json format in case you need later time
with open('books-clean-data.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, ensure_ascii=False, indent=4)

## SageMaker Model Hosting
In this section will host the pretrained BERT model into SageMaker Pytorch model server to generte 768 fixed length sentecce embedding from sentence-transformers (https://github.com/UKPLab/sentence-transformers).

Citation:
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}

In [None]:
!pip install sentence-transformers
!pip install sagemaker_containers

In [None]:
import os 
#Save the model to disk which we will host at sagemaker
from sentence_transformers import models, SentenceTransformer
saved_model_dir = 'transformer'
if not os.path.isdir(saved_model_dir):
    os.makedirs(saved_model_dir)

model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
model.save(saved_model_dir)


In [None]:
#Defining defalut bucket for SageMaker pretrained model hosting
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
role

In [None]:
#zip the model .gz format
import tarfile
export_dir = 'transformer'
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
    archive.add(export_dir, recursive=True)

In [None]:
#Upload the model to S3
inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')
inputs

First we need to create a PyTorchModel object. The deploy() method on the model object creates an endpoint which serves prediction requests in real-time. If the instance_type is set to a SageMaker instance type (e.g. ml.m5.large) then the model will be deployed on SageMaker. If the instance_type parameter is set to local then it will be deployed locally as a Docker container and ready for testing locally.

First we need to create a RealTimePredictor class to accept TEXT as input and output JSON. The default behaviour is to accept a numpy array.

In [None]:
from sagemaker.pytorch import PyTorch, PyTorchModel
from sagemaker.predictor import RealTimePredictor
from sagemaker import get_execution_role

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

In [None]:

pytorch_model = PyTorchModel(model_data = inputs, 
                             role=role, 
                             entry_point ='inference.py',
                             source_dir = './code',
                             py_version = 'py3', 
                             framework_version = '1.5.1',
                             predictor_cls=StringPredictor)

predictor = pytorch_model.deploy(instance_type='ml.m5.large', initial_instance_count=3)

sentence transformers uses BERT pretrained model so it will generate 768 dimension for the given text. we will quickly validate the same in next cell.

In [None]:
# Doing a quick test to make sure model is generating the embeddings
import json
import sagemaker_containers
payload = 'When it comes to mobile apps, Android can do almost anything'
features = predictor.predict(payload)
embedding = json.loads(features)

embedding