# Retrieval Augumented Generation (RAG) inference

***This notebook works best with the `conda_python3` on the `ml.t3.large` instance***.

---

At this point our slide deck data is ingested into Amazon OpenSearch Service Serverless collection. We are now ready to talk to our slide deck using a large multimodal model. We are using the [LLaVA 1.5-7b](https://huggingface.co/anymodality/llava-v1.5-7b) for this purpose. LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

## Step 1. Setup

Install the required Python packages and import the relevant files.

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
import os
import io
import sys
import json
import glob
import boto3
import codecs
import logging
import requests
import botocore
import jsonlines
import numpy as np
import pandas as pd
import globals as g
from pathlib import Path
from typing import List, Dict
from IPython.display import Image
from urllib.parse import urlparse
from botocore.auth import SigV4Auth
from pandas.core.series import Series
from sagemaker import get_execution_role
from botocore.awsrequest import AWSRequest
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
from sagemaker.huggingface.model import HuggingFaceModel, HuggingFacePredictor
from utils import get_img_desc, download_image_from_url, encode_image_to_base64
from utils import get_cfn_outputs, get_text_embedding, get_llm_response, find_similar_data

In [None]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

## Step 2. Create an OpenSearch client and SageMaker Predictor object

We create an OpenSearch client so that we can query the vector database for embeddings (slides) similar to the questions that we might want to ask of our slide deck and then we create a SageMaker [`Predictor`](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) to run inference using the LLaVA model given the slide we retrieved from OpenSearch.

Get the name of the OpenSearch Service Serverless collection endpoint and index name from the CloudFormation stack outputs.

In [None]:
outputs = get_cfn_outputs(g.CFN_STACK_NAME)
host = outputs['MultimodalCollectionEndpoint'].split('//')[1]
# index_name = outputs['OpenSearchIndexName']
index_name = "blog3slides-app1"
logger.info(f"opensearchhost={host}, index={index_name}")

We use the OpenSearch client to create an index.

In [None]:
session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection,
    pool_maxsize = 20
)

## Step 3. Read for RAG

We now have all the pieces for RAG. Here is how we _talk to our slide deck_.

1. Convert the user question into embeddings using the Titan Multimodal Embeddings model.

1. Find the most similar slide (image) corresponding to the the embeddings (for the user question) from the vector database (OpenSearch Serverless).

1. Now ask LLaVA (via the SageMaker Endpoint) to answer the user question using the retrieved image for the most similar slide.

In [None]:
bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=g.TITAN_URL)

Use the following prompt template to make sure that the model only answers from the slides (images).

### Ask questions
Loop through questions in the jsonl file to -
1. Embed the question
2. Do a similarity search to retrive the closest image url and image description
3. Get final response from Claude by passing question and image description
4. Append responses to a list

Save the responses

In [None]:
llm_prompt: str = """

Human: Use the image to provide a concise answer to the question to the best of your abilities. If you cannot answer the question from the context then say I do not know, do not make up an answer.
<question>
{question}
</question>

Assistant:"""

In [None]:
responses_list = []
with jsonlines.open('qa.jsonl') as f:
    for line in f.iter():
        question: str = line['question']
        text_embedding = get_text_embedding(bedrock, question, g.FMC_MODEL_ID)
        vector_db_response: Dict = find_similar_data(os_client, text_embedding, 1, index_name)
        deck_name = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('metadata').get('deck_name')
        deck_url = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('metadata').get('deck_url')
        img_url = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('image_url')
        
        logger.info(f"going to answer the question=\"{question}\" using the image \"{img_url}\"")
        prompt = llm_prompt.format(question=question)
        img_path = download_image_from_url(img_url, g.IMAGE_DIR)
        if img_path != "":
            b64_img_path = encode_image_to_base64(img_path)
            resp_text = get_img_desc(bedrock, b64_img_path, prompt)

            response = {
                "question": question,
                "response": {
                    "resp_txt": resp_text,
                    "resp_img_url": img_url,
                    "resp_deck_name": deck_name,
                    "resp_deck_url": deck_url
                }
            }
            responses_list.append(response)
            logger.info(f"appended response corresponding to {question}")

In [None]:
fpath: str = "responses-appr1.json"
json.dump(responses_list, codecs.open(fpath, 'w', encoding='utf-8'), 
          separators=(',', ':'), 
          sort_keys=True, 
          indent=4)
logger.info(f"saved responses for all questions in {fpath}")