## 02 Deploy Llama-2-7b Model & Knowledge management Question answering with Retrieval Augmented Generation design pattern. 
Use Python 3 (Data Science 3.0) kernel image and `ml.t3.medium` for this notebook.

In this notebook we deploy [**Llama-2-7b**](https://ai.meta.com/llama/) model. This model will be used as generation model to generate the response. 

Sagemaker endpoint instance: ml.g5.4xlarge

To perform inference on the [Llama models](https://ai.meta.com/llama/), you need to pass `custom_attributes='accept_eula=true'` as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from this [webpage](https://ai.meta.com/resources/models-and-libraries/llama-downloads/). By default, this notebook sets `custom_attributes='accept_eula=false'`, so all inference requests will fail until you explicitly change this custom attribute.

This includes generating embeddings of all existing documents, indexing them in a vector store. Then for every user query, generate local embeddings and search based on embedding distance. The search responses act as context to the LLM model to generate a output. 

## Key components

LLM (Large Language Model): Llama-2-7b model will be used to understand the document chunks and provide an answer in human friendly manner.

Embeddings Model: GPT-J 6B available through Amazon SageMaker. This model will be used to generate a numerical representation of the textual documents.

Vector Store: FAISS available through LangChainIn this notebook we are using this in-memory vector-store to store both the embeddings and the documents. In an enterprise context this could be replaced with a persistent store such as AWS OpenSearch, RDS Postgres with pgVector, ChromaDB, Pinecone or Weaviate.

Index: VectorIndex The index helps to compare the input embedding and the document embeddings to find relevant document

##### Prerequisites

In [None]:
%pip install faiss-cpu==1.7.4 --quiet

In [None]:
%pip install langchain==0.0.222 --quiet

In [None]:
%%capture 

!pip install PyYAML

#### Imports

In [None]:
import requests
import logging 
import boto3
import yaml
import json

##### Setup logging

In [None]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

##### Log versions of dependencies 

In [None]:
logger.info(f'Using requests=={requests.__version__}')
logger.info(f'Using pyyaml=={yaml.__version__}')

#### Setup essentials

In [None]:
TEXT_EMBEDDING_MODEL_ENDPOINT_NAME = 'huggingface-textembedding-gpt-j-6b-fp16-1705613925'

REGION_NAME = boto3.session.Session().region_name

#### Encode passages (chunks) using JumpStart's GPT-J text embedding model . 

In order to follow the RAG approach this notebook is using the LangChain framework where it has integrations with different services and tools that allow efficient building of patterns such as RAG. 

In [None]:
import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("./src_doc/", glob="**/Reporting-FAQ*.txt", loader_cls=TextLoader)

documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 1000,
    chunk_overlap  = 100,
)
docs = text_splitter.split_documents(documents)

In [None]:
print(docs[0])

In [None]:
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(docs)
print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.')
print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')

In [None]:
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from langchain.embeddings import SagemakerEndpointEmbeddings
from typing import Any, Dict, List, Optional
from langchain.llms.sagemaker_endpoint import ContentHandlerBase


class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int = 5) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size

        for i in range(0, len(texts), _chunk_size):
            response = self._embedding_func(texts[i : i + _chunk_size])
            print
            results.extend(response)
        return results


class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"


    #def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        #input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        #return input_str.encode("utf-8")

    #def transform_input(self, prompt: Dict, model_kwargs: Dict) -> bytes:
        #input_dict = {"text_inputs": prompt, **model_kwargs}
        #return json.dumps(input_dict).encode('utf-8')

    #def transform_output(self, output: bytes) -> str:
        #response_json = json.loads(output.read().decode("utf-8"))
        #embeddings = response_json["embedding"]
        #return embeddings

    def transform_input(self, inputs: list[str], model_kwargs: Dict) -> bytes:
        input_str = json.dumps({"text_inputs": inputs, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> List[List[float]]:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json["embedding"]

content_handler = ContentHandler()

sagemakerEndpointEmbeddingsJumpStart = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=TEXT_EMBEDDING_MODEL_ENDPOINT_NAME,
    region_name=REGION_NAME,
    content_handler=content_handler,
)

In [None]:
print(docs[0].page_content)

In [None]:
sample_embedding = np.array(sagemakerEndpointEmbeddingsJumpStart.embed_query(docs[0].page_content))
print("Sample embedding of a document chunk: ", sample_embedding)
print("Size of the embedding: ", sample_embedding.shape)

Now create embeddings for the entire document set. Note for a single medical textbook, it takes about 6 minutes.

In [None]:
from tqdm.contrib.concurrent import process_map
from multiprocessing import cpu_count

def generate_embeddings(x):
    return (x, sagemakerEndpointEmbeddingsJumpStart.embed_query(x))
    
workers = 1 * cpu_count()

texts = [i.page_content for i in docs]

In [None]:
workers

In [None]:
data = process_map(generate_embeddings, texts, max_workers=workers, chunksize=100)

Next, we insert the embeddings to the FAISS vector store

In [None]:
from langchain.vectorstores import FAISS
faiss = FAISS.from_documents(docs[0:2], sagemakerEndpointEmbeddingsJumpStart)
faiss.add_embeddings(data)
faiss.save_local("faiss_index")

Next we create user query to retrieve a response from vector search and LLM combined

In [None]:
query = "Tell me the typs of reports I can accessl?"
query = "what's the process to report an incident or raise a new feature?"

In [None]:
query_embedding = faiss.embedding_function(query)
np.array(query_embedding)

In [None]:
relevant_documents = faiss.similarity_search_by_vector(query_embedding)
context = ""
print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
    print(f'## Document {i+1}: {rel_doc.page_content}.......')
    print('---')
    context += rel_doc.page_content
context = context.replace("\n", " ")

Now create a prompt template to trigger the model with above context from vector search. We specifically inform the model to answer only using the context provied.

In [None]:
template = """
        You are a helpful, polite, fact-based agent.
        If you don't know the answer, just say that you don't know.
        Please answer the following question using the context provided. 

        CONTEXT: 
        {context}
        =========
        QUESTION: {question} 
        ANSWER: """


In [None]:
prompt = template.format(context=context, question=query)
print(prompt)

Invoke the endpoint to generate a response from the LLM

## Deploy llama2-7b

In [None]:
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()

my_model = JumpStartModel(model_id="meta-textgeneration-llama-2-7b-f")

In [None]:
predictor = my_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.4xlarge",
    endpoint_name="llama-2-generator-2-01-18"
)

In [None]:
payload = {
    "inputs":  
      [
        [
         {"role": "system", "content": prompt},
         {"role": "user", "content": query},
        ]   
      ],
   "parameters":{"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False}
}

Generate Query response using the llama2-7b model and print

In [None]:
#print(predictor)
#print(payload)
out = predictor.predict(payload, custom_attributes='accept_eula=true')
out[0]['generation']['content']