# Retrieval Augmented Question Answering with Llama 2, LangChain and Pinecone using SageMaker Studio Notebooks for fast experimentation

In this notebook, we demonstrate the use of Llama2 text generation combined with the HuggingFace Embedding model to efficiently construct a Retrieval Augmented Generation (RAG) QnA system on Studio Notebooks. This notebook, powered by Pytorch 2.0.0 Image and an ml.g5.2xlarge instance, enables the download of open-source HuggingFace models. These are converted into local LLMs, which we then use to build, experiment with, tune, and deploy the LLM for a RAG application framework. Additionally, we showcase how the PineCone Embedding store can be utilized to archive and retrieve embeddings, integrating it into your RAG workflow.

<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel:</strong> PyTorch 2.0.0 Python 3.10 GPU Optimized <strong>Instance Type:</strong> ml.g5.2xlarge
</div>

## 01. Set-up

Install the required libriaries

In [None]:
%%writefile requirements.txt
sagemaker>=2.175.0
transformers==4.33.0
accelerate==0.21.0
datasets==2.13.0
langchain==0.0.297
pypdf>=3.16.3
pinecone-client
sentence_transformers
safetensors>=0.3.3

In [None]:
!pip install -U -r requirements.txt

## 02. Load Llama-2 7B chat in the notebook for experimentation

First, let's download the Llama-2-7b-chat-hf model from the Hugging Face Hub. Llama 2 models are gated, to get access follow the instructions [here](https://huggingface.co/meta-llama/Llama-2-7b-hf) 

In [None]:
import getpass
hf_access_token = getpass.getpass("Huggingface API Token:")

In [None]:
import torch
import os
from transformers import (
    AutoTokenizer, 
    LlamaTokenizer, 
    LlamaForCausalLM, 
    GenerationConfig,
    AutoModelForCausalLM,
)
import transformers

The following cell takes few minutes to complete

In [None]:
tg_model_id = "meta-llama/Llama-2-7b-chat-hf" #the model id in Hugging Face
tg_model_path = f"./tg_model/{tg_model_id}" #the local directory where the model will be saved

if  not (os.path.exists(tg_model_path)) or os.listdir(tg_model_path and tg_model_path) == []:
    print("Loading model from HuggingFace")

    tg_model = AutoModelForCausalLM.from_pretrained(
        tg_model_id, 
        token=hf_access_token,
        do_sample=True, 
        use_safetensors=True,
        device_map="auto",
        torch_dtype=torch.float16
    )
    tg_tokenizer = AutoTokenizer.from_pretrained(
        tg_model_id, 
        token=hf_access_token
    )

    tg_model.save_pretrained(
        save_directory=tg_model_path, 
        from_pt=True
    )
    tg_tokenizer.save_pretrained(
        save_directory=tg_model_path, 
        from_pt=True
    )
else:
    print("Loading from model from local directory")
    tg_model = LlamaForCausalLM.from_pretrained(
       tg_model_path,
       device_map="auto"
    )
    tg_tokenizer = AutoTokenizer.from_pretrained(tg_model_path)

Check memory consumption

In [None]:
print("Memory allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
print("Memory reserved  %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
print("Max memory reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))

## 03. Simple question-answering using Llama 2 7B chat and LangChain

Now that the model is available in memory, we can start using it to answer questions. The Llama-2 chat models expect the prompt to follow the below format:

    
\<s>[INST] <\<SYS\>>

{{ system_prompt }}

\<<SYS\>>

{{ user_message }} [/INST]

   
where
- \<s> - is the beginning of the sequence.
- <\<SYS>> - is the beginning of the system message.
- \<</SYS\>> - is the end of the system message.
- [INST] - is the beginning of the instructions
- [/INST] - is the end of the instructions

Let's create a recipe based on the above that will helps us define our prompts going forward. For that we will use [PromptTemplate](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) from LangChain.

In [None]:
from langchain import PromptTemplate

template = """<s>[INST] <<SYS>>\nYou are an assistant for question-answering tasks. You are helpful and friendly. Use the following pieces of retrieved context to answer the query. If you don't know the answer, you just say I don't know. Use three sentences maximum and keep the answer concise.
<<SYS>>\n
{context}\n
{question} [/INST]
"""
prompt_template = PromptTemplate(
    template=template, 
    input_variables=['context','question']
)

Next, we test the model on some questions without providing any context. For our tests, we will use questions about AWS news from 2023.

In [None]:
question= "When can I visit the AWS M&E Customer Experience Center in New York City?"
question2 = "How many awards have AWS Media Services won in 2023?"

In [None]:
tg_tokenizer.add_special_tokens(
    {"pad_token": "[PAD]"}
)
tg_tokenizer.padding_side = "left"

tg_pipe = transformers.pipeline(
    task='text-generation',
    model=tg_model, 
    tokenizer=tg_tokenizer,
    num_return_sequences=1,
    eos_token_id=tg_tokenizer.eos_token_id,
    pad_token_id=tg_tokenizer.eos_token_id,
    max_new_tokens=300,
    temperature=0.7
)

In [None]:
from langchain.chains import LLMChain
from langchain.llms import HuggingFacePipeline

llm=HuggingFacePipeline(pipeline=tg_pipe, model_kwargs={'temperature':0.7})


llm_chain = LLMChain(llm=llm, prompt=prompt_template)
no_context_response = llm_chain.predict(context="", question=question)
print(no_context_response)

Let's see if we can improve this answer by adding information from the AWS blog post [AWS announces new M&E Customer Experience Center in New York City](https://aws.amazon.com/blogs/media/aws-announces-new-me-customer-experience-center-in-new-york-city/) to our prompt as context to see if that improves the response.

In [None]:
context = """Media and entertainment (M&E) customers continue to face challenges in creating more content, 
more quickly, and distributing it to more endpoints than ever before in their quest to delight viewers globally. 
Amazon Web Services (AWS), along with AWS Partners, have showcased the rapid evolution of M&E solutions for years at industry events 
like the National Association of Broadcasters (NAB) Show and the International Broadcast Convention (IBC). Until now, AWS for M&E technology demonstrations
were accessible in this way just a few weeks out of the year. Customers are more engaged than ever before; they want to have higher quality conversations 
regarding user experience and media tooling. These conversations are best supported by having an interconnected solution architecture for reference.
Scheduling a visit of the M&E Customer Experience Center will be available starting November 13th, please send an email to AWS-MediaEnt-CXC@amazon.com.."""

In [None]:
context_response = llm_chain.predict(context=context, question=question)
print(context_response)

## 04. RAG question answering with Llama 2 7B chat, LangChain and Pinecone


In the above response, the model provides an answer with data from 2023 based on the context we provided. Next we want to scale this approach using __Retrieval Augmented Generation (RAG)__.
With RAG, we will ingest external data into our knowledge base and augment the prompt by adding only the data that is relevant to the context.

For our example, we will use 2 AWS blog posts as external files. These are already available as PDF files in the data folder of this project.
1. [AWS Media Services awarded industry accolades](https://aws.amazon.com/blogs/media/aws-media-services-awarded-industry-accolades/)
2. [AWS announces new M&E Customer Experience Center in New York City](https://aws.amazon.com/blogs/media/aws-announces-new-me-customer-experience-center-in-new-york-city/)

After that, we split files into documents

In [None]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()

text_splitter=RecursiveCharacterTextSplitter(
    chunk_size=700,
    chunk_overlap=5,
)
docs = text_splitter.split_documents(documents)

Next, we generate the embeddings for the documents. For that we will use the [bge-small-en](https://huggingface.co/BAAI/bge-small-en) model. We use HuggingFace transfomers to download it to the local directory and load it in memory.

In [None]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

em_model_name = "BAAI/bge-small-en"
em_model_path = f"./em-model"

In [None]:
from transformers import AutoModel

# Load model from HuggingFace Hub
em_model = AutoModel.from_pretrained(
    em_model_name,
    torch_dtype=torch.float32
)

em_tokenizer = AutoTokenizer.from_pretrained(em_model_name, device_map="cuda")

# save model to disk
em_tokenizer.save_pretrained(
    save_directory=f"{em_model_path}/model", 
    from_pt=True
)

em_model.save_pretrained(
    save_directory=f"{em_model_path}/model", 
    from_pt=True
)
em_model.eval()

In [None]:
# Tokenize sentences
def tokenize_text(_input, device):
    return em_tokenizer(
        [_input], 
        padding=True, 
        truncation=True, 
        return_tensors='pt'
    ).to(device)

# Run embedding task a function with model and text sentences as input
def embedding_generator(_input, normalize=True):
    # Compute token embeddings
    with torch.no_grad():
        embedded_output = em_model(
            **tokenize_text(
                _input, 
                em_model.device
            )
        )
        sentence_embeddings = embedded_output[0][:, 0]
        # normalize embeddings
        if normalize:
            sentence_embeddings = torch.nn.functional.normalize(
                sentence_embeddings, 
                p=2, 
                dim=1
            )
    
    return sentence_embeddings[0, :].tolist()

In [None]:
sample_sentence_embedding = embedding_generator(docs[0].page_content)
print(f"Embedding size of the document --->", len(sample_sentence_embedding))

We are now ready to ingest the embeddings into our vector store. In this notebook we will use [Pinecone](https://www.pinecone.io/), however you can replace the below code with that for the vector store of your choice.
If you don't have a Pinecone account you can sign up for free to complete this notebook. 

In [None]:
#enter your Pinecone keys
os.environ["PINECONE_API_KEY"] = getpass.getpass("Pinecone API Key:")
os.environ["PINECONE_ENV"] = getpass.getpass("Pinecone Environment:")

In [None]:
#initialize Pinecone
import pinecone
pinecone.init(
    api_key = os.environ["PINECONE_API_KEY"],
    environment = os.environ["PINECONE_ENV"]
)

In Pinecone, we create a new vector search index and ingest the embeddings we created in the previous step. The size of the index is the dimension of our embeddings model.

In [None]:
#check if index already exists, if not we create it
index_name = "rag-index"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=len(sample_sentence_embedding),
        metric='cosine'
    )

In [None]:
#insert the embeddings
from langchain.vectorstores import Pinecone
vector_store = Pinecone.from_documents(
    docs, 
    embedding_generator, 
    index_name=index_name
)

Let's do a quick test to see if the similarity search is working well.

In [None]:
docs = vector_store.similarity_search(question)
print(docs[0].page_content)

We have Llama-2 chat model in memory and the embeddings inserted in our Pinecone index. To improve the responses of the Llama 2 chat model we bring it alltogether and implement the RAG architecture easily with the Langchain [RetrievalQA](https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa). RetrievalQA augments our initial prompt with the most similar documents from the vector store.

In [None]:
from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

llm_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=vector_store.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template}
)

And that's it! Let's ask the model again to see if we will get 2023 data

In [None]:
import textwrap
#helper method to improve the readability of the response
def print_response(llm_response):
    temp = [textwrap.fill(line, width=100) for line in llm_response['result'].split('\n')]
    response = '\n'.join(temp)
    print(f"{llm_response['query']}\n \n{response}'\n \n Source Documents:")
    for source in llm_response["source_documents"]:
        print(source.metadata)

In [None]:
print_response(llm_qa_chain(question))

The model returns a more informed response with details from 2023 and the pages in the documents from where it acquired the information. 

Let's try another question. The answer to this one is in a different document.

In [None]:
print_response(llm_qa_chain(question2))

We can continue our experimentation with more files, different model parameters and different questions. Once we have sufficiet confidence in our approach, 
we can deploy our models to Amazon SageMaker

## 04. Supercharge your applications with GenAI by deploying your models to Amazon SageMaker

First we import the required libraries, and retrieve the IAM role and session we will use for deployment.  To deploy a model to a SageMaker endpoint, we first need to compress the model artifacts and upload the tar.gz file to Amazon S3.

### 04a. Deploy Text Generation Model

In [None]:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker import image_uris
from sagemaker.model import Model
from sagemaker import serializers, deserializers

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sagemaker.Session().boto_region_name
bucket = sess.default_bucket() # Set a default S3 bucket
sm_client = boto3.client('sagemaker', region_name=region)
smr_client = boto3.client("sagemaker-runtime")
prefix = 'qa-rag-models-test/rag-blog'

In [None]:
pretrained_model_location = f"s3://{bucket}/{prefix}/llama-2-7B-chat"

In [None]:
llm_path = sagemaker.s3.S3Uploader.upload(tg_model_path, pretrained_model_location)

In [None]:
djl_properties_filename = "serving.properties"

In [None]:
%%writefile {djl_properties_filename}
engine = MPI
option.tensor_parallel_degree = 1
option.rolling_batch = auto
option.max_rolling_batch_size = 64
option.model_loading_timeout = 3600
option.paged_attention = true
option.trust_remote_code = true
option.dtype = fp16
option.rolling_batch=lmi-dist
option.max_rolling_batch_prefill_tokens=1560

In [None]:
!echo -n "option.s3url = $pretrained_model_location" >> {djl_properties_filename}

In [None]:
modelfile_base_name = f"local-{tg_model_id.replace('/', '-')}"

In [None]:
!mkdir {modelfile_base_name}
!mv serving.properties {modelfile_base_name}/
!tar czvf {modelfile_base_name}.tar.gz {modelfile_base_name}/
!rm -rf {modelfile_base_name}

In [None]:
# list out the contents of the tar gz file for validation
!tar -ztvf {modelfile_base_name}.tar.gz

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

In [None]:
# Upload file and instantiate a new SageMaker Model
s3_code_prefix = "large-model-lmi/artifacts"

code_artifact = sess.upload_data(f"{modelfile_base_name}.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

In [None]:
llama2_model_name = sagemaker.utils.name_from_base(
    f"{tg_model_id.replace('/', '-')}"
)

tg_sm_model = Model(
    sagemaker_session=sess,
    image_uri=inference_image_uri,
    model_data=code_artifact,
    role=role,
    name=llama2_model_name,
)

In [None]:
instance_type = "ml.g5.2xlarge"
endpoint_name = f"ep-{llama2_model_name}"

tg_sm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900,
    wait=False, # <-- Set to True, if you would prefer to wait for the endpoint to spin up
)

In [None]:
print(f"Endpoint name to use ---> {tg_sm_model.endpoint_name}")

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=tg_sm_model.endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

In [None]:
predictor.predict(
    {
        "inputs": "Who is the president of Brazil?",
        "parameters": {"temperature": 0.1, "max_new_tokens": 50}
    }
)

### 04b. Deploy Embedding Model

In [None]:
%%writefile {em_model_path}/model.py
from djl_python import Input, Output
import os
import torch
from transformers import (
    AutoModel, 
    AutoTokenizer
)
from typing import Any, Dict, Tuple
import deepspeed
import warnings
import tarfile

model, tokenizer = None, None
model_dir = "./model/"


def get_model(properties):
    
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    
    print(f"Loading model from {model_dir}")
    model = AutoModel.from_pretrained(
        model_dir
    )
    
    model = deepspeed.init_inference(
        model,
        mp_size=properties["tensor_parallel_degree"]
    )
    
    print(f"Loading tokenizer from {model_dir}")
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    return model, tokenizer


def handle(inputs: Input) -> None:
    global model, tokenizer
    
    if not model:
        model, tokenizer = get_model(inputs.get_properties())

    if inputs.is_empty():
        return None
    
    data = inputs.get_as_json()
    text = data["text"]
    
    input_tokenized = tokenizer(
        [text], 
        padding=True, 
        truncation=True, 
        return_tensors='pt'
    ).to("cuda")
    
    outputs = model(**input_tokenized)
    
    sentence_embeddings = outputs[0][:, 0]
    
    # normalize embeddings
    sentence_embeddings = torch.nn.functional.normalize(
        sentence_embeddings, 
        p=2, 
        dim=1
    )
    sentence_embeddings = sentence_embeddings[0, :].tolist()
    
    result = {"outputs": sentence_embeddings}
    
    return Output().add(result)

In [None]:
%%writefile {em_model_path}/requirements.txt
einops
git+https://github.com/lanking520/DeepSpeed.git@falcon

In [None]:
%%writefile {em_model_path}/serving.properties
engine=DeepSpeed
option.tensor_parallel_degree=1

In [None]:
!rm embeddings-model.tar.gz
!rm -rf {em_model_path}/.ipynb_checkpoints
!cd {em_model_path} && tar -czvf ../embeddings-model.tar.gz ./

In [None]:
!tar -tzvf embeddings-model.tar.gz

In [None]:
embedding_inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

In [None]:
embedded_code_artifact = sess.upload_data("embeddings-model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {embedded_code_artifact}")

In [None]:
embedding_model_name = sagemaker.utils.name_from_base(
    f"{em_model_name.replace('/', '-')}"
)

em_sm_model = Model(
    sagemaker_session=sess,
    image_uri=embedding_inference_image_uri,
    model_data=embedded_code_artifact,
    role=role,
    name=embedding_model_name,
)
print(f"Creating a new model ---> {em_sm_model.name}")

In [None]:
embedding_instance_type = "ml.g5.2xlarge"

em_sm_model.deploy(
    initial_instance_count=1,
    instance_type=embedding_instance_type,
    endpoint_name=f"ep-{embedding_model_name}",
    container_startup_health_check_timeout=900,
    wait=False,
)

## 05. Run LangChain Inference using SageMaker Endpoint

In [None]:
from typing import Dict
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.llms import SagemakerEndpoint

In [None]:
class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        body = {
            "inputs": prompt, 
            "parameters": model_kwargs
        }
        input_str = json.dumps(body)
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json["generated_text"].strip()

In [None]:
content_handler = ContentHandler()

In [None]:
# convert your local LLM into SageMaker endpoint LLM
llm_sm_ep = SagemakerEndpoint(
    endpoint_name=tg_sm_model.endpoint_name, # <--- Your endpoint name
    region_name=region,
    model_kwargs={
        "temperature": 0.05, 
        "max_new_tokens": 512
    },
    content_handler=content_handler,
)

In [None]:
llm_qa_smep_chain = RetrievalQA.from_chain_type(
    llm=llm_sm_ep,
    chain_type='stuff',
    retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template}
)

In [None]:
print_response(llm_qa_smep_chain(question))

In [None]:
print_response(llm_qa_smep_chain(question2))

### 05a. Invoke the Embedding Endpoint for Inference

This section shows you how to invoke your custom embedding endpoint for inference.  

In [None]:
response_model = smr_client.invoke_endpoint(
    EndpointName=em_sm_model.endpoint_name,
    Body=json.dumps({
        "text": "This is a sample text"
    }),
    ContentType="application/json",
)

outputs = json.loads(response_model["Body"].read().decode("utf8"))['outputs']

In [None]:
print(f"Sample embeddings ---> {outputs[:10]}")

## 06. Clean Up Resources

In [None]:
# delete your text generation endpoint
sm_client.delete_endpoint(
    EndpointName=tg_sm_model.endpoint_name
)

In [None]:
# delete your text embedding endpoint
sm_client.delete_endpoint(
    EndpointName=em_sm_model.endpoint_name
)