# Retrieval Augmented Question Answering with Llama 2, LangChain and Pinecone using SageMaker Studio Notebooks for fast experimentation

In this notebook, we demonstrate the use of Llama2 text generation combined with the HuggingFace Embedding model to efficiently construct a Retrieval Augmented Generation (RAG) QnA system on Studio Notebooks. This notebook, powered by Pytorch 2.0.0 Image and an ml.g5.2xlarge instance, enables the download of open-source HuggingFace models. These are converted into local LLMs, which we then use to build, experiment with, tune, and deploy the LLM for a RAG application framework. Additionally, we showcase how the PineCone Embedding store can be utilized to archive and retrieve embeddings, integrating it into your RAG workflow.

<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel:</strong> PyTorch 2.0.0 Python 3.10 GPU Optimized <strong>Instance Type:</strong> ml.g5.2xlarge
</div>

## 01. Set-up

Install the required libriaries

In [3]:
%%writefile requirements.txt
sagemaker>=2.175.0
transformers==4.33.0
accelerate==0.21.0
datasets==2.13.0
langchain==0.0.297
pypdf>=3.16.3
pinecone-client
sentence_transformers
safetensors>=0.3.3

Overwriting requirements.txt


In [4]:
!pip install -U -r requirements.txt

Collecting sagemaker>=2.175.0 (from -r requirements.txt (line 1))
  Using cached sagemaker-2.192.1-py2.py3-none-any.whl
Collecting transformers==4.33.0 (from -r requirements.txt (line 2))
  Using cached transformers-4.33.0-py3-none-any.whl (7.6 MB)
Collecting accelerate==0.21.0 (from -r requirements.txt (line 3))
  Using cached accelerate-0.21.0-py3-none-any.whl (244 kB)
Collecting datasets==2.13.0 (from -r requirements.txt (line 4))
  Using cached datasets-2.13.0-py3-none-any.whl (485 kB)
Collecting langchain==0.0.297 (from -r requirements.txt (line 5))
  Using cached langchain-0.0.297-py3-none-any.whl (1.7 MB)
Collecting pypdf>=3.16.3 (from -r requirements.txt (line 6))
  Using cached pypdf-3.16.4-py3-none-any.whl (276 kB)
Collecting pinecone-client (from -r requirements.txt (line 7))
  Using cached pinecone_client-2.2.4-py3-none-any.whl (179 kB)
Collecting sentence_transformers (from -r requirements.txt (line 8))
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting

## 02. Load Llama-2 7B chat in the notebook for experimentation

First, let's download the Llama-2-7b-chat-hf model from the Hugging Face Hub. Llama 2 models are gated, to get access follow the instructions [here](https://huggingface.co/meta-llama/Llama-2-7b-hf) 

In [None]:
# import nvidia
# if torch.cuda.is_available():
#     cuda_install_dir = '/'.join(nvidia.__file__.split('/')[:-1]) + '/cuda_runtime/lib/'
#     os.environ['LD_LIBRARY_PATH'] =  cuda_install_dir

In [5]:
import getpass
hf_access_token = getpass.getpass("Huggingface API Token:")

Huggingface API Token: ········


In [6]:
import torch
import os
from transformers import (
    AutoTokenizer, 
    LlamaTokenizer, 
    LlamaForCausalLM, 
    GenerationConfig,
    AutoModelForCausalLM,
)
import transformers

  from .autonotebook import tqdm as notebook_tqdm


The following cell takes few minutes to complete

In [12]:
tg_model_id = "meta-llama/Llama-2-7b-chat-hf" #the model id in Hugging Face
tg_model_path = f"./tg_model/{tg_model_id}" #the local directory where the model will be saved

if  not (os.path.exists(tg_model_path)) or os.listdir(tg_model_path and tg_model_path) == []:
    print("Loading model from HuggingFace")

    tg_model = AutoModelForCausalLM.from_pretrained(
        tg_model_id, 
        token=hf_access_token,
        do_sample=True, 
        use_safetensors=True,
        device_map="auto",
        torch_dtype=torch.float16
    )
    tg_tokenizer = AutoTokenizer.from_pretrained(
        tg_model_id, 
        token=hf_access_token
    )

    tg_model.save_pretrained(
        save_directory=tg_model_path, 
        from_pt=True
    )
    tg_tokenizer.save_pretrained(
        save_directory=tg_model_path, 
        from_pt=True
    )
else:
    print("Loading from model from local directory")
    tg_model = LlamaForCausalLM.from_pretrained(
       tg_model_path,
       device_map="auto"
    )
    tg_tokenizer = AutoTokenizer.from_pretrained(tg_model_path)

Loading model from HuggingFace


Loading checkpoint shards: 100%|██████████| 2/2 [01:53<00:00, 56.70s/it]


Check memory consumption

In [13]:
print("Memory allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
print("Memory reserved  %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
print("Max memory reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))

Memory allocated: 12.613792GB
Memory reserved  12.615234GB
Max memory reserved: 12.615234GB


## 03. Simple question-answering using Llama 2 7B chat and LangChain

Now that the model is available in memory, we can start using it to answer questions. The Llama-2 chat models expect the prompt to follow the below format:

    
\<s>[INST] <\<SYS\>>

{{ system_prompt }}

\<<SYS\>>

{{ user_message }} [/INST]

   
where
- \<s> - is the beginning of the sequence.
- <\<SYS>> - is the beginning of the system message.
- \<</SYS\>> - is the end of the system message.
- [INST] - is the beginning of the instructions
- [/INST] - is the end of the instructions

Let's create a recipe based on the above that will helps us define our prompts going forward. For that we will use [PromptTemplate](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) from langchain.

In [14]:
from langchain import PromptTemplate

template = """<s>[INST] <<SYS>>\nYou are an assistant for question-answering tasks. You are helpful and friendly. Use the following pieces of retrieved context to answer the query. If you don't know the answer, you just say I don't know. Use three sentences maximum and keep the answer concise.
<<SYS>>\n
{context}\n
{question} [/INST]
"""
prompt_template = PromptTemplate(
    template=template, 
    input_variables=['context','question']
)

Next, we test the model on some questions without providing any context. For our tests, we will use questions about the UK Home Office activities in 2023.

In [114]:
question = "How is UK home office driving down crime in 2023?"
question2= "How are the British nationality fees changing in 2023"

In [19]:
tg_tokenizer.add_special_tokens(
    {"pad_token": "[PAD]"}
)
tg_tokenizer.padding_side = "left"

tg_pipe = transformers.pipeline(
    task='text-generation',
    model=tg_model, 
    tokenizer=tg_tokenizer,
    num_return_sequences=1,
    eos_token_id=tg_tokenizer.eos_token_id,
    pad_token_id=tg_tokenizer.eos_token_id,
    max_new_tokens=300,
    temperature=0.7
)

In [20]:
from langchain.chains import LLMChain
from langchain.llms import HuggingFacePipeline

llm=HuggingFacePipeline(pipeline=tg_pipe, model_kwargs={'temperature':0.7})


llm_chain = LLMChain(llm=llm, prompt=prompt_template)
no_context_response = llm_chain.predict(context="", question=question)
print(no_context_response)

According to the latest crime statistics released by the UK Home Office, crime rates in England and Wales have been steadily decreasing over the past few years. In 2023, the overall crime rate decreased by 3.9% compared to the previous year, with some categories of crime experiencing even larger reductions. For example, violent crime fell by 5.3% and burglary dropped by 7.7%. These trends are consistent with the ongoing efforts of the UK Home Office to drive down crime through a combination of law enforcement and crime prevention strategies.


This answer gives us some relevant information about crime in 2023 but it doesn't answer the question about the concrete activities to drive down crime in 2023. To improve it we pass some context. The below is an extract from  the [Home Office Annual Report and Accounts 2022-2023](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1185849/Home_Office_Annual_Report_and_Accounts_22-23.pdf)

In [24]:
context = """Overall crime, excluding fraud and computer crime which has only been counted since
2016, is now down by 54% since 2010, with burglary down 55%, robbery down 77%,
violence down 46%, theft down 47%, neighbourhood crime down 51%, and criminal
damage down 72%. We are rolling out our ambitious programme, Operation Soteria, to
transform rape investigations and prosecutions and have brought in a big package of
measures for domestic abuse victims. Our new Fraud Strategy will mean this wicked crime
is treated like the epidemic it is.
We received the Third Volume of the Manchester Arena Inquiry report and published the
Terrorism (Protection of Premises) draft Bill. This is also known as Martyn’s Law. It will
place on public places a greater duty to protect their visitors.
Our efforts to drive down crime have been boosted by recruiting 20,951 additional police
officers by March 2023, exceeding our manifesto commitment to recruit an additional
20,000 by this date. This brings the total number of police officers in England and Wales to
149,566 in March 2023 – the highest on record against a previous peak of 146,030 in
2010. Now the police need to ensure they focus on getting the basics right: the highest
professional standards and a relentless focus on crime, not politically correct distractions.
This means continuing to smash county lines gangs, using proven methods like stop and
search, and deploying officers to high-crime areas. The Anti-Social Behaviour Action Plan
reflects the fact that there is no such thing as petty crime and that it is easy for areas to
slip into degeneracy and misery"""

In [30]:
context_response = llm_chain.predict(context=context, question=question)
print(context_response)

The UK Home Office is driving down crime in 2023 through various initiatives and measures. They have seen a 54% reduction in overall crime, excluding fraud and computer crime, since 2010, with specific categories such as burglary, robbery, violence, theft, and criminal damage showing significant decreases. Additionally, they have launched Operation Soteria to transform rape investigations and prosecutions, and have introduced a new Fraud Strategy to tackle this issue. The Home Office has also recruited 20,951 additional police officers by March 2023, exceeding their manifesto commitment, and is focusing on ensuring the police prioritize professional standards and a relentless focus on crime, rather than politically correct distractions. These efforts include continuing to disrupt county lines gangs, using proven methods like stop and search, and deploying officers to high-crime areas.


## 04. RAG question answering with Llama 2 7B chat, LangChain and Pinecone


In the above response, the model provides an answer with data from 2023 based on the context we provided. Next we want to scale this approach using __Retrieval Augmented Generation (RAG)__.
With RAG, we will ingest external data into our knowledge base and augment the prompt by adding only the data that is relevant to the context.

First we download the external files we want to store in our knowledge base locally so than we can quickly iterate if needed. We will use reports published by the UK Home office in 2023.

In [31]:
from urllib.request import urlretrieve
files = [
    "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1185849/Home_Office_Annual_Report_and_Accounts_22-23.pdf",
    "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1185949/2023-9-18_Equality_Impact_Assessment_for_Autumn_2023_fee_increases_FINAL.pdf",
]

os.makedirs("data", exist_ok=True)

for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

After that, we split files into documents

In [145]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()

text_splitter=RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=5,
)
docs = text_splitter.split_documents(documents)

Next, we generate the embeddings for the documents. For that we will use the [bge-small-en](https://huggingface.co/BAAI/bge-small-en) model. We use HuggingFace transfomers to download it to the local directory and load it in memory.

In [33]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

em_model_name = "BAAI/bge-small-en"
em_model_path = f"./em-model"

In [88]:
from transformers import AutoModel

# Load model from HuggingFace Hub
em_model = AutoModel.from_pretrained(
    em_model_name,
    torch_dtype=torch.float32
)

em_tokenizer = AutoTokenizer.from_pretrained(em_model_name, device_map="cuda")

# save model to disk
em_tokenizer.save_pretrained(
    save_directory=f"{em_model_path}/model", 
    from_pt=True
)

em_model.save_pretrained(
    save_directory=f"{em_model_path}/model", 
    from_pt=True
)
em_model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 384, padding_idx=0)
    (position_embeddings): Embedding(512, 384)
    (token_type_embeddings): Embedding(2, 384)
    (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [127]:
# Tokenize sentences
def tokenize_text(_input, device):
    return em_tokenizer(
        [_input], 
        padding=True, 
        truncation=True, 
        return_tensors='pt'
    ).to(device)

# Run embedding task a function with model and text sentences as input
def embedding_generator(_input, normalize=True):
    # Compute token embeddings
    with torch.no_grad():
        embedded_output = em_model(
            **tokenize_text(
                _input, 
                em_model.device
            )
        )
        sentence_embeddings = embedded_output[0][:, 0]
        # normalize embeddings
        if normalize:
            sentence_embeddings = torch.nn.functional.normalize(
                sentence_embeddings, 
                p=2, 
                dim=1
            )
    
    return sentence_embeddings[0, :].tolist()

In [128]:
sample_sentence_embedding = embedding_generator(docs[0].page_content)
print(f"Embedding size of the document --->", len(sample_sentence_embedding))

Embedding size of the document ---> 384


We are now ready to ingest the embeddings into our vector store. In this notebook we will use [Pinecone](https://www.pinecone.io/), however you can replace the below code with that for the vector store of your choice.
If you don't have a Pinecone account you can sign up for free to complete this notebook. 

In [95]:
#enter your Pinecone keys
os.environ["PINECONE_API_KEY"] = getpass.getpass("Pinecone API Key:")
os.environ["PINECONE_ENV"] = getpass.getpass("Pinecone Environment:")

Pinecone API Key: ········
Pinecone Environment: ········


In [96]:
#initialize Pinecone
import pinecone
pinecone.init(
    api_key = os.environ["PINECONE_API_KEY"],
    environment = os.environ["PINECONE_ENV"]
)

In Pinecone, we create a new vector search index and ingest the embeddings we created in the previous step. The size of the index is the dimension of our embeddings model.

In [120]:
#check if index already exists, if not we create it
index_name = "rag-index"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=len(sample_sentence_embedding),
        metric='cosine'
    )

In [146]:
#insert the embeddings
from langchain.vectorstores import Pinecone
vector_store = Pinecone.from_documents(
    docs, 
    embedding_generator, 
    index_name=index_name
)

Let's do a quick test to see if the similarity search is working well.

In [147]:
docs = vector_store.similarity_search(question)
print(docs[0].page_content)

society’s response to these crimes.   
In 2022 -23, the Home Office , alongside othe r government departments,  has implemented 
35% of the commitments in both documents. This  activity  includes:   
• Deliver ing two phases of the long -term national behaviour change campaign ‘Enough’, 
which has reached millions across England and Wales   
• Increas ing the size of the Children Affected by Domestic Abuse Fund for specialist


We have Llama-2 chat model in memory and the embeddings inserted in our Pinecone index. To improve the responses of the Llama 2 chat model we bring it alltogether and implement the RAG architecture easily with the Langchain [RetrievalQA](https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa). RetrievalQA augments our initial prompt with the most similar documents from the vector store.

In [156]:
from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

llm_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template}
)

And that's it! Let's ask the model again to see if we will get 2023 data

In [152]:
import textwrap
#helper method to improve the readability of the response
def print_response(llm_response):
    temp = [textwrap.fill(line, width=100) for line in llm_response['result'].split('\n')]
    response = '\n'.join(temp)
    print(f"{llm_response['query']}\n \n{response}'\n \n Source Documents:")
    for source in llm_response["source_documents"]:
        print(source.metadata)

In [157]:
print_response(llm_qa_chain(question))

How is UK home office driving down crime in 2023?
 
In 2023, the UK Home Office is driving down crime through various initiatives and activities,
including:
1. Recruiting additional 20,000 police officers across England and Wales by March 2023 to enhance
the police force's capacity to respond to crime.
2. Providing the police with the necessary resources and tools to tackle the evolving profile of
crime.
3. Reducing serious violence, homicide, and neighborhood crime by 20% from December 2019 levels by
December 2023.
4. Reducing drug-related crime over the next three years.
The Home Office is also monitoring and analyzing the risks associated with higher levels of
inflation and increasing cost of living, which may impact crime rates in the next twelve months,
particularly neighborhood crime and domestic abuse.'
 
 Source Documents:
{'page': 36.0, 'source': 'data/Home_Office_Annual_Report_and_Accounts_22-23.pdf'}
{'page': 81.0, 'source': 'data/Home_Office_Annual_Report_and_Accounts_22-23

The model returns a more informed response with details from 2023 and the pages in the documents from where it acquired the information. 

Let's try another question. The answer to this one is in a different document.

In [158]:
print_response(llm_qa_chain(question2))

How are the British nationality fees changing in 2023
 
Based on the retrieved context, here is the answer to the query:
The British nationality fees are proposed to increase in Autumn 2023, with a 20% increase to the
Leave to Remain fee. The exact amount of the increase will be laid out in further Regulations in due
course.'
 
 Source Documents:
{'page': 1.0, 'source': 'data/2023-9-18_Equality_Impact_Assessment_for_Autumn_2023_fee_increases_FINAL.pdf'}
{'page': 153.0, 'source': 'data/Home_Office_Annual_Report_and_Accounts_22-23.pdf'}
{'page': 1.0, 'source': 'data/2023-9-18_Equality_Impact_Assessment_for_Autumn_2023_fee_increases_FINAL.pdf'}


We can continue our experimentation with more files, different model parameters and different questions. Once we have sufficiet confidence in our approach, 
we can deploy our models to Amazon SageMaker

## 04. Supercharge your applications with GenAI by deploying your models to Amazon SageMaker

First we import the required libraries, and retrieve the IAM role and session we will use for deployment.  To deploy a model to a SageMaker endpoint, we first need to compress the model artifacts and upload the tar.gz file to Amazon S3.

### 04a. Deploy Text Generation Model

In [159]:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker import image_uris
from sagemaker.model import Model
from sagemaker import serializers, deserializers

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sagemaker.Session().boto_region_name
bucket = sess.default_bucket() # Set a default S3 bucket
sm_client = boto3.client('sagemaker', region_name=region)
smr_client = boto3.client("sagemaker-runtime")
prefix = 'qa-rag-models-test/rag-blog'

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [160]:
pretrained_model_location = f"s3://{bucket}/{prefix}/llama-2-7B-chat"

In [161]:
llm_path = sagemaker.s3.S3Uploader.upload(tg_model_path, pretrained_model_location)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [162]:
djl_properties_filename = "serving.properties"

In [163]:
%%writefile {djl_properties_filename}
engine = MPI
option.tensor_parallel_degree = 1
option.rolling_batch = auto
option.max_rolling_batch_size = 64
option.model_loading_timeout = 3600
option.paged_attention = true
option.trust_remote_code = true
option.dtype = fp16
option.rolling_batch=lmi-dist
option.max_rolling_batch_prefill_tokens=1560

Writing serving.properties


In [164]:
!echo -n "option.s3url = $pretrained_model_location" >> {djl_properties_filename}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [165]:
modelfile_base_name = f"local-{tg_model_id.replace('/', '-')}"

In [166]:
!mkdir {modelfile_base_name}
!mv serving.properties {modelfile_base_name}/
!tar czvf {modelfile_base_name}.tar.gz {modelfile_base_name}/
!rm -rf {modelfile_base_name}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
local-meta-llama-Llama-2-7b-chat-hf/
local-meta-llama-Llama-2-7b-chat-hf/serving.properties
huggingface/tokenizers: The current process just got forked

In [167]:
# list out the contents of the tar gz file for validation
!tar -ztvf {modelfile_base_name}.tar.gz

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
drwxr-xr-x root/root         0 2023-10-14 11:23 local-meta-llama-Llama-2-7b-chat-hf/
-rw-r--r-- root/root       399 2023-10-14 11:23 local-meta-llama-Llama-2-7b-chat-hf/serving.properties


In [168]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118


In [169]:
# Upload file and instantiate a new SageMaker Model
s3_code_prefix = "large-model-lmi/artifacts"

code_artifact = sess.upload_data(f"{modelfile_base_name}.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-477886989750/large-model-lmi/artifacts/local-meta-llama-Llama-2-7b-chat-hf.tar.gz


In [170]:
llama2_model_name = sagemaker.utils.name_from_base(
    f"{tg_model_id.replace('/', '-')}"
)

tg_sm_model = Model(
    sagemaker_session=sess,
    image_uri=inference_image_uri,
    model_data=code_artifact,
    role=role,
    name=llama2_model_name,
)

In [None]:
instance_type = "ml.g5.2xlarge"
endpoint_name = f"ep-{llama2_model_name}"

tg_sm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900,
    wait=False, # <-- Set to True, if you would prefer to wait for the endpoint to spin up
)

In [175]:
print(f"Endpoint name to use ---> {tg_sm_model.endpoint_name}")

Endpoint name to use ---> ep-meta-llama-Llama-2-7b-chat-hf-2023-10-14-11-23-22-908


In [176]:
predictor = sagemaker.Predictor(
    endpoint_name=tg_sm_model.endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

In [177]:
predictor.predict(
    {
        "inputs": "Who is the president of Brazil?",
        "parameters": {"temperature": 0.1, "max_new_tokens": 50}
    }
)

{'generated_text': '\n\nThe current president of Brazil is Jair Bolsonaro. He was inaugurated on January 1, 2019, and is serving a four-year term as the 38th president of Brazil. Prior to his'}

### 04b. Deploy Embedding Model

In [178]:
%%writefile {em_model_path}/model.py
from djl_python import Input, Output
import os
import torch
from transformers import (
    AutoModel, 
    AutoTokenizer
)
from typing import Any, Dict, Tuple
import deepspeed
import warnings
import tarfile

model, tokenizer = None, None
model_dir = "./model/"


def get_model(properties):
    
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    
    print(f"Loading model from {model_dir}")
    model = AutoModel.from_pretrained(
        model_dir
    )
    
    model = deepspeed.init_inference(
        model,
        mp_size=properties["tensor_parallel_degree"]
    )
    
    print(f"Loading tokenizer from {model_dir}")
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    return model, tokenizer


def handle(inputs: Input) -> None:
    global model, tokenizer
    
    if not model:
        model, tokenizer = get_model(inputs.get_properties())

    if inputs.is_empty():
        return None
    
    data = inputs.get_as_json()
    text = data["text"]
    
    input_tokenized = tokenizer(
        [text], 
        padding=True, 
        truncation=True, 
        return_tensors='pt'
    ).to("cuda")
    
    outputs = model(**input_tokenized)
    
    sentence_embeddings = outputs[0][:, 0]
    
    # normalize embeddings
    sentence_embeddings = torch.nn.functional.normalize(
        sentence_embeddings, 
        p=2, 
        dim=1
    )
    sentence_embeddings = sentence_embeddings[0, :].tolist()
    
    result = {"outputs": sentence_embeddings}
    
    return Output().add(result)

Writing ./em-model/model.py


In [179]:
%%writefile {em_model_path}/requirements.txt
einops
git+https://github.com/lanking520/DeepSpeed.git@falcon

Writing ./em-model/requirements.txt


In [180]:
%%writefile {em_model_path}/serving.properties
engine=DeepSpeed
option.tensor_parallel_degree=1

Writing ./em-model/serving.properties


In [181]:
!rm embeddings-model.tar.gz
!rm -rf {em_model_path}/.ipynb_checkpoints
!cd {em_model_path} && tar -czvf ../embeddings-model.tar.gz ./

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
rm: cannot remove 'embeddings-model.tar.gz': No such file or directory
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
./
./model.py
./requirements.txt
./model/
./model/pytorch_model.bin
./model/toke

In [182]:
!tar -tzvf embeddings-model.tar.gz

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
drwxr-xr-x root/root         0 2023-10-14 11:49 ./
-rw-r--r-- root/root      1521 2023-10-14 11:49 ./model.py
-rw-r--r-- root/root        62 2023-10-14 11:49 ./requirements.txt
drwxr-xr-x root/root         0 2023-10-14 09:35 ./model/
-rw-r--r-- root/root 133503977 2023-10-14 10:11 ./model/pytorch_model.bin
-rw-r--r-- root/root       390 2023-10-14 10:11 ./model/tokenizer_config.json
-rw-r--r-- root/root       701 2023-10-14 10:11 ./model/config.json
-rw-r--r-- root/root       125 2023-10-14 10:11 ./model/special_tokens_map.json
-rw-r--r-- root/root    231508 2023-10-14 10:11 ./model/vocab.txt
-rw-r--r-- root/root    711396 2023-10-14 10:11 ./model/tokenizer.json
-rw-r--r-- root/root        49 2023-10-14 11:4

In [183]:
embedding_inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118


In [184]:
embedded_code_artifact = sess.upload_data("embeddings-model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {embedded_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-477886989750/large-model-lmi/artifacts/embeddings-model.tar.gz


In [185]:
embedding_model_name = sagemaker.utils.name_from_base(
    f"{em_model_name.replace('/', '-')}"
)

em_sm_model = Model(
    sagemaker_session=sess,
    image_uri=embedding_inference_image_uri,
    model_data=embedded_code_artifact,
    role=role,
    name=embedding_model_name,
)
print(f"Creating a new model ---> {em_sm_model.name}")

Creating a new model ---> BAAI-bge-small-en-2023-10-14-11-50-19-212


In [186]:
embedding_instance_type = "ml.g5.2xlarge"

em_sm_model.deploy(
    initial_instance_count=1,
    instance_type=embedding_instance_type,
    endpoint_name=f"ep-{embedding_model_name}",
    container_startup_health_check_timeout=900,
    wait=False,
)

## 05. Run LangChain Inference using SageMaker Endpoint

In [187]:
from typing import Dict
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.llms import SagemakerEndpoint

In [188]:
class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        body = {
            "inputs": prompt, 
            "parameters": model_kwargs
        }
        input_str = json.dumps(body)
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json["generated_text"].strip()

In [189]:
content_handler = ContentHandler()

In [190]:
# convert your local LLM into SageMaker endpoint LLM
llm_sm_ep = SagemakerEndpoint(
    endpoint_name=tg_sm_model.endpoint_name, # <--- Your endpoint name
    region_name=region,
    model_kwargs={
        "temperature": 0.05, 
        "max_new_tokens": 512
    },
    content_handler=content_handler,
)

In [191]:
llm_qa_smep_chain = RetrievalQA.from_chain_type(
    llm=llm_sm_ep,
    chain_type='stuff',
    retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template}
)

In [192]:
print_response(llm_qa_smep_chain(question))

How is UK home office driving down crime in 2023?
 
The UK Home Office is driving down crime in 2023 through various initiatives and strategies outlined
in its Performance Overview and Outcome Delivery Plan. These include:
1. Recruiting additional police officers across England and Wales to improve police response and
resource allocation.
2. Providing the police with the necessary tools and resources to tackle evolving crime profiles.
3. Reducing serious violence, homicide, and neighborhood crime by 20% from December 2019 levels by
December 2023.
4. Reducing drug-related crime over the next three years.
The Home Office has also identified risks associated with higher levels of inflation and increasing
cost of living, which may lead to an increase in crime over the next twelve months, particularly
neighborhood crime and domestic abuse.'
 
 Source Documents:
{'page': 36.0, 'source': 'data/Home_Office_Annual_Report_and_Accounts_22-23.pdf'}
{'page': 81.0, 'source': 'data/Home_Office_Annual

In [193]:
print_response(llm_qa_smep_chain(question2))

How are the British nationality fees changing in 2023
 
Based on the provided context, here is the answer to the query:
The British nationality fees are proposed to change in 2023, with a 20% increase to the Leave to
Remain fee. The exact amount of the increase will be laid out in further Regulations in due course.'
 
 Source Documents:
{'page': 1.0, 'source': 'data/2023-9-18_Equality_Impact_Assessment_for_Autumn_2023_fee_increases_FINAL.pdf'}
{'page': 153.0, 'source': 'data/Home_Office_Annual_Report_and_Accounts_22-23.pdf'}
{'page': 1.0, 'source': 'data/2023-9-18_Equality_Impact_Assessment_for_Autumn_2023_fee_increases_FINAL.pdf'}


### 05a. Invoke the Embedding Endpoint for Inference

This section shows you how to invoke your custom embedding endpoint for inference.  

In [194]:
response_model = smr_client.invoke_endpoint(
    EndpointName=em_sm_model.endpoint_name,
    Body=json.dumps({
        "text": "This is a sample text"
    }),
    ContentType="application/json",
)

outputs = json.loads(response_model["Body"].read().decode("utf8"))['outputs']

In [195]:
print(f"Sample embeddings ---> {outputs[:10]}")

Sample embeddings ---> [-0.0482177734375, -0.0160675048828125, 0.00830841064453125, -0.04095458984375, 0.0013589859008789062, 0.014678955078125, 0.0142822265625, 0.051422119140625, 0.0207366943359375, -0.002231597900390625]


## 06. Clean Up Resources

In [None]:
# delete your text generation endpoint
sm_client.delete_endpoint(
    EndpointName=tg_sm_model.endpoint_name
)

In [None]:
# delete your text embedding endpoint
sm_client.delete_endpoint(
    EndpointName=em_sm_model.endpoint_name
)