This notebook was tested in SageMaker Studio with a Data Science 3.0 image on a ml.g5.xlarge instance.

## 01. Set-up

Install the required libriaries

In [2]:
%%writefile requirements.txt
sagemaker
torch==2.0.1
git+https://github.com/huggingface/transformers.git
accelerate==0.21.0
datasets==2.13.0
langchain==0.0.297
pypdf>=3.8,<4
pinecone-client
sentence_transformers
safetensors>=0.3.3
bitsandbytes==0.40.2
jinja2

Overwriting requirements.txt


In [3]:
!pip install -U -r requirements.txt

Collecting git+https://github.com/huggingface/transformers.git (from -r requirements.txt (line 3))
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-mm3kyjqj
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-mm3kyjqj
  Resolved https://github.com/huggingface/transformers.git to commit 9ed538f2e67ee10323d96c97284cf83d44f0c507
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting sagemaker (from -r requirements.txt (line 1))
  Downloading sagemaker-2.188.0.tar.gz (892 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m892.2/892.2 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting torch==2.0.1 (from -r requirements.txt (line 2))
  Using cached torch-2.0.1-cp310-cp310-manylinux1_x8

The below is only needed if using bits and bytes

In [5]:
import torch
import os
import nvidia

if torch.cuda.is_available():
    cuda_install_dir = '/'.join(nvidia.__file__.split('/')[:-1]) + '/cuda_runtime/lib/'
    os.environ['LD_LIBRARY_PATH'] =  cuda_install_dir

## 02. Load Llama-2 7B chat in the notebook for experimentation

First, let's download the Llama-2-7b-chat-hf model from the Hugging Face Hub. Llama 2 models are gated, to get access follow the instructions [here](https://huggingface.co/meta-llama/Llama-2-7b-hf) 

In [6]:
from transformers import AutoTokenizer, LlamaTokenizer, LlamaForCausalLM, GenerationConfig,AutoModelForCausalLM
import transformers

model_id = "meta-llama/Llama-2-7b-chat-hf" #the model id in Hugging Face
model_path = f"./model/{model_id}" #the local directory where the model will be saved

access_token = "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"

In [7]:
from transformers import BitsAndBytesConfig

if  not (os.path.exists(model_path)) or os.listdir(model_path and model_path) == []:
    quantization_config = BitsAndBytesConfig(load_in_8bit=True, llm_int8_threshold=10.0)
    
    model = AutoModelForCausalLM.from_pretrained(
        model_id, 
        token=access_token, 
        device_map="auto", 
        do_sample=True, 
        use_safetensors=True, 
    #    quantization_config=quantization_config ,
        torch_dtype=torch.float16)
    tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)

    model.save_pretrained(save_directory=model_path, from_pt=True)
    tokenizer.save_pretrained(save_directory=model_path, from_pt=True)
else:
    model = LlamaForCausalLM.from_pretrained(model_path,device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_path)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Check memory consumption

In [8]:
print("Memory allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
print("Memory reserved  %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
print("Max memory reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))

Memory allocated: 12.613792GB
Memory reserved  12.615234GB
Max memory reserved: 12.615234GB


## 03. Simple question-answering using Llama 2 7B chat and LangChain

Now that the model is available in memory, we can start using it to answer questions. The Llama-2 chat models expect the prompt to follow the below format:

    
\<s>[INST] <\<SYS\>>

{{ system_prompt }}

\<<SYS\>>

{{ user_message }} [/INST]

   
where
- \<s> - is the beginning of the sequence.
- <\<SYS>> - is the beginning of the system message.
- \<</SYS\>> - is the end of the system message.
- [INST] - is the beginning of the instructions
- [/INST] - is the end of the instructions

Let's create a recipe based on the above that will helps us define our prompts going forward. For that we will use [PromptTemplate](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) from langchain.

In [9]:
from langchain import PromptTemplate

template = """<s>[INST] <<SYS>>\nYou are an assistant for question-answering tasks. You are helpful and friendly. Use the following pieces of retrieved context to answer the query. If you don't know the answer, you just say I don't know. Use three sentences maximum and keep the answer concise.
<<SYS>>\n
{context}\n
{question} [/INST]
"""
prompt_template = PromptTemplate(template=template, input_variables=['context','question'])


Next, we test the model on some questions without providing any context. For our tests, we will use questions about the UK Home Office activities in 2023.

In [10]:
question = "How is UK home office driving down crime in 2023?"
question2= "How are the nationality fees changing in 2023"

In [11]:
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
tokenizer.padding_side = "left"

pipe = transformers.pipeline(
    task='text-generation',
    model=model, 
    tokenizer=tokenizer,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=400,
    temperature=0.7
)

In [12]:
from langchain.chains import LLMChain
from langchain.llms import HuggingFacePipeline

llm=HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0.7})


llm_chain = LLMChain(llm=llm, prompt=prompt_template)
llm_chain.predict(context="", question=question)

'According to the latest crime statistics released by the UK Home Office, crime in England and Wales has decreased by 8% compared to 2022. This is largely attributed to the ongoing efforts of the police to tackle crime through proactive policing strategies, as well as the introduction of new technologies and initiatives to improve crime detection and prevention. Additionally, the Home Office has implemented various measures to support victims of crime and improve the criminal justice system as a whole.'

Although this answer is inline with what was asked it doesn't provide a lot of detail on the specific activities that took place in 2023. To improve it we pass some context. The below is an extract from  the [Home Office Annual Report and Accounts 2022-2023](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1185849/Home_Office_Annual_Report_and_Accounts_22-23.pdf)

In [13]:
context = """Overall crime, excluding fraud and computer crime which has only been counted since
2016, is now down by 54% since 2010, with burglary down 55%, robbery down 77%,
violence down 46%, theft down 47%, neighbourhood crime down 51%, and criminal
damage down 72%. We are rolling out our ambitious programme, Operation Soteria, to
transform rape investigations and prosecutions and have brought in a big package of
measures for domestic abuse victims. Our new Fraud Strategy will mean this wicked crime
is treated like the epidemic it is.
We received the Third Volume of the Manchester Arena Inquiry report and published the
Terrorism (Protection of Premises) draft Bill. This is also known as Martyn’s Law. It will
place on public places a greater duty to protect their visitors.
Our efforts to drive down crime have been boosted by recruiting 20,951 additional police
officers by March 2023, exceeding our manifesto commitment to recruit an additional
20,000 by this date. This brings the total number of police officers in England and Wales to
149,566 in March 2023 – the highest on record against a previous peak of 146,030 in
2010. Now the police need to ensure they focus on getting the basics right: the highest
professional standards and a relentless focus on crime, not politically correct distractions.
This means continuing to smash county lines gangs, using proven methods like stop and
search, and deploying officers to high-crime areas. The Anti-Social Behaviour Action Plan
reflects the fact that there is no such thing as petty crime and that it is easy for areas to
slip into degeneracy and misery"""

In [14]:
from langchain.chains import LLMChain
from langchain.llms import HuggingFacePipeline

llm=HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0.7})

llm_chain = LLMChain(llm=llm, prompt=prompt_template)
llm_chain.predict(context=context, question=question)


'The UK Home Office is driving down crime in 2023 through various initiatives, including the rollout of Operation Soteria to transform rape investigations and prosecutions, a new Fraud Strategy to tackle this "epidemic," and the recruitment of 20,951 additional police officers by March 2023, exceeding their manifesto commitment. They are also focusing on getting the basics right by ensuring police officers maintain high professional standards and focus on crime, rather than politically correct distractions. This includes continuing to disrupt county lines gangs, using proven methods like stop and search, and deploying officers to high-crime areas.'

## 04. RAG question answering with Llama 2 7B chat, LangChain and Pinecone


In the above response, the model provides an answer with data from 2023 based on the context we provided. Next we want to scale this approach using __Retrieval Augmented Generation (RAG)__.
With RAG, we will ingest external data into our knowledge base and augment the prompt by adding only the data that is relevant to the context.

First we download the external files we want to store in our knowledge base locally so than we can quickly iterate if needed. We will use reports published by the UK Home office in 2023.

In [15]:
from urllib.request import urlretrieve
files = [
    "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1185849/Home_Office_Annual_Report_and_Accounts_22-23.pdf",
    "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1185949/2023-9-18_Equality_Impact_Assessment_for_Autumn_2023_fee_increases_FINAL.pdf",
]

os.makedirs("data", exist_ok=True)

for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

After that, we split files into documents

In [16]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap  = 50,
)
docs = text_splitter.split_documents(documents)

Next, we generate the embeddings for the documents. For that we will use the [bge-small-en](https://huggingface.co/BAAI/bge-small-en) model. We can do that easily in our notebook through the HuggingFaceBgeEmbeddings class in langchain

In [17]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

The dimension for our embeddings model is 384

In [20]:
#optionally take a look at one of the embeddings
sample_embedding = np.array(embeddings.embed_query(docs[0].page_content))
print("Size of the embedding: ", sample_embedding.shape[0])

Size of the embedding:  384


We are now ready to ingest the embeddings into our vector store. In this notebook we will use [Pinecone](https://www.pinecone.io/), however you can replace the below code with that for the vector store of your choice.
If you don't have a Pinecone account you can sign up for free to complete this notebook. 

In [21]:
#enter your Pinecone keys
import getpass
os.environ["PINECONE_API_KEY"] = getpass.getpass("Pinecone API Key:")
os.environ["PINECONE_ENV"] = getpass.getpass("Pinecone Environment:")

Pinecone API Key: ········
Pinecone Environment: ········


In [22]:
#initialize Pinecone
import pinecone
pinecone.init(
    api_key = os.environ["PINECONE_API_KEY"],
    environment = os.environ["PINECONE_ENV"]
)

In Pinecone, we create a new vector search index and ingest the embeddings we created in the previous step. The size of the index is the dimension of our embeddings model.

In [23]:
#check if index already exists, if not we create it
index_name = "qa-index"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=384,
        metric='cosine'
    )

In [24]:
#insert the embeddings
from langchain.vectorstores import Pinecone
vector_store = Pinecone.from_documents(docs, embeddings, index_name=index_name)

Let's do a quick test to see if the similarity search is working well.

In [25]:
docs = vector_store.similarity_search(question)
print(docs[1].page_content)

£140  million to support victims. These plans and strategies aim to transform the whole of 
society’s response to these crimes.   
In 2022 -23, the Home Office , alongside othe r government departments,  has implemented 
35% of the commitments in both documents. This  activity  includes:   
• Deliver ing two phases of the long -term national behaviour change campaign ‘Enough’, 
which has reached millions across England and Wales


We have Llama-2 chat model in memory and the embeddings inserted in our Pinecone index. To improve the responses of the Llama 2 chat model we bring it alltogether and implement the RAG architecture easily with the Langchain [RetrievalQA](https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa). RetrievalQA augments our initial prompt with the most similar documents from the vector store.

In [30]:
from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

llm_qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                     chain_type='stuff',
                                     retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
                                     return_source_documents=True,
                                     chain_type_kwargs={"prompt": prompt_template})

And that's it! Let's ask the model again to see if we will get 2023 data

In [31]:
import textwrap

def print_response(llm_response):
    temp = [textwrap.fill(line, width=100) for line in llm_response['result'].split('\n')]
    response = '\n'.join(temp)
    print(f"{llm_response['query']}\n \n{response}'\n \n Source Documents:")
    for source in llm_response["source_documents"]:
        print(source.metadata)

In [28]:
print_response(llm_qa_chain(question))

How is UK home office driving down crime in 2023?
 
The UK Home Office has implemented 35% of the commitments in its plans and strategies to drive down
crime in 2023, including delivering two phases of the long-term national behaviour change campaign
'Enough' which has reached millions across England and Wales. The campaign aims to transform
society's response to crime and support victims. While I don't have access to real-time crime
statistics, these efforts suggest a concerted effort to address crime and improve the justice system
in the UK.'
 
 Source Documents:
{'page': 36.0, 'source': 'data/Home_Office_Annual_Report_and_Accounts_22-23.pdf'}
{'page': 36.0, 'source': 'data/Home_Office_Annual_Report_and_Accounts_22-23.pdf'}
{'page': 36.0, 'source': 'data/Home_Office_Annual_Report_and_Accounts_22-23.pdf'}


The model returns a more informed response with details from 2023 and the pages in the documents from where it acquired the information. 

Let's try another question. The answer to this one is in a different document.

In [29]:
print_response(llm_qa_chain(question2))

How are the nationality fees changing in 2023
 
According to the document, the department is proposing changes to immigration and nationality fees
in Autumn 2023, with the objective of increasing the level of income generated from those fees to
mitigate wider costs. The document does not provide specific information on how nationality fees are
changing in 2023. Therefore, I cannot provide an answer to your question.'
 
 Source Documents:
{'page': 1.0, 'source': 'data/2023-9-18_Equality_Impact_Assessment_for_Autumn_2023_fee_increases_FINAL.pdf'}
{'page': 1.0, 'source': 'data/2023-9-18_Equality_Impact_Assessment_for_Autumn_2023_fee_increases_FINAL.pdf'}
{'page': 1.0, 'source': 'data/2023-9-18_Equality_Impact_Assessment_for_Autumn_2023_fee_increases_FINAL.pdf'}


We can continue our experimentation with more files, different model parameters and different questions. Once we have sufficiet confidence in our approach, 
we can deploy our models to Amazon SageMaker

## 04. Supercharge your applications with GenAI by deploying your models to Amazon SageMaker

First we import the required libraries, and retrieve the IAM role and session we will use for deployment.  To deploy a model to a SageMaker endpoint, we first need to compress the model artifacts and upload the tar.gz file to Amazon S3.

In [88]:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
from sagemaker import image_uris
from sagemaker.model import Model
from sagemaker import serializers, deserializers

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sagemaker.Session().boto_region_name
bucket = sess.default_bucket() # Set a default S3 bucket
prefix = 'qa-rag-models'

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [89]:
pretrained_model_location = f"s3://{bucket}/{prefix}/llama-2-7B-chat"

In [90]:
llm_path = sagemaker.s3.S3Uploader.upload(model_path, pretrained_model_location)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [103]:
%%writefile src/serving.properties
engine = MPI
option.tensor_parallel_degree = 1
option.rolling_batch = auto
option.max_rolling_batch_size = 8
option.model_loading_timeout = 3600
option.model_id = {{pretrained_model_location}}
option.paged_attention = true
option.trust_remote_code = true
option.dtype = fp16

Overwriting src/serving.properties


In [104]:
import jinja2
jinja_env = jinja2.Environment()
from pathlib import Path

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

template = jinja_env.from_string(Path("src/serving.properties").open().read())
Path("src/serving.properties").open("w").write(
    template.render(s3url=pretrained_model_location)
)
!pygmentize src/serving.properties | cat -n

     1	[36mengine[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mMPI[39;49;00m[37m[39;49;00m
     2	[36moption.tensor_parallel_degree[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33m1[39;49;00m[37m[39;49;00m
     3	[36moption.rolling_batch[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mauto[39;49;00m[37m[39;49;00m
     4	[36moption.max_rolling_batch_size[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33m8[39;49;00m[37m[39;49;00m
     5	[36moption.model_loading_timeout[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33m3600[39;49;00m[37m[39;49;00m
     6	[36moption.model_id[39;49;00m[37m [39;49;00m=[37m [39;49;00m[37m[39;49;00m
     7	[36moption.paged_attention[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mtrue[39;49;00m[37m[39;49;00m
     8	[36moption.trust_remote_code[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mtrue[39;49;00m[37m[39;49;00m
     9	[36moption.dtype[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mfp16[39;49;00m[37

In [116]:
!rm model.tar.gz
!tar czvf model.tar.gz src

src/
src/.ipynb_checkpoints/
src/.ipynb_checkpoints/serving-checkpoint.properties
src/serving.properties


In [117]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, f"{bucket}/{prefix}/code")

In [106]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118


In [118]:
model = Model(image_uri=inference_image_uri, model_data=s3_code_artifact, role=role)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [119]:
endpoint_name = sagemaker.utils.name_from_base("llama-2-7b-chat")
model.deploy(initial_instance_count=1, instance_type="ml.g5.2xlarge", endpoint_name=endpoint_name, wait=False)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [101]:
predictor = sagemaker.Predictor(
        endpoint_name="llama-2-7b-chat-2023-09-23-13-43-51-480",
        sagemaker_session=sess,
        serializer=serializers.JSONSerializer(),
        deserializer=deserializers.JSONDeserializer(),
    )

In [102]:
predictor.predict({
	"inputs": question,
})

{'generated_text': ''}