Pending items:
1. Ensure Llama2 responses are full sentences
2. Improve the QA Results with RAG in the notebook, at the moment it is giving back a lot of duplicates and missing some important ones

3. Check for any other optimization

## 01. Set-up

Install the required libriaries

In [13]:
%%writefile requirements.txt
torch==2.0.1
#transformers==4.31
accelerate==0.21.0
bitsandbytes==0.40.2
accelerate==0.21.0
datasets==2.13.0
langchain==0.0.297
ipywidgets>=7,<8
pypdf>=3.8,<4
git+https://github.com/huggingface/transformers.git
gpt4all 
chromadb
langchainhub
pinecone-client
tiktoken
sentence_transformers

Overwriting requirements.txt


In [14]:
!pip install -U -r requirements.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting git+https://github.com/huggingface/transformers.git (from -r requirements.txt (line 10))
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-bnp1t14i
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-bnp1t14i
  Resolved https://github.com/huggingface/transformers.git to commit e3a4bd2bee212a2d0fd9f03b27fe7bfc1debe42d
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting pydantic<3,>=1 (from langchain==0.0.297->-r requirements.txt (line 7))
  Obtaining dependency information for pydantic<3,>=1 from https://files.pythonhosted.org/packages/bc/e0/0371e9b6c910afe502e5fe18cc94562bfd9399617c7b4f5b6e13c29115b3/pydantic-1.10.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached pydantic-1.10.12-cp310-cp310-manylinux_2_17_x86_64.manyli

In [15]:
import torch
import os
import nvidia

if not torch.cuda.is_available():
    print("Please use a compute instance with at least 1 GPU of XGB memory to run this notebok")
else:
    cuda_install_dir = '/'.join(nvidia.__file__.split('/')[:-1]) + '/cuda_runtime/lib/'
    os.environ['LD_LIBRARY_PATH'] =  cuda_install_dir

## 02. Prompt Llama-2 in the notebook

In [111]:
from transformers import AutoTokenizer, LlamaForCausalLM, GenerationConfig,AutoModelForCausalLM
import transformers

model_id = "NousResearch/Llama-2-7b-chat-hf" #the model id in Hugging Face
model_path = f"./model/{model_id}" #the local directory where the model will be saved

question = "How is UK home office driving down crime in 2023?"


Download the model from the Hugging Face hub or load it from the local directory if it's already downloaded

In [91]:
if  not (os.path.exists(model_path)) or os.listdir(model_path and model_path) == []:
    model = LlamaForCausalLM.from_pretrained(model_id, device_map="auto", do_sample=True, use_safetensors=True, load_in_8bit=True,torch_dtype=torch.float16)
    model.generation_config = GenerationConfig(
        do_sample=True,
        temperature=0.1,
        top_p=0.75,
        top_k=20,
        max_new_tokens=100,
        return_full_text=True
    )
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    model.save_pretrained(save_directory=model_path)
    tokenizer.save_pretrained(save_directory=model_path)
else:
    model = LlamaForCausalLM.from_pretrained(model_path,device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_path)


Check memory consumption

In [105]:
print("Memory allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
print("Memory reserved  %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
print("Max memory reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))

Memory allocated: 13.467896GB
Memory reserved  14.214844GB
Max memory reserved: 14.214844GB


Test some prompts

In [102]:
%time

inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)
generated_output = model.generate(
        inputs.input_ids,
        generation_config=model.generation_config
)
print(tokenizer.batch_decode(generated_output, skip_special_tokens=True,return_dict_in_generate=True)[0])

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 11 µs
How is UK home office driving down crime in 2023?

The UK Home Office is driving down crime in 2023 through a variety of initiatives and strategies. Here are some of


In [104]:
%time
pipe = transformers.pipeline(
    task='text-generation',
    model=model, 
    tokenizer=tokenizer,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=100
    
)
response = pipe(prompt)
print(response[0]["generated_text"])

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 11 µs
How is UK home office driving down crime in 2023?

The UK Home Office is responsible for immigration, security, and law and order. In 2023, the Home Office is driving down crime through various initiatives and strategies. Here are some of the ways the Home Office is working to reduce crime in the UK:

1. Investing in Police Technology: The Home Office is investing in cutting-edge technology to help the police fight crime more effectively. This includes the use of facial recognition software,


## 03. Use LangChain for simple question answering

#### Llama 2 chat models' prompt structure 

- \<s> - the beginning of the entire sequence.
- <\<SYS>> - the beginning of the system message.
- \<</SYS\>> - the end of the system message.
- [INST] - the beginning of some instructions
- [/INST] - the end of the instructions
    
\<s>[INST] <\<SYS\>>

{{ system_prompt }}

\<<SYS\>>

{{ user_message }} [/INST]


In [112]:
from langchain import PromptTemplate

template = """<s>[INST] <<SYS>>\nYou are an assistant for question-answering tasks. You are helpful and friendly. Use the following pieces of retrieved context to answer the question. If you don't know the answer, you just say I don't know. Use three sentences maximum and keep the answer concise.
<<SYS>>\n
{context}\n
{question} [/INST]
"""
prompt_template = PromptTemplate(template=template, input_variables=['context','question'])


In [107]:
context = """The Home Office is at the heart of many of the Government’s top priorities.
We must stop the boats. Illegal immigration is unfair on taxpayers and would-be
immigrants who play by the rules, has an unbearable impact on public services and local
communities, enriches the gangs, and is lethally dangerous.
The Illegal Migration Act is a vital step in securing our borders. It places a cap on the
number of people seeking protection that the UK will resettle via safe and legal routes,
radically narrows the challenges and appeals that can suspend removal and tightens
modern slavery laws. We have signed a new and improved deal with France, along with
returns agreements with several countries.
The UK-Rwanda Migration and Economic Development Partnership is crucial to these
endeavours. The most similar scheme to it internationally is the Operation Sovereign
Borders in Australia, which started in late 2013. This entails offshoring asylum claims,
including to Nauru, combined with turnarounds at sea. The effect of this programme is
reduced Illegal Maritime Arrivals to Australia from approximately 18,000 in 2013, to
virtually zero.
Overall crime, excluding fraud and computer crime which has only been counted since
2016, is now down by 54% since 2010, with burglary down 55%, robbery down 77%,
violence down 46%, theft down 47%, neighbourhood crime down 51%, and criminal
damage down 72%. We are rolling out our ambitious programme, Operation Soteria, to
transform rape investigations and prosecutions and have brought in a big package of
measures for domestic abuse victims. Our new Fraud Strategy will mean this wicked crime
is treated like the epidemic it is.
We received the Third Volume of the Manchester Arena Inquiry report and published the
Terrorism (Protection of Premises) draft Bill. This is also known as Martyn’s Law. It will
place on public places a greater duty to protect their visitors.
Our efforts to drive down crime have been boosted by recruiting 20,951 additional police
officers by March 2023, exceeding our manifesto commitment to recruit an additional
20,000 by this date. This brings the total number of police officers in England and Wales to
149,566 in March 2023 – the highest on record against a previous peak of 146,030 in
2010. Now the police need to ensure they focus on getting the basics right: the highest
professional standards and a relentless focus on crime, not politically correct distractions.
This means continuing to smash county lines gangs, using proven methods like stop and
search, and deploying officers to high-crime areas. The Anti-Social Behaviour Action Plan
reflects the fact that there is no such thing as petty crime and that it is easy for areas to
slip into degeneracy and misery"""

In [113]:
from langchain.chains import LLMChain
from langchain.llms import HuggingFacePipeline

llm=HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0.7})

llm_chain = LLMChain(llm=llm, prompt=prompt_template)
llm_chain.predict(context=context, question=question)


'The UK Home Office is driving down crime in 2023 through various initiatives, including:\n\n1. Offshoring asylum claims, including to Rwanda, as part of the UK-Rwanda Migration and Economic Development Partnership, which has reduced illegal maritime arrivals to Australia by virtually zero.\n2. Implementing a big package of measures for domestic abuse victims, including transforming rape investigations and pro'

## 04. Use LangChain and Pinecone for RAG question answering

First, let's download the data to be used as knowledge base in RAG. We will use reports published by the UK Home office in 2023

In [114]:
from urllib.request import urlretrieve
files = [
    "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1185849/Home_Office_Annual_Report_and_Accounts_22-23.pdf",
    "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1185949/2023-9-18_Equality_Impact_Assessment_for_Autumn_2023_fee_increases_FINAL.pdf",
]

os.makedirs("data", exist_ok=True)

for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

Next let's split the data into documents

In [115]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 100,
)
docs = text_splitter.split_documents(documents)

In [116]:
# avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
# avg_char_count_pre = avg_doc_length(documents)
# avg_char_count_post = avg_doc_length(docs)
# print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.')
# print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.')
# print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')

Generate the embeddings. For that we will use the bge-small-en model

In [124]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)
embeddings

HuggingFaceBgeEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
), model_name='BAAI/bge-small-en', cache_folder=None, model_kwargs={'device': 'cuda'}, encode_kwargs={'normalize_embeddings': False}, query_instruction='Represent this question for searching relevant passages: ')

Let's have a look at an example of the generated embeddings. The dimension for our embeddings model is 384

In [127]:
sample_embedding = np.array(embeddings.embed_query(docs[0].page_content))
#print("Sample embedding of a document chunk: ", sample_embedding)
print("Size of the embedding: ", sample_embedding.shape[0])

Size of the embedding:  384


Ingest the embeddings into the vector store. We will be using Pinecone.

In [128]:
import getpass
os.environ["PINECONE_API_KEY"] = getpass.getpass("Pinecone API Key:")
os.environ["PINECONE_ENV"] = getpass.getpass("Pinecone Environment:")

Pinecone API Key: ········
Pinecone Environment: ········


In [129]:
import pinecone
pinecone.init(
    api_key = os.environ["PINECONE_API_KEY"],
    environment = os.environ["PINECONE_ENV"]
)

In [130]:
#check if index already exists, if not we create it
index_name = "qa-index"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=384,
        metric='cosine'
    )

In [131]:
from langchain.vectorstores import Pinecone
vector_store = Pinecone.from_documents(docs, embeddings, index_name=index_name)

Let's validate the similarity search is working well

In [132]:
docs = vector_store.similarity_search(question)
print(docs[3].page_content)

Performance Report  | 27 This section expands on the Performance Overview section and includes details of our 
activities and further analysis of progress against performance indicators.  
Outcome Delivery Plan 1: Reduce crime  
The Home Office ’s aim, as set out in the Beating Crime Plan, is to cut crime, reduce the 
number of victims and make our country safe. Our approach focuses on cutting homicide, 
neighbourhood crime and serious violence; exposing and ending hidden harms such as 
child sexual abuse and domestic abuse, prosecuting perpetrators; and building capability 
and capacity to tackle fraud, cyber crime  and online crime.  
The Department set out to achieve the following through its 2022 -23 performance:  
• Recruit an additional 20,000 police officers across England and Wales by March 
2023 .  
• Ensure the police have the resources and tools they need to respond to the 
evolving profile of crime and deliver against our priorities.


Next, we will use the Lagnchain RetrievalQA to augment our initial prompt with the most similar documents from the vector store

In [136]:
from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

llm_qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                     chain_type='stuff',
                                     retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
                                     return_source_documents=True,
                                     chain_type_kwargs={"prompt": prompt_template})

In [139]:
llm_response = llm_qa_chain(query)
llm_response

{'query': 'How is UK home office driving down crime in 2023?',
 'result': "\nThe UK Home Office is driving down crime in 2023 through various initiatives outlined in its Outcome Delivery Plan 1: Reduce crime. These include recruiting an additional 20,000 police officers across England and Wales by March 2023, ensuring the police have the resources and tools they need to respond to the evolving profile of crime, and delivering against the department's priorities. Additionally, the Home Office",
 'source_documents': [Document(page_content='Performance Report  | 27 This section expands on the Performance Overview section and includes details of our \nactivities and further analysis of progress against performance indicators.  \nOutcome Delivery Plan 1: Reduce crime  \nThe Home Office ’s aim, as set out in the Beating Crime Plan, is to cut crime, reduce the \nnumber of victims and make our country safe. Our approach focuses on cutting homicide, \nneighbourhood crime and serious violence; e

## 04. Build GenAI applications at scale by deploying our RAG question answering to SageMaker

In [None]:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

In [None]:
import tarfile

zipped_model_path = os.path.join(model_path, "model.tar.gz")

with tarfile.open(zipped_model_path, "w:gz") as tar:
    tar.add(model_path)
    tar.add(code_path)

In [None]:
import time

endpoint_name = model_id + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.9.3"
)

print(f"llm image uri: {llm_image}")

model = HuggingFaceModel(
   # entry_point="inference_code.py",
    model_data=zipped_model_path,
    role=get_execution_role(),
    framework_version="1.5",
    py_version="py3",
)

predictor = model.deploy(
    initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=endpoint_name
)

In [None]:
# Define Model and Endpoint configuration parameter

instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300


config = {
  'HF_MODEL_ID': "meta-llama/Llama-2-7b-chat-hf",
  'SM_NUM_GPUS': json.dumps(number_of_gpu),
  'MAX_INPUT_LENGTH': json.dumps(2048),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),  # Limits the number of tokens that can be processed in parallel during the generation
  'HUGGING_FACE_HUB_TOKEN': "<REPLACE WITH YOUR TOKEN>"
  # ,'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# check if token is set
assert config['HUGGING_FACE_HUB_TOKEN'] != "<REPLACE WITH YOUR TOKEN>", "Please set your Hugging Face Hub token"

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="0.9.3"),
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g5.2xlarge",
	container_startup_health_check_timeout=300,
  )
  
# send request
predictor.predict({
	"inputs": "My name is Julien and I like to",
})

In [None]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

=