# RAG Demo

![RAG Architecture](rag.png 'RAG Architecture')

Let's first try a super small LLM...

In [1]:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


<pad> Wie ich er bitten?</s>


We need an embedding model/tokenizer for vectorizing our documents.

In [2]:
from transformers import BertTokenizer, BertModel

embedding_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
embedding_model = BertModel.from_pretrained('bert-base-uncased')



Now we need to load our sample documents.

In [3]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader
loader = DirectoryLoader('docs', glob="**/*.txt", loader_cls=TextLoader)
docs = loader.load()
len(docs)

3

For now we will use FAISS as our in-memory vector database, but we could also do something more robust for production.

In [4]:
import faiss
index = faiss.IndexFlatL2(768)

We next need some functions to split our documents into manageable chunks and create embeddings for the chunks. Libraries like Langchain can handle this as well, but for now we'll use some rudimentary logic, assisted by nltk for splitting documents into sentences before chunking.

In [5]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def chunk_text(text, max_chunk_size=512):
    
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        if current_length + len(sentence) <= max_chunk_size:
            current_chunk.append(sentence)
            current_length += len(sentence)
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentence]
            current_length = len(sentence)

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

def embed_text(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()

[nltk_data] Downloading package punkt to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
docs[0]

Document(metadata={'source': 'docs/evidently.txt'}, page_content='Core Concepts\nThis is an explanatory page to describe the key features and concepts at Evidently.\n\nTL;DR\nEvidently helps evaluate, test and monitor ML models in production.\n\nA Metric is a core component of Evidently. You can combine multiple Metrics in a Report. Reports are best for visual analysis and debugging of your models and data.\n\nA Test is a metric with a condition. Each test returns a pass or fail result. You can combine multiple Tests in a Test Suite. Test Suites are best for automated model checks as part of an ML pipeline.\n\nFor both Tests and Metrics, Evidently has Presets. These are pre-built combinations of metrics or checks that fit a specific use case.\n\nA Snapshot is a JSON version of the Report or a Test Suite which contains measurements and test results for a specific period. You can log them over time and run an Evidently Monitoring Dashboard for continuous monitoring.\n\nMetrics and Report

Now we can chunk all the documents and add them as embeddings to the vector database.

In [7]:
all_chunks = []

for doc in docs:
    text = doc.page_content
    all_chunks.extend(chunk_text(text))

for chunk in all_chunks:
    embedding = embed_text(chunk, embedding_tokenizer, embedding_model)
    index.add(embedding)



Let's try searching the vector database for relevant context, given a query.

In [8]:
def search(query, k=5):
    query_embedding = embed_text(query, embedding_tokenizer, embedding_model)
    D, I = index.search(query_embedding, k)
    return [all_chunks[i] for i in I[0]]

query = "Why should I use ZenML for MLOps?"
results = search(query, k=3)
results

['Copy\nzenml stack set gcp\npython run.py  # Run your ML workflows in GCP\nzenml stack set aws\npython run.py  # Now your ML workflow runs in AWS\n🚀 Learn More\n\nReady to deploy and manage your MLOps infrastructure with ZenML? Here is a collection of pages you can take a look at next:',
 "Partner packages\nWhile the long tail of integrations are in langchain-community, we split popular integrations into their own packages (e.g. langchain-openai, langchain-anthropic, etc). This was done in order to improve support for these important integrations. langchain\nThe main langchain package contains chains, agents, and retrieval strategies that make up an application's cognitive architecture. These are NOT third party integrations.",
 'LCEL was designed from day 1 to support putting prototypes in production, with no code changes, from the simplest “prompt + LLM” chain to the most complex chains (we’ve seen folks successfully run LCEL chains with 100s of steps in production). To highlight a 

In [9]:
context = " ".join(results)
context

"Copy\nzenml stack set gcp\npython run.py  # Run your ML workflows in GCP\nzenml stack set aws\npython run.py  # Now your ML workflow runs in AWS\n🚀 Learn More\n\nReady to deploy and manage your MLOps infrastructure with ZenML? Here is a collection of pages you can take a look at next: Partner packages\nWhile the long tail of integrations are in langchain-community, we split popular integrations into their own packages (e.g. langchain-openai, langchain-anthropic, etc). This was done in order to improve support for these important integrations. langchain\nThe main langchain package contains chains, agents, and retrieval strategies that make up an application's cognitive architecture. These are NOT third party integrations. LCEL was designed from day 1 to support putting prototypes in production, with no code changes, from the simplest “prompt + LLM” chain to the most complex chains (we’ve seen folks successfully run LCEL chains with 100s of steps in production). To highlight a few of th

Now we can try out the small RAG...

In [10]:
def generate_response(context):
    inputs = tokenizer(context, return_tensors='pt', truncation=True, padding=True)
    outputs = model.generate(**inputs, max_length=512)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

query_template = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
response = generate_response(query_template)
print(response)

To improve support for these important integrations.


What if we try adding only the top result as context?

In [11]:
context = results[0]
query_template = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
response = generate_response(query_template)
print(response)

To enable a unified service


When I ran the above, the result was somewhat relevant but not really usable. This is because the model is pretty small. There are ways to squeeze out performance from small models. For now, I want to try a bigger model. 

First lets try LLama2 without context.

The following cells expect a Llama2 foundation model deployed as a Sagemaker endpoint, so it won't work without that. 

In [12]:
from sagemaker.predictor import retrieve_default
endpoint_name = "jumpstart-dft-meta-textgeneration-l-20240725-204045"
predictor = retrieve_default(endpoint_name)

payload = {
    "inputs": f"""
    <s>[INST] <<SYS>> You are an assistant for answering questions. 
    If you don't know the answer, just say 'I do not know.' Don't make up an answer. <</SYS>>

    {query} [/INST] 
    """,
    "parameters": {
        "max_new_tokens": 256,
        "top_p": 0.9,
        "temperature": 0.05
    }
}
response = predictor.predict(payload)
print(response)


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
{'generated_text': ' I do not know.'}


Now let's try engineering a prompt with the context included.

In [13]:
inputs = f""" 
<s>[INST] <<SYS>>
You are an assistant for answering questions.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
<</SYS>>

Question: {query} Context: {context}[/INST]
"""

payload = {
    "inputs": inputs,
        "parameters": {
        "max_new_tokens": 256,
        "top_p": 0.9,
        "temperature": 0.05
    }
}
response = predictor.predict(payload)
print(response)

{'generated_text': "\nGreat, I'd be happy to help you with that! 😊\n\nSo, you want to know why you should use ZenML for MLOps? Well, let me tell you - ZenML is a powerful tool that can help you streamline your ML workflows and deploy them on various cloud providers like GCP and AWS. 🌌\n\nWith ZenML, you can easily set up and manage your ML infrastructure, including data pipelines, model training, and deployment. It also provides a unified interface for managing your ML workflows across different cloud providers, making it easier to switch between them as needed. 💻\n\nBut that's not all - ZenML also offers a range of other benefits, such as automated monitoring and alerting, version control, and collaboration tools. It's like having a personal ML assistant, making your life easier and more efficient! 🤖\n\nSo, if you want to take your MLOps to the next level, give ZenML a try. It's free to use, and there are plenty of resources available to help you get started. "}
