# Backblaze B2 Retrieval-Augmented Generation (RAG) Demo

Retrieval-Augmented Generation (RAG) allows you to build on an off-the-shelf large language model (LLM), adding custom context that the model can use in interacting with a user. You can use RAG to implement chatbots that use your own proprietary data to answer questions, without that data leaking to the internet. 

This notebook walks you through loading PDF files from [Backblaze B2 Cloud Object Storage](https://www.backblaze.com/cloud-storage) into a [LangChain](https://python.langchain.com/v0.2/docs/introduction/) RAG app, then building a chatbot that can answer questions relating to the content of those PDF files. You'll use an open-source language model that you run locally, rather than an online API, ensuring that your data stays confidential.

The code is based on the LangChain tutorial, [Build a Local RAG Application](https://python.langchain.com/v0.2/docs/tutorials/local_rag/).

## Install Dependencies

First, install the required Python packages:

In [1]:
%pip install --upgrade --quiet -r requirements.txt

# Restart the kernel so that it uses the new modules
get_ipython().kernel.do_shutdown(restart=True)

Note: you may need to restart the kernel to use updated packages.


{'status': 'ok', 'restart': True}

## Prerequisites

You need a Backblaze B2 Account, Bucket and Application Key, and some PDF files. Follow these instructions, as necessary:

* [Create a Backblaze B2 Account](https://www.backblaze.com/sign-up/cloud-storage).
* [Create a Backblaze B2 Bucket](https://www.backblaze.com/docs/cloud-storage-create-and-manage-buckets).
* [Create an Application Key](https://www.backblaze.com/docs/cloud-storage-create-and-manage-app-keys#create-an-app-key) with access to the bucket you wish to use.

Be sure to copy the application key as soon as you create it, as you will not be able to retrieve it later!

## Upload PDF Files to Your Bucket

You can use the Backblaze web UI, or any B2 or S3-compatible file management to [upload PDF files to your bucket](https://www.backblaze.com/docs/cloud-storage-upload-and-manage-files). It's useful to organize files by prefix (analogous to a folder or directory in a traditional filesystem); this example assumes the PDFs have the prefix `pdfs/` within the bucket.

If you don't have any suitable PDF files to hand, you can [download the Backblaze B2 documentation PDF used in creating this tutorial](https://metadaddy-langchain-demo.s3.us-west-004.backblazeb2.com/pdfs/documentation.pdf) and upload it to your own bucket. 

## Configuration

Since Backblaze B2 has an S3-compatible API, this notebook uses LangChain's [`S3FileLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.s3_file.S3FileLoader.html) and the [`s3fs`](https://s3fs.readthedocs.io/en/latest/) module to interact with files in Backblaze B2, as well as the AWS SDK for Python, also known as Boto3. Both `S3FileLoader` and `s3fs` use Boto3 under the covers, so you need simply configure the latter so that all of the tools can access your Backblaze B2 Bucket. The most straightforward way to do so in this context is via environment variables.

Note: you should never, *ever* put credentials in your code, including Jupyter notebooks! This example uses `python-dotenv` to load configuration from a `.env` file into environment variables for use by `S3FileLoader`. This repo includes a template file, `.env.template`. Copy it to `.env`, then edit it as follows:

```dotenv
AWS_ACCESS_KEY_ID='<Your Backblaze application key ID>'
AWS_SECRET_ACCESS_KEY='<Your Backblaze application key>'
AWS_ENDPOINT_URL='<Your bucket endpoint, prefixed with https://, e.g., https://s3.us-west-004.backblazeb2.com >'
```

When you're done, `.env` should look like this:

```dotenv
AWS_ACCESS_KEY_ID='004qlekmvpwemrt000000009e'
AWS_SECRET_ACCESS_KEY='K004JEKEUTGLKEJFKLRJHTKLVCNWURM'
AWS_ENDPOINT_URL='https://s3.us-west-004.backblazeb2.com'
```

Now you can load the configuration into the environment:

In [42]:
from dotenv import load_dotenv

if load_dotenv():
    print('Loaded environment variables from .env')
else:
    print('No environment variables in .env!')

Loaded environment variables from .env


Set the bucket name to match the bucket you are using

In [44]:
bucket_name = 'metadaddy-langchain-demo'

Set the PDF location to the prefix (folder/directory) within the bucket that you are using for your PDFs. You can set it to `''` if you put the PDFs in the root of the bucket.

In [43]:
pdf_location = 'pdfs'

## List the PDF Files for Processing

Use Boto3 to list the files in `pdf_location`.

In [38]:
import boto3

b2_client = boto3.client('s3')

try:
    # Note - list_object_v2 returns a maximum of 1000 objects per call, 
    # so you should use a paginator in a real-world implementation. 
    # See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/paginators.html
    object_list = b2_client.list_objects_v2(Bucket=bucket_name, Prefix=pdf_location)
    print(f'Successfully accessed {bucket_name}, found {object_list["KeyCount"]} file(s) under {pdf_location}/')
except Exception as e:
    print(f'Error accessing B2: {e}')

Successfully accessed metadaddy-langchain-demo, found 1 file(s) under pdfs/


## Load PDF Data from Backblaze B2

Now you can iterate through the list of files, loading each with [`S3FileLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.s3_file.S3FileLoader.html).

This can take a few minutes, depending on how much data you are loading. Most of the time is consumed by parsing the PDF, rather than downloading the data. Note that you need only download and parse the PDF data once. In later steps you'll see how the PDF data is [loaded into a vector store](#Create-a-Vector-Store) that can be persisted to local disk, then [archived to Backblaze B2](#Archive-the-Vector-Store-to-Backblaze-B2) and [downloaded from Backblaze B2](#Download-the-Vector-Store-from-Backblaze-B2) for use with the model.

In [9]:
from langchain_community.document_loaders import S3FileLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from fnmatch import fnmatch

print(f'Loading PDF data from B2 bucket {bucket_name}/{pdf_location}')
docs = []
for object in object_list['Contents']:
    # Only process PDF files
    if fnmatch(object['Key'], '*.pdf'):
        print(f'Loading {object["Key"]}')
        loader = S3FileLoader(bucket_name, object['Key'])
        docs += loader.load()

print(f'Loaded {len(docs)} document(s)')

Loading PDF data from B2 bucket metadaddy-langchain-demo/pdfs
Loading pdfs/documentation.pdf
Loaded 1 document(s)


You must split the text into chunks for loading into a [vector store](https://python.langchain.com/v0.2/docs/concepts/#vector-stores). A chunk size of 1000 characters, with a 200 character overlap seems to work well for technical articles. You can experiment by changing these parameters.

In [10]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)
print(f'Split {len(docs)} document(s) into {len(all_splits)} chunks')

Split 1 document(s) into 1982 chunks


## Create a Vector Store

Now create a vector store from the splits.

Note that we specify `persist_directory` so that the vector store will be saved to local disk. 

In [22]:
from langchain_chroma import Chroma
from langchain_community.embeddings import GPT4AllEmbeddings

# The vector store will be persisted into this local directory
vector_db_directory = 'vectordb'

vectorstore = Chroma.from_documents(
    documents=all_splits,
    embedding=GPT4AllEmbeddings(model_name='all-MiniLM-L6-v2.gguf2.f16.gguf', gpt4all_kwargs={'allow_download': True}),
    persist_directory=vector_db_directory
)

Running a similarity search on the vector store with a relevant query should return one or more results.

In [23]:
search_results = vectorstore.similarity_search('When would you use a master application key?')
print(f'Found {len(search_results)} docs')
print(f'First doc ({len(search_results[0].page_content)} characters): {search_results[0]}')

Found 4 docs
First doc (237 characters): page_content='a standardapplication key that is limited to the level of access that a user needs. The following use casesdemonstrate when to use a standard app key versus a master application key:Use a standard application key for the following cases:' metadata={'source': 's3://metadaddy-langchain-demo/pdfs/documentation.pdf'}


## Archive the Vector Store to Backblaze B2

Since it can take considerable time to download, parse, and split the PDF data, then load it into the vector store, it is worth archiving the vector store files for later use.

Notice that this code block zips the vector store files directly into a file in Backblaze B2, rather than creating, uploading, then deleting a local ZIP file. This is a useful technique to save time and local storage space!

In [29]:
from shutil import copyfileobj
from datetime import datetime
import os

from s3fs import S3FileSystem
from zipfile import ZipFile, ZipInfo

b2fs = S3FileSystem(version_aware=True)

# Vector store archive location within your Backblaze B2 bucket
vector_db_archive = 'vectordb/vectordb.zip'

# S3FileSystem uses the bucket name in the path
vector_db_path = f'{bucket_name}/{vector_db_archive}'

with b2fs.open(vector_db_path, mode='wb') as f, ZipFile(f, mode='w') as zipfile:
    for root, _dirnames, filenames in os.walk(vector_db_directory):
        for filename in filenames:
            fullpath = os.path.join(root, filename)
            mtime = os.path.getmtime(fullpath)
            last_modified = datetime.fromtimestamp(mtime)
            date_time = (last_modified.year, last_modified.month, last_modified.day,
                         last_modified.hour, last_modified.minute, last_modified.second)
            # Want path relative to vector_db_directory
            zipinfo = ZipInfo(filename=fullpath.removeprefix(f'{vector_db_directory}/'), date_time=date_time)
            with open(fullpath, mode='rb') as src, zipfile.open(zipinfo, mode='w') as dst:
                copyfileobj(src, dst)
                print(f'Added {zipinfo.filename}')

# Check that the ZIP is there, and see how big it is
response = b2fs.ls(vector_db_path, detail=True)
print(f'\nArchived vector store at {vector_db_directory} to {vector_db_archive}; archive size is {response[0]["size"]}')

Added chroma.sqlite3
Added e4616987-a3bb-4405-9250-2fdac7275b64/data_level0.bin
Added e4616987-a3bb-4405-9250-2fdac7275b64/length.bin
Added e4616987-a3bb-4405-9250-2fdac7275b64/link_lists.bin
Added e4616987-a3bb-4405-9250-2fdac7275b64/header.bin
Added e4616987-a3bb-4405-9250-2fdac7275b64/index_metadata.pickle

Archived vector store at vectordb to vectordb/vectordb.zip; archive size is 37986676


## Download the Vector Store from Backblaze B2

Once you've archived the vector store to Backblaze B2, for future use, you can download and unzip it rather than recreating it from the PDF data.

You don't need to do this if you already have a local vector store, but, as a test, you can delete the local vector store before downloading and unzipping the archive by uncommenting the `rmtree()` call.

In [36]:
from shutil import rmtree

# Uncomment the following line if you want to delete the local vector store before you download and unzip the archive
# rmtree(vector_db_directory)

if not os.path.isdir(vector_db_directory):
    os.mkdir(vector_db_directory)

print(f'Downloading and unzipping {vector_db_path} to {vector_db_directory}')
with b2fs.open(vector_db_path, mode='rb') as f, ZipFile(f, mode='r') as myzip:
    myzip.extractall(vector_db_directory)

print('Downloaded and extracted vector store from B2:')
for root, _dirnames, filenames in os.walk(vector_db_directory):
    for filename in filenames:
        fullpath = os.path.join(root, filename)
        info = os.stat(fullpath)
        print(f'{info.st_size:>12} {fullpath}')

Downloading and unzipping metadaddy-langchain-demo/vectordb/vectordb.zip to vectordb
Downloaded and extracted vector store from B2:
    32747520 vectordb/chroma.sqlite3
     5028000 vectordb/e4616987-a3bb-4405-9250-2fdac7275b64/data_level0.bin
       12000 vectordb/e4616987-a3bb-4405-9250-2fdac7275b64/length.bin
       25940 vectordb/e4616987-a3bb-4405-9250-2fdac7275b64/link_lists.bin
         100 vectordb/e4616987-a3bb-4405-9250-2fdac7275b64/header.bin
      172004 vectordb/e4616987-a3bb-4405-9250-2fdac7275b64/index_metadata.pickle



## Load the Large Language Model (LLM)

[GPT4All](https://docs.gpt4all.io/) allows you to run LLMs locally on consumer-grade hardware; it's a great tool for getting started building LLM-based applications.
You can [download the GPT4All app](https://www.nomic.ai/gpt4all) and use it to download one or more models, or download model files from [Hugging Face](https://huggingface.co/) directly. GPT4All offers a [wide choice of models](https://docs.gpt4all.io/gpt4all_desktop/models.html); this tutorial uses [Nous Hermes 2 Mistral DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO), a fast chat-based model.
 
If you use the app, you will need to locate the directory to which it downloads models. The location on my Mac is shown below as an example.

In [14]:
from langchain_community.llms import GPT4All

# Change this to point to the model file on your machine
model_path = '/Users/ppatterson/Library/Application Support/nomic.ai/GPT4All/Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf'

# The device on which to run the model: 'cpu', 'gpu', 'nvidia', 'intel', 'amd' or a DeviceName
device = 'gpu'

# Maximum size of context window, in tokens. A higher number can produce better responses, but will consume more memory.
max_context_window = 4096

print(f'Loading LLM, requesting device {device}')
model = GPT4All(
    model=model_path,
    max_tokens=max_context_window,
    device=device
)
print(f'Loaded LLM, running on {model.device}.')

Loading LLM, requesting device gpu
Loaded LLM, running on gpu.


As its name implies, LangChain allows you to combine components such as vector stores and LLMs into chains to implement a wide variety of use cases. Each component in the chain accepts input, performs some processing, and emits some output.
 
To use a vector store in a chain, you obtain its `retriever` interface - the retriever accepts string queries and returns the most 'relevant' documents from its source.

In [39]:
retriever = vectorstore.as_retriever()

You need to define a prompt template to frame the interaction with the LLM. In this RAG chain, it will combine instructions, the context retrieved from the vector store, and the user's question.

This prompt template is based on the example Q&A RAQ prompt at
https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/chains/retrieval_qa/prompt.py. Note how it explicitly instructs the model to use the provided context in answering the question, and not to try to make up an answer. `{context}` and `{question}` are placeholders; the relevant text will be substituted as the chain executes.

In [40]:
from langchain_core.prompts import PromptTemplate

prompt_template = """Use the following pieces of context to answer the question at the end. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
    {context}
    
    Question: {question}
    Helpful Answer:"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

Now you have all the ingredients to build a chain! You can see how the context and question are fed into the prompt, the result being fed into the model, the output of which is fed into a `StrOutputParser()` to produce a string.

In [17]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
)

Let's feed a few questions through the chain. Note that this chain does not implement chat history, so each question must be self-contained. Feel free to edit the questions and see if you can stump the chatbot!

In [20]:
questions = [
    'What is the difference between the master application key and a standard application key?',
    'What are best practices for working with application keys?',
    'Tell me about event notifications in Backblaze B2'
]

for question in questions:
    answer = chain.invoke(question)
    print(f'\n{question}\n{answer}\n')


What is the difference between the master application key and a standard application key?
 The master application key provides complete access to your Backblaze B2 Cloud Storage account, while a standard application key has specific file prefixes, limited capabilities, can expire, and can be deleted without disrupting other users.


What are best practices for working with application keys?
 Use a standard application key when access is limited to what a user needs; use a master application key only in situations where that level of access is necessary; ensure the key string and ID are securely saved after creation.


Tell me about event notifications in Backblaze B2
 Event Notifications are a feature of Backblaze B2 Cloud Storage that allow users to receive real-time updates on events happening within their storage. These notifications can be set up using rules, which specify the type of event and the action to take when it occurs. The rules also include options for enabling or disab

## Next Steps

Congratulations - you have a chatbot that answers questions based on context you provided! Try experimenting with chunk size, overlap, and the maximum context window and observe how the model behaves. You can even swap out the model–GPT4All supports a [range of alternative models](https://docs.gpt4all.io/gpt4all_desktop/models.html), or you can use a different model framework entirely.