# **SIERRA ⛰️: Semantic Information Encoding, Retrieval, and Reasoning Agent**

*- [Adam Muhtar](mailto:adam.muhtar@bankofengland.co.uk)*

---

Large Language Models (LLMs) capability in generating texts is anchored on the fact that they are trained on enormous corpora, often mined from the public internet. While the training corpora is huge, they are often 'general' in nature - the LLMs therefore may not be as effective in generating texts for domain-specific prompts. In other words, the parametric memory of the LLMs are not adapted for domain specific tasks.

A new pipeline of LLMs called Retrieval Augmented Generation (RAG) can be used to address these limitations. RAG retrieves information provided by the user, which often lies outside of the foundation model's parametric memory and augments the LLM's output by utilising the retrieved information instead (also called the source memory). This pipeline ensures the LLM utilises the contextually relevant information as part of its inputs to generate responses to the users' query. Just like humans referencing source materials to ensure the best quality answers, this system replicates that process for LLMs.

This notebook details SIERRA, a RAG system by performing semantic information retrieval from user-provided documents and feeding them into an LLM. The first stage of SIERRA involves extraction and meaningful interpretation of content from user-provided documents, mapping text from these documents onto a semantic representation of their latent information. This encoded knowledge is then indexed for efficient retrieval, enabling the system to rapidly locate pertinent information in response to user queries. After the retrieval process, SIERRA leverages a large language model (LLM) to generate coherent and relevant responses based on the retrieved information. Uniquely, the system can also trace and report the source of the information used in these responses, ensuring transparency and credibility.

This combination of technologies is a step forward towards building a sophisticated tool for interpreting and synthesising information, ideally one that is capable of providing users with accurate, sourced answers to a wide range of domain-specific questions.

## **Table of Contents**

* [1. Notebook setup](#section-1)
* [2. Load and embed corpus](#section-2)
* [3. Load LLM; setup Q&A retrieval chain](#section-3)
* [4. Testing Q&A retrieval chain](#section-4)

## 1. Notebook Setup <a name="section-1"></a>

This notebook is run using [Google Colaboratory](https://colab.research.google.com/) (Colab) - Google's implementation of [Jupyter Notebooks](https://jupyter.org/). This notebook will require the following package(s) to be installed:
* `pymupdf==1.22.5`
* `openai==0.27.8`
* `sentence-transformers==2.2.2`
* `torch==2.0.1`

Running this Colab notebook will require hardware accelerators to access higher RAM runtimes; this instance runs on the Tesla T4 GPU (16 GB GDDR6 @ 320 GB/s) provided for free by Google.

In [1]:
# Standard library imports
import json
import locale
from pathlib import Path
import re
import textwrap

# Third-party imports
import fitz
import openai
from sentence_transformers import SentenceTransformer, util
import torch
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


## 2. Download corpus and create vector embedding database <a name="section-2"></a>

This notebook makes use of several publicly available reports from the PRA and a Bank staff working paper:
* [2022 PRA Annual Report](https://www.bankofengland.co.uk/prudential-regulation/publication/2022/june/pra-annual-report-2021-22)
* [2023 PRA Annual Report](https://www.bankofengland.co.uk/prudential-regulation/publication/2023/july/pra-annual-report-2022-23)
* [Sending firm messages: text mining letters from PRA supervisors to banks and building societies they regulate](https://www.bankofengland.co.uk/working-paper/2017/sending-firm-messages-text-mining-letters-from-pra-supervisors-to-banks-building-societies)

We then define several text pre-processing functions to help extract the texts from the PDFs.

In [2]:
def preprocess_text(
    text: str,
    encoding: bool = True,
    lowercase: bool = False,
    remove_newlines: bool = True
) -> str:
    """
    Takes in a string and removes newline characters, tab characters, excess
    whitespaces, as well as regularizing common unicode characters.

    Args:
        * text (`str`): Text to pre-process
        * encoding (`bool`): Convert non UTF-8 characters to UTF-8. Default is
        `True`.
        * lowercase (`bool`): Returns the processed string in lowercase if set
        to `True`. Default is `False`.
        * remove_newlines (`bool`): Removes all newline characters in string.
        Default is `True`.

    Returns:
        * `str`: Pre-processed text
    """
    # Fix apostrophes/quotation marks
    _text = re.sub("[‘’]", "'", text)
    _text = re.sub("[“”]", '"', _text)

    if encoding:
        if locale.getencoding() != "utf-8":
            # Fix encoding mismatch
            _text = _text.encode(
                encoding=locale.getencoding(), errors="ignore"
            ).decode(
                encoding="utf-8", errors="ignore"
            )
            _text = re.sub("(&\\\\#x27;|&#x27;)", "'", _text)
        else:
            _text = re.sub("(&\\\\#x27;|&#x27;)", "'", _text)

    # Remove newlines, tabs, non-breaking spaces, excess backslashes/whitespaces
    if remove_newlines:
        _text = re.sub("[\n\r]+", " ", _text)
    _text = re.sub("[\t\xa0]+", " ", _text)
    _text = re.sub(r"\\+", "", _text)
    _text = re.sub(r"\s+", " ", _text).strip()

    if lowercase:
        _text = _text.lower()

    return _text

def get_pdf_text_blocks(
    doc: fitz.Document, file_name: str, preprocess: bool = True
) -> list:
    """
    Extracts text from a PyMuPDF document, returning a list of dictionaries
    containing the text and associated metadata (the name of the PDF and the
    page number).

    Args:
        * doc (`fitz.Document`): The PyMuPDF Document object from which to
        extract text.
        * file_name (`str`): The name of the PDF file being processed.
        * preprocess (`bool`): Whether to preprocess the text. Default is True.

    Returns:
        * `list`: A list of dictionaries. Each dictionary contains:
            - "text": A string containing the preprocessed text block.
            - "source": A string with the name of the PDF file.
            - "page": An integer representing the page number in the PDF file
            from which the text block was extracted.
    """
    text_blocks = []
    for i, page in enumerate(doc):
        for x in page.get_text("blocks"):
            # Create a dictionary to hold text block and related metadata
            block_dict = {}
            block_dict["text"] = x[4]
            block_dict["source"] = file_name
            block_dict["page"] = i + 1 # page numbers start from 1

            # Only add blocks that are not empty
            if block_dict["text"].strip() != "":
                if preprocess:
                    # Preprocess text
                    block_dict["text"] = preprocess_text(block_dict["text"])
                text_blocks.append(block_dict)
    return text_blocks

We extract all text blocks within each downloaded PDF file save them in a dictionary

In [3]:
pdf_path = Path.cwd() / "sample_docs"

docs = {}
for i, path in tqdm(enumerate(pdf_path.rglob("*")), desc="Processing PDFs"):
    docs[i] = get_pdf_text_blocks(
        doc=fitz.open(path, filetype="pdf"),
        file_name=path.stem,
        preprocess=True
    )

Processing PDFs: 3it [00:00,  3.58it/s]


We then create a vector database of our corpus by creating sentence-level embeddings from extracted texts. This allows us to:
* encode extracted texts from documents as vector embeddings.
* store these embeddings and their associated metadata.
* perform semantic similarity searches on these embeddings.

For this step, we use the BAAI General Embedding (BGE) model, based on this model checkpoint: https://huggingface.co/BAAI/bge-base-en. At the time of writing, the BGE models are the highest performing models in the Hugging Face [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) Leaderboard

In [4]:
# Define the name of the pre-trained model to use
MODEL_PATH = Path.cwd().parent / "src" / "models" / "bge-base-en"

# Initialise a SentenceTransformer model with the specified pre-trained model
encoder = SentenceTransformer(MODEL_PATH)

# Initialise dictionary to store document embeddings and metadata
vectordb = {}
for i in range(0, len(docs)):
    # Extract text, source, and page information from each document in the set
    texts = [doc["text"] for doc in docs[i]]
    sources = [doc["source"] for doc in docs[i]]
    pages = [doc["page"] for doc in docs[i]]

    # Compute embeddings for extracted texts
    embeddings = encoder.encode(
        sentences=texts,
        convert_to_tensor=True,
        show_progress_bar=True
    )

    # Store the texts, sources, pages, and corresponding embeddings in vectordb
    vectordb[i] = {
        "texts": texts,
        "sources": sources,
        "pages": pages,
        "embeddings": embeddings
    }

Batches: 100%|██████████| 39/39 [01:34<00:00,  2.43s/it]
Batches: 100%|██████████| 36/36 [01:36<00:00,  2.69s/it]
Batches: 100%|██████████| 36/36 [01:41<00:00,  2.81s/it]


The full vector embeddings database will have an ID for each document, with the corresponding information associated with each ID:
* `texts`: An ordered list of extracted texts from PDFs in order of chronological appearance in the source document, with each element stored as string.
* `sources`: A list of source file name associated for each extracted text. This will be identical as long as the document ID is the same
* `pages`: A list of integers associated with the location of the text in the source file document. Pages are counted from the literal first page of the document
* `embeddings`: A list of tensors, with each element of the list being 768-dimensional vector embedding, each embedding corresponding to each text.

We can export these embeddings into machine-readable formats (e.g. JSON file) using the code in the cell below

In [5]:
# Convert ndarray to list as ndarray is not JSON serialisable
for i in range(0, len(vectordb)):
    vectordb[i]["embeddings"] = vectordb[i]["embeddings"].tolist()

# Save outputs of encoding process as JSON file
vectordb_path = Path.cwd() / "data" / "vector-database.json"
with open(vectordb_path, "w") as json_file: 
    json.dump(vectordb, json_file)

To load a specific vector database checkpoint, we use the code in the cell below

In [6]:
# Define the name of the pre-trained model to use
MODEL_PATH = Path.cwd().parent / "src" / "models" / "bge-base-en"

# Initialise a SentenceTransformer model with the specified pre-trained model
encoder = SentenceTransformer(MODEL_PATH)

# Load outputs of previous encoding jobs from JSON file checkpoint
vectordb_path = Path.cwd() / "data" / "vector-database.json"
with open(vectordb_path) as json_file:
    vectordb = json.load(json_file)

# Convert document ids into integer forms
vectordb = {int(key): value for key, value in vectordb.items()}

# Convert embeddings back to tensors 
embedding_dim = len(vectordb[0]["embeddings"][0])
for i in range(0, len(vectordb)):
    for j in range(0, len(vectordb[i]["texts"])):
        vectordb[i]["embeddings"][j] = torch.Tensor(vectordb[i]["embeddings"][j])

## 3. Connect to Azure OpenAI resource

All API requests to the Azure OpenAI resource must include the API Key generated at the point of resource creation, along with the resource endpoint.

In [7]:
# Populate with relevant details from Azure OpenAI service
openai.api_key = ""
openai.api_base = ""
openai.api_type = "azure"
openai.api_version = "2023-05-15"

We then define several functions to:
* Run a semantic search on the corpus to retrieve contextually relevant information.
* Create a system prompt that utilises the retrieved information.
* Feed the engineered prompt as input to the LLM.
* Print the RAG-LLM response with the corresponding source metadata.

In [8]:
def semantic_search(
    query: str,
    encoder: SentenceTransformer,
    vectordb: dict,
    min_results_length: int = 20,
    top_n: int = 3,
    metadata: bool = True
):
    """
    Perform semantic search using a query against a vector database.

    Args:
        * query (`str`): The query string for semantic search.
        * encoder (`SentenceTransformer`): A SentenceTransformer model.
        * vectordb (`dict`): A dictionary containing embeddings, texts, sources,
        and pages.
        * min_results_length (`int`, optional): Minimum length of words in a
        valid search result text. Default is 20.
        * top_n (`int`, optional): Number of top results to return. Default is 5.
        * metadata (`bool`, optional): Whether to return search metadata as list.
        Default is True.

    Returns:
        * `str`: Text containing top search results with text, source, and page
        information.
        * `dict`: A dictionary containing the search source and page.
    """
    # Encode the query into a vector using SentenceTransformer
    question_embedding = encoder.encode(query, convert_to_tensor=True)

    # Perform semantic search for each entry in the vector database
    hits = {}           # Store intermediate semantic search results
    valid_hits = {}     # Store valid results based on min_results_length
    for i in range(0, len(vectordb)):
        hits[i] = util.semantic_search(
            query_embeddings=question_embedding,
            corpus_embeddings=vectordb[i]["embeddings"],
            top_k=32
        )
        hits[i] = hits[i][0]
        hits[i] = sorted(hits[i], key=lambda x: x["score"], reverse=True)

    # Filter valid search results based on min_results_length
    for i in range(0, len(vectordb)):
        temp = []
        for hit in hits[i]:
            if len(vectordb[i]["texts"][hit["corpus_id"]].split(" ")) > min_results_length:
                temp.append((hit["corpus_id"], hit["score"]))
        valid_hits[i] = temp

    # Flatten and sort valid search results
    flattened_valid_hits = [[key, value] for key, values in valid_hits.items() for value in values]
    sorted_hits = sorted(flattened_valid_hits, key=lambda x: x[1][1], reverse=True)
    top_n_results = sorted_hits[:top_n]

    # Generate and format search result strings
    retrieved_info = ""
    for i, result in enumerate(top_n_results):
        retrieved_info += (
            f"SEARCH RESULT {i+1}:\n"
            + f"Text: {vectordb[result[0]]['texts'][result[1][0]]}\n"
            + f"Source: {vectordb[result[0]]['sources'][result[1][0]]}\n"
            + f"Page: {vectordb[result[0]]['pages'][result[1][0]]}\n"
        )

    if metadata:
        retrieved_info_metadata = []
        for result in top_n_results:
            temp = {
                "source": vectordb[result[0]]["sources"][result[1][0]],
                "page": vectordb[result[0]]["pages"][result[1][0]]
            }
            retrieved_info_metadata.append(temp)
        return retrieved_info, retrieved_info_metadata
    else:
        return retrieved_info

def sierra_speak(
    temperature: float = 0.0,
    max_tokens: int = 2000,
    deployment_name: str = "hjoai-gpt4"
):
    """
    This function takes a user's question, runs a semantic search to retrieve
    the most contextually relevant information and feeds both the question and
    semantic search results to the large language model to process the inputs.
    The response aims to provide helpful and accurate information based on the
    search results.

    Args:
        * temperature (`float`, optional): The randomness of the output. Higher
        values result in more randomness. Default is 0.0.
        * max_tokens (`int`, optional): The maximum number of tokens in the
        response. Default is 2000.
        * deployment_name (`str`, optional): The name or ID of Azure OpenAI
        model deployment.

    Returns:
        The function displays the generated response and relevant metadata to
        the console.

    Note:
        * The function uses the `semantic_search` function to retrieve relevant
        information based on the user's question.
        * The generated response is formatted and printed to the console.
        * Metadata about the retrieved information, including source and page
        numbers, is displayed in the console.

    Example:
        >>> sierra_speak("What does PRA stand for?")
        ...
        [Generated AI response]
        ...
        * Source: [Source name] | Page: [Page number]
        ...
    """
    question = input("Question: ")
    print(f"Question: {question}")
    print("-"*100)
    retrieved_info, metadata = semantic_search(
        query=question, encoder=encoder, vectordb=vectordb
    )
    system_prompt = f"""You are a helpful, respectful and honest assistant to
        the Bank of England, a central bank and financial regulator.
        Always answer as helpfully as possible using the search results provided,
        which includes the search results text, source, and page number,
        delimited by triple backticks:
        ```{retrieved_info}```

        Include the source and page number in your answer. If the search results
        does not adequately answer the query provided, do not use it. If no search
        results are relevant, say so.

        If a question does not make any sense, or is not factually coherent,
        explain why instead of answering something not correct. If you don't know
        the answer to a question, please do not share false information and say
        you don't know."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question},
    ]

    completion = openai.ChatCompletion.create(
        engine=deployment_name,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens
    )
    
    # Split the text into paragraphs based on newlines, wrap each element
    llm_response = completion["choices"][0]["message"]["content"]
    paragraphs = llm_response.split("\n")
    wrapped_paragraphs = []
    for paragraph in paragraphs:
        # Wrap each paragraph individually, maintaining existing newlines
        wrapped_paragraph = "\n".join(
            textwrap.fill(line, width=100) for line in paragraph.splitlines()
        )
        wrapped_paragraphs.append(wrapped_paragraph)
    formatted_response = "\n".join(wrapped_paragraphs)
    
    print(formatted_response)
    print("-"*100)
    for data in metadata:
        print(f"* Source: {data['source']} | Page: {data['page']}")
    print("-"*100)

## 4. Testing retrieval-augmented generation (RAG) <a name="section-4"></a>

This section tests the Q&A pipeline built by asking the following questions:
* How is the PIF score determined?
* What is the PRA's stance on artificial intelligence?
* What is the PRA's risk model?
* What are the top concerns of the PRA in 2023?
* What is the fee levied on firms used for?
* Give some examples of international engagements
* What has the PRA done to improve growth and competitiveness?

In [9]:
sierra_speak()

Question: How is the PIF score determined?
----------------------------------------------------------------------------------------------------
The Proactive Intervention Framework (PIF) score is determined by considering a firm's risk
elements, which include External Context, Business Risk, Management and Governance, Risk Management
and Controls, Capital, and Liquidity. Supervisors use a ten-point scale to score a firm along each
of these risk elements, with 1 indicating the lowest risk to safety and soundness, and 10 the
highest. The PIF stages run from 1 to 5, with 1 signifying low risks to the viability of the firm,
and 5 a firm that is in resolution or being actively wound down.

While PIF staging takes into account the risk element scores, supervisors use judgement when
deciding the weight applied to each. In other words, the PIF stage is not simply a summation and
average of the risk element scores, but is the product of a more complex deliberation, reflecting
the PRA's emphasis

In [10]:
sierra_speak()

Question: What is the PRA's stance on artificial intelligence?
----------------------------------------------------------------------------------------------------
The PRA (Prudential Regulation Authority), together with the Bank of England and the FCA (Financial
Conduct Authority), has been actively working to understand the potential benefits and risks of
artificial intelligence (AI) and machine learning (ML) in financial services. In October 2022, they
published a discussion paper (DP) on AI and ML, which outlines their views on these technologies,
how the current regulatory framework applies to them, where additional clarification of existing
regulation may be helpful, and how policy can best support further AI and ML adoption [83]. This DP
is part of the PRA's wider program of work on AI, which also includes the AI Public-Private Forum
(the final report of which was published in February 2022) [84] and a survey of machine learning in
UK financial services (published in October 202

In [11]:
sierra_speak()

Question: What is the PRA's risk model?
----------------------------------------------------------------------------------------------------
The PRA's Risk Model is a framework used by supervisors to assess the risks posed by firms to the
PRA's objectives. The Risk Model has two high-level aspects:

1. Gross Risk: This comprises the Potential Impact a firm's failure would have on the financial
system; macroeconomic and other risks to which the firm is exposed (External Context); and risks
inherent in the firm's business model and corporate structure (Business Risk).

2. Mitigating Factors: These offset the risks mentioned above and include Management and Governance,
Risk Management and Controls, Capital, Liquidity, and Resolvability.

This information can be found in the source "sending-firm-messages-text-mining-letters-from-PRA-
supervisors-to-banks-and-building-societies" on page 5.
----------------------------------------------------------------------------------------------------
*

In [12]:
sierra_speak()

Question: What are the top concerns of the PRA in 2023?
----------------------------------------------------------------------------------------------------
I cannot provide a specific list of the top concerns of the PRA in 2023, as the search results do
not provide that information. However, one of the issues mentioned in the search results is dealing
with insurers in financial difficulties, as outlined in the PRA's publication CP3/23 in February
2023 (source: pra-2023, page: 50). For a comprehensive list of concerns, I would need more
information or additional search results.
----------------------------------------------------------------------------------------------------
* Source: pra-2023 | Page: 61
* Source: pra-2023 | Page: 50
* Source: pra-2022 | Page: 59
----------------------------------------------------------------------------------------------------


In [13]:
sierra_speak()

Question: What is the fee levied on firms used for?
----------------------------------------------------------------------------------------------------
The fee income generated from regulated firms is used for the functions covered by the statutory
framework that the Prudential Regulation Authority (PRA) operates within. The PRA's budget covers
its support costs, as well as support costs charged by the Bank of England, including those for
central functions such as technology, finance, and human resources (Source: pra-2023, Page: 19).
----------------------------------------------------------------------------------------------------
* Source: pra-2022 | Page: 16
* Source: pra-2023 | Page: 17
* Source: pra-2023 | Page: 19
----------------------------------------------------------------------------------------------------


In [14]:
sierra_speak()

Question: Give some examples of international engagements
----------------------------------------------------------------------------------------------------
I found an example of international engagement involving the Prudential Regulation Authority (PRA).
The PRA has participated in international fora, including the Financial Stability Board (FSB), the
Basel Committee on Banking Supervision (BCBS), and the International Association of Insurance
Supervisors (IAIS). Their engagement focused on identifying and implementing internationally agreed
standards in banking and insurance (Source: pra-2023, Page: 51).
----------------------------------------------------------------------------------------------------
* Source: pra-2023 | Page: 51
----------------------------------------------------------------------------------------------------


In [15]:
sierra_speak()

Question: What has the PRA done to improve growth and competitiveness?
----------------------------------------------------------------------------------------------------
The PRA (Prudential Regulation Authority) works towards facilitating effective competition in the
financial sector. They aim to enable a dynamic and competitive market where entrants can join and
leave with minimal disruption, including through a solvent exit which does not require the use of an
insolvency or resolution process where appropriate (Source: pra-2023, Page: 50). More information on
how the PRA is meeting its objective to facilitate effective competition can be found in the Annual
Competition Report (Source: pra-2022, Page: 43). Additionally, under the FSM Bill, currently before
Parliament, the PRA has a proposed new secondary objective to facilitate international
competitiveness of the UK economy and its growth (Source: pra-2023, Page: 14).
----------------------------------------------------------------