# Prepare PDF Documents for RAG 

*Code adapded from: https://github.com/Azure-Samples/azure-search-openai-demo

## Install Dependencies

In [1]:
%pip install -r requirements.txt

[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


## Import

In [3]:
import base64
import os
import re

import openai

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import *
from pypdf import PdfReader, PdfWriter

import dotenv 
#load the environment variables of .env file
%load_ext dotenv
%dotenv


## Setup Search Index

- Search Index is the means by which data is organized and structured so that search engines can generate relevant search results. Search indexing can transform any and all data and file types into searchable data. 

    - In an abstract sense we are creating a dictionary so that a key can correspond to some values. In our case the embedding is the key and the other field are the values 

- If we are using Text search instead of Vector search, search index will be setup in a structure called **Inverted Index**. Read more at: https://www.geeksforgeeks.org/inverted-index/

In [4]:
# Setup the required credentials for using Azure cognitive search
search_endpoint = f"https://{os.getenv('AZURE_SEARCH_SERVICE')}.search.windows.net/"
search_creds = AzureKeyCredential(os.getenv("AZURE_SEARCH_KEY"))
index_client = SearchIndexClient(endpoint= search_endpoint, credential=search_creds)

index_name = os.getenv("AZURE_SEARCH_INDEX")

if index_name not in index_client.list_index_names():
    # Define a search index
    index = SearchIndex(
                name=index_name,
                fields=[
                    SimpleField(name="id", type="Edm.String", key=True),                                              #Unique id field
                    SearchableField(name="content", type="Edm.String", analyzer_name="en.microsoft"),                 #Content field
                    SearchField(name="embedding", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),  
                                hidden=False, searchable=True, filterable=False, sortable=False, facetable=False,
                                vector_search_dimensions=1536, vector_search_configuration="default"),                #Searchable field: Embedding
                    SimpleField(name="sourcepage", type="Edm.String", filterable=True, facetable=True),               #Blob name field
                    SimpleField(name="sourcefile", type="Edm.String", filterable=True, facetable=True)                #File name field
                ],
                
                #Additional Semantic Setting if we want to use the semantic search feature in the future
                semantic_settings=SemanticSettings(
                    configurations=[SemanticConfiguration(
                        name='default',
                        prioritized_fields=PrioritizedFields(title_field=None, prioritized_content_fields=[SemanticField(field_name='content')]))]),
                    vector_search=VectorSearch(
                        algorithm_configurations=[
                            VectorSearchAlgorithmConfiguration(
                                name="default",
                                kind="hnsw",
                                hnsw_parameters=HnswParameters(metric="cosine") 
                            )
                        ]
                    )        
                )   
    # Create the search index
    index_client.create_index(index)
else:
    print(f"Index: '{index_name}' already exists")

HttpResponseError: (ResourceNameAlreadyInUse) Cannot create index 'test-test' because it already exists.
Code: ResourceNameAlreadyInUse
Message: Cannot create index 'test-test' because it already exists.
Exception Details:	(CannotCreateExistingIndex) Cannot create index 'test-test' because it already exists.
	Code: CannotCreateExistingIndex
	Message: Cannot create index 'test-test' because it already exists.

## Setup Embedding model

- Embedding is the process of turning text in to a very high dimensional vector, where each dimensional represents a feature of the text. Therefore the semantics of the text can be emmbedded in the vector representation

- There we simplifies the OpenAI Embedding API call into a function

In [5]:
# Setup the required credential for using Azure OpenAI
openai.api_type = "azure"
openai.api_key = os.getenv("AZURE_OPENAI_KEY")      
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
openai.api_version = "2023-05-15"

def compute_embedding(text):
    return openai.Embedding.create(engine="embedding", input=text)["data"][0]["embedding"]

## Extract data from documents

- Using a python library **pdfreader** to extract the text of our pdf documents and stores it in a special data structure

In [7]:
filename="./data/" + "The_Innovation_Wings.pdf" #Change to name of yout file (make sure the file name does not include any space)

offset = 0       #The character count from the start of the document
page_map = []    #List of turples: (page_num, offset, page_text)

print(f"Extracting text from '{filename}' using PdfReader")

reader = PdfReader(filename)
pages = reader.pages
for page_num, p in enumerate(pages):
    page_text = p.extract_text()
    page_map.append((page_num, offset, page_text))
    offset += len(page_text)
    
page_map

Extracting text from './data/The_Innovation_Wings.pdf' using PdfReader


[(0,
  0,
  'The Innovation Wings The Tam Wing Fan Innovation Wing One, also known as innovation wing, or innowing for short provides an open environment to foster interdisciplinary innovations among undergraduate students and teachers in Engineering and Technology. The provision of state-of-the-art facilities in a collaborative space will enable curriculum innovations that emphasize hands-on and experiential learning activities. The Innovation Wing serves as a platform to engage the young generation to explore the world, create opportunities for them to learn about the needs of the underprivileged, and acquire practical hands-on experience in developing solutions with real-world impact.  The Tam Wing Fan Innovation Wing Two aims to serve as an enabling platform for Engineering researchers to interact and collaborate synergistically with researchers and professionals across various disciplines to tackle grand challenges and deliver research outputs with significant impact to Hong Kong 

## Section the extracted text

- Sectioning means segmenting/splitting text before indexing them, it brings a few benefits:

    - Improved Efficiency: Faster and more efficient vector search. 

    - Enhanced Precision: By indexing smaller text segments the search engine can capture the fine-grained semantic information present in the data.
    

In [8]:
MAX_SECTION_LENGTH = 1000
SENTENCE_SEARCH_LIMIT = 100
SECTION_OVERLAP = 100


def filename_to_id(filename): 
    filename_ascii = re.sub("[^0-9a-zA-Z_-]", "_", filename)
    filename_hash = base64.b16encode(filename.encode('utf-8')).decode('ascii')
    return f"file-{filename_ascii}-{filename_hash}"

def split_text(page_map):
    SENTENCE_ENDINGS = [".", "!", "?"]
    WORDS_BREAKS = [",", ";", ":", " ", "(", ")", "[", "]", "{", "}", "\t", " "]

    def find_page(offset):
        l = len(page_map)
        for i in range(l - 1):
            if offset >= page_map[i][1] and offset < page_map[i + 1][1]:
                return i
        return l - 1

    all_text = "".join(p[2] for p in page_map)
    length = len(all_text)
    start = 0
    end = length
    while start + SECTION_OVERLAP < length:
        last_word = -1
        end = start + MAX_SECTION_LENGTH

        if end > length:
            end = length
        else:
            # Try to find the end of the sentence
            while end < length and (end - start - MAX_SECTION_LENGTH) < SENTENCE_SEARCH_LIMIT and all_text[end] not in SENTENCE_ENDINGS:
                if all_text[end] in WORDS_BREAKS:
                    last_word = end
                end += 1
            if end < length and all_text[end] not in SENTENCE_ENDINGS and last_word > 0:
                end = last_word # Fall back to at least keeping a whole word
        if end < length:
            end += 1

        # Try to find the start of the sentence or at least a whole word boundary
        last_word = -1
        while start > 0 and start > end - MAX_SECTION_LENGTH - 2 * SENTENCE_SEARCH_LIMIT and all_text[start] not in SENTENCE_ENDINGS:
            if all_text[start] in WORDS_BREAKS:
                last_word = start
            start -= 1
        if all_text[start] not in SENTENCE_ENDINGS and last_word > 0:
            start = last_word
        if start > 0:
            start += 1

        section_text = all_text[start:end]
        yield (section_text, find_page(start))

        last_table_start = section_text.rfind("<table")
        if (last_table_start > 2 * SENTENCE_SEARCH_LIMIT and last_table_start > section_text.rfind("</table")):
            # If the section ends with an unclosed table, we need to start the next section with the table.
            # If table starts inside SENTENCE_SEARCH_LIMIT, we ignore it, as that will cause an infinite loop for tables longer than MAX_SECTION_LENGTH
            # If last table starts inside SECTION_OVERLAP, keep overlapping
            start = min(end - SECTION_OVERLAP, start + last_table_start)
        else:
            start = end - SECTION_OVERLAP
        
    if start + SECTION_OVERLAP < end:
        yield (all_text[start:end], find_page(start))

In [9]:
sections = []
file_id = filename_to_id(filename)
for i, (content, pagenum) in enumerate(split_text(page_map)):
    section = {
        "id": f"{file_id}-page-{i}",
        "content": content,
        "embedding": compute_embedding(content),
        "sourcepage": os.path.splitext(os.path.basename(filename))[0] + f"-{pagenum}" + ".pdf",
        "sourcefile": filename
    }
    sections.append(section)

## Index sections

- Uploading the sections to the search index we created earlier

In [10]:
search_client = SearchClient(endpoint=search_endpoint,
                                    index_name=os.getenv("AZURE_SEARCH_INDEX"),
                                    credential=search_creds)
i = 0
batch = []
#index 1000 sections at a time
for s in sections:
    batch.append(s)
    i += 1
    if i % 1000 == 0:
        results = search_client.upload_documents(documents=batch)
        succeeded = sum([1 for r in results if r.succeeded])
        print(f"\tIndexed {len(results)} sections, {succeeded} succeeded")
        batch = []
        
#index the remaining sections
if len(batch) > 0:
    results = search_client.upload_documents(documents=batch)
    succeeded = sum([1 for r in results if r.succeeded])
    print(f"\tIndexed {len(results)} sections, {succeeded} succeeded")


	Indexed 28 sections, 28 succeeded


## Split PDF into single page Blobs

- Creating a PDF file for each page of the document so that we can cited individual pages in the future

In [11]:
reader = PdfReader(filename)
pages = reader.pages
for i in range(len(pages)):
    blob_name = os.path.splitext(os.path.basename(filename))[0] + f"-{i}" + ".pdf"
    print(f"\tCreating blob for page {i} -> {blob_name}")
    writer = PdfWriter()
    writer.add_page(pages[i])
    writer.write("../database/"+blob_name)
    writer.write("../app/backend/static/database/"+blob_name)
    writer.write("../app/frontend/public/database/"+blob_name)
    writer.close()

	Creating blob for page 0 -> The_Innovation_Wings-0.pdf
	Creating blob for page 1 -> The_Innovation_Wings-1.pdf
	Creating blob for page 2 -> The_Innovation_Wings-2.pdf
	Creating blob for page 3 -> The_Innovation_Wings-3.pdf
	Creating blob for page 4 -> The_Innovation_Wings-4.pdf
	Creating blob for page 5 -> The_Innovation_Wings-5.pdf
	Creating blob for page 6 -> The_Innovation_Wings-6.pdf
	Creating blob for page 7 -> The_Innovation_Wings-7.pdf
	Creating blob for page 8 -> The_Innovation_Wings-8.pdf
	Creating blob for page 9 -> The_Innovation_Wings-9.pdf
	Creating blob for page 10 -> The_Innovation_Wings-10.pdf


## Searching the Index

- Searching the vector index using kNN

In [12]:
query = " " #your query keywords
query_vector = compute_embedding(query)

def nonewlines(s: str) -> str:
    return s.replace(' ', ' ').replace('\r', ' ')

r = search_client.search(query, 
                        top=3, 
                        vector=query_vector, 
                        top_k=50, 
                        vector_fields="embedding")

results = [doc["sourcepage"] + ": " + nonewlines(doc["content"]) for doc in r]

for result in results:
    print(result)

The_Innovation_Wings-2.pdf: life, with three fundamental functionalities that do not exist or not well supported by (smart) walkers in the market: smart walking assistance; falling prevention and support; autonomous mobility. A set of mechanical, control, sensory, and AI technologies is being developed including: (1) novel walker mechanical structure with omnidirectional mobility and outrigger mechanisms; (2) dual-mode actuation and control for walking/standing support and fall prevention/ recovery; (3) multimodal sensory data collection through soft sensory skin, and data processing on device and in the cloud, for event detection and control such as user front following and fall detection; (4) sound-source localization for elderly localisation and auto-navigation of walker. SIG – BREED Robotics  BREED is a student group committed to developing and promoting bio-inspired technology. Our flagship VAYU project – the world’s fastest robotic fish – and our upcoming initiatives such as our 