# Load  Files from Storage Account into CogSearch 
### The files from Storage Account are loaded into CogSearch by following below steps
- Establish a connection with Storage Account using the Python SDK.
- Retrieve the required files from Storage Account container using the file stream download method.
- Use Azure AI Document Intelligence to parse the PDF files from Storage Account.
- Divide the PDF into smaller, manageable page chunks for easier processing.
- Index the parsed chunks into Azure Cognitive Search.
- Repeat the process for all the required files.

#### Using the Azure Storage Pyhon SDK  to fetch the file stream and use PYPDF and Document Intelligence to chunk the page in memory and create a vector index
- Azure Python SDK https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/storage/azure-storage-blob
- Refer to https://github.com/MSUSAzureAccelerators/Azure-Cognitive-Search-Azure-OpenAI-Accelerator/blob/main/04-Complex-Docs.ipynb for loading large documents using PYPDF and Document Intelligence

Install the required Packages and setup the auth

In [1]:
#pip install -r requirements.txt

In [1]:
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
import shutil
from PyPDF2 import PdfFileReader, PdfFileWriter,PdfReader 
import os
from dotenv import load_dotenv  
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
import html
import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma, FAISS
from langchain import OpenAI, VectorDBQA
from langchain.chat_models import AzureChatOpenAI
from langchain.chat_models import ChatOpenAI
from azure.storage.blob.aio import BlobClient
from azure.storage.blob import ContainerClient
from tqdm import tqdm
import base64
from azure.storage.blob import BlobClient
import io  

# Configure environment variables  
load_dotenv()

True

In [2]:
# Setup the Payloads header for cog search
headers = {'Content-Type': 'application/json','api-key': os.getenv('AZURE_SEARCH_KEY')}

# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_BASE"] = os.getenv("AZURE_OPENAI_ENDPOINT")
os.environ["OPENAI_API_KEY"] = os.getenv("AZURE_OPENAI_API_KEY")
os.environ["OPENAI_API_VERSION"] = os.getenv("AZURE_OPENAI_API_VERSION")
os.environ["OPENAI_API_TYPE"] = "azure"


In [3]:
embedder = OpenAIEmbeddings(deployment="text-embedding-ada-002", chunk_size=1) 

#### Define function to Parse the PDF using the PYPDF or FormRecognizer ,  stream of file content from Storage Account is source to parse the files

In [5]:
def parse_pdf(filename, form_recognizer=False, formrecognizer_endpoint=None, formrecognizerkey=None, model="prebuilt-document", from_url=False, verbose=False):
    """Parses PDFs using PyPDF or Azure Document Intelligence SDK (former Azure Form Recognizer)"""
    
    page_map=[]
    #setup a blob client with authentication
    blob_client = BlobClient.from_connection_string(conn_str=os.getenv("AZURE_STORAGE_CONNECTION_STRING"), 
                                                    container_name=os.getenv("AZURE_STORAGE_CONTAINER_NAME"), blob_name=filename)
    #get stream 
    stream = blob_client.download_blob()
    file_content_stream = stream.readall()
    offset = 0
    
    if not form_recognizer:
        if verbose: print(f"Extracting text using PyPDF")
        reader = PdfReader(file_content_stream)
        pages = reader.pages
        for page_num, p in enumerate(pages):
            page_text = p.extract_text()
            page_map.append((page_num, offset, page_text))
            offset += len(page_text)
    else:
        if verbose: print(f"Extracting text using Azure Document Intelligence")
        credential = AzureKeyCredential(os.getenv("FORM_RECOGNIZER_KEY"))
        form_recognizer_client = DocumentAnalysisClient(endpoint=os.getenv("FORM_RECOGNIZER_ENDPOINT"), credential=credential)
        
        if not from_url:
            #with open(file, "rb") as filename:
            poller = form_recognizer_client.begin_analyze_document(model, document = file_content_stream)
        #else:
        #    poller = form_recognizer_client.begin_analyze_document_from_url(model, document_url = file)
            
        form_recognizer_results = poller.result()

        for page_num, page in enumerate(form_recognizer_results.pages):
            tables_on_page = [table for table in form_recognizer_results.tables if table.bounding_regions[0].page_number == page_num + 1]

            # mark all positions of the table spans in the page
            page_offset = page.spans[0].offset
            page_length = page.spans[0].length
            table_chars = [-1]*page_length
            for table_id, table in enumerate(tables_on_page):
                for span in table.spans:
                    # replace all table spans with "table_id" in table_chars array
                    for i in range(span.length):
                        idx = span.offset - page_offset + i
                        if idx >=0 and idx < page_length:
                            table_chars[idx] = table_id

            # build page text by replacing charcters in table spans with table html
            page_text = ""
            added_tables = set()
            for idx, table_id in enumerate(table_chars):
                if table_id == -1:
                    page_text += form_recognizer_results.content[page_offset + idx]
                elif not table_id in added_tables:
                    page_text += table_to_html(tables_on_page[table_id])
                    added_tables.add(table_id)

            page_text += " "
            page_map.append((page_num, offset, page_text))
            offset += len(page_text)

    return page_map    

In [6]:
#for tables in pdf convert to html
def table_to_html(table):
    table_html = "<table>"
    rows = [sorted([cell for cell in table.cells if cell.row_index == i], key=lambda cell: cell.column_index) for i in range(table.row_count)]
    for row_cells in rows:
        table_html += "<tr>"
        for cell in row_cells:
            tag = "th" if (cell.kind == "columnHeader" or cell.kind == "rowHeader") else "td"
            cell_spans = ""
            if cell.column_span > 1: cell_spans += f" colSpan={cell.column_span}"
            if cell.row_span > 1: cell_spans += f" rowSpan={cell.row_span}"
            table_html += f"<{tag}{cell_spans}>{html.escape(cell.content)}</{tag}>"
        table_html +="</tr>"
    table_html += "</table>"
    return table_html

In [7]:
def text_to_base64(text):
    # Convert text to bytes using UTF-8 encoding
    bytes_data = text.encode('utf-8')

    # Perform Base64 encoding
    base64_encoded = base64.b64encode(bytes_data)

    # Convert the result back to a UTF-8 string representation
    base64_text = base64_encoded.decode('utf-8')

    return base64_text

In [8]:
#define a list for holding the file details 
files_to_index = []

### List the blobs in your container using the ContainerClient

In [9]:
container = ContainerClient.from_connection_string(conn_str=os.getenv("AZURE_STORAGE_CONNECTION_STRING"), container_name=os.getenv("AZURE_STORAGE_CONTAINER_NAME"))

blob_list = container.list_blobs()
for blob in blob_list:
    files_to_index.append({"file_name": blob.name,"file_url": (os.getenv("AZURE_STORAGE_BASE_URL")+ os.getenv("AZURE_STORAGE_CONTAINER_NAME") + "/" + blob.name)})  

### Get the file content Stream for the blobs async and use the document intelligence to chunk pdf pages 

In [13]:
for item in files_to_index:
    print(item)
    #item["page_map"] = parse_pdf(filename=item["file_name"], form_recognizer=True, model="prebuilt-document",from_url=False, verbose=True)
    item["page_map"] = parse_pdf(filename=item["file_name"], form_recognizer=True, model="prebuilt-layout",from_url=False, verbose=True)
    
    

{'file_name': 'APCA-2023 Parkside Budget.pdf', 'file_url': 'https://stgwrkshp.blob.core.windows.net/amhlatest/APCA-2023 Parkside Budget.pdf', 'page_map': [(0, 0, 'Anthem Parkside Budget by Category Annualized 2023 Built Out - Accrual Basis\n<table><tr><th>Date: 1/1/2023 - 12/31/2023 1 Operating Budget</th><th>Annual Total</th></tr><tr><td>INCOME</td><td></td></tr><tr><td>Assessment Revenue</td><td></td></tr><tr><td>40005 Assessments</td><td>1,444,354</td></tr><tr><td>Total Assessment Revenue</td><td>1,444,354</td></tr><tr><td>Other Operating Income</td><td></td></tr><tr><td>41047 ACC Fee Sharing Revenue</td><td>45,600</td></tr><tr><td>42003 Legal Fee Reimbursement</td><td>87,156</td></tr><tr><td>45001 Interest Income</td><td>6,000</td></tr><tr><td>49001 Transfers to Reserve Fund</td><td>(65,268)</td></tr><tr><td>Total Other Operating Income</td><td>73,488</td></tr><tr><td>TOTAL INCOME</td><td>1,517,842</td></tr><tr><td>EXPENSE</td><td></td></tr><tr><td>Contracted Services</td><td></td>

In [14]:
print(files_to_index[3]["page_map"])

[(0, 0, '1\nBY-LAWS .\nOF\nANTHEM PARKSIDE\n- COMMUNITY ASSOCIATION, INC.\n-\nHyatt & Stubblefield, PC. 225 Peachtree Street, N.E., Suite 1200 Atlanta, Georgia 30303 (404) 659-6600 :unselected: '), (1, 179, 'TABLE OF CONTENTS\n<table><tr><td colSpan=2>Article I Name, Principal Office, and Definitions 1</td></tr><tr><td>1.1. Name.</td><td>1</td></tr><tr><td>1.2. Principal Office.</td><td>1</td></tr><tr><td>1.3. Definitions</td><td>1</td></tr><tr><td>Article II Association: Membership, Meetings, Quorum, Voting, Proxies</td><td>1</td></tr><tr><td>2.1. Membership ...</td><td>1</td></tr><tr><td>2.2. Place of Meetings</td><td>1</td></tr><tr><td>2.3. Annual Meetings.</td><td>1</td></tr><tr><td>2.4. Special Meetings.</td><td>1</td></tr><tr><td>2.5. Notice of Meetings.</td><td>1</td></tr><tr><td>2.6. Waiver of Notice.</td><td>2 ...</td></tr><tr><td>2.7. Adjournment of Meetings. 20000</td><td>2</td></tr><tr><td>2.8. Voting.</td><td>2</td></tr><tr><td>2.9. Proxies</td><td>3</td></tr><tr><td>2.10.

In [15]:
str_index_name = "cogsrch-index-filesamhadls-vector"

In [16]:
### Create Azure Search Vector-based Index
# Setup the Payloads header
headers = {'Content-Type': 'application/json','api-key': os.getenv('AZURE_SEARCH_KEY')}
params = {'api-version': os.getenv('AZURE_SEARCH_API_VERSION')}

In [None]:
#from azure.storage.blob import BlobClient

#blob = BlobClient.from_connection_string(conn_str=os.getenv("AZURE_STORAGE_CONNECTION_STRING"), container_name=os.getenv("AZURE_STORAGE_CONTAINER_NAME"), blob_name="004003-Contract-Client-2023Oct18-232541.pdf")
#stream = blob.download_blob()
#data = stream.readall()
#print(data)

Now that we have the content of the box file chunks (each page of each file) in the dictionary page_map, let's create the Vector-based index in our Azure Search Engine where this content is going to land

In [17]:
index_payload = {
    "name": str_index_name,
    "fields": [
        {"name": "id", "type": "Edm.String", "key": "true", "filterable": "true" },
        {"name": "title","type": "Edm.String","searchable": "true","retrievable": "true"},
        {"name": "chunk","type": "Edm.String","searchable": "true","retrievable": "true"},
        {"name": "chunkVector","type": "Collection(Edm.Single)","searchable": "true","retrievable": "true","dimensions": 1536,"vectorSearchConfiguration": "vectorConfig"},
        {"name": "name", "type": "Edm.String", "searchable": "true", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "location", "type": "Edm.String", "searchable": "false", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "page_num","type": "Edm.Int32","searchable": "false","retrievable": "true"},
        
    ],
    "vectorSearch": {
        "algorithmConfigurations": [
            {
                "name": "vectorConfig",
                "kind": "hnsw"
            }
        ]
    },
    "semantic": {
        "configurations": [
            {
                "name": "my-semantic-config",
                "prioritizedFields": {
                    "titleField": {
                        "fieldName": "title"
                    },
                    "prioritizedContentFields": [
                        {
                            "fieldName": "chunk"
                        }
                    ],
                    "prioritizedKeywordsFields": []
                }
            }
        ]
    }
}

r = requests.put(os.getenv('AZURE_SEARCH_ENDPOINT') + "/indexes/" + str_index_name,
                 data=json.dumps(index_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

204
True


In [None]:
#uncomment in case of errors
# r.text

Upload the Document chunks and its vectors to the Vector-Based Index
The following code will iterate over each chunk of each book and use the Azure Search Rest API upload method to insert each document with its corresponding vector (using OpenAI embedding model) to the index.

In [18]:
%%time
for item in files_to_index:
    print("Uploading chunks from",item["file_name"])
    for page in tqdm(item['page_map']):
        try:
            page_num = page[0] + 1
            content = page[2]
            book_url = item["file_url"]
            idx_id = text_to_base64(item["file_name"] + str(page_num))
            title = f"{item['file_name']}_page_{str(page_num)}"  
            upload_payload = {
                "value": [
                    {
                        "id": idx_id,
                        "title": title,
                        "chunk": content,
                        "chunkVector": embedder.embed_query(content if content!="" else "-------"),
                        "name": item["file_name"],
                        "location": item["file_url"],
                        "page_num": page_num,
                        "@search.action": "upload"
                    },
                ]
            }

            r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + str_index_name + "/docs/index",
                                 data=json.dumps(upload_payload), headers=headers, params=params)
            if r.status_code != 200:
                print(r.status_code)
                print(r.text)
        except Exception as e:
            print("Exception:",e)
            print(content)

Uploading chunks from APCA-2023 Parkside Budget.pdf


  0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 3/3 [00:02<00:00,  1.00it/s]


Uploading chunks from APCA-Appeal Process.pdf


100%|██████████| 2/2 [00:01<00:00,  1.43it/s]


Uploading chunks from APCA-Articles of Incorporation.pdf


100%|██████████| 3/3 [00:02<00:00,  1.39it/s]


Uploading chunks from APCA-Bylaws.pdf


100%|██████████| 30/30 [00:21<00:00,  1.38it/s]


Uploading chunks from APCA-CCRs - Covenants, Conditions and Restrictions w. Updated Use Restrictions.pdf


100%|██████████| 68/68 [00:56<00:00,  1.21it/s]


Uploading chunks from APCA-Fine Policy.pdf


100%|██████████| 2/2 [00:01<00:00,  1.28it/s]


Uploading chunks from APCA-Initial Use Restrictions.pdf


100%|██████████| 8/8 [00:05<00:00,  1.40it/s]


Uploading chunks from APCA-Paseo Towing Policy.pdf


100%|██████████| 5/5 [00:03<00:00,  1.47it/s]


Uploading chunks from ARC Guidelines Brookstone Ranch HOA.pdf


100%|██████████| 30/30 [00:21<00:00,  1.41it/s]


Uploading chunks from Articles Brookstone Ranch HOA 9816360.pdf


100%|██████████| 2/2 [00:01<00:00,  1.40it/s]


Uploading chunks from Assessment Collection Policy Brookstone Ranch HOA.pdf


100%|██████████| 3/3 [00:02<00:00,  1.39it/s]


Uploading chunks from Brookstone Ranch 2019-08-16 Remittance Address.pdf


100%|██████████| 3/3 [00:02<00:00,  1.43it/s]


Uploading chunks from Brookstone Ranch HOA statement 2017.pdf


100%|██████████| 1/1 [00:00<00:00,  1.44it/s]


Uploading chunks from CCR Brookstone Ranch HOA 12582.pdf


100%|██████████| 47/47 [00:33<00:00,  1.40it/s]


Uploading chunks from CCR Foxstone HOA 0616310.pdf


100%|██████████| 91/91 [01:06<00:00,  1.38it/s]


Uploading chunks from CCR Supp Highland Grove HOA Bk 553 Pg 717.pdf


100%|██████████| 2/2 [00:01<00:00,  1.27it/s]


Uploading chunks from CCR Supp Highland Grove HOA Bk 556 Pg 782.pdf


100%|██████████| 2/2 [00:01<00:00,  1.44it/s]


Uploading chunks from CCRs Highland Grove Bk 504 Pg 270.pdf


100%|██████████| 10/10 [00:07<00:00,  1.40it/s]


Uploading chunks from Plat Map Highland Grove HOA BK 93 Pg 6.pdf


100%|██████████| 3/3 [00:02<00:00,  1.16it/s]


Uploading chunks from Rules  Foxstone HOA 2005-08.pdf


100%|██████████| 7/7 [00:05<00:00,  1.38it/s]


Uploading chunks from Rules Architectural Guidelines Foxstone HOA 2005-08.pdf


100%|██████████| 17/17 [00:12<00:00,  1.40it/s]


Uploading chunks from Rules Collection Policy  Foxstone HOA.pdf


100%|██████████| 12/12 [00:08<00:00,  1.41it/s]


Uploading chunks from Rules Foxstone HOA 2011-01-01.pdf


100%|██████████| 1/1 [00:00<00:00,  1.47it/s]


Uploading chunks from Rules Parking  Foxstone HOA 2013.pdf


100%|██████████| 4/4 [00:02<00:00,  1.43it/s]


Uploading chunks from Rules Violation and Fines  Foxstone HOA 2005-08.pdf


100%|██████████| 1/1 [00:00<00:00,  1.45it/s]


Uploading chunks from W-9 Brookstone Ranch HOA.pdf


100%|██████████| 1/1 [00:00<00:00,  1.33it/s]


Uploading chunks from W-9 Foxstone HOA.pdf


100%|██████████| 1/1 [00:00<00:00,  1.38it/s]


Uploading chunks from W-9 Highland Grove.pdf


100%|██████████| 4/4 [00:02<00:00,  1.35it/s]


Uploading chunks from W9-Anthem Parkside ( Paid through Master) Anthem Community Council.pdf


100%|██████████| 1/1 [00:00<00:00,  1.36it/s]

CPU times: total: 6.23 s
Wall time: 4min 30s



