<a href="https://colab.research.google.com/github/genaiconference/Agentic_RAG_Workshop/blob/main/04_Azure_AI_Search_Indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create Azure AI Search Index

This notebook demonstrates how to use Azure AI Search.  It walks you through the process of creating an index, setting up semantic configuration and loading documents

In [None]:
!git clone https://github.com/genaiconference/Agentic_RAG_Workshop.git

Cloning into 'Agentic_RAG_Workshop'...
remote: Enumerating objects: 54, done.[K
remote: Counting objects: 100% (54/54), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 54 (delta 19), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (54/54), 2.21 MiB | 4.53 MiB/s, done.
Resolving deltas: 100% (19/19), done.


## Setup and Installations
Install necessary libraries for document processing, data handling, and interacting with Azure Document Intelligence and OpenAI.

In [None]:
!pip install -r /content/Agentic_RAG_Workshop/requirements.txt

Collecting azure-search-documents (from -r /content/Agentic_RAG_Workshop/requirements.txt (line 11))
  Downloading azure_search_documents-11.5.3-py3-none-any.whl.metadata (23 kB)
Collecting azure-identity (from -r /content/Agentic_RAG_Workshop/requirements.txt (line 12))
  Downloading azure_identity-1.24.0-py3-none-any.whl.metadata (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting azure-common>=1.1 (from azure-search-documents->-r /content/Agentic_RAG_Workshop/requirements.txt (line 11))
  Downloading azure_common-1.1.28-py2.py3-none-any.whl.metadata (5.0 kB)
Collecting msal>=1.30.0 (from azure-identity->-r /content/Agentic_RAG_Workshop/requirements.txt (line 12))
  Downloading msal-1.33.0-py3-none-any.whl.metadata (11 kB)
Collecting msal-extensions>=1.2.0 (from azure-identity->-r /content/Agentic_RAG_Workshop/requirements.txt (line 12))
  Downloading msal_extensions-1.3.1-py3-none-any.whl.metadata

## Load Environment Variables and Initialize Clients
Load environment variables containing API keys and endpoint information, and initialize the Azure Document Intelligence and OpenAI clients.

In [None]:
import os

os.chdir("/content/Agentic_RAG_Workshop/")

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

load_dotenv()

llm = ChatOpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4.1",
    temperature=0,
)

embeddings = OpenAIEmbeddings(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="text-embedding-3-small"
)

## Load DI output

In [None]:
import pickle
import os
import chunking_utils

DI_OUTPUT_DIR = os.path.join(os.getcwd(), "DI_output")

leave_policy_results = pickle.load(open(os.path.join(DI_OUTPUT_DIR, "leave_policy_results.pkl"), "rb"))
insurance_policy_results = pickle.load(open(os.path.join(DI_OUTPUT_DIR, "insurance_policy_results.pkl"), "rb"))

# Create chunks for the leave policy document
leave_policy_docs = chunking_utils.generate_parents(leave_policy_results['md_result'],
                                           leave_policy_results['result_with_image_descp'],
                                            "Leave Policy",
                                           llm)

# Create chunks for the insurance policy document
insurance_policy_docs = chunking_utils.generate_parents(insurance_policy_results['md_result'],
                                           insurance_policy_results['result_with_image_descp'],
                                          "Insurance Policy",
                                           llm)

parent_docs = leave_policy_docs + insurance_policy_docs

parent_docs = leave_policy_docs + insurance_policy_docs
doc_ids = [str(i) for i in range(len(parent_docs))]
id_key = "doc_id"















[INFO] Created 14 parent documents.












































[INFO] Created 45 parent documents.


### Save doc_ids & parent docs to pickle

In [None]:
import pickle
import os

# Save parent_docs and doc_ids to a pickle file
parent_data = {
    "parent_docs": parent_docs,
    "doc_ids": doc_ids
}
with open(os.path.join(os.getcwd(), "parent_docs_and_ids.pkl"), "wb") as f:
    pickle.dump(parent_data, f)

## Create Child Docs, Summaries, Hypothetical Questions

In [None]:
from multivector_utils import create_child_documents, generate_summaries, generate_hypothetical_questions

child_docs = create_child_documents(parent_docs, doc_ids, id_key)
summaries = generate_summaries(parent_docs, llm, id_key, doc_ids)
questions = generate_hypothetical_questions(parent_docs, id_key, doc_ids)

In [None]:
def replace_header_keys(documents):
    """Replaces header keys in the metadata of Langchain documents."""
    for doc in documents:
        new_metadata = {}
        for key, value in doc.metadata.items():
            if key == "Header 1":
                new_metadata["Header_1"] = value
            elif key == "Header 2":
                new_metadata["Header_2"] = value
            elif key == "Header 3":
                new_metadata["Header_3"] = value
            elif key == "Header 4":
                new_metadata["Header_4"] = value
            else:
                new_metadata[key] = value
        doc.metadata = new_metadata
    return documents

# Apply the function to your document lists
parent_docs = replace_header_keys(parent_docs)
child_docs = replace_header_keys(child_docs)
summaries = replace_header_keys(summaries)
questions = replace_header_keys(questions)

In [None]:
child_docs[45]

Document(metadata={'doc_id': '35', 'Header_1': 'Employee Benefits Manual 2023-24 Novartis Group', 'Header_2': "Voluntary Parent's Policy - Plan Details (2/2)", 'page_number': 22, 'custom_metadata': "Voluntary Parent's Policy - Plan Details (2/2)", 'source': 'Insurance Policy', 'parent_id': '35', 'source_type': 'Children'}, page_content='##Voluntary Parent\'s Policy - Plan Details (2/2)  \n<table>\n<tr>\n<td>Special Conditions</td>\n<td>Pandemic and Dental and Vision in cases of accidents covered, only in case of accidents. Refractive error or sight correction $\\left( + / - \\right) 7$ and macular degeneration of retina- on IPD and OPD basis, Final decision will be taken on receipt of complete set of documents. 50% Co-Pay for cyber-knife treatment, Gamma Knife treatment and Stem Cell Transplantation, Robotic Surgery, Femto laser treatment for eye. It will be applicable for each eye each event. Cochlear Implant treatment shall be restricted to 50% of the SI. Coverage under this benefit 

## Create Azure AI Search Index

### Configure vector store settings
You need an Azure subscription and Azure AI Search service to use this vector store integration. No-cost versions are available for small and limited workloads.

Set variables for your Azure AI Search URL and admin API key. You can get these variables from the Azure portal.

In [None]:
vector_store_address: str = os.getenv("AZURE_SEARCH_ENDPOINT")
vector_store_password: str = os.getenv("AZURE_SEARCH_KEY")

### Define custom filterable fields
This schema shows field definitions. It's the default schema, plus several new fields attributed as filterable. Because it's using the default vector configuration, you won't see vector configuration or vector profile overrides here. The name of the default vector profile is "myHnswProfile" and it's using a vector configuration of Hierarchical Navigable Small World (HNSW) for indexing and queries against the content_vector field.

In [None]:
from azure.search.documents.indexes.models import (
    ScoringProfile,
    SearchableField,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    TextWeights,
    SemanticConfiguration,
    SemanticField,
    SemanticPrioritizedFields,
)

embedding_function = embeddings.embed_query

fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        key=True,
        filterable=True,
    ),
    SearchableField(
        name="content",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="content_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=len(embedding_function("Text")),
        vector_search_profile_name="myHnswProfile",
    ),
    SearchableField(
        name="metadata",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    # Additional field to store the Header
    SearchableField(
        name="Header_1",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    # Additional field to store the Header
    SearchableField(
        name="Header_2",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    # Additional field to store the Header
    SearchableField(
        name="Header_3",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    # Additional field to store the Header
    SearchableField(
        name="Header_4",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    # Additional field for filtering on document source
    SimpleField(
        name="source",
        type=SearchFieldDataType.String,
        filterable=True,
    ),
]



semantic_config = SemanticConfiguration(name='agentic-rag-semantic-config',
                                        prioritized_fields=SemanticPrioritizedFields(
                                            title_field=SemanticField(field_name="source"),
                                                    keywords_fields=[
                                                        SemanticField(field_name="Header_1"),
                                                        SemanticField(field_name="Header_2"),
                                                        SemanticField(field_name="Header_3"),
                                                        SemanticField(field_name="Header_4")
                                                                    ],
                                                    content_fields=[
                                                        SemanticField(field_name="content")
                                                        ]
                                                )
                                            )

### Create vector store instance
Create instance of the AzureSearch class using the embeddings from above

In [None]:
import os
from langchain_community.vectorstores.azuresearch import AzureSearch

index_name: str = "agentic-rag-workshop"
search_type: str = "semantic_hybrid"

vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embedding_function,
    fields=fields,
    search_type = search_type,
    semantic_configurations = semantic_config,
)

### Add data to the vectorstore

In [None]:
# Add child_docs, sumamries, questions to the vectorstore
vector_store.add_documents(documents=child_docs)
vector_store.add_documents(documents=summaries)
vector_store.add_documents(documents=questions)

['MzAzZTAyOGEtMjQ3MS00MzdmLTk2OTktOGJiODMxYWRmMzQy',
 'YTA3ZmJkZjItZWM3OS00NzEzLTk3NWEtMTdkMjAxMjM2MzQ3',
 'MzA1OTliMGEtZWE1NS00ZWIwLWIzZDYtYTljMzYzZWY4Yzlh',
 'NzY0ODZlNDYtMTBiMi00ODBkLWE0NjEtNmM0ZWE4YmFkMTI2',
 'MWZkZmExNTYtOGQ1OC00ZWMyLWI4MDUtMDQ1MmYyYTRiZDVi',
 'YjQ2YTkxNzctOTk5Yy00MzQ4LWJhZDMtNWUzZDJiNmU5MjE4',
 'YjJhOGU1OTAtMDE2Yy00YjIxLTljZjEtNTI4YTI2YTg2ZTVk',
 'OTJhOTEzMGEtOWJiNy00YWFiLWE2YjQtNjczOTNlNmY4NGQx',
 'NDgzOTAxNGUtZWM1MC00ZGYzLWJmZmItMWIwZjgyNDZjY2Ex',
 'ZDQyYTk4YTAtMjRmZC00ZmE5LWFhODYtYTY0MzJlMjRlMzVk',
 'YmY1YjhiNWItOWJmZi00ZWVkLTg2NTktZDgwOWZjMTNhMTY3',
 'OTQ1ZjdlOTEtMzk5ZS00MDhiLWFkMzQtOTU2NWQ5NWM4ZmQw',
 'YmNkMDdhM2UtNzVjNi00NGYyLTliNmEtYmI2Njg1MmEzMzUz',
 'ODYzOWRiODctNjQ4My00ZTQ0LWIyOTUtNDFkN2YxNmFlNTVi',
 'YWZjZjdiN2ItYWM2NC00MGE5LThkZWYtZDE1ZGQxODU0N2Q2',
 'ODY3MTRiMGMtYjk5Mi00YTcwLWEwYzItMjNkMzNhZWRlMGYx',
 'YTg2ZjFkNTQtYTgyMy00ODJjLThlZWEtMWYzMTQxN2Y4Zjcx',
 'OTNiZmIyNjQtMmM2NS00YTg0LWFmMDMtMWYyMzcyMTJjMzk5',
 'ZGM4OGQzNGUtZWRjMS00YTVmLWE3ZmQtYmNiZTJjZDJj

## Perform a hybrid search
Execute hybrid search using the search_type or hybrid_search() method. Vector and nonvector text fields are queried in parallel, results are merged, and top matches of the unified result set are returned.

In [None]:
# Perform a hybrid search using the search_type parameter
docs = vector_store.similarity_search(
    query="What are the benefits of maternity leave",
    k=3,
    search_type="hybrid",
)
print(docs[0].page_content)

##Your Plan Details Maternity Benefits  
· Maternity benefits are admissible only if the expenses are incurred in Hospital / Nursing Home as in-patients in India.
· Those Insured Persons who already have two or more living children will not be eligible for this benefit.
· Expenses incurred in connection with voluntary medical termination of pregnancy during the first 12 weeks from the date
of conception are not covered. Infertility Treatment and sterilization are excluded from the policy.
