<img src="rag.jpg" style="display:block; margin:auto" width=350>

<div align="center"> <h1 align="center">Building a RAG-Based AI Chatbot for Document Search and Retrieval</h1> </div>

This notebook outlines a comprehensive end-to-end workflow for building an AI-powered document search and retrieval system using Azure Document Intelligence, Cognitive Search, and OpenAI. The process involves several key steps:

1. **Extracting Data**: Retrieve relevant document data from Azure Blob Storage, including labels and content, based on the automatically pre-labeled data with Azure Document Intelligence.
2. **Index Creation**: Define and configure a custom search index in Azure Cognitive Search to store and organize the document data.
3. **Generating Embeddings**: Use OpenAI's models to create vector embeddings for document content, enabling semantic search.
4. **Uploading Data**: Upload the documents, including their embeddings, to Azure Cognitive Search for efficient retrieval.
5. **Integrating AI-based Search**: Implement a search function that leverages both traditional and vector-based search, with responses generated by an AI-powered chatbot based on the retrieved documents using a Retrieval-Augmented Generation (RAG) model.

This workflow combines cloud-based data storage, advanced search capabilities, and natural language processing (NLP) to facilitate interactive, context-driven conversations with the AI model. It allows users to ask questions and receive insightful answers based on document content.

##### 0. Prerequisites & Install Relevant Packages

- An Azure subscription, with access to Azure OpenAI, Azure Cognitive Search, Azure Document Intelligence.

- A deployment of the text-embedding-3-large/text-embedding-ada-002 embedding model.

- I used Python 3.12.5, Visual Studio Code with the Azure extension and test this example inside the Jupyter extension.

In [None]:
%pip install -r requirements.txt

##### Importing Required Libraries for Azure Integration and OpenAI

We import essential libraries needed for interacting with Azure services, managing secrets, and working with OpenAI. These libraries are key to enabling the document storage, search, and retrieval capabilities, as well as integrating OpenAI for NLP tasks.

In [None]:
import uuid  
import os  
from dotenv import load_dotenv
import json    
from azure.storage.blob import BlobServiceClient  
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from openai import AzureOpenAI
from azure.search.documents.models import VectorizableTextQuery
from azure.identity import DefaultAzureCredential
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure.search.documents.indexes.models import (HnswParameters, 
                                                HnswAlgorithmConfiguration, 
                                                VectorSearchProfile, 
                                                SearchableField, 
                                                SearchField, 
                                                SearchFieldDataType, 
                                                SearchIndex, 
                                                SimpleField,
                                                VectorSearch, 
                                                AzureOpenAIVectorizer,
                                                AzureOpenAIParameters  
                                                )

##### 1. Loading Environment Variables for Secure Configuration

First, we load environment variables from a `.env` file using the `load_dotenv` function, allowing secure storage of sensitive information such as API keys, endpoint names, and deployment configurations. 

In [None]:
load_dotenv(override=True) # take environment variables from .env

# Import variables from your .env file
key_vault_name = os.getenv('key_vault_name')
openai_secret_key_name = os.getenv('openai_secret_key_name')
openai_endpoint_name = os.getenv('openai_endpoint_name')
deployment_name = os.getenv('deployment_name') # gpt4o, gpt4, gpt3.5-turbo etc.
search_endpoint_name = os.getenv('search_endpoint_name')
search_endpoint_key = os.getenv('search_endpoint_key')
use_embedding_deployment = os.getenv('use_embedding_deployment') # text-embedding-3-large, text-embedding-ada-002, text-embedding-3-small
embedding_dimensions = int(os.getenv('embedding_dimensions'))
azure_openai_api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2024-06-01")

##### 1.1 Setting Up Azure OpenAI Client with Authentication

This code snippet initializes the Azure OpenAI client by setting up authentication using Azure credentials, API keys, and token providers, enabling access to OpenAI models and deployment configurations.

In [None]:
openai_credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(openai_credential, "https://cognitiveservices.azure.com/.default")

client = AzureOpenAI(
    azure_deployment=use_embedding_deployment,
    api_version=azure_openai_api_version,
    azure_endpoint=openai_endpoint_name,
    api_key=openai_secret_key_name,
    azure_ad_token_provider=token_provider if not openai_secret_key_name else None
)

##### 1.2 Creating a Custom Azure Search Index with Vector Search Configuration

This code defines a function to create a custom Azure Cognitive Search index, with optional custom fields and vector search capabilities. It configures the index to include embedding vectors for advanced search features and ensures the correct setup of fields before indexing documents.

In [None]:
def create_custom_index(index_name, embedding_dimensions=1536, custom_fields=None, embedding_deployment=None):
    
    # Check if embedding deployment variable is provided
    if embedding_deployment is None:
        raise ValueError('Embedding deployment must be provided')

    # Create the Search Index Client
    index_client = SearchIndexClient(endpoint=search_endpoint_name, credential=AzureKeyCredential(search_endpoint_key))

    # If no custom fields are provided, use default fields
    if custom_fields is None:
        # Default fields
        default_fields = [
            SimpleField(name='id', type=SearchFieldDataType.String, key=True, searchable=False, retrievable=False),
            SearchableField(name='combined_content', type=SearchFieldDataType.String, retrievable=True, searchable=True, filterable=True),
            SearchField(
                name='combined_content_vector', 
                type=SearchFieldDataType.Collection(SearchFieldDataType.Single), 
                searchable=True, 
                retrievable=True,
                vector_search_dimensions=embedding_dimensions, 
                vector_search_profile_name='vector_profile'
            )
        ]
        fields_to_use = default_fields  # Use default fields if no custom fields provided
    
    else:
        # Process the custom fields
        searchable_fields = [
            SearchField(
                name=field['name'],
                type=field['type'],
                searchable=field.get('searchable', True),
                filterable=field.get('filterable', True),
                sortable=field.get('sortable', True),
                retrievable=field.get('retrievable', True)
            )
            for field in custom_fields if field['name'] != 'id' and not field['name'].endswith('_vector')
        ]

        # Create the SimpleField for primary key (id)
        simple_field = [
            SimpleField(
                name=field['name'],
                type=field['type'],
                key=True,
                searchable=field.get('searchable', False),
                retrievable=field.get('retrievable', False)
            )
            for field in custom_fields if field['name'] == 'id'
        ]

        # Create the Vector Field
        vector_field = [
            SearchField(
                name=field['name'],
                type=field['type'],
                searchable=field.get('searchable', False),
                retrievable=field.get('retrievable', True),
                vector_search_dimensions=embedding_dimensions,
                vector_search_profile_name='vector_profile'
            )
            for field in custom_fields if field['name'].endswith('_vector')
        ]

        # Combine fields into one list for custom fields
        fields_to_use = simple_field + searchable_fields + vector_field

    # Ensure there are fields to pass to the SearchIndex
    if not fields_to_use:
        raise ValueError('No fields defined for the index.')

    # Define Vector Search configuration
    vector_search = VectorSearch(
        algorithms=[HnswAlgorithmConfiguration(name='algorithm', parameters=HnswParameters(metric='cosine'))],
        profiles=[VectorSearchProfile(name='vector_profile', algorithm_configuration_name='algorithm', vectorizer='Vectorizer')],  
        vectorizers=[AzureOpenAIVectorizer(  
            name='Vectorizer',  
            azure_open_ai_parameters=AzureOpenAIParameters(  
                resource_uri=openai_endpoint_name,   
                deployment_id=embedding_deployment,
                model_name=embedding_deployment, 
                api_key=openai_secret_key_name)
            )
        ]
    )

    # Create the SearchIndex object, using fields_to_use
    index = SearchIndex(
        name=index_name,
        fields=fields_to_use,  # Pass either default or custom fields
        vector_search=vector_search
    )

    # Try to create or update the index
    try:
        result = index_client.create_or_update_index(index)
        print(f"Index '{result.name}' created successfully.")
        return result
    
    except Exception as e:
        print(f'Error creating or updating index: {e}')
        return None

##### 1.3 Defining Custom Fields for Azure Cognitive Search 

Based on the `create_custom_index` function, we now create custom fields for an Azure Cognitive Search index, specifying the field types, searchability, retrievability, and vector search settings. It includes essential fields like `id`, `title`, and `author`, along with a `combined_content_vector` for advanced vector-based search.

In [None]:
# Example custom_field usage
set_custom_fields = [
    {'name': 'id', 'type': SearchFieldDataType.String, 'key': True, 'retrievable': False, 'analyzer_name': 'keyword'},
    {'name': 'title', 'type': SearchFieldDataType.String, 'searchable': True, 'filterable': True},
    {'name': 'combined_content', 'type': SearchFieldDataType.String, 'searchable': True, 'filterable': True, 'retrievable' : True},
    {'name': 'author', 'type': SearchFieldDataType.String, 'searchable': True, 'filterable': True},
    {'name': 'published_date', 'type': SearchFieldDataType.String, 'searchable': True, 'sortable': True},
    {'name': 'combined_content_vector', 'type': SearchFieldDataType.Collection(SearchFieldDataType.Single),
     'searchable': True, 'retrievable': True,
     'vector_search_dimensions': embedding_dimensions, 'vector_search_profile_name': 'vector_profile'}
]

##### 1.4 Creating the Custom Index with Defined Fields and Embedding Deployment

This function call creates a custom Azure Cognitive Search index named `'test_index'` with specified custom fields and embedding dimensions. It also integrates the selected embedding deployment to enable vector-based search functionality.

In [None]:
create_custom_index('test_index', embedding_dimensions=1536, custom_fields=set_custom_fields, embedding_deployment=use_embedding_deployment)

##### 2. Downloading and Extracting Labels from Azure Blob Storage

These functions download data from Azure Blob Storage and extract labels from `.json` files. The `download_blob_to_string` function retrieves the content of a specific blob, while `extract_labels_from_blob` lists all blobs in a container, filters based on specified criteria, and extracts labels from valid `.json` files.

In [None]:
def download_blob_to_string(container_client, blob_name):  
    try:  
        blob_client = container_client.get_blob_client(blob_name)  
        blob_data = blob_client.download_blob().readall()  
        return blob_data.decode('utf-8')  
    except Exception as e:
        print(f"Error downloading blob {blob_name}: {e}")  
        return None  
  
def extract_labels_from_blob(blob_url, sas_token, container_name, filter_value):  
    try:  
        blob_service_client = BlobServiceClient(account_url=blob_url, credential=sas_token)  
        container_client = blob_service_client.get_container_client(container_name)  
          
        labels = []  
          
        # List all blobs in the container  
        blob_list = container_client.list_blobs()  
        for blob in blob_list:  
            if blob.name.endswith('.json') and all(value not in blob.name for value in filter_value):
                blob_content = download_blob_to_string(container_client, blob.name)  
                if blob_content:  
                    labels.append(json.loads(blob_content))

        return labels  
  
    except Exception as e:
        print(e)  
        return []

##### 2.1 Extracting Entities and Compiling Documents with Labels

Next, we define two functions: `extract_entitie`s and `compile_documents`. The `extract_entities` function extracts and arranges text from labeled entities into a unified format. The `compile_documents` function organizes documents with their respective labels into separate dictionaries, ensuring proper structure based on provided title keys.

In [None]:
# Arrange all text with labels in one line, remove fragmanted status
def extract_entities(data):  
    entities = []  
    for document in data:  
        for item in document.get('labels', []):  
            label = item['label']  
            entity_texts = [v['text'] for v in item['value']]  
            entity = ' '.join(entity_texts)  
            entities.append({label: entity})  
    return entities  

 # Compile documents with their labels in separate dictionaries  
def compile_documents(entities, title_keys):  
    if not title_keys:
        raise ValueError("title_keys must be provided and cannot be None or empty.")

    compiled_docs = []  
    current_doc = {}  

    for item in entities:  
        # Check if the item contains any of the title keys
        if any(key in item for key in title_keys):  
            if current_doc:  # Save the previous document  
                compiled_docs.append(current_doc)  
            current_doc = item  # Start a new document  
        else:  
            current_doc.update(item)  # Add other labels to the current document
    
    if current_doc:  # Add the last document  
        compiled_docs.append(current_doc)  
    
    return compiled_docs


##### 2.2 Extracting and Compiling Documents from Blob Storage with Error Handling

In thhis code block we use the functions created before to extract the data from Azure Blob Storage using a SAS token, processes the entities from the data, and organizes them into structured documents. We handle potential errors and print relevant messages when no data or entities are found. The `compile_documents` function is used to group entities into documents based on provided title keys, and the results are printed out for further processing.

In [None]:
try:  
    # Configuration parameters  
    blob_url =  os.getenv('url-of-blob') 
    sas_token = os.getenv('your-sas-token-of-blob-container')
    container_name = os.getenv('your-container-name') 
          
    filter_value = ['config', 'fields', 'ocr'] # Adjust filter_value to just include files with specific names 
          
    # Extract labels from blob storage  
    data = extract_labels_from_blob(blob_url, sas_token, container_name, filter_value)  
    if not data:  
        print('No data extracted from blob storage.')  
        exit()  
          
    # Extract entities  
    entities = extract_entities(data)  
    if not entities:  
        print('No entities extracted.')  
        exit()  
    
# Example entity data! - use first key of true data and pass to title_keys parameter

#     entities = [
#     {'document_title': 'Document 1', 'author': 'Author 1'},
#     {'section': 'Introduction', 'content': 'This is the introduction.'},
#     {'document_title': 'Document 2', 'author': 'Author 2'},
#     {'section': 'Summary', 'content': 'This is the summary.'}
# ]

    # Call the function with mandatory title keys
    compiled_documents = compile_documents(entities, title_keys=['your_first_document_key'])
    if not compiled_documents:  
        print('No compiled documents generated.')  
        exit()  
          
    # Print compiled documents  
    for i, doc in enumerate(compiled_documents):  
        print(f'Document {i}: {doc}')
            
except Exception as e:
    print(e)

##### 3. Ensuring All Documents Have Expected Keys

Here we iterate through the compiled documents and ensure that each document contains all keys listed in `expected_key_list`. If any key is missing, it adds the key with a default value of `'Null'`. This ensures that all documents are consistently structured with the expected fields.

In [None]:
def expected_keys(compiled_documents, expected_key_list):
    for compiled_doc in compiled_documents:  
    # Ensure each document has all expected keys with empty values if missing  
        for expected_key in expected_key_list:  
            if expected_key not in compiled_doc:  
                compiled_doc[expected_key] = 'Null'

    return compiled_documents

expected_key_list = [
    'id',
    'title', 
    'author',  
    'published_date',
]

compiled_documents = expected_keys(compiled_documents, expected_key_list)

##### 4. Generating Document Embeddings and Preparing Data for Upload

In this code block we generate embeddings for each compiled document by combining its content and creating a vector representation using Azure OpenAI. It processes the documents, appends embeddings, and prepares them for upload, ensuring each document has a unique ID and includes both the original data and the computed vector. The documents are stored in the `final_documents` list, ready for further use.

In [None]:
client = AzureOpenAI(  
    azure_endpoint=openai_endpoint_name,  
    api_key=openai_secret_key_name,  
    api_version='2024-05-01-preview'
) 

# Prepare final documents for upload  
final_documents = []

for compiled_doc in compiled_documents:  # Make sure to use the correct variable here (compiled_documents)
              
    # Create a combined_content field for the document dynamically based on available keys
    combined_content = ' '.join([compiled_doc[key] for key in expected_key_list if key in compiled_doc])  
              
    # Generate embeddings for the combined_content field  
    response = client.embeddings.create(input=combined_content, model=use_embedding_deployment)  
    combined_content_vector = response.data[0].embedding  
              
    # Create a new dictionary to hold the entire dataset, including combined content and vector  
    final_data = {
        'id': str(uuid.uuid4()),  # Generate a unique ID for the entire document
        'combined_content': combined_content,  
        'combined_content_vector': combined_content_vector
    }
    
    # Dynamically add all keys from compiled_doc
    for key, value in compiled_doc.items():
        final_data[key] = value  # Add each key-value pair from compiled_doc to final_data
              
    final_documents.append(final_data)  # Add the processed document to final_documents
          
# Print the final documents structure  
for i, doc in enumerate(final_documents):  
    print(f'Final Document {i}: {doc}')

##### 4.1. Saving and Loading Documents with Embeddings in JSON Format

This code block saves the processed documents, including their embeddings, to a JSON file (here: `vector_data.json`) in a specified output directory. It ensures the directory exists before saving. After saving, it reads the same JSON file back into memory to access the stored documents for further processing or querying.

In [None]:
# Output combined data to json file  
output_path = os.path.join('..', 'output', 'vector_data.json')  
output_directory = os.path.dirname(output_path)  

if not os.path.exists(output_directory):  
    os.makedirs(output_directory)
      
with open(output_path, 'w') as f:  
    json.dump(final_documents, f)  
          
# Read the combined data from docVectors.json file  
with open(output_path, 'r') as file:  
    documents = json.load(file) 

##### 4.2. Uploading Documents to Azure Cognitive Search

After reading the JSON, we upload documents to an Azure Cognitive Search index. It ensures that documents are properly formatted as a list (in case only a single document is passed as a dictionary) and then uploads them using the `SearchClient`. After the upload, it prints the number of documents successfully uploaded to the index.

In [None]:
search_client = SearchClient(endpoint=search_endpoint_name, index_name='test_index', credential=AzureKeyCredential(search_endpoint_key))  
          
# Ensure the documents are wrapped in a list  
if isinstance(documents, dict):  
    documents = [documents]  # Wrap the single dictionary in a list  
          
# Upload documents to Azure Cognitive Search  
result = search_client.upload_documents(documents)  
print(f'Uploaded {len(documents)} documents')  

##### 5. Performing Vector-Based Search on Azure Cognitive Search

Now, we can connect to the index and perform the first vector-based search on the Azure Cognitive Search index. This code creates a query using the input text, searches the `combined_content_vector` field for the nearest neighbors, and retrieves results based on the query. The function processes the results and returns relevant fields such as `title`, `content`, `author`, and `publication date`, along with the `search score` for each document.

In [None]:
# Initialize the SearchClient with the endpoint, index name, and credentials  
search_client = SearchClient(endpoint=search_endpoint_name, index_name='test_index', credential=AzureKeyCredential(search_endpoint_key))  

def perform_search(query_text):  
    # Create a vector query with the entered text and the top k nearest neighbors  
    vector_query = VectorizableTextQuery(  
        text=query_text,  
        fields='combined_content_vector',  
        k_nearest_neighbors=3,  
        exhaustive=True  
    )  
    
    try:
        # Perform the search  
        results = search_client.search(  
            search_text=query_text,  # The entered search text  
            vector_queries=[vector_query],  # The vector-based query  
            select=expected_keys,  # Fields to be returned in the result  
            include_total_count=True  # Include the total count of the results  
        )  
    
        result_list = []  
        # Iterate over the search results, include all fields  
        for result in results:  
            result_dict = {  
                'score': result['@search.score'],  # Include search score  
                'title': result.get('title', '').strip(),  # Include 'title' field
                'combined_content': result.get('combined_content', '').strip(),  
                'author': result.get('author', '').strip(),  # Include 'author' field
                'published_date': result.get('published_date', '')  # Include 'published_date' field
            } 
            result_list.append(result_dict)  
        
        return result_list
    
    except Exception as e:
        print(f'Failed response: {e}')

##### 6. Creating a Prompt for OpenAI Chat Completion

This function generates a prompt for OpenAI Chat Completion by formatting the search results and including the user's query. It organizes the search results into a readable structure and appends the user’s query, asking how to assist further based on the retrieved information. This prompt is then passed to OpenAI to generate a response.

In [None]:
# Prepare the prompt for OpenAI Chat Completion  
def create_prompt(results, user_input):  
    prompt = 'Here are some search results based on your query:\n\n'  
    for i, result in enumerate(results):  
        prompt += f'Result {i+1}:\n'  
        for key, value in result.items():  
            prompt += f'{key}: {value}\n'  
        prompt += '\n'  
    prompt += f'User query: {user_input}\nHow can I assist you further based on these results?'  
    return prompt

##### 7. Getting a Response from OpenAI Chat API

The `get_openai_response` function sends the formatted prompt to the OpenAI model (e.g., gpt4o) to generate a response. It configures the request with a system message and the user's query, specifies parameters like `max_tokens` for output length, and adjusts creativity via the `temperature` setting. The function returns the generated response from OpenAI.

In [None]:
# Function to get response from OpenAI  
def get_openai_response(prompt):  
    
    try:
        response = client.chat.completions.create(  
            model='gpt4o',    # Ensure this matches the deployment name in Azure  
            messages=[  
                {'role': 'system', 'content': 'You are an assistant for any problem'}, # Adjust this value for personal preferences and based on documents 
                {'role': 'user', 'content': prompt}  
            ],  
            max_tokens=400,  # Adjust output length
            n=1,  # Generate one output message
            stop=None,  # Up to 4 sequences where the API will stop generating further tokens.
            temperature=0.3,  # Set creativity of AI
        )  
        return response.choices[0].message.content

    except Exception as e:
        print(f'Failed response: {e}')  

##### 8. Managing a Conversation with OpenAI Chat API

Finally we can initiate a conversation with the OpenAI model by sending an initial query and generating a response. The code is created as a loop where user input is continuously processed. For each input, a search is performed, a new prompt is created, and a response is generated. The conversation context is stored, and the loop continues until the user decides to exit. The conversation history is updated after every interaction, ensuring continuity in the discussion.

In [None]:
# Initial interaction  
initial_query = 'I will asks you questions'  # Replace with other initial query if needed  
results = perform_search(initial_query)  
prompt = create_prompt(results, initial_query)  
chat_response = get_openai_response(prompt)  
print('Chat Completion Response:', chat_response, flush=True) # Ensure immediate output  

conversation_history = []
# Continue the conversation  
while True:  
    user_input = input('You: ')  
    if user_input.lower() in ['exit', 'quit']:  
        print('Ending the conversation.', flush=True)  
        break  

    # Store conversation context   
    conversation_history.append({'role': 'user', 'content': user_input}) 
    # Perform a new search based on user input  
    results = perform_search(user_input)  
      
    # Create a new prompt based on user input and search results  
    new_prompt = create_prompt(results, user_input)  
    chat_response = get_openai_response(new_prompt)
    conversation_history.append({'role': 'assistant', 'content': chat_response}) 

    print('AI: ', chat_response, flush=True)  # Ensure immediate output