# **Azure RAG Workshop – Session 1**
### Resource Creation: Document Intelligence, Azure AI Search, and Azure OpenAI

Welcome to the first session of our **RAG (Retrieval Augmented Generation) Assistant** Workshop! 

In this notebook, we will:
1. **Create and configure Azure resources** needed for our project:
   - **Document Intelligence (Form Recognizer)** for document ingestion and text extraction.
   - **Azure AI Search** (Cognitive Search) to store and query document embeddings.
   - **Azure OpenAI** to leverage foundational language models for text generation and embeddings.

2. **Show two approaches** for resource creation:
   - **Using the Azure Portal** (manual UI steps).
   - **Using the Azure Python SDK** (programmatic approach).

3. **Discuss environment variables** for authentication and best practices in secure credential management.

---

## **1. Prerequisites**

Before we dive into code, make sure you have the following in place:

- **Azure Subscription** – You must have Contributor or Owner rights on at least one subscription.
- **Azure CLI installed** (optional but helpful).
- **Python 3.8+** with the following packages installed:
  ```bash
  pip install azure-identity azure-mgmt-resource azure-mgmt-cognitiveservices azure-search-documents
  ```
   Optionally, if you will interact directly with Azure OpenAI or Document Intelligence in your code, you may need:  

  ```bash
  pip install azure-ai-formrecognizer azure-openai
  ```
- **Jupyter Notebook environment** (e.g., JupyterLab, VSCode, or another environment supporting notebooks).

---

## **2. Environment Setup**
We'll use environment variables to store sensitive information like Subscription ID or personal tokens. This ensures we don't hard-code credentials in our notebooks.

### Example of environment variables to set locally

- `AZURE_SUBSCRIPTION_ID`: Your Azure subscription ID.
- `AZURE_CLIENT_ID`: If using a Service Principal or managed identity.
- `AZURE_TENANT_ID`: Azure tenant ID.
- `AZURE_CLIENT_SECRET`: If using a Service Principal.

In [None]:
import os
from dotenv import load_dotenv

# Load Environemment Variables
load_dotenv(".secrets/.env")

# Example of retrieving environment variables
subscription_id = os.environ.get("AZURE_SUBSCRIPTION_ID")
print(f"Subscription ID: {subscription_id}")
tenant_id = os.environ.get("AZURE_TENANT_ID")
print(f"Tenant ID: {tenant_id}")

# Quick check to ensure they are set
if not subscription_id:
    raise ValueError("AZURE_SUBSCRIPTION_ID environment variable is not set.")

Centralized Secrets Management - If you're using a secret manager (like AWS Secrets Manager, Azure Key Vault, or HashiCorp Vault, Google Cloud Secret Manager HashiCorp Vault), you won't need .env files or the dotenv library.

---

## **3. Creating Azure Resources**
We’ll create three main resources: **Document Intelligence (Form Recognizer)**, **Azure AI Search (Cognitive Search)**, and **Azure OpenAI**. We’ll demonstrate two methods: **Portal** and **Python SDK**.

---

###  **3.1 Resource Names and Configuration**
Let's define some variables (like resource group name, region, and unique resource names) to keep our code organized.

In [None]:
import uuid

# Adjust these variables as needed
RESOURCE_GROUP_NAME = "PMR Consulting"
LOCATION = "westeurope"  # or your preferred region
YOUR_NAME = "ricardo"

# Typically, resource names must be unique. 
# For demonstration, we append a short GUID segment:
random_id = str(uuid.uuid4())[:8]

DOCUMENT_INTELLIGENCE_NAME = f"docintel-{random_id}-{YOUR_NAME}"
SEARCH_SERVICE_NAME = f"search-{random_id}-{YOUR_NAME}"
OPENAI_SERVICE_NAME = f"openai-{random_id}-{YOUR_NAME}"

print("Resource Group:", RESOURCE_GROUP_NAME)
print("Location:", LOCATION)
print("Document Intelligence (Form Recognizer):", DOCUMENT_INTELLIGENCE_NAME)
print("Search Service:", SEARCH_SERVICE_NAME)
print("OpenAI Service:", OPENAI_SERVICE_NAME)


"""
eg. output:

Resource Group: PMR Consulting
Location: westeurope
Document Intelligence (Form Recognizer): docintel-a89890c4-ricardo
Search Service: search-a89890c4-ricardo
OpenAI Service: openai-a89890c4-ricardo
"""

### **3.2 Creating Resources via the Azure Portal (Manual Steps)**
(You can skip to 3.3 if you only want the Python SDK approach.)

1. **Create a Resource Group**
   1. Log into Azure Portal.
   2. Search for “Resource groups” in the top search bar.
   3. Click “Create” and enter:
     - **Subscription**: Select your subscription.
     - **Resource group**: `PMR Consulting` (or chosen name)
     - **Region**: `westeurope`.
   4. Click “Review + Create” and then “Create”.
   
2. **Create Document Intelligence (Form Recognizer)**
   1. In the Portal, search for "Document Intelligence" (previously “Form Recognizer” or “Cognitive Services”). 
   2. Click “Create” on "Document Intelligence".
   3. Fill out:
      - **Subscription**: Same subscription.
      - **Resource group**: `PMR Consulting`.
      - **Region**: `westeurope`.
      - **Resource name**: `docintel-XXXXXX` (unique name).
   4. Choose Pricing tier (e.g., “F0”).
   5. Review, then click “Create”.


3. **Create Azure AI Search (Cognitive Search)**
   1. In the Portal, search for “Azure AI Search”.
   2. Click “Create” and select your subscription and resource group.
   3. Set “Search service name” to `search-XXXXXX` (unique name).
   4. Select region `westeurope`, choose **Basic** or **Standard tier** (depending on your needs).
   5. Click “Review + create” then “Create”.

4. **Create Azure OpenAI**
   1. In the Portal, search for “Azure OpenAI” (ensure you have access to Azure OpenAI).
   2. Click “Create” under the Azure OpenAI pane.
   3. Fill out the form (subscription, resource group, name, region).
   4. **Pricing tier**: Choose one that matches your usage (e.g., “Standard”).
   5. Click “Review + create” then “Create”.


Once all resources are created, you can verify them in your resource group in the Portal.

---
### **3.3 Creating Resources Programmatically (Azure CLI)**
For a more automated, scriptable approach, we can use the Azure Python CLI to create the same resources.

Just open up the terminal in the Azure portal (you can also locally authenticate an environement), and you can use Azure CLI to programmatically create resources.

- Here is an example with the `Document Intelligence` resource.

```bash
# Create the Document Intelligence resource
az cognitiveservices account create --name <your-resource-name> --resource-group <your-resource-group-name> --kind FormRecognizer --sku <Tier> --location <location> --yes

# Get the endpoint for the Document Intelligence resource
az cognitiveservices account show --name <your-resource-name> --resource-group "<your-resource-group-name>" --query "properties.endpoint"

# Get the API Key for the Document Intelligence resource
az cognitiveservices account keys list --name <your-resource-name> --resource-group "<your-resource-group-name>"
```

- With the replaced details, full commands here:

```bash
# Create the Document Intelligence resource
az cognitiveservices account create --name sdk-docintel --resource-group DataAcademy --kind FormRecognizer --sku S0 --location westeurope --yes

# Get the endpoint for the Document Intelligence resource
az cognitiveservices account show --name "sdk-docintel" --resource-group "DataAcademy" --query "properties.endpoint"

# Get the API Key for the Document Intelligence resource
az cognitiveservices account keys list --name "sdk-docintel" --resource-group "DataAcademy"
```

---
## **4. Create an ingestion function with Langchain using Azure Document Intelligence**

### **4.1 - Create a text extraction function** 

In [3]:
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from pathlib import Path

# Define a cross-platform file path
file_path = Path("data") / "General FAQ.pdf"
# file_path = "data/General FAQ.pdf"

endpoint = os.environ.get("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT")
key = os.environ.get("AZURE_DOCUMENT_INTELLIGENCE_KEY")
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, api_key=key, file_path=str(file_path), api_model="prebuilt-layout"
)

documents = loader.load()

In [None]:
print(documents)

In [None]:
for document in documents:
   print(f"Page Content: {document.page_content}")
   print(f"Metadata: {document.metadata}")

### **4.2 - Create a text extraction function for a full folder** 

In [6]:
import os

def extract_text_from_pdfs_in_directory(directory_path):
    """
    Extract text from all PDF documents located in a specified directory 
    using Azure Document Intelligence (Form Recognizer). 
    
    :param directory_path: Path to the directory containing PDF files.
    :return: A list of Document objects (LangChain Document type) from all PDF files.
    """
    endpoint = os.environ.get("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT")
    key = os.environ.get("AZURE_DOCUMENT_INTELLIGENCE_KEY")

    if not endpoint:
        raise ValueError("The environment variable 'AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT' is not set.")
    if not key:
        raise ValueError("The environment variable 'AZURE_DOCUMENT_INTELLIGENCE_KEY' is not set.")

    all_documents = []

    # Iterate over every file in the directory
    for filename in os.listdir(directory_path):
        # Check if the file is a PDF
        if filename.lower().endswith(".pdf"):
            file_path = os.path.join(directory_path, filename)
            # Instantiate the loader for this PDF file
            loader = AzureAIDocumentIntelligenceLoader(
                api_endpoint=endpoint, 
                api_key=key, 
                file_path=file_path, 
                api_model="prebuilt-layout"
            )
            # Load the documents from the current PDF
            docs = loader.load()
            # Accumulate the documents
            all_documents.extend(docs)

    return all_documents

In [None]:
all_documents = extract_text_from_pdfs_in_directory("./data")

In [3]:
i = 1
for document in all_documents:
   print(f"Page {i}:")  
   print(document.page_content)
   i += 1

---
## **5. Create a Search Index in Azure Search**

In [9]:
import os

from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings

In [10]:
# Use an Azure OpenAI account with a deployment of an embedding model
azure_endpoint= os.environ.get("AZURE_OPENAI_API_EMBEDDING_ENDPOINT")
azure_openai_api_key= os.environ.get("AZURE_OPENAI_API_EMBEDDING_KEY")
azure_openai_api_version= "2023-05-15"
azure_deployment= "text-embedding-3-large"

In [11]:
# Set up details for Azure Search
vector_store_address = os.environ.get("AZURE_SEARCH_ENDPOINT")
vector_store_password = os.environ.get("AZURE_SEARCH_ADMIN_KEY")

### **5.1 - Create index**

- Load Embeddings Model

In [12]:
# Use AzureOpenAIEmbeddings with an Azure account
embeddings = AzureOpenAIEmbeddings(
    azure_deployment=azure_deployment,
    openai_api_version=azure_openai_api_version,
    azure_endpoint=azure_endpoint,
    api_key=azure_openai_api_key,
)

- Create Vector Store Index

In [13]:
index_name = "langchain-vector-demo"

vector_store = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)

- Add Documents to the index

In [None]:
vector_store.add_documents(documents=all_documents)

- Perform Semantic Similarity Search

In [17]:
# Perform a similarity search
ss_result = vector_store.similarity_search(
    query="What types of tires does AutoGrip manufacture? ",
    k=3,
    search_type="similarity",
)

In [None]:
print(ss_result[0].page_content)

- Delete Index (if needed)

In [19]:
# from azure.core.credentials import AzureKeyCredential
# from azure.search.documents.indexes import SearchIndexClient

# service_endpoint = os.environ["AZURE_SEARCH_ENDPOINT"]
# index_name = "langchain-vector-demo"
# key = os.environ["AZURE_SEARCH_ADMIN_KEY"]

# # Delete Index
# search_index_client = SearchIndexClient(service_endpoint, AzureKeyCredential(key))
# search_index_client.delete_index(index_name)

### **5.1 - Create function for semantic similarity search**

In [20]:
def get_ss_results_text(query, n_results):
    # Instantiate the embeddings model
    # Use AzureOpenAIEmbeddings with an Azure account
    embeddings = AzureOpenAIEmbeddings(
        azure_deployment="text-embedding-3-large",
        openai_api_version="2023-05-15",
        azure_endpoint=os.environ.get("AZURE_OPENAI_API_EMBEDDING_ENDPOINT"),
        api_key=os.environ.get("AZURE_OPENAI_API_EMBEDDING_KEY"),
    )

    # Instantiate the Vector Store
    vector_store = AzureSearch(
        azure_search_endpoint=os.environ.get("AZURE_SEARCH_ENDPOINT"),
        azure_search_key=os.environ.get("AZURE_SEARCH_ADMIN_KEY"),
        index_name="langchain-vector-demo",
        embedding_function=embeddings.embed_query)
    
    ss_result = vector_store.similarity_search(
        query=query,
        k=n_results,
        search_type="similarity"
        )
    
    final_result_string = ""

    for document in ss_result:
       final_result_string += document.page_content

    return final_result_string, ss_result

In [21]:
ss_result_string, full_results = get_ss_results_text("What types of tires does AutoGrip manufacture?", 3)

---
## **6. Make Requests to a LLM**

### **6.1 - Instantiate the model**

In [22]:
import getpass
import os

if not os.environ.get("AZURE_OPENAI_API_KEY"):
  os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass("Enter API key for Azure: ")

from langchain_openai import AzureChatOpenAI

model = AzureChatOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_API_ENDPOINT"],
    azure_deployment="gpt-4o",
    openai_api_version="2024-12-01-preview",
)

In [None]:
model.invoke("Hello, world!").content

### **6.2 - Create a method for prompt interaction**

In [24]:
# Prompt call
def llm_invoke(prompt):
    model = AzureChatOpenAI(
        azure_endpoint=os.environ["AZURE_OPENAI_API_ENDPOINT"],
        azure_deployment="gpt-4o",
        openai_api_version="2024-12-01-preview",
    )

    return model.invoke(prompt).content

In [None]:
print(llm_invoke("who was Marie Curie?"))

---
## **7. Final Orchestration**

In [26]:
import os

def rag_chatbot():
    """
    Runs a simple RAG-based chatbot in a Jupyter notebook cell.
    Continues the conversation until the user types '\quit'.
    
    Requirements:
      - get_ss_results_text(query, n_results) -> returns (context_string, list_of_docs)
      - llm_invoke(prompt) -> returns LLM-generated answer
      
    Workflow:
      1. Prompt user for input query.
      2. Pass the user query to get_ss_results_text() to retrieve context.
      3. Build a prompt that includes conversation history + context + user query.
      4. Call llm_invoke() to get the model's answer.
      5. Append user query & model answer to conversation history.
      6. Repeat until the user types '\quit'.
    """
    
    conversation_memory = ""  # Will store conversation in text form
    print("Welcome to the RAG Chatbot! Type '\\quit' to exit.\n")
    
    while True:
        # 1) Prompt user for input
        user_query = input("User: ")
        
        # Check if user wants to exit
        if user_query.strip().lower() == "\\quit":
            print("Exiting the chatbot. Goodbye!")
            break
        
        # Print User query
        print(f"User: {user_query}\n")

        # 2) Retrieve semantic search context
        context_string, docs = get_ss_results_text(user_query, n_results=3)
        
        # 3) Build a prompt that incorporates the conversation so far + new user query + context
        prompt = f"""
        You are a helpful AI assistant. Use the conversation history, the user's new question, and any provided context to craft your answer.
        
        Conversation so far:
        {conversation_memory}

        Relevant context from knowledge base:
        {context_string}

        Now the user asks: {user_query}

        Answer in a helpful, concise manner:
        """
        
        # 4) Call the LLM with the combined prompt
        answer = llm_invoke(prompt)
        
        # 5) Update conversation memory
        conversation_memory += f"User: {user_query}\nAssistant: {answer}\n"
        
        # 6) Print the answer for the user
        print(f"Assistant: {answer}\n")


In [None]:
rag_chatbot()