# Challenge 04-B - Retrieval Augmented Generation (RAG) for Unstructured Data


## Introduction

Businesses have a lot of proprietary information that needs to be taken into account when answering user's questions - these cannot always be answered through the data that the GPT models have been trained on. 

In the last notebook, we worked with structured data primarily. A lot of the time, your enterprise data is not limited to just structured formats like CSV files or SQL tables. It may also include unstructured data like PDF documents or images. In fact, your individual documents could have both unstructured and structured data built into them. Extracting information from these diverse formats in a comprehensible manner presents a challenge. Tools like Azure Form Recognizer enable the extraction of data from unstructured sources such as forms or documents. Once the data is extracted into a structured JSON format, then Cognitive Search can be utilized to consolidate the entire information from different data types into indexes, facilitating the retrieval of relevant documents.

In this notebook, we will walk you through a use case of Retrieval Augmented Generation (RAG) that involves working with unstructured data. The RAG approach combines various technologies to enhance the quality and relevance of generated outputs. We will leverage Azure Form Recognizer to process complex documents, utilizing the layout API to extract text and tables effectively. We will utilize Azure Cognitive Search to create an index by configuring semantic search capabilities, enabling the retrieval of relevant document pages. Additionally, embeddings will be incorporated to retrieve content that is more closely aligned with the user's question. Finally, Azure OpenAI's ChatGPT model will utilize the extracted content to generate a more meaningful answer. It is important to emphasize that this grounding process follows the RAG pattern mentioned in the previous notebook and helps eliminate inaccuracies in the generated responses.

Your goals for this challenge are to read through this notebook, run each code block, observe the results, and then be able to answer the questions posed in the student guide.


In [1]:
! pip install "tiktoken==0.9.0" 

Defaulting to user installation because normal site-packages is not writeable
Collecting tiktoken==0.9.0
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: tiktoken
  Attempting uninstall: tiktoken
    Found existing installation: tiktoken 0.12.0
    Uninstalling tiktoken-0.12.0:
      Successfully uninstalled tiktoken-0.12.0
Successfully installed tiktoken-0.9.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# Import Azure Forms Recognizer, Azure Cognitive Search, OpenAI, and other python modules

import os, json, requests, sys, re
import requests
from pprint import pprint
import pandas as pd
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient 
from azure.search.documents import SearchClient
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SemanticConfiguration,
    PrioritizedFields,
    SemanticField,
    SemanticSettings
)

from azure.ai.formrecognizer import DocumentAnalysisClient
import openai
import numpy as np

from dotenv import load_dotenv
load_dotenv()

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default"
)

In [3]:
# This is secure and recommended way to load OpenAI resource credentials and deployment names

# Initialize the Azure OpenAI client for the latest version
from openai import AzureOpenAI

# Initialize the Azure OpenAI client
client = AzureOpenAI(
    azure_endpoint=os.getenv("OPENAI_API_BASE"),
    azure_ad_token_provider=token_provider,
    api_version=os.getenv("OPENAI_API_VERSION")
)

chat_model = os.environ['CHAT_MODEL_NAME']
embedding_model = os.environ['EMBEDDING_MODEL_NAME']

**NOTE:** The path in the code cell below is referring to the `/data/unstructured/raw` folder. You may need to update this path if you are running this notebook from a different location then from where you extracted it.

In [4]:
# -- raw data
RAW_DATA_FOLDER= '../data/unstructured/raw'
# -- extracted json file 
EXTRACTED_DATA_FOLDER = '../data/unstructured/extracted'

In [5]:
from azure.ai.formrecognizer import DocumentAnalysisClient

endpoint = os.environ["DOCUMENT_INTELLIGENCE_ENDPOINT"]

# Use Entra ID authentication instead of API key
credential = DefaultAzureCredential()

document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=credential
)

We want to extract the data from our unstructured data into a more readable format for the model to understand. The Form Recognizer tool helps us do so by leveraging the prebuilt layout models. Here, we primarily are working with PDFs but we could also have JPG and PNG formats that the form recognizer tool also supports.

For each document, we want to specify the way information is being extracted. For example in this use case, each document has many pages. To keep track of the pages, we store them in page_number. We also want to extract the content for each page and drop it in a page_context field.

In [6]:
def extract_local_single_file(file_name: str):
    not_completed = True
    while not_completed:
        with open(file_name, "rb") as f:
            poller = document_analysis_client.begin_analyze_document(
                "prebuilt-layout", document=f
            )
            not_completed=False
    result = poller.result()
    return get_page_content(file_name, result)

def extract_files( folder_name: str, destination_folder_name: str):
    os.makedirs(destination_folder_name, exist_ok=True)
    for file in os.listdir(folder_name):
        if file[-3:].upper() in ['PDF','JPG','PNG']:
            print('Processing file:', file, end='')
        
            page_content = extract_local_single_file(os.path.join(folder_name, file))
            output_file = os.path.join(destination_folder_name, file[:-3] +'json')
            print(f'  write output to {output_file}')
            with open(output_file, "w") as f:
                f.write(json.dumps(page_content))


def get_page_content(file_name:str, result):
    page_content = []
    for page in result.pages:
        all_lines_content = []
        for line_idx, line in enumerate(page.lines):
            all_lines_content.append(' '.join([word.content for word in line.get_words()]))
        page_content.append({'page_number':page.page_number, 
                                'page_content':' '.join(all_lines_content)})
    return {'filename':file_name, 'content':page_content}





In [7]:
extract_files(RAW_DATA_FOLDER, EXTRACTED_DATA_FOLDER)

Processing file: Chain-of-Thought_Prompting_Elicits_Reasoning_in_LLMs.pdf  write output to ../data/unstructured/extracted/Chain-of-Thought_Prompting_Elicits_Reasoning_in_LLMs.json
Processing file: AutoPrompt_Eliciting_Knowledge_From_LanguageModels.pdf  write output to ../data/unstructured/extracted/AutoPrompt_Eliciting_Knowledge_From_LanguageModels.json
Processing file: Precise_Zero-Shot_Dense_Retrieval_without_Relevance_Labels.pdf  write output to ../data/unstructured/extracted/Precise_Zero-Shot_Dense_Retrieval_without_Relevance_Labels.json
Processing file: Prefix-Tuning_Optimizing_Continuous_Prompts_for_Generation.pdf  write output to ../data/unstructured/extracted/Prefix-Tuning_Optimizing_Continuous_Prompts_for_Generation.json
Processing file: Generated_Knowledge_Prompting_for_Commonsense_Reasoning.pdf  write output to ../data/unstructured/extracted/Generated_Knowledge_Prompting_for_Commonsense_Reasoning.json
Processing file: Power_of_Scale_for_Parameter-Efficient_Prompt_Tuning.pdf 

## More About our data

For this walkthrough, we will take a look at various Research Papers on LLM topics in PDF documents. This includes topics like autoprompting, chain of thought prompting, precise zero shot dense retrival, and more. This dataset contains various unstructured formats such as text, tables, graphs, and formulas.

## Data Description

The relevant schema for our work today consists of 

- document_id
- document_name
- file_path
- page_number
- page_text


In [8]:
documents=[]
for file in os.listdir(EXTRACTED_DATA_FOLDER):
    with open(os.path.join(EXTRACTED_DATA_FOLDER, file)) as f:
        page_content= json.loads(f.read())
    documents.extend(
        [
            {
                'document_id':page_content['filename'].split('\\')[-1].split('.')[0] + '-' + str(page['page_number']),
                'document_name':page_content['filename'].split('\\')[-1],
                'file_path':page_content['filename'],              
                'page_number':page['page_number'],
                'page_text':page['page_content']
            }
            for page in page_content['content']
        ]
    )

In [9]:
#Example of a single page of research paper file that will be indexed in Azure Cognitive Search
documents[1]

{'document_id': '-2',
 'document_name': '../data/unstructured/raw/LLMs_are_Human-Level_Prompt_Engineers.pdf',
 'file_path': '../data/unstructured/raw/LLMs_are_Human-Level_Prompt_Engineers.pdf',
 'page_number': 2,
 'page_text': "Instruction Approximate Inference using LLMs Model Input I instructed my friend to <M>. The friend read the instruction and wrote an output for every one of the inputs. Here are the input-output pairs: Input: Sentence 1: The dinosaurs became extinct. Sentence 2: A large object hit the Earth. Output: A large object hit the Earth. ... 0.8 0.76 1.75 0 71 -- Human Prompt Engineer 0.63 0.65 0.61 0.59 0.6- 0.57 Interquartile Mean Zero-Shot Performance 0.4 0.40 Input: Sentence 1: The company's posted strong earnings. Sentence 2: The company's stock went up. Output: The company's posted strong earnings. Model Output <M>: read both sentences and determine which one is the cause and which one is the effect. Choose the sentence that is the cause and write it as the output.

This section will focus on Cognitive Search and the following topics:
1. Creating an index client
2. Defining the index fields with necessary attributes
3. Creating a semantic configuration
4. Loading our index with the document pages

In [10]:
# Create an SDK client
service_endpoint = os.getenv("AZURE_AI_SEARCH_ENDPOINT")   
credential = DefaultAzureCredential()

index_name = "research-paper-index"

index_client = SearchIndexClient(
    endpoint=service_endpoint, credential=credential)
index_client

<azure.search.documents.indexes._search_index_client.SearchIndexClient at 0x7e7dfcab46d0>

In [11]:
fields = [
    SimpleField(name="document_id", type=SearchFieldDataType.String, key=True),
    SimpleField(name="page_number", type=SearchFieldDataType.Int64),
    SimpleField(name="file_path", type=SearchFieldDataType.String),
    SearchableField(name="document_name", type=SearchFieldDataType.String,
                searchable=True, retrievable=True),
    SearchableField(name="page_text", type=SearchFieldDataType.String,
                filterable=True, searchable=True, retrievable=True),
]

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        title_field=SemanticField(field_name="document_id"),
        prioritized_keywords_fields=[SemanticField(field_name="document_name")],
        prioritized_content_fields=[SemanticField(field_name="page_text")]
    )
)


# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields, semantic_settings=semantic_settings)
result = index_client.create_or_update_index(index)
print(f' {result.name} created')

 research-paper-index created


In [12]:
search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
result = search_client.upload_documents(documents)  
print(f"Uploaded {len(documents)} documents") 

Uploaded 179 documents


In [13]:
len(result)

179

Here we see Azure Cognitive Search in action! We can retrive the most relevant documents out of all the ones that we are working with. 

In [14]:
query = "What is automated prompt engineering?"
count = 10
results = search_client.search(search_text=query, top=count, include_total_count=True)
page_chunks = []
citations = []
for result in results:
    page_chunks.append(result['page_text'])
    citations.append(result['document_name'])
    
    

In [15]:
embed_df = pd.DataFrame(page_chunks, columns = ["page_chunks"]) #datframe with document chunks
embed_df

Unnamed: 0,page_chunks
0,Table 24: Few-shot exemplars for full chain of...
1,Table 25: Few-shot exemplars for full chain of...
2,Self-Consistency Improves Chain of Thought Rea...
3,likely that models could arrive at the correct...
4,T5 Size Prompt Length Trainable Parameters Tot...
5,"matches, the best individual prompt. 7 Interpr..."
6,F Appendix: Input/Output Examples Table 13: Ex...
7,Table 23: Few-shot exemplars for full chain of...
8,at every transformer layer. This is akin to le...
9,Self-Consistency Improves Chain of Thought Rea...


Once we have the most relevant documents, let us create embeddings for all the page chunks. This will help us find the most similar documents to our given user query.

In [16]:
# Handling Rate Limits

from time import sleep

def get_embedding(text: str, model: str = "text-embedding-ada-002"):
    count=0
    while True:
        try:
            response = client.embeddings.create(
                input=[text],
                model=model
            )
            embedding = response.data[0].embedding
            break;
        except Exception as e:
            # Handle rate limiting and other errors
            if "rate limit" in str(e).lower() or "429" in str(e):
                count+=1
                #print(f'Rate Limit Error Count: {count}')
                sleep(2)
            else:
                raise e            
    return np.array(embedding).astype(np.float32)

def get_completion(prompt, model="gpt-35-turbo"): 
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message.content

def cosine_similarity(vec1, vec2):
    """Calculate the cosine similarity between two vectors."""
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

In [17]:
#Create an embedding vector for each chunk that will capture the semantic meaning and overall topic of that chunk
embed_df['embedding'] = embed_df["page_chunks"].apply(lambda page_text : get_embedding(page_text, model = embedding_model))

In [18]:
embed_df

Unnamed: 0,page_chunks,embedding
0,Table 24: Few-shot exemplars for full chain of...,"[0.013752952, 0.012547761, 0.0302744, -0.00447..."
1,Table 25: Few-shot exemplars for full chain of...,"[-0.0046408083, 0.011499113, -0.005062398, -0...."
2,Self-Consistency Improves Chain of Thought Rea...,"[0.012865214, 0.023359789, 0.007188161, -0.032..."
3,likely that models could arrive at the correct...,"[0.028534416, 0.0068504177, 0.021387327, -0.03..."
4,T5 Size Prompt Length Trainable Parameters Tot...,"[-0.033432808, 0.007821185, 0.026300108, -0.01..."
5,"matches, the best individual prompt. 7 Interpr...","[-0.011787865, -0.011878273, 0.018540679, -0.0..."
6,F Appendix: Input/Output Examples Table 13: Ex...,"[-0.0065887556, 0.029776651, 0.009727606, -0.0..."
7,Table 23: Few-shot exemplars for full chain of...,"[-0.007411479, 0.0077118557, 0.013974066, -0.0..."
8,at every transformer layer. This is akin to le...,"[-0.021121152, -0.0059559154, 0.018654246, -0...."
9,Self-Consistency Improves Chain of Thought Rea...,"[0.0022435007, 0.018677488, 0.0099718785, -0.0..."


In [19]:
embed_df

query_embedding = get_embedding(query, model=embedding_model)
embed_df["similarities"] = embed_df['embedding'].apply(lambda page_embedding: cosine_similarity(page_embedding, query_embedding))

top_results = (
    embed_df.sort_values("similarities", ascending=False)
    .reset_index(drop=True)
    .head(3)
)
top_results

Unnamed: 0,page_chunks,embedding,similarities
0,"matches, the best individual prompt. 7 Interpr...","[-0.011787865, -0.011878273, 0.018540679, -0.0...",0.809985
1,at every transformer layer. This is akin to le...,"[-0.021121152, -0.0059559154, 0.018654246, -0....",0.80492
2,Self-Consistency Improves Chain of Thought Rea...,"[0.0022435007, 0.018677488, 0.0099718785, -0.0...",0.766031


In [20]:
prompt = f"""
Provided below are user query and list of extracted pages from research papers separated by triple backticks.
Your task is to extract key pieces of information from that list based on the user query and phrase that as a comprehensive answer. 

User Query: ```{query}```
List of Extracted Pages: ```{top_results['page_chunks'].to_list()}```

Answer:
"""

print(prompt)


Provided below are user query and list of extracted pages from research papers separated by triple backticks.
Your task is to extract key pieces of information from that list based on the user query and phrase that as a comprehensive answer. 

User Query: ```What is automated prompt engineering?```
List of Extracted Pages: ```['matches, the best individual prompt. 7 Interpretability An ideally interpretable prompt would consist of natural language that clearly describes the task at hand, explicitly asks the model for some result or action, and makes it easy to understand why the prompt elicited such behavior from the model. As prompt tuning works in the continuous em- bedding space rather than the discrete token space, interpreting prompts becomes more difficult. To test the interpretability of our learned soft prompts, we compute the nearest neighbors to each prompt token from the frozen model’s vocabulary. We use cosine distance between the vocabulary embedding vector and the prompt

In [21]:
response = get_completion(prompt, chat_model)
print(response)

Automated prompt engineering refers to the process of designing and optimizing prompts for language models in a systematic and automated manner. This involves creating prompts that effectively guide the model to perform specific tasks or generate desired outputs without manually crafting each prompt. The concept is closely related to prompt tuning, which involves adjusting prompts in the continuous embedding space rather than the discrete token space. This makes interpreting the prompts more challenging, as they do not consist of straightforward natural language instructions.

The process of automated prompt engineering can include techniques such as initializing prompts with specific strategies (e.g., "class-label" strategy) and refining them through training to ensure they align closely with the desired task outcomes. The goal is to create prompts that can prime the model to interpret inputs within a specific domain or context, such as scientific or technological fields, as observed 

In [22]:

def query_search(query, count=10):
    results = search_client.search(search_text=query, top=count, include_total_count=True)
    page_chunks = []
    for result in results:
        page_chunks.append(result['page_text'])
        
    #Create an embedding vector for each chunk that will capture the semantic meaning and overall topic of that chunk
    embed_df = pd.DataFrame(page_chunks, columns = ["page_chunks"])
    embed_df['embedding'] = embed_df["page_chunks"].apply(lambda page_text : get_embedding(page_text, model = embedding_model))

    query_embedding = get_embedding(query, model=embedding_model)
    embed_df["similarities"] = embed_df['embedding'].apply(lambda page_embedding: cosine_similarity(page_embedding, query_embedding))

    top_results = (
        embed_df.sort_values("similarities", ascending=False)
        .reset_index(drop=True)
        .head(3)
    )
    
    prompt = f"""
    Provided below are user query and list of extracted pages from research papers separated by triple backticks.
    Your task is to extract key pieces of information from that list based on the user query and phrase that as a comprehensive answer. 

    User Query: ```{query}```
    List of Extracted Pages: ```{top_results['page_chunks'].to_list()}```

    Answer:
    """
    
    response = get_completion(prompt, chat_model)
    return response

In [23]:
answer = query_search("How does automated prompt engineering work?", 5)
print(answer)

Automated prompt engineering involves techniques that modify or enhance the input to a language model to improve its performance on specific tasks without altering the model's core parameters. Several methods have been developed to achieve this, each with its unique approach and benefits:

1. **Prompt Tuning**: This method involves prepending a learnable prompt to the input data. Unlike traditional model tuning, which updates the entire model's parameters, prompt tuning keeps the language model's parameters frozen and only adjusts the prompt embeddings. This approach is highly parameter-efficient, requiring less than 0.01% task-specific parameters for large models, and is particularly effective in scenarios with significant domain shifts, such as moving from one dataset to another.

2. **Prefix Tuning**: Proposed by Li and Liang (2021), this technique involves learning a sequence of prefixes that are added to both the encoder and decoder networks in models like BART. It requires more p

In [24]:
answer = query_search("what is prompt tuning?", 10)
print(answer)

Prompt tuning is a technique used to adapt pre-trained language models to specific downstream tasks by learning "soft prompts." Unlike traditional model tuning, which involves adjusting all the parameters of a model, prompt tuning keeps the core model parameters frozen and only learns a small set of additional parameters that are prepended to the input as a prompt. This approach is parameter-efficient, requiring significantly fewer parameters than full model tuning, and allows for the reuse of a single pre-trained model across multiple tasks.

The key idea behind prompt tuning is to condition the model on task-specific information by adding a learned prompt to the input data. This prompt is trained end-to-end using backpropagation, allowing it to incorporate signals from labeled examples. The method is particularly effective as the size of the model increases, becoming competitive with full model tuning for large models with billions of parameters.

Prompt tuning offers several advanta