## ⚡️ Quick Overview
Followed this [LangChain documentation](https://python.langchain.com/v0.1/docs/integrations/vectorstores/azuresearch/) with several tricks, testing, and refinement. 

Overall this is a simple RAG flow based on Azure AI Search and Azure Document Intelligence (reading a word document in this case). It uses OpenAI Embedding model (unfortunately it is forced to be used by Azure AI Search) to create the embedding, creates an index (like a retriever) in Azure AI Search, load a chunked document into the created Azure AI Search object, and perform semantic search utlising the created indexer. The downside is, this script did not create a persistent vector database within Azure as it would require blob storage, which needs another subscription. 

Code snippets from [Azure AI Document Intelligence](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/office_file/) and [LangChain's TextSplitter and tiktoken](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/split_by_token/) were used to load and chunk the text document

Notes: It used OpenAI embedding model only, didn't involve LLM response.

In [45]:
!pip install --quiet azure-search-documents
!pip install --quiet azure-identity


[notice] A new release of pip is available: 23.0.1 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.0.1 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [48]:
import os
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings    # this langchain integration doesn't work
from openai import OpenAI
from dotenv import dotenv_values

In [65]:
config = dotenv_values(".env")
openai_api_key: str = config['OPENAI_API_KEY']
os.environ["OPENAI_API_KEY"] = openai_api_key

openai_api_version: str = "2023-05-15"
model: str = "text-embedding-ada-002"

vector_store_address: str = config["AZURE_AI_SEARCH_ENDPOINT"]
vector_store_password: str = config['AZURE_AI_SEARCH_API_KEY']

In [59]:
#// Outdated method to pull openai embeddings via LangChain! Use OpenAI API directly instead
# embeddings: OpenAIEmbeddings = OpenAIEmbeddings(
#     openai_api_key=openai_api_key, openai_api_version=openai_api_version, model=model
# )

client = OpenAI()
def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

index_name: str = "langchain-vector-demo"
vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=get_embedding)     # need to create a embedding function ahead to instantiate the AzureSearch object 

In [66]:
# %pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "C:/Users/WongD7/Documents/AECOM internal/Wong Daniel_Graduate Consultant_WIP.docx"  # change this to your own file path
endpoint = config["AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT"]
key = config["AZURE_DOCUMENT_INTELLIGENCE_API_KEY"]
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout", mode="markdown"
)

documents = loader.load()
print(documents)

[Document(page_content='Daniel Wong BSc (Hons), MSc\n\nGraduate Cost & Carbon Intelligence Consultant\n\n| Proposed position on project Bespoke to opportunity we are bidding for | Qualifications BSc (Hons) Environmental & Occupational Safety and Health Year completed: 2019 MSc Computing and Information System (Distinction) Year completed: 2022 | Professional memberships Publications Financial Risk Manager (FRM) Part 1 Certified (GARP) Certified ISO9001 Internal Auditor (HKQAA) Certificate in First Aid & adult CPR & automated external defibrillation (St. John) Completed training for Oracle Primavera P6 | |\n| - | - | - | - |\n| Project name Name of project we are bidding for ||||\n\nDaniel is a graduate consultant with a passion for innovation across industries. He consistently navigates new challenges with ease, demonstrating an exceptional capacity to adapt swiftly and devise innovative solutions to solve problems. Moreover, Daniel possesses a deep understanding and proficiency in lev

In [90]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# this method is better than loading a tiktoken splitter directly, which will chunk in the middle of sentences
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=250,
    chunk_overlap=0,
)

# page_contents = [doc.page_content for doc in documents]     # convert the list of document objects to a list of strings
# docs = text_splitter.create_documents(page_contents)        # this method can create a list of objects with page_content attribute, which is required by the AzureSearch object
docs = text_splitter.split_documents(documents)       # this method will create a list of strings, which is not suitable for the AzureSearch object
print(len(docs))
print(docs)

10
[Document(page_content='Daniel Wong BSc (Hons), MSc\n\nGraduate Cost & Carbon Intelligence Consultant\n\n| Proposed position on project Bespoke to opportunity we are bidding for | Qualifications BSc (Hons) Environmental & Occupational Safety and Health Year completed: 2019 MSc Computing and Information System (Distinction) Year completed: 2022 | Professional memberships Publications Financial Risk Manager (FRM) Part 1 Certified (GARP) Certified ISO9001 Internal Auditor (HKQAA) Certificate in First Aid & adult CPR & automated external defibrillation (St. John) Completed training for Oracle Primavera P6 | |\n| - | - | - | - |\n| Project name Name of project we are bidding for ||||\n\nDaniel is a graduate consultant with a passion for innovation across industries. He consistently navigates new challenges with ease, demonstrating an exceptional capacity to adapt swiftly and devise innovative solutions to solve problems. Moreover, Daniel possesses a deep understanding and proficiency in 

In [91]:
vector_store.add_documents(documents=docs)

['NjJiMDVjMGUtOTRlNy00ZmE0LTlhYzctMjNjYTkzMGVmMjI1',
 'OWY2ZjJhODAtMmFiMS00ODdiLThiMDQtMWYyMGM5MjY2NGE0',
 'MTRmOTBiOTAtMGNkNC00OWMxLWJiMDctNzJiZTE4ODVlOWMz',
 'YTM0Mzc0OWItZTQwZi00YmM5LTg2ZjYtNGUwNTg5MGIxMzk1',
 'NTJlZmI0YmItM2ViZi00MjEwLWIyYTQtZjVlMGY4ZmEwN2Q3',
 'ZWRkNmM0NWUtZmI1OC00ZjU1LTg1ZGItOWNjNmJjYjFmOWI5',
 'ZjhiZjUwZDUtZThmMC00OWQzLTg5NjItMTk4NzZjMDVmMTJh',
 'NjNmZWE5OTEtZDRiOS00ZTMwLTgzMzctMWQ0ZThlOGI2ZjIy',
 'ODliNmQzYTYtNTM2ZC00MzQzLTgyNGEtYWNiYWFjOTgwMzlj',
 'ZTc3M2JjOWUtYzdiMy00ZjIyLTk5OTQtOGJmYzI2ODc3ZmEz']

In [103]:
# Perform a similarity search
docs = vector_store.similarity_search(
    query="What did Daniel do at AECOM?",
    k=3,
    search_type="similarity",
)
docs

[Document(page_content="Daniel was responsible for designing and implementing a mapping approach to compare Scottish Water's provided models with AECOM internal models. His"),
 Document(page_content="Daniel was responsible for designing and implementing a mapping approach to compare Scottish Water's provided models with AECOM internal models. His tasks included data cleaning, transformation, and the development of a VBA script to systematically benchmark the options. He conducted statistical validation to ensure the accuracy and reliability of the benchmarking process. Additionally, Daniel created a range of output values under different scenarios, adjusted these values to the appropriate date, and synthesised them into a PowerBI report for clear visualisation and interpretation of the benchmarking outputs."),
 Document(page_content="Daniel's career began in the oil and gas sector, where he quickly made a name for himself as a corporate HSSE advisor. Throughout his career, he has consi

In [102]:
# Perform a hybrid search 
docs = vector_store.similarity_search(
    query="What did Daniel do at AECOM?",
    k=3,
    search_type="hybrid",
)
docs

[Document(page_content="Daniel was responsible for designing and implementing a mapping approach to compare Scottish Water's provided models with AECOM internal models. His"),
 Document(page_content="Daniel was responsible for designing and implementing a mapping approach to compare Scottish Water's provided models with AECOM internal models. His tasks included data cleaning, transformation, and the development of a VBA script to systematically benchmark the options. He conducted statistical validation to ensure the accuracy and reliability of the benchmarking process. Additionally, Daniel created a range of output values under different scenarios, adjusted these values to the appropriate date, and synthesised them into a PowerBI report for clear visualisation and interpretation of the benchmarking outputs."),
 Document(page_content='analysis aimed at identifying best practices to be adopted by UK water companies. Daniel conducted a statistical review of cost curves using specific')]

In [96]:
docs

[Document(page_content="Daniel was responsible for designing and implementing a mapping approach to compare Scottish Water's provided models with AECOM internal models. His"),
 Document(page_content="Daniel was responsible for designing and implementing a mapping approach to compare Scottish Water's provided models with AECOM internal models. His tasks included data cleaning, transformation, and the development of a VBA script to systematically benchmark the options. He conducted statistical validation to ensure the accuracy and reliability of the benchmarking process. Additionally, Daniel created a range of output values under different scenarios, adjusted these values to the appropriate date, and synthesised them into a PowerBI report for clear visualisation and interpretation of the benchmarking outputs."),
 Document(page_content="Daniel's career began in the oil and gas sector, where he quickly made a name for himself as a corporate HSSE advisor. Throughout his career, he has consi