### Load and Index Data - Vector Store

We will use [Azure Cognitive Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search) to load and index the data.  Azure Cognitive Search is a cloud search service with built-in AI capabilities that enrich all types of information to easily identify and explore relevant content at scale. It uses the same integrated Microsoft natural language stack that Bing and Office have used for more than a decade, and AI services across vision, language, and speech, to deliver knowledge from structured and unstructured data.

Cognitive search enabled the vector search feature! When done correctly, vector search is a proven technique for significantly increasing the semantic relevance of search results.  It is a technique that uses machine learning to embed text into a vector space, where the distance between vectors is a measure of semantic similarity.  This allows for the use of vector similarity search to find relevant results.  [Sign up]
(https://aka.ms/VectorSearchSignUp) for Private Preview of Vector Search.

Cognitive Search can index and store vectors, but it doesn't generate them out of the box. The documents that you push to your search service must contain vectors within the payload. Alternatively, you can use the Indexer to pull vectors from your data sources such as Blob Storage JSON files or CSVs. You can also use a Custom Skill to generate embeddings as part of the AI Enrichment process.


[Sample repo](https://github.com/Azure/cognitive-search-vector-pr) to get started with vector search. 

#### Pre-requisite:
- To run the code, install the following packages from local Wheel file. Alternatively, install azure-search-documents==11.4.0a20230509004 from the Dev Feed. For instructions on how to connect to the dev feed, please visit Azure-Python-SDK Azure Search Documents [Dev Feed](https://dev.azure.com/azure-sdk/public/_artifacts/feed/azure-sdk-for-python/connect/pip).
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/).
- An Azure Cognitive Search service (any tier, any region). [Create a service](https://learn.microsoft.com/en-us/azure/search/search-create-service-portal) or find an [existing service](https://portal.azure.com/#blade/HubsExtension/BrowseResourceBlade/resourceType/Microsoft.Search%2FsearchServices) under your current subscription.

In [1]:
#%pip install ./azure_search_documents-11.4.0b4-py3-none-any.whl

In [2]:
# Install langchain
#%pip install langchain

#### Set the Environment Variable

In [3]:
import os  
import json  
import openai
from Utilities.envVars import *
from openai import OpenAI, AzureOpenAI, AsyncAzureOpenAI
from Utilities.cogSearch import createSearchIndex, indexSections

# Set Search Service endpoint, index name, and API key from environment variables
indexName = SearchIndex

azure_endpoint  = f"{OpenAiEndPoint}"
api_key = OpenAiKey
api_version = OpenAiVersion

client = AzureOpenAI(
    api_key=api_key,  
    api_version=api_version,
    #base_url=f"{os.getenv('OpenAiWestUsEp')}openai/deployments/{os.getenv('OpenAiGpt4v')}/extensions",
    azure_endpoint=azure_endpoint,
)

In [4]:
from langchain_openai import OpenAIEmbeddings
from langchain_openai import AzureOpenAIEmbeddings

#### Import Required Library

In [5]:
# Import required libraries
from langchain.llms.openai import AzureOpenAI, OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import (
    PDFMinerLoader,
    UnstructuredFileLoader,
)
#from Utilities.cogSearch import createSearchIndex, indexSections

#### Load the PDF, create the chunk and push to Azure Cognitive Search

In [6]:
# Flexibility to change the call to OpenAI or Azure OpenAI
embeddingModelType = "azureopenai"

In [7]:
# Set the file name and the namespace for the index
fileName = "Fabric Get Started.pdf"
fabricGetStartedPath = "Data/PDF/" + fileName
# Load the PDF with Document Loader available from Langchain
loader = PDFMinerLoader(fabricGetStartedPath)
rawDocs = loader.load()
# Set the source 
for doc in rawDocs:
    doc.metadata['source'] = fabricGetStartedPath

textSplitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = textSplitter.split_documents(rawDocs)
# Call Helper function to create Index and Index the sections
createSearchIndex(SearchService, SearchKey, indexName)
indexSections(SearchService, SearchKey, embeddingModelType, fileName, indexName, docs)

Creating oaiworkshop search index


#### Perform Vector Search

In [8]:
from Utilities.cogSearch import performCogSearch

# Pure Vector Search
query = "What is Microsoft Fabric"  

results = performCogSearch(OpenAiEndPoint, OpenAiKey, OpenAiVersion, OpenAiApiKey, SearchService, SearchKey, embeddingModelType, OpenAiEmbedding, query, indexName, 3)

for result in results:  
    print(f"Id: {result['id']}")  
    print(f"Content: {result['content']}")  
    print(f"Source File: {result['sourcefile']}\n") 

Id: Fabric_Get_Started_pdf-1
Content: Tell us about your PDF experience.

Microsoft Fabric get started
documentation

Microsoft Fabric is a unified platform that can meet your organization's data and
analytics needs. Discover the Fabric shared and platform documentation from this page.

About Microsoft Fabric

ｅ OVERVIEW

What is Fabric?

Fabric terminology

ｂ GET STARTED

Start a Fabric trial

Fabric home navigation

End-to-end tutorials

Context sensitive Help pane

Get started with Fabric items

ｐ CONCEPT

Find items in OneLake data hub

Promote and certify items

ｃ HOW-TO GUIDE

Apply sensitivity labels

Workspaces

ｐ CONCEPT

Fabric workspace

Workspace roles

ｂ GET STARTED

Create a workspace

ｃ HOW-TO GUIDE

Workspace access control

What is Microsoft Fabric?

Article • 05/23/2023

Microsoft Fabric is an all-in-one analytics solution for enterprises that covers everything

from data movement to data science, Real-Time Analytics, and business intelligence. It

offers a comprehe

In [9]:
# Vector Search with Multi-language support
query = "¿Qué es Microsoft Fabric?"

results = performCogSearch(OpenAiEndPoint, OpenAiKey, OpenAiVersion, OpenAiApiKey, SearchService, SearchKey, embeddingModelType, OpenAiEmbedding, query, indexName, 3)
  
for result in results:  
    print(f"Id: {result['id']}")  
    print(f"Content: {result['content']}")  
    print(f"Source File: {result['sourcefile']}\n") 

Id: Fabric_Get_Started_pdf-1
Content: Tell us about your PDF experience.

Microsoft Fabric get started
documentation

Microsoft Fabric is a unified platform that can meet your organization's data and
analytics needs. Discover the Fabric shared and platform documentation from this page.

About Microsoft Fabric

ｅ OVERVIEW

What is Fabric?

Fabric terminology

ｂ GET STARTED

Start a Fabric trial

Fabric home navigation

End-to-end tutorials

Context sensitive Help pane

Get started with Fabric items

ｐ CONCEPT

Find items in OneLake data hub

Promote and certify items

ｃ HOW-TO GUIDE

Apply sensitivity labels

Workspaces

ｐ CONCEPT

Fabric workspace

Workspace roles

ｂ GET STARTED

Create a workspace

ｃ HOW-TO GUIDE

Workspace access control

What is Microsoft Fabric?

Article • 05/23/2023

Microsoft Fabric is an all-in-one analytics solution for enterprises that covers everything

from data movement to data science, Real-Time Analytics, and business intelligence. It

offers a comprehe

In [10]:
import requests
from io import BytesIO
from unstructured.chunking.title import chunk_by_title

In [11]:
def PartitionFile(fileExtension: str, fileName: str):      
    """ uses the unstructured.io libraries to analyse a document
    Returns:
        elements: A list of available models
    """  
    # Send a GET request to the URL to download the file
    #readBytes  = getBlob(OpenAiDocConnStr, OpenAiDocContainer, fileName)
    with open(fileName, "rb") as file:
        readByte = file.read()
        readBytes = BytesIO(readByte)

    metadata = [] 
    elements = None
    try:        
        if fileExtension == '.csv':
            from unstructured.partition.csv import partition_csv
            elements = partition_csv(file=readBytes)               
                     
        elif fileExtension == '.doc':
            from unstructured.partition.doc import partition_doc
            elements = partition_doc(file=readBytes) 
            
        elif fileExtension == '.docx':
            from unstructured.partition.docx import partition_docx
            elements = partition_docx(file=readBytes)
            
        elif fileExtension == '.eml' or fileExtension == '.msg':
            if fileExtension == '.msg':
                from unstructured.partition.msg import partition_msg
                elements = partition_msg(file=readBytes) 
            else:        
                from unstructured.partition.email import partition_email
                elements = partition_email(file=readBytes)
            metadata.append(f'Subject: {elements[0].metadata.subject}')
            metadata.append(f'From: {elements[0].metadata.sent_from[0]}')
            sent_to_str = 'To: '
            for sent_to in elements[0].metadata.sent_to:
                sent_to_str = sent_to_str + " " + sent_to
            metadata.append(sent_to_str)
            
        elif fileExtension == '.html' or fileExtension == '.htm':  
            from unstructured.partition.html import partition_html
            elements = partition_html(file=readBytes) 
            
        elif fileExtension == '.md':
            from unstructured.partition.md import partition_md
            elements = partition_md(file=readBytes)
                       
        elif fileExtension == '.ppt':
            from unstructured.partition.ppt import partition_ppt
            elements = partition_ppt(file=readBytes)
            
        elif fileExtension == '.pptx':    
            from unstructured.partition.pptx import partition_pptx
            elements = partition_pptx(file=readBytes)
            
        elif any(fileExtension in x for x in ['.txt', '.json']):
            from unstructured.partition.text import partition_text
            elements = partition_text(file=readBytes)
            
        elif fileExtension == '.xlsx':
            from unstructured.partition.xlsx import partition_xlsx
            elements = partition_xlsx(file=readBytes)
            
        elif fileExtension == '.xml':
            from unstructured.partition.xml import partition_xml
            elements = partition_xml(file=readBytes)
            
    except Exception as e:
        print(f"An error occurred trying to parse the file: {str(e)}")
         
    return elements, metadata

##### Using Mode elements to let UIO chunk it

In [74]:
try:
    from langchain.document_loaders import UnstructuredFileLoader
    from unstructured.cleaners.core import clean_extra_whitespace, group_broken_paragraphs

    fileName = "./Data/Gru/Multi-modal RAG.pptx"

    loader = UnstructuredFileLoader(fileName, mode="elements", strategy="fast", post_processors=[clean_extra_whitespace, group_broken_paragraphs])
                                    # unstructured_kwargs={"multipage_sections":True, 
                                    #                      "new_after_n_chars":1500, 
                                    #                      "combine_text_under_n_chars":500, 
                                    #                      "max_characters":2500}) 
    rawDocs = loader.load()
except Exception as e:
    print(f"An error occurred trying to parse the file: {str(e)}")

In [78]:
rawDocs

[Document(page_content='Research CoPilot', metadata={'source': './Data/Gru/Multi-modal RAG.pptx', 'category_depth': 0, 'file_directory': './Data/Gru', 'filename': 'Multi-modal RAG.pptx', 'last_modified': '2024-03-06T10:46:51', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title'}),
 Document(page_content='Multimodal RAG with Code Execution (RAG-CE)', metadata={'source': './Data/Gru/Multi-modal RAG.pptx', 'category_depth': 1, 'file_directory': './Data/Gru', 'filename': 'Multi-modal RAG.pptx', 'last_modified': '2024-03-06T10:46:51', 'page_number': 1, 'languages': ['eng'], 'parent_id': 'd31b823e761a10ea51bd38072fe0fa6d', 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'category': 'Title'}),
 Document(page_content='Agenda', metadata={'source': './Data/Gru/Multi-modal RAG.pptx', 'category_depth': 0, 'file_directory': './Data/Gru', 'filename': 'Multi-modal 

##### Using Mode Single, we can use langchain to chunk

In [72]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter

fileName = "./Data/Gru/Multi-modal RAG.pptx"
loader = UnstructuredFileLoader(fileName, post_processors=[clean_extra_whitespace, group_broken_paragraphs])
text_splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n\n", "\n\n"],
        chunk_size=1000,
        chunk_overlap=300,
        length_function=len,
        is_separator_regex=False,
    )
docs = loader.load_and_split(text_splitter=text_splitter)

In [73]:
docs

[Document(page_content='Research CoPilot\n\nMultimodal RAG with Code Execution (RAG-CE)\n\nAgenda\n\nProcess\n\nWhy do we need this?\n\nExamples and Findings\n\nUser Interface\n\nThe Process\n\nIngestion\n\nSearch\n\nContent Generation\n\nLimitations\n\nImprovements and Detection Process.\n\nError control\n\nLive Demo\n\nWhy do we need this?\n\nExisting Challenges\n\nTo be able to search through a knowledge base with RAG, text from documents need to be extracted, chunked and stored in a vector database\n\nThis process now is purely concerned with text:\n\nIf the documents have any images, graphs or tables, these elements are usually either ignored or extracted as messy unstructured text\n\nRetrieving unstructured table data through RAG will lead to very low accuracy answers\n\nLLMs are usually very bad with numbers. If the query requires any sort of calculations, LLMs usually hallucinate or make basic math mistakes\n\nWhy do we need this?\n\nIngest and interact with multi-modal analyti

##### Alternatively we can use our own way to define those

In [76]:
fileName = "./Data/Gru/Multi-modal RAG.pptx"
elements, uioMetadata = PartitionFile(os.path.splitext(fileName)[1], fileName)

In [77]:
metaDataText = ''
for metadata_value in uioMetadata:
    metaDataText += metadata_value + '\n'    

title = ''
# Capture the file title
try:
    for i, element in enumerate(elements):
        if title == '' and element.category == 'Title':
            # capture the first title
            title = element.text
            break
except:
    # if this type of element does not include title, then process with empty value
    pass
chunks = chunk_by_title(elements, multipage_sections=True, new_after_n_chars=1500, combine_text_under_n_chars=500, max_characters=2500)

In [79]:
subTitleName = ''
sectionName = ''
rawDocs1 = []
from langchain.docstore.document import Document
# Complete and write chunks
for i, chunk in enumerate(chunks):      
    if chunk.metadata.page_number == None:
        page_list = [1]
    else:
        page_list = [chunk.metadata.page_number] 
    # substitute html if text is a table            
    if chunk.category == 'Table':
        chunk_text = chunk.metadata.text_as_html
    else:
        chunk_text = chunk.text
    # add filetype specific metadata as chunk text header
    chunk_text = metaDataText + chunk_text
    #print(f"Chunk {i} - Page: {page_list} - Category: {chunk.category} - Text: {chunk_text}")
    rawDocs1.append(Document(page_content=chunk_text, 
                             metadata={"id": i, "source": fileName, 
                                       "title": title, 
                                       "subtitle": subTitleName, 
                                       "section": sectionName, 
                                       "page": page_list}))
    
rawDocs1

[Document(page_content='Research CoPilot\n\nMultimodal RAG with Code Execution (RAG-CE)\n\nAgenda\n\nProcess\n\nWhy do we need this?\n\nExamples and Findings\n\nUser Interface\n\nThe Process\n\nIngestion\n\nSearch\n\nContent Generation\n\nLimitations \n\nImprovements and Detection Process. \n\nError control\n\nLive Demo\n\nWhy do we need this?\n\nExisting Challenges\n\nTo be able to search through a knowledge base with RAG, text from documents need to be extracted, chunked and stored in a vector database\n\nThis process now is purely concerned with text: \n\nIf the documents have any images, graphs or tables, these elements are usually either ignored or extracted as messy unstructured text\n\nRetrieving unstructured table data through RAG will lead to very low accuracy answers\n\nLLMs are usually very bad with numbers. If the query requires any sort of calculations, LLMs usually hallucinate or make basic math mistakes', metadata={'id': 0, 'source': './Data/Gru/Multi-modal RAG.pptx', 't

In [13]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.redis import Redis
from langchain.document_loaders import TextLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter

load_dotenv()

embeddings = AzureOpenAIEmbeddings(azure_endpoint=OpenAiEndPoint, azure_deployment=OpenAiEmbedding, api_key=OpenAiKey, openai_api_type="azure")

loader = TextLoader("./Data/Compare and Contrast.txt", encoding="utf-8")
documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
redisUrl = "redis://default:" + os.getenv("RedisPassword") + "@" + os.getenv("RedisAddress") + ":" + os.getenv("RedisPort")
rds = Redis.from_documents(
    docs,
    embeddings,
    redis_url=redisUrl,
    index_name="deleteme",
)


results = rds.similarity_search_with_score("What is document about.")
print(results)

Created a chunk of size 1004, which is longer than the specified 1000


[(Document(page_content='What is the sales and trading revenue for equities for all year across all companies.  Display the output as Table with columns as company, year and revenue(figures in million)\nHow differently those banks are handling CCAR.  Give me the answer in bulleted format with breakdown by company and by each year\nCompare and contrast the revenue between 2021 and 2022.  Display the output as JSON object with keys as company, year and revenue\nWhat strategies each company is using to optimize cash management?. Give me the answer in bulleted format with breakdown by company\nWhat is the status of LIBOR Transitions over the years for all companies. If there\'s no information for a specific year or company, just say "No Information" for that specific year and company.  Breakdown the answer in bulleted list by company and year with minimum of 3 paragraphs to maximum of 7 paragraphs for each company and year\nWhat is the status of LIBOR Transitions?  Provide the information 

In [15]:
rds.write_schema("redis_schema.yaml")

In [18]:
indexSchema = {
    "text": [{"name": "source"}, {"name": "content"}],
    "vector": [{"name": "content_vector", "dims": 768, "algorithm": "FLAT", "distance_metric": "COSINE"}],
}

redisUrl = "redis://default:" + os.getenv("RedisPassword") + "@" + os.getenv("RedisAddress") + ":" + os.getenv("RedisPort")

existingRds = Redis.from_existing_index(
    embeddings,
    index_name="deleteme",
    redis_url=redisUrl,
    schema=indexSchema,
)

In [19]:
results = existingRds.similarity_search("Sales revenue", k=3)


In [20]:
results

[Document(page_content='What is the sales and trading revenue for equities for all year across all companies.  Display the output as Table with columns as company, year and revenue(figures in million)\nHow differently those banks are handling CCAR.  Give me the answer in bulleted format with breakdown by company and by each year\nCompare and contrast the revenue between 2021 and 2022.  Display the output as JSON object with keys as company, year and revenue\nWhat strategies each company is using to optimize cash management?. Give me the answer in bulleted format with breakdown by company\nWhat is the status of LIBOR Transitions over the years for all companies. If there\'s no information for a specific year or company, just say "No Information" for that specific year and company.  Breakdown the answer in bulleted list by company and year with minimum of 3 paragraphs to maximum of 7 paragraphs for each company and year\nWhat is the status of LIBOR Transitions?  Provide the information w

In [2]:
import pandas as pd

df = pd.read_parquet("azureml://subscriptions/e2171f6d-2650-45e6-af7e-6d6e44ca92b1/resourcegroups/dataai/workspaces/dataaiamlwks/datastores/workspaceblobstore/paths/LocalUpload/49e795399bb8232f9e9a479a1c443f29/cleaned-credit-card.parquet")
df.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0
