### Load and Index Data - Vector Store

We will use [Azure Cognitive Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search) to load and index the data.  Azure Cognitive Search is a cloud search service with built-in AI capabilities that enrich all types of information to easily identify and explore relevant content at scale. It uses the same integrated Microsoft natural language stack that Bing and Office have used for more than a decade, and AI services across vision, language, and speech, to deliver knowledge from structured and unstructured data.

Cognitive search enabled the vector search feature! When done correctly, vector search is a proven technique for significantly increasing the semantic relevance of search results.  It is a technique that uses machine learning to embed text into a vector space, where the distance between vectors is a measure of semantic similarity.  This allows for the use of vector similarity search to find relevant results.  [Sign up]
(https://aka.ms/VectorSearchSignUp) for Private Preview of Vector Search.

Cognitive Search can index and store vectors, but it doesn't generate them out of the box. The documents that you push to your search service must contain vectors within the payload. Alternatively, you can use the Indexer to pull vectors from your data sources such as Blob Storage JSON files or CSVs. You can also use a Custom Skill to generate embeddings as part of the AI Enrichment process.


[Sample repo](https://github.com/Azure/cognitive-search-vector-pr) to get started with vector search. 

#### Pre-requisite:
- To run the code, install the following packages from local Wheel file. Alternatively, install azure-search-documents==11.4.0a20230509004 from the Dev Feed. For instructions on how to connect to the dev feed, please visit Azure-Python-SDK Azure Search Documents [Dev Feed](https://dev.azure.com/azure-sdk/public/_artifacts/feed/azure-sdk-for-python/connect/pip).
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/).
- An Azure Cognitive Search service (any tier, any region). [Create a service](https://learn.microsoft.com/en-us/azure/search/search-create-service-portal) or find an [existing service](https://portal.azure.com/#blade/HubsExtension/BrowseResourceBlade/resourceType/Microsoft.Search%2FsearchServices) under your current subscription.

In [1]:
#%pip install ./azure_search_documents-11.4.0b4-py3-none-any.whl

In [2]:
# Install langchain
#%pip install langchain

#### Set the Environment Variable

In [3]:
import os  
import json  
import openai
from Utilities.envVars import *

# Set Search Service endpoint, index name, and API key from environment variables
indexName = "fabricbp"

# Set OpenAI API key and endpoint
openai.api_type = "azure"
openai.api_version = OpenAiVersion
openai_api_key = OpenAiKey
assert openai_api_key, "ERROR: Azure OpenAI Key is missing"
openai.api_key = openai_api_key
openAiEndPoint = f"{OpenAiEndPoint}"
openai.api_base = openAiEndPoint

#### Import Required Library

In [4]:
# Import required libraries
from langchain.llms.openai import AzureOpenAI, OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import (
    PDFMinerLoader,
    UnstructuredFileLoader,
)
from Utilities.cogSearch import createSearchIndex, indexSections

#### Load the PDF, create the chunk and push to Azure Cognitive Search

In [5]:
# Flexibility to change the call to OpenAI or Azure OpenAI
embeddingModelType = "azureopenai"

In [6]:
# Set the file name and the namespace for the index
fileName = "Fabric Get Started.pdf"
fabricGetStartedPath = "Data/PDF/" + fileName
# Load the PDF with Document Loader available from Langchain
loader = PDFMinerLoader(fabricGetStartedPath)
rawDocs = loader.load()
# Set the source 
for doc in rawDocs:
    doc.metadata['source'] = fabricGetStartedPath

textSplitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=125)
docs = textSplitter.split_documents(rawDocs)
# Call Helper function to create Index and Index the sections
#createSearchIndex(SearchService, SearchKey, indexName)
#indexSections(OpenAiEndPoint, OpenAiKey, OpenAiVersion, OpenAiApiKey, SearchService, SearchKey, embeddingModelType, OpenAiEmbedding, fileName, indexName, docs)

#### Perform Vector Search

In [None]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chat_models import AzureChatOpenAI, ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate

embeddingModelType = "azureopenai"
temperature = 0
tokenLength = 1000

if (embeddingModelType == 'azureopenai'):
        openai.api_type = "azure"
        openai.api_key = OpenAiKey
        openai.api_version = OpenAiVersion
        openai.api_base = f"{OpenAiEndPoint}"

        llm = AzureChatOpenAI(
                openai_api_base=openai.api_base,
                openai_api_version=OpenAiVersion,
                deployment_name=OpenAiChat,
                temperature=temperature,
                openai_api_key=OpenAiKey,
                openai_api_type="azure",
                max_tokens=tokenLength)
        embeddings = OpenAIEmbeddings(engine=OpenAiEmbedding, chunk_size=1, openai_api_key=OpenAiKey)
elif embeddingModelType == "openai":
        openai.api_type = "open_ai"
        openai.api_base = "https://api.openai.com/v1"
        openai.api_version = '2020-11-07' 
        openai.api_key = OpenAiApiKey
        llm = ChatOpenAI(temperature=temperature,
        openai_api_key=OpenAiApiKey,
        model_name="gpt-3.5-turbo",
        max_tokens=tokenLength)
        embeddings = OpenAIEmbeddings(openai_api_key=OpenAiApiKey)


In [50]:
from Utilities.cogSearch import performCogSemanticHybridSearch
from langchain.docstore.document import Document

# Pure Vector Search
#query = "What is Microsoft Fabric"
#query = "Is there a feature to run spark jobs on fabric?"
#query = "Can you build ETL Pipelines and if so what tool do you use?"
#query = "What exactly is Compute Unit in Fabric"
#query = "What is the advantage of OneLake and OneSecurity in Fabric?"
#query = "What kind of Data Governance features are available?"
query = "¿Qué es Microsoft Fabric?"

results = performCogSemanticHybridSearch(OpenAiEndPoint, OpenAiKey, OpenAiVersion, OpenAiApiKey, SearchService, SearchKey, embeddingModelType, OpenAiEmbedding, query, indexName, 3)
# for result in results:
#     print(f"Id: {result['id']}")  
#     print(f"Content: {result['content']}")  
#     print(f"Source File: {result['sourcefile']}\n")

if results == None:
    docs = [Document(page_content="No results found")]
else :
    docs = [
        Document(page_content=doc['content'], metadata={"id": doc['id'], "source": doc['sourcefile']})
        for doc in results
        ]

In [51]:
docs

[Document(page_content='Fabric home navigation\n\nEnd-to-end tutorials\n\nContext sensitive Help pane\n\nGet started with Fabric items\n\nｐ CONCEPT\n\nFind items in OneLake data hub\n\nPromote and certify items\n\nｃ HOW-TO GUIDE\n\nApply sensitivity labels\n\nWorkspaces\n\nｐ CONCEPT\n\nFabric workspace\n\n\x0cWorkspace roles\n\nｂ GET STARTED\n\nCreate a workspace\n\nｃ HOW-TO GUIDE\n\nWorkspace access control\n\n\x0cWhat is Microsoft Fabric?\n\nArticle • 05/23/2023\n\nMicrosoft Fabric is an all-in-one analytics solution for enterprises that covers everything', metadata={'id': 'Fabric_Get_Started_pdf-2', 'source': 'Fabric Get Started.pdf'}),
 Document(page_content="Article • 05/23/2023\n\nMicrosoft Fabric is an all-in-one analytics solution for enterprises that covers everything\n\nfrom data movement to data science, Real-Time Analytics, and business intelligence. It\n\noffers a comprehensive suite of services, including data lake, data engineering, and data\n\nintegration, all in one pl

In [52]:
semantic_answers = results.get_answers()
for answer in semantic_answers:
    if answer.highlights:
        print(f"Semantic Answer: {answer.highlights}")
    else:
        print(f"Semantic Answer: {answer.text}")
    print(f"Semantic Answer Score: {answer.score}\n")

Semantic Answer: Article • 05/23/2023

Microsoft Fabric is<em> an all-in-one analytics solution for enterprises that covers everything

from data movement to data science, Real-Time Analytics, and business intelligence.</em> It

offers a comprehensive suite of services, including data lake, data engineering, and data

integration, all in one place.

With Fabric, you don't need to piece together different services from multiple vendors..
Semantic Answer Score: 0.98876953125



In [53]:
for result in results:
    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")

In [54]:
chainType = "stuff"
template = """
            Given the following extracted parts of a long document and a question, create a final answer. 
            If you don't know the answer, just say that you don't know. Don't try to make up an answer. 
            If the answer is not contained within the text below, say \"I don't know\".

            QUESTION: {question}
            =========
            {summaries}
            =========
            """
#qaPrompt = PromptTemplate(template=template, input_variables=["summaries", "question"])
#qaChain = load_qa_with_sources_chain(llm, chain_type=chainType, prompt=qaPrompt)
qaChain = load_qa_with_sources_chain(llm, chain_type=chainType)
answer = qaChain({"input_documents": docs, "question": query}, return_only_outputs=True)
outputAnswer = answer['output_text']
print(outputAnswer)

Microsoft Fabric is an all-in-one analytics solution for enterprises that covers everything from data movement to data science, Real-Time Analytics, and business intelligence. It offers a comprehensive suite of services, including data lake, data engineering, and data integration, all in one place. With Fabric, you don't need to piece together different services from multiple vendors.
SOURCES: Fabric Get Started.pdf


In [12]:
# # Vector Search with Multi-language support
# query = "¿Qué es Microsoft Fabric?"

# results = performCogSearch(OpenAiEndPoint, OpenAiKey, OpenAiVersion, OpenAiApiKey, SearchService, SearchKey, embeddingModelType, OpenAiEmbedding, query, indexName, 3)
  
# for result in results:  
#     print(f"Id: {result['id']}")  
#     print(f"Content: {result['content']}")  
#     print(f"Source File: {result['sourcefile']}\n") 