### Load and Index Data using Semantic Kernel - Vector Store

We will use [Azure Cognitive Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search) to load and index the data.  Azure Cognitive Search is a cloud search service with built-in AI capabilities that enrich all types of information to easily identify and explore relevant content at scale. It uses the same integrated Microsoft natural language stack that Bing and Office have used for more than a decade, and AI services across vision, language, and speech, to deliver knowledge from structured and unstructured data.

Cognitive search enabled the vector search feature! When done correctly, vector search is a proven technique for significantly increasing the semantic relevance of search results.  It is a technique that uses machine learning to embed text into a vector space, where the distance between vectors is a measure of semantic similarity.  This allows for the use of vector similarity search to find relevant results.  [Sign up]
(https://aka.ms/VectorSearchSignUp) for Private Preview of Vector Search.

Cognitive Search can index and store vectors, but it doesn't generate them out of the box. The documents that you push to your search service must contain vectors within the payload. Alternatively, you can use the Indexer to pull vectors from your data sources such as Blob Storage JSON files or CSVs. You can also use a Custom Skill to generate embeddings as part of the AI Enrichment process.


[Sample repo](https://github.com/Azure/cognitive-search-vector-pr) to get started with vector search. 

#### Pre-requisite:
- To run the code, install the following packages from local Wheel file. Alternatively, install azure-search-documents==11.4.0a20230509004 from the Dev Feed. For instructions on how to connect to the dev feed, please visit Azure-Python-SDK Azure Search Documents [Dev Feed](https://dev.azure.com/azure-sdk/public/_artifacts/feed/azure-sdk-for-python/connect/pip).
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/).
- An Azure Cognitive Search service (any tier, any region). [Create a service](https://learn.microsoft.com/en-us/azure/search/search-create-service-portal) or find an [existing service](https://portal.azure.com/#blade/HubsExtension/BrowseResourceBlade/resourceType/Microsoft.Search%2FsearchServices) under your current subscription.

In [1]:
#%pip install ./azure_search_documents-11.4.0b4-py3-none-any.whl

In [2]:
# Install semenatic kernel
#%pip install semantic-kernel

#### Set the Environment Variable

In [4]:
import os  
import json  
import openai
from Utilities.envVars import *

# Set Search Service endpoint, index name, and API key from environment variables
indexName = "skindex"

# Set OpenAI API key and endpoint
openai.api_type = "azure"
openai.api_version = OpenAiVersion
openai_api_key = OpenAiKey
assert openai_api_key, "ERROR: Azure OpenAI Key is missing"
openai.api_key = openai_api_key
openAiEndPoint = f"{OpenAiEndPoint}"
assert openAiEndPoint, "ERROR: Azure OpenAI Endpoint is missing"
openai.api_base = openAiEndPoint

#### Import Required Library

In [10]:
# Import required libraries
import semantic_kernel as sk
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion, OpenAITextEmbedding, AzureChatCompletion, AzureTextEmbedding
from semantic_kernel.connectors.ai.open_ai import (
    AzureTextCompletion,
    AzureTextEmbedding,
)
from semantic_kernel.connectors.memory.azure_cognitive_search import (
    AzureCognitiveSearchMemoryStore,
)

#### Load the PDF, create the chunk and push to Azure Cognitive Search

In [11]:
# Flexibility to change the call to OpenAI or Azure OpenAI
embeddingModelType = "azureopenai"
#AZURE_COGNITIVE_SEARCH_ENDPOINT = SearchService
#AZURE_COGNITIVE_SEARCH_ADMIN_KEY = SearchKey
#AZURE_OPENAI_API_KEY = OpenAiKey
#AZURE_OPENAI_ENDPOINT = OpenAiEndPoint
#AZURE_OPENAI_DEPLOYMENT_NAME = OpenAiChat
vectorSize = 1536

In [23]:
kernel = sk.Kernel()

# Configure AI service used by the kernel
if embeddingModelType == "azureopenai":
    #deployment, api_key, endpoint = sk.azure_openai_settings_from_dot_env()
    kernel.add_chat_service("chat_completion", AzureChatCompletion(OpenAiChat, OpenAiEndPoint, OpenAiKey))
    # next line assumes embeddings deployment name is "text-embedding-ada-002", adjust this if  appropriate 
    kernel.add_text_embedding_generation_service("ada", AzureTextEmbedding(deployment_name=OpenAiEmbedding,
            endpoint=OpenAiEndPoint,
            api_key=OpenAiKey))
    kernel.add_text_completion_service(
        "dv",
        AzureTextCompletion(
            deployment_name=OpenAiEmbedding,
            endpoint=OpenAiEndPoint,
            api_key=OpenAiKey,
        ),
    )
else:
    #api_key, org_id = sk.openai_settings_from_dot_env()
    kernel.add_chat_service("chat-gpt", OpenAIChatCompletion("gpt-3.5-turbo", OpenAiApiKey, ""))
    kernel.add_text_embedding_generation_service("ada", OpenAITextEmbedding("text-embedding-ada-002", OpenAiApiKey, ""))

In [24]:
# kernel.register_memory_store(memory_store=sk.memory.VolatileMemoryStore())
# kernel.import_skill(sk.core_skills.TextMemorySkill())
connector = AzureCognitiveSearchMemoryStore(
        vector_size=vectorSize, search_endpoint=f"https://{SearchService}.search.windows.net", admin_key=SearchKey
    )
# Register the memory store with the kernel
kernel.register_memory_store(memory_store=connector)

In [37]:
# Set the file name and the namespace for the index
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams
from pdfminer.utils import open_filename
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import (
    PDFMinerLoader
)

# Change the code below to use it as "Semantic Kernel way"
fileName = "Fabric Get Started.pdf"
fabricGetStartedPath = "Data/PDF/" + fileName
loader = PDFMinerLoader(fabricGetStartedPath)
rawDocs = loader.load()
textSplitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = textSplitter.split_documents(rawDocs)

In [38]:
# Iterate thru the pandas dataframe and embed each row and save into memory (Chroma)
async def populateCogSearch(kernel: sk.Kernel, docs, fileName) -> None:
    # Add some documents to the semantic memory using save_information_async
    counter = 1
    for doc in docs:
        await kernel.memory.save_information_async(
        collection=indexName, 
        id=f"{fileName}-{counter}".replace(".", "_").replace(" ", "_").replace(":", "_").replace("/", "_").replace(",", "_").replace("&", "_"),
        text=doc.page_content)
        counter += 1

In [39]:
await populateCogSearch(kernel, docs, fileName)

hnsw_parameters is not a known attribute of class <class 'azure.search.documents.indexes._generated.models._models_py3.HnswVectorSearchAlgorithmConfiguration'> and will be ignored


#### Perform Vector Search

In [41]:
async def searchCogDb(kernel: sk.Kernel, question, k, relevanceScore):
    result = await kernel.memory.search_async(collection=indexName, query=question, limit=k, min_relevance_score=relevanceScore)
    return {result[0].text}

In [42]:
# Pure Vector Search
query = "What is Microsoft Fabric"  
results = await searchCogDb(kernel, query, 3, 0.3)
print(f"Id: {results}")

Id: {"Fabric allows creators to concentrate on producing their best work, freeing them from\n\nthe need to integrate, manage, or understand the underlying infrastructure that\n\nsupports the experience.\n\nComponents of Microsoft Fabric\n\nMicrosoft Fabric offers the comprehensive set of analytics experiences designed to work\n\ntogether seamlessly. Each experience is tailored to a specific persona and a specific task.\n\nFabric includes industry-leading experiences in the following categories for an end-to-\n\nend analytical need.\n\nData Engineering - Data Engineering experience provides a world class Spark\n\nplatform with great authoring experiences, enabling data engineers to perform\n\nlarge scale data transformation and democratize data through the lakehouse.\n\nMicrosoft Fabric Spark's integration with Data Factory enables notebooks and\n\nspark jobs to be scheduled and orchestrated. For more information, see What is\n\nData engineering in Microsoft Fabric?\n\n\x0cData Factory 

In [44]:
# Vector Search with Multi-language support
query = "¿Qué es Microsoft Fabric?"
results = await searchCogDb(kernel, query, 3, 0.6)
print(f"Id: {results}")

Id: {"Tell us about your PDF experience.\n\nMicrosoft Fabric get started\ndocumentation\n\nMicrosoft Fabric is a unified platform that can meet your organization's data and\nanalytics needs. Discover the Fabric shared and platform documentation from this page.\n\nAbout Microsoft Fabric\n\nｅ OVERVIEW\n\nWhat is Fabric?\n\nFabric terminology\n\nｂ GET STARTED\n\nStart a Fabric trial\n\nFabric home navigation\n\nEnd-to-end tutorials\n\nContext sensitive Help pane\n\nGet started with Fabric items\n\nｐ CONCEPT\n\nFind items in OneLake data hub\n\nPromote and certify items\n\nｃ HOW-TO GUIDE\n\nApply sensitivity labels\n\nWorkspaces\n\nｐ CONCEPT\n\nFabric workspace\n\n\x0cWorkspace roles\n\nｂ GET STARTED\n\nCreate a workspace\n\nｃ HOW-TO GUIDE\n\nWorkspace access control\n\n\x0cWhat is Microsoft Fabric?\n\nArticle • 05/23/2023\n\nMicrosoft Fabric is an all-in-one analytics solution for enterprises that covers everything\n\nfrom data movement to data science, Real-Time Analytics, and business i