# **Challenge 2: Embedding chunks, create Azure AI search index**

### Embeddings Overview
An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating-point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar.

Different Azure OpenAI embedding models are specifically created to be good at particular tasks:

- Similarity embeddings are good at capturing semantic similarity between two or more pieces of text.
- Text search embeddings help find which long document is relevant to a short query.
- Code search embeddings are useful for embedding code snippets and embedding nature language search queries.

Embeddings make it easier to do machine learning on large inputs representing words by capturing the semantic similarities in a vector space. Therefore, we can use embeddings to if two text chunks are semantically related or similar, and inherently provide a score to assess similarity.

### Cosine Similarity
A previously used approach to match similar documents was based on counting maximum number of common words between documents. This is flawed since as the document size increases, the overlap of common words increases even if the topics differ. Therefore cosine similarity is a better approach.

Mathematically, cosine similarity measures the cosine of the angle between two vectors projected in a multi-dimensional space. This is beneficial because if two documents are far apart by Euclidean distance because of size, they could still have a smaller angle between them and therefore higher cosine similarity.

The Azure OpenAI embeddings rely on cosine similarity to compute similarity between documents and a query.

## Let start the challenge

In [1]:
# Import required libraries  
import os  
import json  
from openai import AzureOpenAI, DefaultHttpxClient
from dotenv import load_dotenv, find_dotenv
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient  
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.indexes.models import (  
    SearchIndex,  
    SearchField,  
    SearchFieldDataType,  
    SimpleField,  
    SearchableField,  
    SearchIndex,  
    SemanticConfiguration,  
    PrioritizedFields,  
    SemanticField,  
    SearchField,  
    SemanticSettings,  
    VectorSearch,
    HnswVectorSearchAlgorithmConfiguration,
)
from pathlib import Path

In [2]:
# Configure environment variables  
load_dotenv(find_dotenv('credential.env'), override=True)

# Azure AI Search
service_endpoint = os.environ['AZURE_AI_SEARCH_ENDPOINT']
key = os.environ['AZURE_AI_SEARCH_KEY']
index_name = os.environ['AZURE_AI_SEARCH_INDEX_NAME']
credential = AzureKeyCredential(key)

#Azure OpenAI
client = AzureOpenAI(
  api_key = os.environ['AZURE_OPENAI_API_KEY'],  # this is also the default, it can be omitted
  azure_endpoint = os.environ['AZURE_OPENAI_API_ENDPOINT'],
  api_version = os.environ['AZURE_OPENAI_API_VERSION'],
  http_client = DefaultHttpxClient(verify=False)
)
embedding_model = os.environ['EMBEDDING_MODEL_NAME']

In [3]:
# Declare useful method
def check_and_create_folder(folder_name):
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)
        print(f"The folder '{folder_name}' has been created.")
    else:
        print(f"The folder '{folder_name}' already exists.")

def print_error_message(message, prefix_message='Error: '):
    print(f"\033[1;31m{prefix_message}\033[0m{message}")

def print_warning_message(message, prefix_message='Warning: '):
    print(f"\033[1;33m{prefix_message}\033[0m{message}")
    
def print_success_message(message, prefix_message='Success: '):
    print(f"\033[1;32m{prefix_message}\033[0m{message}")

### [Step4] Generate embeddings of chunked document
In this step, it will read the chunked documents, generate OpenAI embeddings and export to a format to insert your Azure AI Search index

In [4]:
def generate_embeddings(text):
    response = client.embeddings.create(input=text, model=embedding_model)
    return response.data[0].embedding

In [None]:
print_warning_message("Generate embeddings of chunked document", ">>>[STEP4] ")

# Set the local folder name for document intelligence output
# Check if the folder exists
check_and_create_folder("chunked_document_vector")

# Create embeddings on field "Content" using Azure OpenAI embedding model        
for file in Path().glob("chunked_document/*.json"):
    input_data = json.loads(file.read_text())
    content = input_data['content']
    content_embeddings = generate_embeddings(content)
    input_data['contentVector'] = content_embeddings
    with open(f"chunked_document_vector/{file.name}", "w") as f:
        json.dump(input_data, f)
    print_success_message(f'Embedding chunked document {file.name}')

### [Check Point] Review embedded chunked document in "chunked_document_vector" folder

### [Step5] Create Azure AI Search index
Create the search index schema and vector search configuration:

In [None]:
print_warning_message("Create Azure AI Search index", ">>>[STEP5] ")

# Create a search index
index_client = SearchIndexClient(endpoint=service_endpoint, credential=credential)
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True),
    SearchableField(name="content", type=SearchFieldDataType.String),
    SearchableField(name="category", type=SearchFieldDataType.String, filterable=True),
    SearchableField(name="sourcepage", type=SearchFieldDataType.String),
    SearchableField(name="sourcefile", type=SearchFieldDataType.String),
    SearchField(name="contentVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=1536, vector_search_configuration="my-vector-config"),
]

vector_search = VectorSearch(
    algorithm_configurations=[
        HnswVectorSearchAlgorithmConfiguration(
            name="my-vector-config",
            kind="hnsw",
        )
    ]
)

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        prioritized_content_fields=[SemanticField(field_name="content")]
    )
)

# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search,
                    semantic_settings=semantic_settings)
result = index_client.create_or_update_index(index)
print_success_message(f'Index name: "{result.name}" is created')

### [Step6] Upload embedded chunk documents to Azure AI Index

Insert text and upload embeddings into an index by adding texts and metadata from the JSON data to the vector store

In [None]:
print_warning_message("Upload embedded chunk documents to Azure AI Index", ">>>[STEP6] ")

search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
        
for file in Path().glob("chunked_document_vector/*.json"):
    input_data = json.loads(file.read_text())
    result = search_client.upload_documents(input_data)
    print_success_message(f"Uploaded embedded chunk: {file.name} to {index_name} index.") 

### [Check Point] Navigate to Azure AI Search in Azure Portal, review your index and try to search using "Search Explorer" inside your index.