**Blog index sample**

The code does the following:
- create an index with a vector field
- populate the index with chunked blog articles
- search the index
- RAG with Azure OpenAI model

This code requires a .env with:

- AZURE_OPENAI_EMBEDDING_ENDPOINT: endpoint for the embedding model in form of https://OPENAI_INSTANCE.openai.azure.com/openai/deployments/embedding/embeddings?api-version=2023-03-15-preview"
- AZURE_OPENAI_KEY: Azure OpenAI key (check portal)
- AZURE_SEARCH_SERVICE: search service in the form of SHORT_NAME_OF_SERVICE (not full FQDN)
- AZURE_SEARCH_KEY: Azure AI Search key (check portal)



In [3]:
# this define the index, it does not create it
def blog_index(name: str):
    from azure.search.documents.indexes.models import (
        SearchIndex,
        SearchField,
        SearchFieldDataType,
        SimpleField,
        SearchableField,
        VectorSearch,
        VectorSearchProfile,
        HnswAlgorithmConfiguration,
    )

    fields = [
        SimpleField(name="Id", type=SearchFieldDataType.String, key=True),
        SearchableField(name="Title", type=SearchFieldDataType.String),
        SearchableField(name="Url", type=SearchFieldDataType.String),
        SearchableField(name="Content", type=SearchFieldDataType.String),
        SearchField(
            name="contentVector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1536,
            vector_search_profile_name="vector_config"
        )
    ]

    vector_search = VectorSearch(
        profiles=[VectorSearchProfile(name="vector_config", algorithm_configuration_name="algo_config")],
        algorithms=[HnswAlgorithmConfiguration(name="algo_config")],
    )
    return SearchIndex(name=name, fields=fields, vector_search=vector_search)


In [1]:
# use RSS feed of blog.baeke.info to get posts
def get_posts():
    import feedparser
    import re

    feed = feedparser.parse("https://blog.baeke.info/feed/")
    posts = []
    for post in feed.entries:
        title = post.title
        url = post.link
        content = re.sub("<[^<]+?>", "", post.content[0].value)
        posts.append({"title": title, "url": url, "content": content})
    return posts

# tiktoken len function
import tiktoken
tokenizer = tiktoken.get_encoding('cl100k_base')


def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

# open ai embedding function
import openai
import dotenv
import os
dotenv.load_dotenv('../.env')

def get_embeddings(text: str):
    import openai

    open_ai_endpoint = os.getenv("AZURE_OPENAI_EMBEDDING_ENDPOINT")
    open_ai_key = os.getenv("AZURE_OPENAI_KEY")

    client = openai.AzureOpenAI(
        azure_endpoint=open_ai_endpoint,
        api_key=open_ai_key,
        api_version="2023-09-01-preview",
    )
    embedding = client.embeddings.create(input=[text], model="text-embedding-ada-002")
    return embedding.data[0].embedding


# chunk content
from langchain.text_splitter import RecursiveCharacterTextSplitter
from bs4 import BeautifulSoup
import requests


text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100,  # number of tokens overlap between chunks
        length_function=tiktoken_len,
        separators=['\n\n', '\n', ' ', '']
    )

posts = get_posts()
print(f"Found {len(posts)} posts")

search_docs = []
for i, post in enumerate(posts):
    post_url = post['url']
    r = requests.get(post_url)
    soup = BeautifulSoup(r.text, 'html.parser')
    article = soup.find("div", {"class": "entry-content"}).text
    chunks = text_splitter.split_text(article)
    print(f"Post {i} has {len(chunks)} chunks")
    
   
    # create azure open ai search documents
    
    for chunk in chunks:
         # create uuid from chunk
        import hashlib
        hash = hashlib.md5(chunk.encode("utf-8"))
        hash_digest = hash.hexdigest()


        search_docs.append({
            "Id": hash_digest,
            "Title": post['title'],
            "Url": post['url'],
            "Content": chunk,
            "contentVector": get_embeddings(chunk)
        })

        

    # print total search docs
    print(f"Total search docs: {len(search_docs)}")




    

Found 50 posts
Post 0 has 6 chunks
Total search docs: 6
Post 1 has 7 chunks
Total search docs: 13
Post 2 has 6 chunks
Total search docs: 19
Post 3 has 6 chunks
Total search docs: 25
Post 4 has 9 chunks
Total search docs: 34
Post 5 has 5 chunks
Total search docs: 39
Post 6 has 3 chunks
Total search docs: 42
Post 7 has 2 chunks
Total search docs: 44
Post 8 has 7 chunks
Total search docs: 51
Post 9 has 6 chunks
Total search docs: 57
Post 10 has 8 chunks
Total search docs: 65
Post 11 has 9 chunks
Total search docs: 74
Post 12 has 8 chunks
Total search docs: 82
Post 13 has 6 chunks
Total search docs: 88
Post 14 has 8 chunks
Total search docs: 96
Post 15 has 8 chunks
Total search docs: 104
Post 16 has 6 chunks
Total search docs: 110
Post 17 has 8 chunks
Total search docs: 118
Post 18 has 8 chunks
Total search docs: 126
Post 19 has 4 chunks
Total search docs: 130
Post 20 has 5 chunks
Total search docs: 135
Post 21 has 5 chunks
Total search docs: 140
Post 22 has 8 chunks
Total search docs: 148

In [4]:
# upsert the documents in the index
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.models import VectorizedQuery
import dotenv
import os
dotenv.load_dotenv('../.env')

service_endpoint = "https://" + os.getenv("AZURE_SEARCH_SERVICE") + ".search.windows.net"
index_name = 'blog'
key = os.getenv("AZURE_SEARCH_KEY")

index_client = SearchIndexClient(service_endpoint, AzureKeyCredential(key))
index = blog_index(index_name)

# create the index
try:
    index_client.create_index(index)
except:
    print("Index probably already exists")

# create a search client to upload documents
client = SearchClient(service_endpoint, index_name, AzureKeyCredential(key))
client.upload_documents(search_docs)



Index probably already exists


[<azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b0cd0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b1690>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b14d0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b29d0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b1590>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b23d0>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b1d50>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b1910>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b1a10>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b1850>,
 <azure.search.documents._generated.models._models_py3.IndexingResult at 0x17a7b24d0>,
 <azure.search.documents._generated.models.

In [5]:
# vector search
def single_vector_search(query: str):
    search_client = SearchClient(service_endpoint, index_name, AzureKeyCredential(key))
    vector_query = VectorizedQuery(vector=get_embeddings(query), k_nearest_neighbors=5, fields="contentVector")

    results = search_client.search(
        vector_queries=[vector_query],
        select=["Content", "Title", "Url"],
    )

    for result in results:
        print(result['Title'], result["Url"], sep=": ")
    


single_vector_search("Assistant API")


Trying the OpenAI Assistants API: https://blog.baeke.info/2023/11/14/trying-the-openai-assistants-api/
Step-by-Step Guide: How to Build Your Own Chatbot with the ChatGPT API: https://blog.baeke.info/2023/03/12/step-by-step-guide-how-to-build-your-own-chatbot-with-the-chatgpt-api/
Trying the OpenAI Assistants API: https://blog.baeke.info/2023/11/14/trying-the-openai-assistants-api/
Building a chatbot in Azure that works with your data: https://blog.baeke.info/2023/07/29/building-a-chatbot-based-on-your-documents-in-azure/
Trying the OpenAI Assistants API: https://blog.baeke.info/2023/11/14/trying-the-openai-assistants-api/


In [6]:
def simple_hybrid_search(query: str):
    search_client = SearchClient(service_endpoint, index_name, AzureKeyCredential(key))
    vector_query = VectorizedQuery(vector=get_embeddings(query), k_nearest_neighbors=3, fields="contentVector")

    results = search_client.search(
        search_text=query,
        vector_queries=[vector_query],
        select=["Content", "Title", "Url"],
        top=5
    )
    
    for result in results:
        print(result['Title'], result["Url"], sep=": ")

simple_hybrid_search("How to use managed identities with AKS?")


Azure AD pod-managed identities in AKS revisited: https://blog.baeke.info/2020/12/09/azure-ad-pod-managed-identities-in-aks-revisited/
Kubernetes Workload Identity with AKS: https://blog.baeke.info/2022/01/31/kubernetes-workload-identity-with-aks/
Kubernetes Workload Identity with AKS: https://blog.baeke.info/2022/01/31/kubernetes-workload-identity-with-aks/
Authenticate to Azure Resources with Azure Managed Identities: https://blog.baeke.info/2023/01/07/authenticate-to-azure-resources-with-azure-managed-identities/
Authenticate to Azure Resources with Azure Managed Identities: https://blog.baeke.info/2023/01/07/authenticate-to-azure-resources-with-azure-managed-identities/
