# Semantic chunking with a GPT-4T/GPT-4o

This code demonstrate how to use GPT-4o to chunk long content, generating chunks with text semantically similar.

The output is the chunks of the content.

## Prerequisites

+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access).
+ An Azure OpenAI service with the service name and an API key.
+ A deployment of the text-embedding-ada-002 embedding model on the Azure OpenAI Service.

We used Python 3.12.3, [Visual Studio Code with the Python extension](https://code.visualstudio.com/docs/python/python-tutorial), and the [Jupyter extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) to test this example.

### Set up a Python virtual environment in Visual Studio Code

1. Open the Command Palette (Ctrl+Shift+P).
1. Search for **Python: Create Environment**.
1. Select **Venv**.
1. Select a Python interpreter. Choose 3.10 or later.

It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments).

### Install packages

In [None]:
! pip install openai

## Install packages and create AOAI client

In [1]:
import os
import re
from dotenv import load_dotenv
from openai import AzureOpenAI
import sys
sys.path.append('..')
from pa_utils import call_aoai, token_len, load_files

# Load environment variables from .env
load_dotenv(override=True)

# AOAI FOR ANSWER GENERATION
aoai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
aoai_apikey = os.environ["AZURE_OPENAI_API_KEY"]
aoai_model_name = os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"]
# Create AOAI client for semantic chunking
aoai_api_version = '2024-02-15-preview'
aoai_client = AzureOpenAI(
    azure_deployment=aoai_model_name,
    api_version=aoai_api_version,
    azure_endpoint=aoai_endpoint,
    api_key=aoai_apikey
)

# AOAI FOR EMBEDDING GENERATION
aoai_embedding_endpoint = os.environ["AZURE_OPENAI_EMBEDDING_ENDPOINT"]
aoai_embedding_apikey = os.environ["AZURE_OPENAI_EMBEDDING_API_KEY"]
embedding_model_name = os.environ["AZURE_OPENAI_EMBEDDING_NAME_ADA"]
# Create AOAI client for embedding generation
client_embed = AzureOpenAI(
    azure_deployment=embedding_model_name,
    api_version=aoai_api_version,
    azure_endpoint=aoai_embedding_endpoint,
    api_key=aoai_embedding_apikey
)

# CONSTANTS
MAX_CHUNK_TOKEN_SIZE = 512

## Chunk text with GPT-4o

In [25]:

def generate_chunks_with_aoai(text):

    system_prompt = f"""Analyze the document provided and divide it into distinct sections where each section contains information that can answer typical customer questions for a Telco scenario. Group related topics together to form semantically coherent chunks. Ensure that each chunk is concise enough to stay within the token limits of the model, with a maximum of {MAX_CHUNK_TOKEN_SIZE} tokens, but comprehensive enough to provide a thorough answer to potential customer inquiries. If there are chunks with a size less than 100 tokens put them together in the same chunk. 
    Additionally, label each chunk with a descriptive title based on its content to facilitate easy navigation and reference. 
    The response has to be in the same language than the document.
    Respond with a format as follows with a line per title and chunk pair generated. For instance:
    title: "Informacion sobre Datos (Móvil)", chunk: "Cliente que necesita estar constantemente conectado a Internet (p ej un agente de bolsa que trabaja en movilidad, un comercial que hace los pedidos contra el stock del almacén?) En este caso le interesa el contrato Plus Datos / Plus Datos UMTS, opcionalmente para este último podrá contratar el Módulo C  o la Tarifa plana datos."
    title: "Descripción de Internet", "chunk": "Internet es una red compuesta de páginas Web a la que se accede desde un PC (y desde determinados modelos de terminales o PDA´s ) utilizando un móvil como módem mediante un cable de conexión, puerto de infrarrojos o bluetooth o con una tarjeta PCMCIA."
    """
    user_prompt = f'Document: "{text}"'

    response = call_aoai(aoai_client, aoai_model_name, system_prompt, user_prompt, 0.5, 4096)
    print(f'RESPONSE: [{response}]')

    if response != None:
        # GPT-4-0409: Parse answer with ", " as the separator between title and chunk
        pattern = r'title: "(.*?)", chunk: "(.*?)"'
        matches = re.findall(pattern, response)
        # Extract values of title and chunk from the response
        titles = [match[0] for match in matches]
        chunks = [match[1] for match in matches]

        data = [{"title": match[0], "content": match[1]} for match in matches]

        for chunk in data:
            print(f'chunk: {chunk}')

        '''
        chunks = []
        for x in range(len(titles)): # For every title and chunk pair
            title = titles[x]
            chunk = chunks[x]
            print(f'[{x}]: title: [{title}]')
            print(f'chunk: [{chunk}]')
            print(f'\ttokens: {token_len(chunk)}\n')
            chunks.append(chunk)
        print(f'total number of chunks: {x}')
        '''
        
        return chunks
    else:
        return None
    

In [26]:
text = """
Supporting Business Continuity Planning - QuickConnect
===

# Supporting Business Continuity Planning
Welcome to QuickConnect's comprehensive guide on how you can support customers with their business continuity planning (BCP). This guide aims to arm you with essential knowledge and strategies to ensure our customers remain connected and operational during any crisis.

## Understanding Business Continuity Planning
BCP involves creating systems and procedures that enable a business to withstand and recover from disruptions. This might include natural disasters, significant technical failures, or cyber incidents. The primary goal is for the business to maintain operations or quickly rebound with minimal impact.

## The Role of QuickConnect in Business Continuity
As a key telecommunications provider, QuickConnect plays a vital role in ensuring that businesses around the clock can maintain communication. Our products and services support continuous operation, remote work capabilities, and backup solutions during emergencies.

## How to Assist Customers with BCP
Below are actionable steps to guide customers through their BCP:

### Learn About the Customerâ€™s Business
Start by gaining insight into their business operations, sector, and specific needs. Discuss their critical operations and assess what interruptions could mean for their functions.

### Identify Essential Communication Tools
Identify the most crucial communication tools for themâ€”this may include phone services, internet, email systems, and data transfer systems.

### Analyze Risks and Weak Points
Help the customer recognize weak points and potential risks such as hardware malfunctions, network outages, cyber risks, and natural calamities.

### Formulate Recovery Plans
Collaborate with the customer to create backup strategies for their communication tools. This might consist of data backups, failover solutions, remote access, and alternative communication methods.

### Execute the Plan
Assist them in setting up the BCP, ensuring they have the necessary hardware, software, and services. Provide additional training if needed.

### Regular Testing and Updates
Promote regular testing and revising of the BCP, with simulation exercises and updates reflecting technological advancements and changing business environments.

## Core QuickConnect Services for BCP
Quick
"""

print(f'total tokens: {token_len(text)}')

chunks = generate_chunks_with_aoai(text)

for i, chunk in enumerate(chunks):
    print(f"[{i + 1}]: {chunk}")


total tokens: 418
RESPONSE: [title: "Introduction to Business Continuity Planning Guide", chunk: "Welcome to QuickConnect's comprehensive guide on how you can support customers with their business continuity planning (BCP). This guide aims to arm you with essential knowledge and strategies to ensure our customers remain connected and operational during any crisis."
title: "Understanding Business Continuity Planning", chunk: "BCP involves creating systems and procedures that enable a business to withstand and recover from disruptions. This might include natural disasters, significant technical failures, or cyber incidents. The primary goal is for the business to maintain operations or quickly rebound with minimal impact."
title: "Role of QuickConnect in Business Continuity", chunk: "As a key telecommunications provider, QuickConnect plays a vital role in ensuring that businesses around the clock can maintain communication. Our products and services support continuous operation, remote w

## Chunk every txt file in the input directory and write them in the output directory

In [None]:
# Chunk markdown files and write the chunks as files in the output directory
input_dir = '../data_out/markdown_files'
output_dir = '../data_out/chunk_gpt_files'
os.makedirs(output_dir,exist_ok=True)
markdown_contents = load_files(input_dir, '.txt')

for i, markdown_content in enumerate(markdown_contents):
    print(f"[{i + 1}]: title: {markdown_content['title']}")
    print(f"\t content: [{markdown_content['content']}]")

    chunks = generate_chunks_with_aoai(markdown_content['content'])
    # Write every chunk in a file in the output directory
    for j, chunk in enumerate(chunks):
        print(f'* Chunk {j + 1}, num. tokens: {token_len(chunk)},\nchunk: [{chunk}]')
        chunk_filename = markdown_content['title'].replace(".txt", f"_{j}.txt")
        file_path = os.path.join(output_dir, chunk_filename)
        print(f"\tWritting file [{file_path}]")
        with open(file_path, "w", encoding="utf-8") as f:
            f.write(chunk)
    print(f'\t total number of chunks: {j}')
