# Generate synthetic documents for RAG knowledge base

## Prerequisites
- An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access).

We used Python 3.12.3, [Visual Studio Code with the Python extension](https://code.visualstudio.com/docs/python/python-tutorial), and the [Jupyter extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) to test this example.

### Set up a Python virtual environment in Visual Studio Code

1. Open the Command Palette (Ctrl+Shift+P).
1. Search for **Python: Create Environment**.
1. Select **Venv**.
1. Select a Python interpreter. Choose 3.10 or later.

It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments).

### Install packages

In [None]:
%pip install openai

## Import packages and create Azure OpenAI client

In [28]:
import os
import io
from openai import AzureOpenAI
import ast

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_ENDPOINT"), 
  api_key=os.getenv("AZURE_API_KEY"),
  api_version="2024-02-01"
)

model_deployment_name = "gpt4o"

## Generate a list of topics for the synthetic documents

In [None]:
prompt = '''
You are an AI that generates documents for customer service agents working at QuickConnect (a telecommunications company) that helps them understand how to support customers with specific questions. Generate a list of 200 topics of documents that could be created to help customer service agents. Each topic should be a short description of the document's content. The topics should cover a wide range of customer queries and issues that agents might encounter. The topics should be relevant to the telecommunications industry and provide useful information for agents to assist customers effectively. The topics should be clear, concise, and informative, helping agents quickly understand the content of each document.
The output should be a python array in the following format: 
[
    "How to reset a customer's modem",
    "Troubleshooting weak Wi-Fi signals",
    "Upgrading a customer's internet plan",
    "Explaining data overage charges",
    ...
]
DO NOT INCLUDE ANY MARKDOWN FORMATTING.
'''


response = client.chat.completions.create(
    model=model_deployment_name,
    messages=[
    {"role": "system", "content": prompt}
    ],
)

topics = ast.literal_eval(response.choices[0].message.content)
print(topics)

## Generate documents based on list of topics

In [None]:
prompt = '''
You are an AI that generates documents for customer service agents working at QuickConnect (a telecommunications company) that helps them understand how to support customers with specific questions. The output should be an html file that contains documentation on a specific customer query. Use html tables, lists, etc as appropriate and make it look pretty. The document should be easy to read and understand. DO NOT INCLUDE ANY MARKDOWN FORMATTING IN THE OUTPUT.
Make sure the document includes at least 1000 words of content.
The topic for this document is:

'''

if not os.path.exists('data'):
    os.makedirs('data')

for i in range(len(topics)):
    print(topics[i])
    prompt_topic = f"{prompt} {topics[i]}"
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
        {"role": "system", "content": prompt_topic}
        ],
        max_tokens=2000
    )
    document = response.choices[0].message.content

    filename = f"data/{topics[i].replace(' ', '_')}.html"
    with io.open(filename, 'w', encoding='utf-8') as file:
        file.write(document)

    print(f"Document {i+1} saved as {filename}")


# Create variants of documents to simulate duplicates

In [None]:
prompt = '''
You are an AI that generates variants of documents for customer service agents working at QuickConnect (a telecommunications company) that helps them understand how to support customers with specific questions. The output should be a variant of the original document's content. The variant should provide alternative ways of explaining the same information or offer additional tips and suggestions. The variant should be relevant to the telecommunications industry and provide useful information for agents to assist customers effectively. The variant should be clear, concise, and informative, helping agents quickly understand the content of each document.

DO NOT INCLUDE ANY MARKDOWN FORMATTING.
'''

variants = []
for filename in os.listdir('data')[0:5]:
    if filename.endswith('.html'):
        with io.open(f'data/{filename}', 'r', encoding='utf-8') as file:
            document = file.read()
        
        prompt_topic = f"{prompt} {document}"
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": prompt_topic}
            ],
            max_tokens=1000
        )
        variant = response.choices[0].message.content
        print(variant)
        variant_filename = f"data/{os.path.splitext(filename)[0]}_2.html"
        with io.open(variant_filename, 'w', encoding='utf-8') as file:
            file.write(variant)