# Generating a Synthetic Knowledge Dataset Using Generative AI

In this notebook, we will be generating a synthetic dataset that we can specifically use for retrieval augmented generation (RAG) purposes. Specifically, what we will be generating is a corpus of documents representing **knowledge items** that one might find in a typical Fortune 500 companies IT helpdesk context. I will personally be leveraging my Perplexity subscription to generate the data, but you are welcome to use any LLM of your choosing.

## Notebook Setup

In [1]:
# Installing the Python modules in Kaggle
# Note: This cell only needs run once so long as you are persisting the files in your Kaggle notebook. Comment out when no longer needed.
# from pip_install import perform_pip_install
# perform_pip_install()

In [2]:
# Importing the necessary Python libraries
import os
import pandas as pd
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI

In [3]:
# Loading the API key per the appropriate environment
if 'KAGGLE_URL_BASE' in os.environ:
    
    # Loading the API keys from Kaggle secrets
    from load_api_keys import load_api_keys
    API_KEYS = load_api_keys()

In [4]:
# Instantiating the chat model
chat_model = ChatOpenAI(api_key = API_KEYS['PERPLEXITY_API_KEY'],
                        base_url = 'https://api.perplexity.ai',
                        model = 'llama-3-70b-instruct')

## Prompt Engineering

In this section, we'll set up the prompts that we will be using to generate our specific synthetic dataset. Specifically, we are going to need two separate prompts: one to generate the list of knowledge item topics and another to generate a body of text for a knowledge item given a specific knowledge item topic.

In [5]:
# Creating the topic generation prompt
TOPIC_GENERATION_PROMPT = '''You are the head of an IT helpdesk as part of a Fortune 500 company, and your company maintains a knowledge management system with articles used by helpdesk specialists to answer questions related to IT support. It does not matter what kind of Fortune 500 company this is. Each individual knowledge item in the knowledge management system has a title and supporting body of text. An example of a knowledge item title might be "Resetting a User's Password." Please generate a list of 100 example knowledge item titles that might be found in your knowledge management system. Only include the list of the 100 examples. Do not provide any additional commentary. Please do not say something like "Here are the 100 examples."'''

# Creating the prompt to generate the knowledge item body of text
KNOWLEDGE_ITEM_GENERATION_PROMPT = '''You are the head of an IT helpdesk as part of a Fortune 500 company, and your company maintains a knowledge management system with articles used by helpdesk specialists to answer questions related to IT support. It does not matter what kind of Fortune 500 company this is. Each individual knowledge item in the knowledge management system has a title and supporting body of text. Within triple backticks below is an example of a title of one of these knowledge items. Please write a body of text of steps that might be associated to the knowledge article. The body of text should be no longer than 1000 words. Only return the text that would be populated into the body of the knowledge article; do not return any other text.

Knowledge item title:
```
{ki_title}
```
'''

## Generating the Knowledge Item Topics
Now that we have set our prompt engineering appropriately, we're ready to begin using it to generate the list of knowledge item topics. We will be leaning on LangChain to help with this. Specifically, we are going to chain together the prompt engineering, chat model itself, and a special parser that will take the output from the chat model and turn it into a Python list that we can iterate over.

In [6]:
# Instantiating the output parser
output_parser = CommaSeparatedListOutputParser()

# Setting up the chat prompt template
ki_topic_generation_prompt_template = PromptTemplate(
    template = '{ki_topic_generation_prompt}\n{format_instructions}',
    input_variables = ['ki_topic_generation_prompt'],
    partial_variables = {'format_instructions': output_parser.get_format_instructions()}
)

In [7]:
# Chaining the KI topic generation prompt template, chat model, and output parser
ki_topic_generation_chain = ki_topic_generation_prompt_template | chat_model | output_parser

(Note: The following cell has been intentionally commented out for officially saving a new version of the notebook to Kaggle. To use, simply uncomment!)

In [8]:
# # Generating the KI topics list
# ki_topics_list = ki_topic_generation_chain.invoke({
#     'ki_topic_generation_prompt': TOPIC_GENERATION_PROMPT
# })

# # Slimming down the list to ensure there are only a specific amount of topics
# ki_topics_list = ki_topics_list[-100:]

# # Creating a Pandas DataFrame around the topics list
# df_knowledge_items = pd.DataFrame(data = {'ki_topic': ki_topics_list})

# # Prepping to add the KI text
# df_knowledge_items['ki_text'] = ''

# # Saving the knowledge item topics
# df_knowledge_items.to_csv('synthetic_ki_topics.csv', index = False)

## Generating the Knowledge Item Text
Now that we have generated our list of topics, we are ready to produce text per each knowledge item topics. Now, I'm honestly a bit worried about something erroring out as we get into this, so to ensure that we're not re-doing work that we've already done, we're going to be constantly saving this text back to `df_knowledge_items` so that if we have to re-do anything, we'll simply skip over the stuff already completed.

(Note: The following cell has been intentionally commented out for officially saving a new version of the notebook to Kaggle. To use, simply uncomment!)

In [16]:
# Loading the knowledge item topics back in from the checkpoint file
df_knowledge_items = pd.read_csv('synthetic_ki_topics.csv')
df_knowledge_items['ki_text'] = ''

In [18]:
# Creating the prompt engineering template to generate the knowledge item text
ki_text_generation_prompt = ChatPromptTemplate.from_messages(messages = [
    HumanMessagePromptTemplate.from_template(template = KNOWLEDGE_ITEM_GENERATION_PROMPT)
])

# Creating the inference chain to generate the knowledge item text
ki_text_chain = ki_text_generation_prompt | chat_model

In [19]:
def generate_ki_text(row):
    '''
    Generates simulated knowledge item text per a given knowledge item topic
    
    Inputs:
        - row (Pandas DataFrame record): A single record from the Pandas DataFrame
        
    Returns:
        - ki_text (str): The knowledge item text generated by the AI model per the record
    '''
    
    # Checking to see if the knowledge item text has already been generated
    if row['ki_text'] == '':
        
        # Generating the knowledge item text
        ki_text = ki_text_chain.invoke({'ki_title': row['ki_topic']}).content
        
        return ki_text
    
    else:
        
        # Returning what is already in place if the string is not empty
        return row['ki_text']

(Note: The following cell has been intentionally commented out for officially saving a new version of the notebook to Kaggle. To use, simply uncomment!)

In [20]:
# # Generating the knowledge item text for any topic that hasn't already been accounted for
# df_knowledge_items['ki_text'] = df_knowledge_items.apply(generate_ki_text, axis = 1)

# # Saving out the final dataset
# df_knowledge_items.to_csv('synthetic_knowledge_items.csv', index = False)