This software project accompanies the research paper LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues.
LUCID is a highly automated, LLM-driven data generation system for task-oriented dialogues. LUCID aims to produce realistic, diverse and challenging conversations, with highly accurate labels. LUCID takes a modularised approach to data generation, compartmentalising the data generation task into manageable steps that an LLM can consistently perform accurately. For more details, please see our paper.
This repo contains the code for the data generation system (which can be used to generate more data), the data we have already generated for our paper (LUCIDv1.0), and the code for our baseline models.
To create new intents from a description:
- Open lucid_generate_data/run_scripts/create_intents_from_description.py
- In this file, update INTENTS, a dictionary containing domains, and the desired intent descriptions within each domain
- Once finished, run the .py file from the root directory (_** python lucid_generate_data/run_scripts/create_intents_from_description.py**)
- The new intents will be generated in lucid_generate_data/intents_for_data_generation
a Open lucid_generate_data/run_scripts/run_conversations.py
- Inside this file, decide now many conversations to generate per intent (CONVS_PER_INTENT), the maximum number of intents for a conversation (MAX_INTENTS_IN_CONVERSATION)
- You also need to specify the conversational phenomena that you would like for the conversation (UNHAPPY_PATHS). Note that for the data generated for the paper, these were randomly sampled for each conversation (with either 0, 1 or 2 unhappy paths per conversation.
- Your saved conversations will be stored in lucid_generate_data/saved_conversations
- To assemble your generated conversations into your final dataset, run lucid_generate_data/compile_data.py
- Your final dataset will be called LUCID_data.json
To run the LUCID baselines, please use: python running_baseline/run_llm.py