# <font color=red>LangChain:  Example Generate Causal Reasoning Pre-training Data</font>
- https://docs.langchain.com/docs

<span style="font-family:'Comic Sans MS', cursive, sans-serif;"><font color=orange>
## Demo 1 - Tiny example to generate data for a {subject}, e.g. biology
</font></span>
This program demos a way to generate causal reasoning examples that might be used to pre-train a model like AuroraGPT.</br>
The program works with GPT-4 trained "as-is", i.e. it does not use "trusted documents" to make sure all is correct.</br>
It also does not try to avoid hallucination with prompt directives such as "don't make things up".</br>
<font color=lightgreen>Important note:</font></br>
The program contains an exit(0) statement which you will need to comment out if you want to write the examples to a JSONL file.

In [None]:
import sys, os, random, json

from langchain import LLMChain, PromptTemplate
from langchain.chat_models import ChatOpenAI

NUM_EXAMPLES_TO_GEN = 11

# MODEL = "gpt-3.5-turbo"
MODEL = "gpt-4"

TEMPLATE = """
You are generating data which will be used to fine-tune a Large Language Model.
The model will be used to work with advanced high-school students studying {subject}.
You will be generating single prompt/response pairs which present examples of
causal reasoning that advanced high-school students in {subject} can understand.
There may also be some previously generated examples provided in the prompt to
help you to ensure uniqueness and diversity.
Please keep each response to a "reasonable" length, i.e. no more than 100 words.
The prompt/response pair that you create should be in this format:
Prompt: <prompt goes here>
Response: <response goes here>
\n
Question:
{question}
"""

QUESTION = """
Please generate a single prompt/response pair which presents an example of
causal reasoning that advanced high-school students in {subject} can understand.
Be sure to avoid duplication with prior questions listed.
Be sure to develop the response using a step-by-step process.
"""

subject = "biology"   # could be a cmd-line arg

llm = ChatOpenAI(model_name=MODEL,temperature=0.5) # ,max_tokens=100)
prompt_template = PromptTemplate.from_template(TEMPLATE)
answer_chain = LLMChain(llm=llm, prompt=prompt_template)

def generate_one_example(prev_examples):
    if len(prev_examples) > 8:
        prev_examples = random.sample(prev_examples,8)
    prevs = ""
    for example in prev_examples:
        prevs += "Previous example:\n" + example + "\n"
    question = prevs + QUESTION
    answer = answer_chain.run(subject=subject,question=question)
    return answer

examples = []
for i in range(NUM_EXAMPLES_TO_GEN):
    print(f'Generating example {i}')
    example = generate_one_example( examples )
    examples.append(example)

for (i,example) in enumerate(examples):
    print(f"\nCausal Reasoning Example {i}\n----------------------------")
    print(f"{example}\n")

system_message = """
Given a prompt which claims that a particular causal relationship exists in biology, generate a response which explains why that causal relationship does indeed exist.
"""

#### save the examples into a JSONL file
## print("EXITING") ; exit(0)  # un-comment this if you want to skip creating the JSONL file
if False:   # change to True if you want to create the JSONL file

    filename = f"{subject}_training_data.jsonl"
    with open(filename, 'w') as jsonl_file:
        for example in examples:
            idx = example.find("Response:")
            sys_msg  = system_message.strip()
            user_msg = example[:idx].strip()
            asst_msg = example[idx:].strip()
            user_msg = user_msg.replace("Prompt: ", "")
            asst_msg = asst_msg.replace("Response: ", "")
            training_example = {
                "messages": [
                    {"role": "system",    "content": sys_msg},
                    {"role": "user",      "content": user_msg},
                    {"role": "assistant", "content": asst_msg}
                ]
            }
            print(json.dumps(training_example), file=jsonl_file)