<a href="https://colab.research.google.com/github/blueai2022/coding_challenge/blob/main/Stepped_CoT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Describe your model -> generate data
By Ben Lu (http://steama.ai)

The goal of this notebook is to complete the first step of the feedback loop approach: generating a small seeding dataset to build a task-specific model based on the rule descriptions.

To create your model, just go to the first code cell, and describe the model you want to build in the prompt. Be descriptive and clear.

Once we analyse generated dataset, we will "mulitiple" it to make a much large dataset using python code. This would save GPT4 usage/credits.







#Mount Google Drive as file storage for code below (Only need to run once)



This will lead you to Google authorization page. There, you'll be asked to select your Google account and grant permission to Colab to access your Google Drive.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).



#Data generation step

Write your prompt here. Make it as descriptive as possible!

Then, choose the temperature (between 0 and 1) to use when generating data. Lower values are great for precise tasks, like writing code, whereas larger values are better for creative tasks, like writing stories.

Finally, choose how many examples you want to generate. The more you generate, a) the longer it takes and b) the more expensive data generation will be. But generally, more examples will lead to a higher-quality model. 100 is usually the minimum to start.

Run this to generate the dataset.

In [None]:
rule_name = "BMI"
measure_name = "BMI"
rule = "BMI ratings rule:\nRefer to Doctor: 0-17.49\nPreferred Plus: 17.5-29.99\nPreferred: 30-31.49\nStandard: 31.5-36.49\nTable 1: 36.5-40.99\nTable 2: 41-41.99\nTable 3: 42-42.99\nTable 4: 43-43.99\nTable 5: 44-44.99\nTable 6: 45-45.99\nTable 7: 46-46.99\nTable 8: 47-47.99\nDecline: 48-up"

#define what you want the trained model to do, so that GPT4 can generate data for you
prompt = "A model that identifies that a given " + measure_name + " value falls within from range definition associated with a rating to decide its risk rating, and responds with a well-reasoned step-by-step thought-out decision process. In other words: follow the order of facts, reasoning steps, and then decision.\nPlease quote the complete rule verbatim as part of your reasoning process.\nBelow are rules for " + rule_name + " ratings:\n\n" + rule
temperature = .1
number_of_examples = 25

In [None]:
!pip install openai==0.28.1

Collecting openai==0.28.1
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0mSuccessfully installed openai-0.28.1


In [None]:
import os
import openai
import random
import time

openai.api_key = "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"

def generate_example(prompt, prev_examples, temperature=.5):
    messages=[
        {
            "role": "system",
            "content": f"You are generating data which will be used to train a machine learning model.\n\nYou will be given a high-level description of the model we want to train, and from that, you will generate data samples, each with a prompt/response pair.\n\nYou will do so in this format:\n```\nprompt\n-----------\n$prompt_goes_here\n-----------\n\nresponse\n-----------\n$response_goes_here\n-----------\n```\n\nOnly one prompt/response pair should be generated per turn.\n\nFor each turn, make the example slightly more complex than the last, while ensuring diversity.\n\nMake sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.\n\nHere is the type of model we want to train:\n`{prompt}`"
        }
    ]

    if len(prev_examples) > 0:
        if len(prev_examples) > 10:
            prev_examples = random.sample(prev_examples, 10)
        for example in prev_examples:
            messages.append({
                "role": "assistant",
                "content": example
            })

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages,
        temperature=temperature,
        max_tokens=1354,
    )

    return response.choices[0].message['content']

# Generate examples
prev_examples = []
for i in range(number_of_examples):
    print(f'Generating example {i}')
    while True:
      try:
        example = generate_example(prompt, prev_examples, temperature)
        break
      except:
        pass
        time.sleep(3)

    prev_examples.append(example)

print(prev_examples)

Generating example 0
Generating example 1
Generating example 2
Generating example 3
Generating example 4
Generating example 5
Generating example 6
Generating example 7
Generating example 8
Generating example 9
Generating example 10
Generating example 11
Generating example 12
Generating example 13
Generating example 14
Generating example 15
Generating example 16
Generating example 17
Generating example 18
Generating example 19
Generating example 20
Generating example 21
Generating example 22
Generating example 23
Generating example 24
["prompt\n-----------\nA person has a BMI of 32. What is their risk rating?\n-----------\n\nresponse\n-----------\nThe person's BMI is 32. According to the BMI ratings rule, a BMI of 31.5-36.49 falls under the 'Standard' category. Therefore, the person's risk rating is 'Standard'.", "prompt\n-----------\nA person has a BMI of 46.5. What is their risk rating?\n-----------\n\nresponse\n-----------\nThe person's BMI is 46.5. According to the BMI ratings rule,

We also need to generate a system message.

In [None]:
def generate_system_message(prompt):

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
          {
            "role": "system",
            "content": "You will be given a high-level description of the model we are training, and from that, you will generate a simple system prompt for that model to use. Remember, you are not generating the system message for data generation -- you are generating the system message to use for inference. A good format to follow is `Given $INPUT_DATA, you will $WHAT_THE_MODEL_SHOULD_DO.`.\n\nMake it as concise as possible. Include nothing but the system prompt in your response.\n\nFor example, never write: `\"$SYSTEM_PROMPT_HERE\"`.\n\nIt should be like: `$SYSTEM_PROMPT_HERE`."
          },
          {
              "role": "user",
              "content": prompt.strip(),
          }
        ],
        temperature=temperature,
        max_tokens=500,
    )

    return response.choices[0].message['content']

system_message = generate_system_message(prompt)

print(f'The system message is: `{system_message}`. Feel free to re-run this cell if you want a better result.')

The system message is: `Given a BMI value, you will identify the corresponding risk rating according to the provided BMI ratings rule, and explain your decision process step-by-step, quoting the relevant rule verbatim.`. Feel free to re-run this cell if you want a better result.


Now let's put our examples into a dataframe and turn them into a final pair of datasets.

In [None]:
import pandas as pd

# Initialize lists to store prompts and responses
prompts = []
responses = []

# Parse out prompts and responses from examples
for example in prev_examples:
  try:
    split_example = example.split('-----------')
    prompts.append(split_example[1].strip())
    responses.append(split_example[3].strip())
  except:
    pass

# Create a DataFrame
df = pd.DataFrame({
    'prompt': prompts,
    'response': responses
})

# Remove duplicates
df = df.drop_duplicates()

print('There are ' + str(len(df)) + ' successfully-generated examples. Here are the first few:')

# Show all 25 records
df

There are 14 successfully-generated examples. Here are the first few:


Unnamed: 0,prompt,response
0,A person has a BMI of 32. What is their risk r...,The person's BMI is 32. According to the BMI r...
1,A person has a BMI of 46.5. What is their risk...,The person's BMI is 46.5. According to the BMI...
2,A person has a BMI of 17. What is their risk r...,The person's BMI is 17. According to the BMI r...
3,A person has a BMI of 44.5. What is their risk...,The person's BMI is 44.5. According to the BMI...
4,A person has a BMI of 30.5. What is their risk...,The person's BMI is 30.5. According to the BMI...
5,A person has a BMI of 48.5. What is their risk...,The person's BMI is 48.5. According to the BMI...
6,A person has a BMI of 42.5. What is their risk...,The person's BMI is 42.5. According to the BMI...
7,A person has a BMI of 36. What is their risk r...,The person's BMI is 36. According to the BMI r...
8,A person has a BMI of 41.5. What is their risk...,The person's BMI is 41.5. According to the BMI...
9,A person has a BMI of 29.5. What is their risk...,The person's BMI is 29.5. According to the BMI...


Save into train and test sets

In [None]:
# Split the data into train and test sets, with 90% in the train set
df.to_json('train_starter.jsonl', orient='records', lines=True)

In [None]:
!cp 'train_starter.jsonl' '/content/drive/My Drive/train_starter1.jsonl'