### Environment Set Up

Install essential requirements.

In [None]:
%%capture
!pip install unsloth==2024.12.4
!pip install datasets==3.2.0
!pip install openai==1.58.1

<a name="Finetuning"></a>
## PART1. Domain-specific LLM Finetuning

### Model initialization

In this tutorial, we use `unsloth/Llama-3.2-3B-Instruct` as base model, and add
LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
# from google.colab import userdata
# HF_TOKEN = userdata.get('HF_TOKEN')

from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

dtype = (
    None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
)
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.


def model_generator(model_name, max_seq_length=2048, is_load=False):
    if is_load:
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_name,
            max_seq_length=max_seq_length,
            dtype=dtype,
            load_in_4bit=load_in_4bit,
        )
    else:
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_name,
            max_seq_length=max_seq_length,
            dtype=dtype,
            load_in_4bit=load_in_4bit,
            # token=HF_TOKEN,  # use one if using gated models like meta-llama/Llama-2-7b-hf
        )

        if 'llama' in model_name:
            tokenizer = get_chat_template(
                tokenizer,
                chat_template = "llama-3.1",
            )

        model = FastLanguageModel.get_peft_model(
            model,
            r=16,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
            target_modules=[
                "q_proj",
                "k_proj",
                "v_proj",
                "o_proj",
                "gate_proj",
                "up_proj",
                "down_proj",
            ],
            lora_alpha=16,
            lora_dropout=0,  # Supports any, but = 0 is optimized
            bias="none",  # Supports any, but = "none" is optimized
            # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
            use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
            random_state=3407,
            use_rslora=False,  # We support rank stabilized LoRA
            loftq_config=None,  # And LoftQ
        )

    return model, tokenizer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


### Data preparetion

Since this tutorial focus on generating math study plan, here we take `GMS8K` as dataset. For each training sample, the question would be user_input and the answer would function as the response from the assistant.

In [None]:
def question_prompt(s):
    return f"Question: {s}"

def answer_prompt(s):
    return f"Answer: {s}"

def formatting_prompts_func(example, tokenizer):
    convo = [
        {"role": "user", "content": question_prompt(example["question"])},
        {"role": "assistant", "content": answer_prompt(example["answer"])},
    ]
    texts = tokenizer.apply_chat_template(
        convo, tokenize=False, add_generation_prompt=False
    )

    return {
        "text": texts,
    }



<a name="Train"></a>
### Train the model

Now let's use Huggingface TRL's SFTTrainer!
We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from datasets import load_dataset

from unsloth import is_bfloat16_supported
from unsloth.chat_templates import get_chat_template
from unsloth.chat_templates import train_on_responses_only

# from models.generator import model_generator
# from utils.formatter import formatting_prompts_func

model_name = "unsloth/Llama-3.2-3B-Instruct"
max_seq_length = 2048  # Choose any! We auto support RoPE Scaling internally!

# get model
model, tokenizer = model_generator(model_name, max_seq_length)

# get dataset
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)
dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.map(lambda example: formatting_prompts_func(example, tokenizer))

# get trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,  # Can make training 5x faster for short sequences.
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps=60,
        learning_rate=1e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # Use this for WandB etc
    ),
)

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

trainer_stats = trainer.train()  # 480 samples

# save
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.47.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,3.9285
2,4.1432
3,4.0106
4,4.5581
5,4.21
6,3.4167
7,3.1682
8,2.8079
9,2.706
10,2.6148


('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

### Test the model

* For prompts, we use
* For answer parsing, we first clarified that the model should return the answer after ####. After receiving the answer, we split with eos and ####.

In [None]:
def nshot_chats(nshot_data, n: int, question: str):
    chats = []

    shuffled_dataset = nshot_data.shuffle()

    for i in range(n):
        qna = shuffled_dataset[i]
        chats.append({"role": "user", "content": question_prompt(qna["question"])})
        chats.append({"role": "assistant", "content": answer_prompt(qna["answer"])})

    chats.append(
        {
            "role": "user",
            "content": question_prompt(question)
            + " Let's think step by step. At the end, you MUST write the answer as an integer after '####'.",
        }
    )

    return chats


def extract_ans_from_response(answer: str, eos=None):
    if eos:
        answer = answer.split(eos)[0].strip()

    answer = answer.split("####")[-1].strip()

    for remove_char in [",", "$", "%", "g"]:
        answer = answer.replace(remove_char, "")

    try:
        return int(answer)
    except ValueError:
        return answer


def get_response(model, tokenizer, message, max_tokens=256, generation=True, batched=False):
    inputs = tokenizer.apply_chat_template(
        message,
        tokenize=True,
        padding=True,
        add_generation_prompt=generation,  # Must add for generation
        return_tensors="pt",
    ).to("cuda")

    outputs = model.generate(
        input_ids=inputs, max_new_tokens=max_tokens, use_cache=True, temperature=0.01, min_p=0.1
    )

    outputs = outputs[:, inputs.shape[1]:]
    if not batched:
      response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    else:
      response = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return response

We can now compare the accuracy of the  finetuned model with the base model using Chain-of-Thought reasoning and 2-shot prompting techniques!

In [None]:
import time
import torch
from tqdm import tqdm
from datasets import load_dataset
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

baselines = [
    "./lora_model",
    # "unsloth/Meta-Llama-3.1-8B",
    "unsloth/Llama-3.2-3B-Instruct",
    "unsloth/Llama-3.2-1B-Instruct",
]

# load data
ds = load_dataset("openai/gsm8k", "main")
samples = ds["train"]
test_dataset = ds["test"]
test_data = test_dataset.take(200)

# test accuracy for different models
acc_data = []

for model_name in baselines:
    start_time = time.perf_counter()
    if "lora_model" in model_name:
        model, tokenizer = model_generator(model_name, 2048, True)
    else:
        model, tokenizer = model_generator(model_name)

    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

    total = correct = 0
    for item in tqdm(test_data):
        messages = nshot_chats(
            nshot_data=samples, n=2, question=item["question"]
        )
        response = get_response(model, tokenizer, messages)
        # print("============== QUESTION ==============")
        # print(item["question"])
        # print("============== ANSWER ==============")
        # print(item["answer"])
        # print("============== RESPONSE ==============")
        # print(response)

        pred_ans = extract_ans_from_response(response)
        true_ans = extract_ans_from_response(item["answer"])

        # print("============== EXTRACT_ANS ==============")
        # print(f"pred_ans: {pred_ans}")
        # print(f"true_ans: {true_ans}")

        total += 1
        if pred_ans == true_ans:
            correct += 1

    accuracy = f"{correct/total:.3f}"
    acc_data.append([model_name, accuracy])
    print(f"============== Testing {model_name} ============== ")
    print(f"Total Accuracy: {accuracy}")
    print(f"Duration: {time.perf_counter() - start_time}")

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.47.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


100%|██████████| 200/200 [23:11<00:00,  6.96s/it]


Total Accuracy: 0.605
Duration: 1404.3316775580001
==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.47.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


100%|██████████| 200/200 [29:44<00:00,  8.92s/it]


Total Accuracy: 0.790
Duration: 1801.7821610969995
==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.47.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Unsloth 2024.12.4 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.
100%|██████████| 200/200 [17:20<00:00,  5.20s/it]

Total Accuracy: 0.315
Duration: 1075.3311735999996





In [None]:
from pandas import DataFrame

df = DataFrame(acc_data, columns=["Model", "Accuracy"])
df

Unnamed: 0,Model,Accuracy
0,./lora_model,0.605
1,unsloth/Llama-3.2-3B-Instruct,0.79
2,unsloth/Llama-3.2-1B-Instruct,0.315


## PART2. Customize Evluation Metrics


### Phase1. Initialize evaluation dimensions

In this phase, we will generate outputs for prompts that contain only the task and the combination of the task with the initial evaluation dimensions.

**Note**: the system prompt must remain unchanged and should stay consistent across different tasks.

#### step1. Define intial evaluation dimensions

Using the predefined prompts, you can adapt them for you task and list 2 to 3 initial evaluation dimensions in the format like Keyword Adjectives. Detailed explanation.

In [None]:
system_prompt = (
    "\n\n"
    + "If there are specific requirements or constraints, you should satisfy the requests in your response"
)

student_task_system_prompt = (
    "You are an expert in math education. "
    + "You should provide clear, descriptive, and helpful answers and explanations to the student's questions."
    + "The student may also ask or provide additional specific information that you should take into account when assisting them."
    "\n\n"
    + "You are helping a student with questions about the direction of their studies. "
    + "You will help the student generate a study plan for their course. "
)

student_evaluation_system_prompt = (
    "\n\n"
    + "Consider the following dimensions when generating the study plan: \n"
    + "1. Be Detailed: Provide a detailed plan that includes the topics and concepts they should focus on, in a clear and organized manner.\n"
    + "2. Be Hierarchical: Create a hierarchical study plan, grouping similar topics and concepts together."
)

initial_student_prompts = {
    "task_prompt": student_task_system_prompt,
    "evaluation_prompt": student_evaluation_system_prompt,
}

####step2. Generate initial LLM responses using two version prompts


In [None]:
MODEL_NAME = "unsloth/Llama-3.2-3B-Instruct"
model, tokenizer = model_generator(MODEL_NAME)
FastLanguageModel.for_inference(model)


task_eval_prompt = initial_student_prompts
user_prompt = "Can you generate a study plan for me? I am taking Calculus 1 this semester and struggling with integration, particularly u-substitution."


prompts = {
    "RAW_SYSTEM_PROMPT": (
        task_eval_prompt["task_prompt"]
        + system_prompt
        + task_eval_prompt["evaluation_prompt"]
    ),
    "PARTIAL_PROMPT": (task_eval_prompt["task_prompt"] + system_prompt),
}

responses = {}

for prompt_name, prompt in prompts.items():
    input_chat = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": user_prompt},
    ]
    resp = get_response(model, tokenizer, input_chat, max_tokens=1024, generation=True)
    responses[prompt_name] = resp

    print(f"============== RESPONSE FOR {prompt_name} ==============")
    print(resp)
    print("============== /RESPONSE ==============\n\n")


==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.47.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
I'd be happy to help you generate a study plan for Calculus 1, focusing on integration and u-substitution.

**Study Plan Overview**

To tackle Calculus 1, we'll break down the course into manageable chunks, focusing on key topics and concepts. We'll prioritize integration and u-substitution, as you mentioned.

**Course Outline**

Here's a hierarchical outline of the course, grouping similar topics and concepts together:

I. **Limits and Basic Calculus**

* Review of basic calculus concepts (e.g., derivatives, integrals, and applications)
* 

### Phase2. Refine evaluation dimensions

Take ChatGPT as an expert on your domain-specific task!


#### step1. Get EXTRA EVALUATION DIMENSIONS

In [None]:
## Temp cell

response = {
    'RAW_SYSTEM_PROMPT': """**Study Plan Overview**

To tackle Calculus 1, we'll break down the course into manageable chunks, focusing on key topics and concepts. We'll prioritize integration and u-substitution, as you mentioned.

**Course Outline**

Here's a hierarchical outline of the course, grouping similar topics and concepts together:

I. **Limits and Basic Calculus**

* Review of basic calculus concepts (e.g., derivatives, integrals, and applications)
* Introduction to limits and their role in calculus
* Practice problems and exercises to solidify understanding

II. **Differentiation**

* Review of differentiation rules (e.g., power rule, product rule, quotient rule)
* Introduction to implicit differentiation
* Practice problems and exercises to solidify understanding

III. **Integration**

* **u-Substitution**: Focus on u-substitution techniques for integration
	+ Practice problems and exercises to develop u-substitution skills
	+ Review of common u-substitution formulas and techniques
* **Integration by Parts**: Focus on integration by parts techniques for integration
	+ Practice problems and exercises to develop integration by parts skills
	+ Review of common integration by parts formulas and techniques
* **Integration by Partial Fractions**: Focus on integration by partial fractions techniques for integration
	+ Practice problems and exercises to develop integration by partial fractions skills
	+ Review of common integration by partial fractions formulas and techniques
* **Integration of Trigonometric and Exponential Functions**: Focus on integrating trigonometric and exponential functions
	+ Practice problems and exercises to develop skills in integrating these functions

IV. **Applications of Integration**

* **Area Between Curves**: Focus on finding areas between curves
	+ Practice problems and exercises to develop skills in finding areas between curves
* **Volume of Solids**: Focus on finding volumes of solids
	+ Practice problems and exercises to develop skills in finding volumes of solids
* **Work and Energy**: Focus on applying integration to real-world problems
	+ Practice problems and exercises to develop skills in applying integration to real-world problems

**Action Plan**

To tackle the course, follow this action plan:

1. **Weeks 1-2**: Review basic calculus concepts, including derivatives and integrals. Practice problems and exercises to solidify understanding.
2. **Weeks 3-4**: Focus on u-substitution techniques for integration. Practice problems and exercises to develop u-substitution skills.
3. **Weeks 5-6**: Focus on integration by parts techniques for integration. Practice problems and exercises to develop integration by parts skills.
4. **Weeks 7-8**: Focus on integration by partial fractions techniques for integration. Practice problems and exercises to develop integration by partial fractions skills.
5. **Weeks 9-10**: Focus on integrating trigonometric and exponential functions. Practice problems and exercises to develop skills in integrating these functions.
6. **Weeks 11-12**: Focus on applying integration to real-world problems, including area between curves, volume of solids, and work and energy.

**Additional Tips**

* **Practice consistently**: Set aside dedicated time each week to practice problems and exercises.
* **Review regularly**: Review notes and textbook regularly to reinforce understanding.
* **Seek help when needed**: Don't hesitate to ask for help if you're struggling with a particular concept or problem.
* **Use online resources**: Utilize online resources, such as video lectures and practice problems, to supplement your learning.

By following this study plan, you'll be well-prepared to tackle the challenges of Calculus 1, particularly integration and u-substitution. Stay focused, and you'll be on your way to success!""",

    'PARTIAL_PROMPT': """I'd be happy to help you create a study plan for Calculus 1, focusing on integration and u-substitution.

**Understanding Your Goals and Current Situation**

Before we begin, let's establish your goals and current situation:

1. **Goals:**
	* Master u-substitution and integration techniques in Calculus 1.
	* Improve overall understanding of integration and its applications.
2. **Current Situation:**
	* You're struggling with u-substitution in Calculus 1.
	* You're not sure where to start or how to approach problems involving u-substitution.

**Study Plan**

To help you achieve your goals, I've created a 4-week study plan, with a focus on u-substitution and integration. This plan is tailored to your specific needs and can be adjusted as needed.

**Week 1: Review and Foundation**

1. Review the basics of integration, including:
	* Definition of definite and indefinite integrals
	* Basic integration rules (e.g., power rule, constant multiple rule)
	* Integration by substitution (without u-substitution)
2. Practice problems:
	* Complete 10-15 practice problems from your textbook or online resources, focusing on basic integration rules and substitution.
	* Review and understand the concept of u-substitution.

**Week 2: u-Substitution**

1. **u-Substitution Fundamentals:**
	* Learn the u-substitution method, including:
		+ Choosing the right u-function
		+ Finding the derivative of u
		+ Substituting u and du into the integral
	* Practice problems:
		+ Complete 10-15 practice problems involving u-substitution, starting with simple examples and gradually increasing to more complex ones.
2. **Integration by u-Substitution:**
	* Learn to integrate functions using u-substitution, including:
		+ Integrating trigonometric functions
		+ Integrating exponential functions
		+ Integrating rational functions

**Week 3: Integration Applications**

1. **Integration Applications:**
	* Learn to apply u-substitution to real-world problems, including:
		+ Physics and engineering applications
		+ Economics and finance applications
	* Practice problems:
		+ Complete 10-15 practice problems involving u-substitution in real-world contexts.
2. **Integration Techniques:**
	* Review and practice other integration techniques, including:
		+ Integration by parts
		+ Integration by partial fractions
		+ Integration of trigonometric and exponential functions

**Week 4: Review and Practice**

1. **Review and Practice:**
	* Review all the concepts covered in the previous weeks.
	* Practice problems:
		+ Complete 20-30 practice problems, including a mix of u-substitution, integration by parts, and other techniques.
2. **Final Project:**
	* Work on a final project that applies u-substitution to a real-world problem or a challenging integration problem.

**Additional Tips and Recommendations**

* **Practice consistently:** Set aside dedicated time each day or week to practice problems and review concepts.
* **Use online resources:** Utilize online resources, such as Khan Academy, MIT OpenCourseWare, or Wolfram Alpha, to supplement your learning and practice problems.
* **Join a study group:** Consider joining a study group or finding a study partner to collaborate and discuss challenging problems.
* **Seek help when needed:** Don't hesitate to ask your instructor or teaching assistant for help if you're struggling with a particular concept or problem.

By following this study plan, you'll be well on your way to mastering u-substitution and integration in Calculus 1. Stay committed, and you'll see improvement over time. Good luck!
    """,
}

task_eval_prompt = initial_student_prompts

system_prompt = (
    "\n\n"
    + "If there are specific requirements or constraints, you should satisfy the requests in your response"
)

student_task_system_prompt = (
    "You are an expert in math education. "
    + "You should provide clear, descriptive, and helpful answers and explanations to the student's questions."
    + "The student may also ask or provide additional specific information that you should take into account when assisting them."
    "\n\n"
    + "You are helping a student with questions about the direction of their studies. "
    + "You will help the student generate a study plan for their course. "
)

student_evaluation_system_prompt = (
    "\n\n"
    + "Consider the following dimensions when generating the study plan: \n"
    + "1. Be Detailed: Provide a detailed plan that includes the topics and concepts they should focus on, in a clear and organized manner.\n"
    + "2. Be Hierarchical: Create a hierarchical study plan, grouping similar topics and concepts together."
)

initial_student_prompts = {
    "task_prompt": student_task_system_prompt,
    "evaluation_prompt": student_evaluation_system_prompt,
}

user_prompt = "Can you generate a study plan for me? I am taking Calculus 1 this semester and struggling with integration, particularly u-substitution."

In [None]:
import openai
from google.colab import userdata

OPENAI_TOKEN = userdata.get('OPENAI_TOKEN')

compare_prompt = (
    "The following texts are two different sample study plans to be used by students to further their studies."
    + "Compare the following study plans and provide feedback on which one is more effective. "
    + "Provide an explanation of your decision and define the 2-3 most important metrics you used to arrive at your conclusion."
)

raw_output = response['RAW_SYSTEM_PROMPT']
output_with_evaluations = response['PARTIAL_PROMPT']
response_pair = f"Response 1:\n{raw_output}\n\n\nResponse 2:\n{output_with_evaluations}"

openai_client = openai.OpenAI(api_key=OPENAI_TOKEN)

resp = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": compare_prompt},
        {"role": "user", "content": response_pair},
    ],
)

resp = resp.choices[0].message.content

print("============== EXTRA_EVALUATION  ==============")
print(resp)
print("============== /EXTRA_EVALUATION ==============")

After evaluating both study plans, Response 1 appears to be more effective in supporting a comprehensive understanding and strong grasp of Calculus 1, specifically focusing on integration and u-substitution. Here are the 2-3 most important metrics used to reach this conclusion:

1. **Clarity and Structure:**
   - Response 1 provides a detailed course outline, presenting a hierarchical and structured approach to the subject matter. It categorizes topics into major sections like Limits, Differentiation, Integration, and Applications of Integration, breaking them down further into subtopics for focused study. This organization makes it easier for a student to navigate through the course material methodically.
   - In contrast, Response 2 presents a more vague outline with less emphasis on the relationships between different calculus concepts, which might make it more challenging for students to see the bigger picture.

2. **Comprehensive Coverage:**
   - Response 1 emphasizes not only the

#### step2. Get ADVICES on the initially listed evaluation dimensions

In [None]:
advice_prompt = (
    "For this task we considered the following dimensions during the generation process. "
    + "Please evaluate the dimensions we use generation criteria, and suggest if any dimensions should be added or removed or modified."
)

resp = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": advice_prompt},
            {"role": "user", "content": task_eval_prompt["evaluation_prompt"]},
        ],
    )

resp = resp.choices[0].message.content

print("============== ADVICE_ON_INITIAL_PROMPT  ==============")
print(resp)
print("============== /ADVICE_ON_INITIAL_PROMPT ==============")

The dimensions you’ve chosen are quite effective for creating a structured and comprehensive study plan. Here are some suggestions and considerations:

1. **Be Detailed:**
   - This is crucial as it provides clear guidance and expectations for students. Consider specifying not just topics, but also subtopics, learning objectives, key concepts, and resources or materials needed. Additionally, timelines or suggested hours for each section could enhance detail.

2. **Be Hierarchical:**
   - Organizing content in a hierarchical fashion aids in learner understanding and retention. Ensure that there is a logical flow from foundational to more complex topics. Additionally, consider including examples or applications of concepts to enforce understanding.

**Additional Dimensions to Consider:**

3. **Be Flexible:**
   - Students have different paces and learning styles. Recommend alternative resources and approaches based on different learning preferences (e.g., visual, auditory, hands-on learn

### Phase3. Finalize evaluation dimensions, task prompts


#### step1. Finalized evaluations and task prompts
Selectively merge the information provided by ChatGPT in Phase 2 and finalize the evaluation dimensions for your task.

In [None]:
finalized_student_evaluation_system_prompt = (
    "\n\n"
    + "Consider the following dimensions when generating the study plan: \n"
    + "1. Be Detailed and Actionable: Provide a detailed plan with specific topics and concepts they should focus on, and include actionable strategies.\n"
    + "2. Be Hierarchical: Create a hierarchical study plan, grouping similar topics and concepts together.\n"
    + "3. Be Adaptable: Allow flexibility for students to personalize the plan based on their progress, time constraints, or focus areas.\n"
)

finalized_student_prompts = {
    "task_prompt": student_task_system_prompt,
    "evaluation_prompt": finalized_student_evaluation_system_prompt,
}

whole_task_prompt = (
        task_eval_prompt["task_prompt"]
        + system_prompt
        + task_eval_prompt["evaluation_prompt"]
    )

input_chat = [
    {"role": "system", "content": whole_task_prompt},
    {"role": "user", "content": user_prompt},
]
final_response = get_response(model, tokenizer, input_chat, max_tokens=1024, generation=True)
print(f"============== FINAL RESPONSE ==============")
print(final_response)
print("============== /FINAL RESPONSE ==============")


I'd be happy to help you generate a study plan for Calculus 1, focusing on integration and u-substitution.

**Study Plan Overview**

To tackle Calculus 1, we'll break down the course into manageable chunks, focusing on key topics and concepts. We'll prioritize integration and u-substitution, as you mentioned.

**Course Outline**

Here's a hierarchical outline of the course, grouping similar topics and concepts together:

I. **Limits and Basic Calculus**

* Review of basic calculus concepts (e.g., derivatives, integrals, and applications)
* Introduction to limits and their role in calculus
* Practice problems and exercises to solidify understanding

II. **Differentiation**

* Review of differentiation rules (e.g., power rule, product rule, quotient rule)
* Introduction to implicit differentiation
* Practice problems and exercises to solidify understanding

III. **Integration**

* **u-Substitution**: Focus on u-substitution techniques for integration
	+ Practice problems and exercises to

#### step2. Customize scoring prompts for your task

Define the scoring task using finalized evaluation dimensions. Here, we use a zero-shot prompt, simplifying the scoring task to three levels: -1, 0, and 1, corresponding to poor, neutral, and good.

In [None]:
score_prompt = (
    "You are a labeller and response evaluator. Please evaluate the following question and response pair. "
    + "Provide a set of scores based on the quality of the response and how well it fulfills each of the criteria, "
    + "with -1 indicating that the response does not meet the requirement, 0 being neutral, and 1 meaning the response effectively meets the required criterion. "
    + "Here are the criteria to consider when evaluating the response, please provide a score for each critereon based on how well the response meets the requirement, "
    + "and a brief explanation as to why each score was chosen: \n"
    + "1. Be Detailed and Actionable: Provide a detailed plan with specific topics and concepts they should focus on, and include actionable strategies.\n"
    + "2. Be Hierarchical: Create a hierarchical study plan, grouping similar topics and concepts together.\n"
    + "3. Be Adaptable: Allow flexibility for students to personalize the plan based on their progress, time constraints, or focus areas.\n"
)

inference_qa_pair = f"Question:\n{user_prompt}\n\nAnswer:\n{response}"

messages = [
    {"role": "system", "content": score_prompt},
    {"role": "user", "content": inference_qa_pair},
]

resp = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
        )

resp = resp.choices[0].message.content

print(f"============== SCORE ==============")
print(resp)
print("============== /SCORE ==============")

**1. Be Detailed and Actionable: 1**

The response is detailed and actionable, offering a structured four-week study plan focusing on critical areas like u-substitution. It provides specific topics to study each week and suggests completing a number of practice problems. The response also includes practical advice on using online resources and joining study groups.

**2. Be Hierarchical: 1**

The study plan is well-organized and hierarchical, moving from foundational concepts to more advanced topics and applications. It groups similar techniques and applications together under weekly headings, effectively creating a clear progression from simpler to more complex ideas.

**3. Be Adaptable: 1**

The response allows for adaptability by suggesting the study plan can be adjusted as needed. It acknowledges potential personalization based on progress or challenges faced by the student, such as collaborating with a study partner or utilizing additional resources. This flexibility makes the res

## Reference

* [How to Reproduce Llama-3's Performance on GSM-8k](https://medium.com/@sewoong.lee/how-to-reproduce-llama-3s-performance-on-gsm-8k-e0dce7fe9926)

* [Unsloth Tutorial: Llama 1B-3B](https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing#scrollTo=gGFzmplrEy9I)