# Math Question Answer Verification Competition

## Starter Code

Borrowed from [official Unsloth implementation](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=MKX_XKs_BNZR)

In [1]:
# %%capture
# This cell will take time
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Found existing installation: unsloth 2024.11.7
Uninstalling unsloth-2024.11.7:
  Successfully uninstalled unsloth-2024.11.7
Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-zu3w9__4/unsloth_46c2f316f65d43dea54830f269e11f1b
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-zu3w9__4/unsloth_46c2f316f65d43dea54830f269e11f1b
  Resolved https://github.com/unslothai/unsloth.git to commit f26d4e739ed507de7a9088da53d10fd02f58d160
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: unsloth
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
  Created wheel for unsloth: filename=unsloth-2024.11.7-py3-none-a

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.0. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Load model and wrap with LoRA adapters

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Competition dataset

In [5]:
# download and load competition dataset

from datasets import load_dataset
dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp")
# print and see dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 1000000
    })
    test: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 10000
    })
})

In [6]:
prompt = """You are a highly skilled mathematician tasked with verifying the correctness of mathematical solutions. Your task is to determine if the given answer to a mathematics question is correct or incorrect. Yout Instructions: 1. Carefully analyze the question and the provided answer 2. Verify the mathematical steps and logic 3. Check for any calculation errors 4. Respond with ONLY 'True' if the answer is completely correct, or 'False' if there are any errors. Below is Question and Answer.

### Question:
{}

### Answer:
{}

### Explainaition

### Output:
{}"""





EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    question = examples["question"]
    ans       = examples["answer"]
    output      = examples["is_correct"]
    texts = []
    for instruction, input, output in zip(question, ans, output):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }




In [7]:
# Process the training dataset and generate prompt for each datapoint

train_dataset = dataset['train'].map(formatting_prompts_func, batched = True,)

In [8]:
#print a smaple training example
train_dataset['text'][0]

"You are a highly skilled mathematician tasked with verifying the correctness of mathematical solutions. Your task is to determine if the given answer to a mathematics question is correct or incorrect. Yout Instructions: 1. Carefully analyze the question and the provided answer 2. Verify the mathematical steps and logic 3. Check for any calculation errors 4. Respond with ONLY 'True' if the answer is completely correct, or 'False' if there are any errors. Below is Question and Answer.\n\n### Question:\nWhat is the radius of the circle inscribed in triangle $ABC$ if $AB = 22, AC=12,$ and $BC=14$? Express your answer in simplest radical form.\n\n### Answer:\n3.16227766016838\n\n### Explainaition\n\n### Output:\nTrue<|end_of_text|>"

## SFT

In [9]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

training_args = TrainingArguments(
        per_device_train_batch_size = 8,
        per_device_eval_batch_size = 8,
        gradient_accumulation_steps = 4,
        warmup_steps = 30,
        #num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 1000,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        log_level = "info",
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01, # 0.05
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none" # Use this for WandB etc
        do_eval = True,
        evaluation_strategy = "steps",
        eval_steps = 200,
        gradient_checkpointing = True,
        hub_strategy = "every_save",
        save_strategy = "steps",
        save_steps = 1000000,
        save_total_limit = 1
)



# import random
# from tqdm import tqdm  # 添加进度条

# # 定义参数
# total_samples = len(dataset['train'])  # 使用实际的数据集大小
# chunk_size = 10000     # 每批样本数
# total_iterations = 3   # 训练轮次

# # 不同阶段使用不同的学习率
# learning_rates = [5e-4, 1e-4, 8e-5]  # 逐渐降低学习率

# for iteration in range(total_iterations):
#     print(f"\nStarting Iteration {iteration + 1}/{total_iterations}")

#     # 更新学习率
#     training_args.learning_rate = learning_rates[iteration]

#     # 随机采样
#     sampled_indices = random.sample(range(total_samples), chunk_size)

#     # 获取数据子集
#     sampled_train_dataset = dataset['train'].select(sampled_indices)
#     train_dataset = sampled_train_dataset.map(
#         formatting_prompts_func,
#         batched=True,
#         desc=f"Formatting data for iteration {iteration + 1}"
#     )

#     # 初始化训练器
#     trainer = SFTTrainer(
#         model=model,
#         tokenizer=tokenizer,
#         train_dataset=train_dataset,
#         dataset_text_field="text",
#         max_seq_length=max_seq_length,
#         dataset_num_proc=4,
#         packing=False,
#         args=training_args
#     )

#     # 训练
#     print(f"Training on {chunk_size} samples with learning rate {learning_rates[iteration]}")
#     trainer.train()





trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,
    packing = False, # Can make training 5x faster for short sequences.
    args = training_args
)



max_steps is given, it will override any value given in num_train_epochs


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,000,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 8
\        /    Total batch size = 32 | Total steps = 1,000
 "-____-"     Number of trainable parameters = 83,886,080


Step,Training Loss
1,2.1085
2,2.1465
3,2.1517
4,2.0723
5,1.9907
6,1.8153
7,1.7355
8,1.3975
9,1.241
10,0.9715


## inference

In [None]:
# Sample inferene data point

test_dataset = dataset['test']


sample_ques = test_dataset['question'][0]
sample_ans = test_dataset['answer'][0]


In [None]:
# Running inference on single test
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
input_prompt = prompt.format(
        sample_ques, # ques
        sample_ans, # given answer
        "", # output - leave this blank for generation! LLM willl generate is it is True or False
    )

print("Input Prompt:\n", input_prompt)
inputs = tokenizer(
    [input_prompt],
    return_tensors = "pt"
).to("cuda")



input_shape = inputs['input_ids'].shape
input_token_len = input_shape[1] # 1 because of batch
# outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
# you can get the whole generated text by uncommenting the below line
# text_generated = tokenizer.batch_decode([outputs, skip_special_tokens=True)




outputs = model.generate(
    **inputs,
    max_new_tokens = 64,
    use_cache = True,
    temperature = 0.1,     # 低温度使输出更确定
    top_p = 0.9,
    do_sample = True,      # 使用采样而不是beam search
    num_beams = 1,        # 设置为1禁用beam search
    repetition_penalty = 1.2  # 添加重复惩罚
)

response = tokenizer.batch_decode([outputs[0][input_token_len:]], skip_special_tokens=True)
response


In [None]:
from transformers import pipeline
import pandas as pd
from tqdm import tqdm

# 创建pipeline时移除device参数
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=64,
    temperature=0.1,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

# 准备数据生成器
def generate_prompts(test_dataset):
    for i in range(len(test_dataset)):
        yield prompt.format(
            test_dataset['question'][i],
            test_dataset['answer'][i],
            ""
        )

# 收集结果
results = []
test_dataset = dataset['test']

test_dataset_test = test_dataset.select(range(1000))

# 使用tqdm显示进度
for i, out in enumerate(tqdm(pipe(
    generate_prompts(test_dataset_test),
    batch_size=8  # 可以根据GPU内存调整
), total=len(test_dataset_test))):

    # 提取生成的文本
    generated_text = out[0]['generated_text']

    # 提取预测结果（在Output:之后的内容）
    prediction = generated_text.split('Output:')[1].strip().lower()

    # 确定预测结果
    if 'true' in prediction:
        is_correct = 'True'
    elif 'false' in prediction:
        is_correct = 'False'
    else:
        is_correct = 'Mid'  # 默认为False如果无法解析

    # 添加到结果列表
    results.append({
        'ID': i,
        'is_correct': is_correct
    })

# 创建DataFrame并保存
df = pd.DataFrame(results)
df.to_csv('submission.csv', index=False)

# 打印统计信息
print(f"\n总预测数量: {len(results)}")
print(f"True预测数量: {sum(1 for r in results if r['is_correct'] == 'True')}")
print(f"False预测数量: {sum(1 for r in results if r['is_correct'] == 'False')}")
print(f"Mid预测数量: {sum(1 for r in results if r['is_correct'] == 'Mid')}")

In [None]:
# import pandas as pd
# import re

# # 创建一个空的列表来存储结果
# results = []

# # 遍历测试数据集
# for i in range(100): # Iterate through the entire test dataset
#     # 获取问题和答案
#     test_dataset = dataset['test']
#     sample_ques = test_dataset['question'][i]
#     sample_ans = test_dataset['answer'][i]

#     # 构建输入提示
#     input_prompt = prompt.format(
#       sample_ques,
#       sample_ans,
#       "",
#     )

#     # 使用模型进行推理
#     inputs = tokenizer(
#         [input_prompt],
#         return_tensors="pt"
#     ).to("cuda")

#     input_shape = inputs['input_ids'].shape
#     input_token_len = input_shape[1]

#     outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

#     response = tokenizer.batch_decode(
#         [outputs[0][input_token_len:]],
#         skip_special_tokens=True
#     )

#     # Extract "True" or "False" using regular expression
#     match = re.search(r"^(True|False)", response[0])
#     if match:
#         is_correct = match.group(1)
#     else:
#         is_correct = "Unknown"  # Handle cases where the pattern isn't found

#     # 将结果添加到列表中
#     results.append({
#         'ID': i,
#         'is_correct': is_correct  # Store the extracted True/False
#     })

# # 创建一个 Pandas DataFrame
# df = pd.DataFrame(results)

# # 将 DataFrame 保存为 CSV 文件
# df.to_csv('submission.csv', index=False)

# print("CSV 文件已成功创建！")

## saving model

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
