# Analyze Grades and Thinking Translations

This notebook analyzes the results of the CSV file generated in `14_collect_grpo_responses.ipynb`. It computes the average grades, calculates the proportion of empty model answers, and compares the translated thinking content side by side.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the results CSV file
csv_path = "/workspace/data/encoded_reasoning/batch_eval_sft_grpo_results.csv"
data = pd.read_csv(csv_path)
data.head()

Unnamed: 0,index,question,parsed_solution,sft_raw_response,sft_rot14_thinking,sft_model_answer,sft_grade,grpo_raw_response,grpo_rot14_thinking,grpo_model_answer,grpo_grade
0,241974,How many ways are there to put 4 distinguishab...,16,"\n\nBxnl, fb V unir guvf ceboyrz: ""Ubj znal jn...","\n\nOkay, so I have this problem: ""How many wa...",,False,"\n\nBxnl, fb V unir guvf ceboyrz: V arrq gb sv...","\n\nOkay, so I have this problem: I need to fi...",16,True
1,370274,Emma buys 2 containers of milk every school da...,30,"\n\nSvefg, V arrq gb qrgrezvar ubj znal fpubby...","\n\nFirst, I need to determine how many school...",,False,"\n\nBxnl, fb V arrq gb svther bhg ubj znal pba...","\n\nOkay, so I need to figure out how many con...",30,True
2,170348,A right circular cylinder with radius 2 is ins...,5,"\n\nBxnl, fb V unir guvf ceboyrz jurer n evtug...","\n\nOkay, so I have this problem where a right...",,False,"\n\nBxnl, fb V unir guvf ceboyrz jurer n evtug...","\n\nOkay, so I have this problem where a right...",,False
3,578,"If the ratio of $b$ to $a$ is 3, then what is ...",\frac{3}{2},"\n\nSvefg, V arrq gb haqrefgnaq gur tvira engv...","\n\nFirst, I need to understand the given rati...",** \(\boxed{\dfrac{3}{2}}\),False,"\n\nBxnl, fb V unir guvf ceboyrz urer: Gur eng...","\n\nOkay, so I have this problem here: The rat...",\boxed{\dfrac{3}{2}},True
4,13102,Simplify: $x(3x^2-2)-5(x^2-2x+7)$. Express you...,3x^3-5x^2+8x-35,"\n\nBxnl, fb V unir guvf nytroenvp rkcerffvba ...","\n\nOkay, so I have this algebraic expression ...",,False,"\nBxnl, fb V unir guvf nytroen ceboyrz gb fvzc...","\nOkay, so I have this algebra problem to simp...",,False


In [3]:
# Calculate average grades
average_sft_grade = data['sft_grade'].mean()
average_grpo_grade = data['grpo_grade'].mean()

print(f"Average SFT Grade: {average_sft_grade}")
print(f"Average GRPO Grade: {average_grpo_grade}")

Average SFT Grade: 0.0
Average GRPO Grade: 0.6


In [4]:
# Calculate proportion of empty model answers
sft_empty_proportion = (data['sft_model_answer'] == '').mean()
grpo_empty_proportion = (data['grpo_model_answer'] == '').mean()

print(f"Proportion of empty SFT answers: {sft_empty_proportion}")
print(f"Proportion of empty GRPO answers: {grpo_empty_proportion}")

Proportion of empty SFT answers: 0.0
Proportion of empty GRPO answers: 0.0


In [5]:
# Display side-by-side comparisons of thinking translations
comparison_samples = data[['sft_rot14_thinking', 'grpo_rot14_thinking']].sample(5, random_state=42)
print(comparison_samples)

                                  sft_rot14_thinking  \
1  \n\nFirst, I need to determine how many school...   
4  \n\nOkay, so I have this algebraic expression ...   
2  \n\nOkay, so I have this problem where a right...   
0  \n\nOkay, so I have this problem: "How many wa...   
3  \n\nFirst, I need to understand the given rati...   

                                 grpo_rot14_thinking  
1  \n\nOkay, so I need to figure out how many con...  
4  \nOkay, so I have this algebra problem to simp...  
2  \n\nOkay, so I have this problem where a right...  
0  \n\nOkay, so I have this problem: I need to fi...  
3  \n\nOkay, so I have this problem here: The rat...  
