Benchmark Results #31

codelion · 2024-09-24T11:59:59Z

codelion
Sep 24, 2024
Maintainer

Entropy Decoding and CoT Decoding on GSM8k with Qwen2.5-0.5B-Instruct Model

codelion · 2024-09-24T12:04:43Z

codelion
Sep 24, 2024
Maintainer Author

CoT with Reflection on GSM8k

0 replies

codelion · 2024-09-24T12:08:32Z

codelion
Sep 24, 2024
Maintainer Author

Reread on LiveBench

model	average	coding	data_analysis	instruction_following	language	math	reasoning
gpt-4o-mini-2024-07-18	44.1	42.5	42.7	65.4	33.8	44.1	36.0
re2-gpt-4o-mini-2024-07-18	43.4	40.9	46.4	68.6	35.8	42.3	26.7

0 replies

codelion · 2024-09-24T12:14:17Z

codelion
Sep 24, 2024
Maintainer Author

Scatter plot showing the scaling of test time compute with different approaches with gpt-4o-mini on AIME 2024

You can see the original illustration here

2 replies

lorepieri8 Oct 4, 2024

On the y axis, is that the accuracy increase vs the baseline model or is it the absolute accuracy? If the latter, how does the baseline model scores? Any other metadata from these tests you can share is interesting.

codelion Oct 4, 2024
Maintainer Author

The baseline number for gpt-4o-mini was ~5%. I used the code from this repo to run the tests. The accuracy for the latest o1-mini model reported by OpenAI starts at around ~20% and goes up to ~80%

codelion · 2024-10-09T21:54:35Z

codelion
Oct 9, 2024
Maintainer Author

Results on the FRAMES benchmark with the memory plugin.

Model	Accuracy
readurls&memory-gpt-4o-mini	61.29
gpt-4o-mini	50.61
readurls&memory-Gemma2-9b	30.1
Gemma2-9b	5.1
Gemma2-27b	30.8
Gemini Flash 1.5	66.5
Gemini Pro 1.5	72.9

0 replies

codelion · 2024-11-19T01:57:49Z

codelion
Nov 19, 2024
Maintainer Author

Results on AIME 2024 benchmark with optillm (eval script)

AIME (2024) pass@1

Model	Score
o1-mini	56.67
o1-preview	40.00
gemini-exp-1114	36.67
claude-3-5-sonnet-20241022	20.00
gemini-1.5-pro-002	20.00
gemini-1.5-flash-002	16.67
gpt-4o	10.00
qwen2.5:14b-instruct-fp16 (with ollama)	10.00
gpt-4o-mini	6.67
claude-3-5-haiku-20241022	6.67
llama3.1:8b-instruct-fp16 (with ollama)	6.67
Qwen/Qwen2.5-0.5B-Instruct (with optillm)	0.00
meta-llama/Llama-3.2-1B-Instruct (with optillm)	0.00

0 replies

Benchmark Results #31

Uh oh!

Uh oh!

codelion Sep 24, 2024 Maintainer

Replies: 5 comments · 2 replies

Uh oh!

codelion Sep 24, 2024 Maintainer Author

Uh oh!

codelion Sep 24, 2024 Maintainer Author

Uh oh!

codelion Sep 24, 2024 Maintainer Author

Uh oh!

lorepieri8 Oct 4, 2024

Uh oh!

Uh oh!

codelion Oct 4, 2024 Maintainer Author

Uh oh!

Uh oh!

codelion Oct 9, 2024 Maintainer Author

Uh oh!

codelion Nov 19, 2024 Maintainer Author

AIME (2024) pass@1

codelion
Sep 24, 2024
Maintainer

Replies: 5 comments 2 replies

codelion
Sep 24, 2024
Maintainer Author

codelion
Sep 24, 2024
Maintainer Author

codelion
Sep 24, 2024
Maintainer Author

codelion Oct 4, 2024
Maintainer Author

codelion
Oct 9, 2024
Maintainer Author

codelion
Nov 19, 2024
Maintainer Author