# Video: Peeking into Reasoning

Reasoning models offer the possibility of more systematically thought out responses to questions.
In this video, we will peek into the reasoning traces, comment on their accuracy, and compare model performance with and without reasoning.


Script: (faculty on screen)
* Reasoning models offer the possibility of more systematically thought out responses to questions.
* Most chat interfaces to language models with reasoning support run the reasoning process in the background, and then present a summary of the results as the answer.
* In this video, we will take an example question and look at the reasoning process with Google's Gemini API and OpenAI's ChatGPT interface.

Script:
* Here is some example code from Google using the question, "What is the sum of the first 50 prime numbers?" to demonstrate reasoning responses.
* Let's run it now.
* Reasoning queries tend to take longer since they are generating a lot more output behind the scenes, so be patient.

In [None]:
# https://ai.google.dev/gemini-api/docs/thinking

from google import genai
from google.colab import userdata
from google.genai import types

client = genai.Client(api_key=userdata.get("GEMINI_API_KEY"))
prompt = "What is the sum of the first 50 prime numbers?"
response = client.models.generate_content(
  model="gemini-2.5-pro",
  contents=prompt,
  config=types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(
      include_thoughts=True
    )
  )
)

for part in response.candidates[0].content.parts:
  if not part.text:
    continue
  if part.thought:
    print("Thought summary:")
    print(part.text)
    print()
  else:
    print("Answer:")
    print(part.text)
    print()

Thought summary:
**Deconstructing and Solving for the Sum of the First 50 Primes**

Okay, here's what I'm thinking. The user wants the sum of the first 50 prime numbers. Straightforward enough, but I need to be precise. "Prime number" means a number greater than 1, divisible only by 1 and itself. "First 50" means I need a way to generate the sequence up to the fiftieth prime. And "sum" is simply adding them up.

My strategy? First, I'll *find* those 50 primes. This is the core challenge. I'll need a systematic method, not just guessing. Then, once I have them, I'll sum them. Simple arithmetic, but I'll use a calculator or script to avoid errors. Finally, I need to present the answer in a clear and organized way.

Let's begin! First, the prime numbers. I know the first few: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29... I need to be methodical. I'll check each number sequentially, testing for primality. Divisibility by 2? (That's easy: if it's even, it's not prime, except for 2). Divisibility by

Script:
* The output with the reasoning included is rather long and goes off the scene.
* I'll copy it into a markdown section so it will be formatted nicely.


Thought summary:
**Deconstructing and Solving for the Sum of the First 50 Primes**

Okay, here's what I'm thinking. The user wants the sum of the first 50 prime numbers. Straightforward enough, but I need to be precise. "Prime number" means a number greater than 1, divisible only by 1 and itself. "First 50" means I need a way to generate the sequence up to the fiftieth prime. And "sum" is simply adding them up.

My strategy? First, I'll *find* those 50 primes. This is the core challenge. I'll need a systematic method, not just guessing. Then, once I have them, I'll sum them. Simple arithmetic, but I'll use a calculator or script to avoid errors. Finally, I need to present the answer in a clear and organized way.

Let's begin! First, the prime numbers. I know the first few: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29... I need to be methodical. I'll check each number sequentially, testing for primality. Divisibility by 2? (That's easy: if it's even, it's not prime, except for 2). Divisibility by 3? (Sometimes the sum of digits trick helps). Divisibility by 5? (Ends in 0 or 5). And then, most importantly, I'll check divisibility by any *prime* I've *already* found. Remember, I only need to check primes up to the square root of the number I'm testing.

*Deep breath* Time to generate the list. After some careful checks and calculations, I arrive at the 50th prime: 229.

Now, I'll use a script or calculator. Here it is:

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229

Using my trusty calculator or script: the sum is **5,133**.

Now, it's about presentation. I'll make sure the answer is clear, the primes are listed for verification, and that the explanation is thorough. All that's left is to provide the answer with context.


Answer:
The sum of the first 50 prime numbers is **5,133**.

Here are the first 50 prime numbers that add up to this sum:

2, 3, 5, 7, 11,
13, 17, 19, 23, 29,
31, 37, 41, 43, 47,
53, 59, 61, 67, 71,
73, 79, 83, 89, 97,
101, 103, 107, 109, 113,
127, 131, 137, 139, 149,
151, 157, 163, 167, 173,
179, 181, 191, 193, 197,
199, 211, 223, 227, 229


Script:
* This is mostly for line wrapping, but the reasoning output included markdown formatting, so we can see parts bolded too.
* Note that it talks about doing a lot of checks, but we do not see those checks actually being done.
* So it is like a person narrating what should be done, but we do not necessarily get the benefits of actually doing those checks.
* So one thing we should watch out for is incorrect verification claims.
* Often, we end up taking the accuracy of these verification claims on faith,
* Let's split up the reasoning from the final answer.
* The Gemini API labels the reasoning component as thoughts.

In [None]:
for part in response.candidates[0].content.parts:
  if not part.text:
    continue
  if part.thought:
    print("Thought summary:")
    print(part.text)
    print()


Thought summary:
**Deconstructing and Solving for the Sum of the First 50 Primes**

Okay, here's what I'm thinking. The user wants the sum of the first 50 prime numbers. Straightforward enough, but I need to be precise. "Prime number" means a number greater than 1, divisible only by 1 and itself. "First 50" means I need a way to generate the sequence up to the fiftieth prime. And "sum" is simply adding them up.

My strategy? First, I'll *find* those 50 primes. This is the core challenge. I'll need a systematic method, not just guessing. Then, once I have them, I'll sum them. Simple arithmetic, but I'll use a calculator or script to avoid errors. Finally, I need to present the answer in a clear and organized way.

Let's begin! First, the prime numbers. I know the first few: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29... I need to be methodical. I'll check each number sequentially, testing for primality. Divisibility by 2? (That's easy: if it's even, it's not prime, except for 2). Divisibility by

Script:
* The thoughts emphasize strategies for checking answers, but we do not see them actually deployed.
* Let's look at the answer one last time.

In [None]:
for part in response.candidates[0].content.parts:
  if not part.text:
    continue
  if not part.thought:
    print("Answer:")
    print(part.text)
    print()

Answer:
The sum of the first 50 prime numbers is **5,133**.

Here are the first 50 prime numbers that add up to this sum:

2, 3, 5, 7, 11,
13, 17, 19, 23, 29,
31, 37, 41, 43, 47,
53, 59, 61, 67, 71,
73, 79, 83, 89, 97,
101, 103, 107, 109, 113,
127, 131, 137, 139, 149,
151, 157, 163, 167, 173,
179, 181, 191, 193, 197,
199, 211, 223, 227, 229



Script:
* This answer still sketches the process, but is more concise than including the reasoning.
* And, we could copy paste that addition at the end to spot check the sum.
* We'd still need to trust the list of numbers there.


Script: (faculty on screen)
* I separately asked ChatGPT the same question.
* ChatGPT had a similar process, talking about it and showing the work.
* And then it got a different answer.


Script:
* If we had run this version first, would we have noticed this is wrong?
* It said it was diligently checking.
* Let's check it's math now.

In [None]:
sum([2, 3, 5, 7, 11, 13, 17, 19, 23, 29,
31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97, 101, 103, 107, 109, 113,
127, 131, 137, 139, 149, 151, 157, 163, 167, 173,
179, 181, 191, 193, 197, 199, 211, 223, 227, 229])

5117

Script:
* So, the overall process was right, but the final addition was wrong.



Script:
* If we look at the ChatGPT log again, I see that it did not use the reasoning process.
* This is in spite of knowing that there is a math problem and talking about the computation and verification.
* It's not a coincidence that it got the math answer wrong.
* So let's see what it does if we explicitly request reasoning.

Script:
* You will see the thinking progress here.
* Open AI will offer you the option to skip that, but we already saw it fail without thinking.
* So we really want them to ride it through the thinking version of the model.
* I should also note that if you are automating this process, you will not be able to request a retry like this.
* Generally, catching a mistake automatically is difficult.
* Especially since you would not be asking the language model when you already have the answer in hand.


Script:
* Now that it actually used the thinking model, it got the right answer.
* It even identified its previous mistake.
* Why couldn't it do that the first time?
* Or rather, why didn't it do that the first time?
* Let's look at its thinking traces.

Script:
* In the first thinking trace, it wrote code to check primality and then generate primes.
* So we can see that ChatGPT has a coding integration too.
* This tends to be a big reliability improvement.
* The description of the code is relatively short and if it gets that right, the longer computation will also be correct.
* There was also some self reflection on the different answer here.

Script:
* In the second thinking trace, it did a second round of thinking to reflect on the error and report what went wrong.

Script: (faculty on screen)
* A common critique of these models is that they're not figuring these mistakes on their own.
* Why didn't it do this right solution the 1st time?
* Because it's not scalable for us to do that.
* If you're coding and you're interacting, you can catch those mistakes.
* If you are just taking the answers and trusting them, you're going to get a lot of errors creeping in like this, as we just saw.
* The reasoning process somewhat reduces these errors, but it doesn't eliminate them.
* Language model reasoning is still flawed, just like humans.
