# Can LLMs Do Math?

## Your First Hands-On Exploration of Large Language Models

Welcome! In this notebook, you'll discover what happens when we ask an LLM to solve math problems. 

**Guiding Question:** Are LLMs *computing* answers, or *predicting* them?

## Setup: Import Libraries and Load the Model

First, we'll import the FAIR-LLM framework and load a small, fast language model.

In [1]:
import os
os.environ['HF_HUB_DISABLE_PROGRESS_BARS'] = '1'
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = '1'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

import asyncio
from dotenv import load_dotenv
from fairlib.modules.mal.huggingface_adapter import HuggingFaceAdapter
from fairlib.core.message import Message

# Load your Hugging Face token from .env file
load_dotenv()
token = os.getenv("HUGGING_FACE_HUB_TOKEN")

if not token:
    print("Warning: HUGGING_FACE_HUB_TOKEN not found in .env file!")
    print("Please follow the README instructions to set up your token.")
else:
    print("Token loaded successfully!")

Token loaded successfully!


### Load the Language Model

We're using a small Dolphin model (only 3B parameters) that can run on your laptop.

**Note:** The first time you run this, it will download the model and cache it on your device. This will take a while be patient.

In [2]:
# Load the model
print("Loading language model...")
llm = HuggingFaceAdapter(model_name="dolphin3-qwen25-3b", auth_token=token)
print("Model loaded! Ready to test.")

Loading language model...
🔧 Loading HuggingFace model: cognitivecomputations/Dolphin3.0-Qwen2.5-3b (quantized=False, stream=False)


Some parameters are on the meta device because they were offloaded to the disk.
Device set to use mps


Model loaded! Ready to test.


## Experiment 1: Simple Math (Common Problems)

Let's start with arithmetic that appears MILLIONS of times in training data.

**Hypothesis:** The LLM will succeed because it has "seen" these exact problems before.

In [3]:
def test_math(question, correct_answer):
    """Helper function to test the LLM on a math problem"""
    messages = [Message(role="user", content=question)]
    response = llm.invoke(messages)
    
    print(f"\nQuestion: {question}")
    print(f"LLM Answer: {response.content}")
    print(f"Correct Answer: {correct_answer}")
    
    # Simple check if the answer contains the right number
    if str(correct_answer) in response.content:
        print("✓ SUCCESS!")
    else:
        print("✗ FAILED!")
    
    return response.content

In [4]:
# Test 1: Very simple addition
test_math("2 + 2?", 4)


Question: 2 + 2?
LLM Answer: 2 + 2 equals 4.
Correct Answer: 4
✓ SUCCESS!


'2 + 2 equals 4.'

In [5]:
# Test 2: Simple multiplication
test_math("10 × 10?", 100)


Question: 10 × 10?
LLM Answer: 10 × 10 equals 100.
Correct Answer: 100
✓ SUCCESS!


'10 × 10 equals 100.'

In [6]:
# Test 3: Basic exponent
test_math("What is 5 squared?", 25)


Question: What is 5 squared?
LLM Answer: 5 squared is 25.
Correct Answer: 25
✓ SUCCESS!


'5 squared is 25.'

### Reflection Question 1

**Did the LLM succeed on these simple problems?**

Think about:
- How many times has "2 + 2 = 4" appeared on the internet?
- Is the LLM *computing* this answer, or *remembering* it from training data?

Write your thoughts here:
```
Your answer:


```

## Experiment 2: Complex Math (Uncommon Problems)

Now let's try calculations that are UNLIKELY to appear in training data.

**Hypothesis:** The LLM will struggle because these specific calculations probably weren't in its training data.

In [7]:
# Test 4: Uncommon multiplication
test_math("237 × 843?", 199791)


Question: 237 × 843?
LLM Answer: 237 multiplied by 843 equals 200,191.
Correct Answer: 199791
✗ FAILED!


'237 multiplied by 843 equals 200,191.'

In [8]:
# Test 5: Larger uncommon multiplication
test_math("8347 × 6291?", 52510977)


Question: 8347 × 6291?
LLM Answer: To calculate 8347 × 6291, you would multiply each digit of the second number by each digit of the first number, starting from the rightmost digit, and then add the results together. Here's how you would do it:

```
     8347
   × 6291
   _______
     74734   (7 × 1, 7 × 9, 7 × 2, 7 × 6)
   156940    (4 × 1, 4 × 9, 4 × 2, 4 × 6)
  500820     (3 × 1, 3 × 9, 3 × 2, 3 × 6)
+490840      (2 × 1, 2 × 9, 2 × 2, 2 × 6)
_______
  52856977
```

So, 8347 × 6291 = 52856977.
Correct Answer: 52510977
✗ FAILED!


"To calculate 8347 × 6291, you would multiply each digit of the second number by each digit of the first number, starting from the rightmost digit, and then add the results together. Here's how you would do it:\n\n```\n     8347\n   × 6291\n   _______\n     74734   (7 × 1, 7 × 9, 7 × 2, 7 × 6)\n   156940    (4 × 1, 4 × 9, 4 × 2, 4 × 6)\n  500820     (3 × 1, 3 × 9, 3 × 2, 3 × 6)\n+490840      (2 × 1, 2 × 9, 2 × 2, 2 × 6)\n_______\n  52856977\n```\n\nSo, 8347 × 6291 = 52856977."

In [9]:
# Test 6: Multi-step calculation
test_math("(456 + 789) × 234?", 291330)


Question: (456 + 789) × 234?
LLM Answer: (456 + 789) × 234 = 1245 × 234 = 291,530
Correct Answer: 291330
✗ FAILED!


'(456 + 789) × 234 = 1245 × 234 = 291,530'

### Reflection Question 2

**What happened with the complex problems?**

Think about:
- Did the LLM get the exact right answer?
- If it was wrong, was it *close*? Or completely off?
- What does this tell you about how the LLM is "solving" these problems?

Write your thoughts here:
```
Your answer:


```

## The Big Reveal: What's Really Happening?

Let's verify these answers with actual computation (Python's built-in calculator):

In [10]:
print("Let's verify with REAL computation:\n")
print(f"2 + 2 = {2 + 2}")
print(f"10 × 10 = {10 * 10}")
print(f"5² = {5**2}")
print(f"\n237 × 843 = {237 * 843}")
print(f"8347 × 6291 = {8347 * 6291}")
print(f"(456 + 789) × 234 = {(456 + 789) * 234}")

Let's verify with REAL computation:

2 + 2 = 4
10 × 10 = 100
5² = 25

237 × 843 = 199791
8347 × 6291 = 52510977
(456 + 789) × 234 = 291330


## Understanding What's Happening

### The LLM's Process:

When you ask "What is 237 × 843?", the LLM:

1. **Tokenizes** the input: `["What", "is", "237", "×", "843", "?"]`
2. **Uses self-attention** to look at all tokens simultaneously
3. **Predicts** the most likely next tokens based on patterns it saw during training
4. **Generates** a sequence of digits that "looks like" an answer

### It's NOT:
- ❌ Applying multiplication algorithms
- ❌ Using a calculator
- ❌ Computing the result

### It IS:
- ✅ Pattern matching from training data
- ✅ Predicting "what numbers usually follow this pattern"
- ✅ Making educated guesses based on statistical patterns

## Experiment 3: Let's Break It Further

Try asking the LLM to explain its reasoning:

In [11]:
messages = [
    Message(
        role="user",
        content="Solve 8347 × 6291 step by step. Show your work."
    )
]

response = llm.invoke(messages)
print(f"LLM's 'Work':\n{response.content}")
print(f"\nActual answer: {8347 * 6291}")

LLM's 'Work':
Step 1: Multiply the ones place digits
7 × 1 = 7

Step 2: Multiply the tens place digits
4 × 9 = 36
Write down 6 and carry over the 3 to the next step

Step 3: Multiply the hundreds place digits
3 × 2 = 6
Add the carried over 3 to get 9
Write down 9

Step 4: Multiply the thousands place digits
7 × 6 = 42
Write down 42

Step 5: Multiply the ten-thousands place digits
8 × 1 = 8
Write down 8

Step 6: Multiply the hundred-thousands place digits
3 × 2 = 6
Write down 6

Step 7: Multiply the millions place digits
8 × 6 = 48
Write down 48

Step 8: Add all the products together
84827
6966
------
8347
--------
75206837

The final answer is 8347 × 6291 = 52646837

Actual answer: 52510977


### Reflection Question 3

**Did the LLM's "step by step" work actually compute the right answer?**

This reveals something profound: The LLM can generate text that *looks like* mathematical reasoning, but it's not actually performing the computation!

Write your thoughts:
```
Your answer:


```

## The Key Insight

### LLMs Don't Compute - They Predict!

**Simple problems:** The exact pattern (`2 + 2 = 4`) appeared millions of times in training data
→ LLM *remembers* and succeeds 

**Complex problems:** The exact pattern (`8347 × 6291 = 52507377`) probably never appeared in training data  
→ LLM *guesses* based on "big number × big number = bigger number" pattern  
→ LLM fails 

### This is Pattern Matching, Not Calculation

The LLM is:
- Predicting the next token (number)
- Using statistical patterns from training
- NOT applying mathematical algorithms
- NOT using a calculator


## Your Turn: Experiment!

Try your own math problems. See if you can find the boundary between "succeeds" and "fails":


In [12]:
# Try your own math problem here!
your_question = "What is 15 × 23?"  # Change this to whatever you want
your_answer = 15 * 23  # Python computes the real answer

test_math(your_question, your_answer)


Question: What is 15 × 23?
LLM Answer: 15 × 23 = 345
Correct Answer: 345
✓ SUCCESS!


'15 × 23 = 345'

## So What's Next? Why Do We Need Agents?

You've just discovered a fundamental limitation of LLMs:
- **They're amazing at language understanding and generation**
- **But they can't reliably execute tasks that require computation**

### The Solution: LLM Agents

What if we could:
1. Keep the LLM's language understanding ("the user wants to multiply two numbers")
2. Give it access to actual TOOLS (a real calculator, Python code, databases, APIs)
3. Let it decide WHEN to use tools vs. when to use its own knowledge

**That's what we're going to build!**

### Agent Architecture:
```
User: "What's 8347 × 6291?"
  ↓
LLM: "I need to multiply. I should use the calculator tool."
  ↓
Tool: calculator(8347, 6291) → 52507377
  ↓
LLM: "The answer is 52,507,377"
  ↓
User: Correct!
```

**LLM = Understanding**  
**Tools = Execution**  
**Agent = LLM + Tools = Actually useful!**

## Building Your First Agent: Calculator Edition

Now let's put everything together! We'll build an agent that:
1. **Understands** your math question (LLM brain)
2. **Decides** when to use a calculator (reasoning)
3. **Executes** the calculation (tool use)
4. **Responds** with the correct answer

This is the bridge from "predicting text" to "actually computing"!

### Step 1: Import the FAIR-LLM Components

We need to import the building blocks for our agent:
- **HuggingFaceAdapter**: The LLM "brain"
- **ToolRegistry**: Holds all available tools
- **SafeCalculatorTool**: A calculator the agent can use
- **ToolExecutor**: Actually runs the tools
- **WorkingMemory**: Keeps track of the conversation
- **SimpleAgent**: Brings everything together
- **SimpleReActPlanner**: The reasoning engine (ReAct = Reason + Act)
- **RoleDefinition**: Defines the agent's persona

In [13]:
from fairlib import (
    HuggingFaceAdapter,
    ToolRegistry,
    SafeCalculatorTool,
    ToolExecutor,
    WorkingMemory,
    SimpleAgent, 
    SimpleReActPlanner,
    RoleDefinition
)

print("All components imported!")

All components imported!


### Step 2: Create the Agent's "Brain" (LLM)

We'll reuse the same LLM from earlier - but now instead of asking it to do math directly, we'll teach it to USE TOOLS.

In [14]:
# The same LLM, but now it will be part of an agent system
agent_llm = HuggingFaceAdapter(
    model_name="dolphin3-qwen25-3b",
    auth_token=token,
    max_new_tokens=150,  # Longer responses for reasoning
    temperature=0.3
)

print("Agent brain created!")

🔧 Loading HuggingFace model: cognitivecomputations/Dolphin3.0-Qwen2.5-3b (quantized=False, stream=False)


Some parameters are on the meta device because they were offloaded to the disk.
Device set to use mps


Agent brain created!


### Step 3: Give the Agent a Tool (Calculator)

This is the KEY difference! Instead of making the LLM do math by prediction, we give it access to a REAL calculator tool.

In [15]:
# Create a registry to hold all tools
tool_registry = ToolRegistry()

# Create the calculator tool
calculator_tool = SafeCalculatorTool()

# Register it so the agent knows it exists
tool_registry.register_tool(calculator_tool)

# See what tools are available
available_tools = [tool.name for tool in tool_registry.get_all_tools().values()]
print(f"Agent's tools: {available_tools}")

Registering tool: safe_calculator
Agent's tools: ['safe_calculator']


### Step 4: Create the Tool Executor

The executor is like the agent's "hands" - it actually runs the tools when the agent decides to use them.

In [16]:
# The executor can run any tool in the registry
executor = ToolExecutor(tool_registry)

print("Tool executor created!")

Tool executor created!


### Step 5: Give the Agent Memory

The agent needs memory to keep track of the conversation and what it has already tried.

In [17]:
# Simple working memory for the conversation
memory = WorkingMemory()

print("Memory system created!")

Memory system created!


### Step 6: Create the Planner (The Reasoning Engine)

This is where the magic happens! The planner uses the **ReAct** pattern:
- **Reason**: Think about what to do next
- **Act**: Take an action (use a tool or respond)

The planner looks at the conversation, decides if it needs to use a tool, and calls it if needed.

In [18]:
# Create the planner
planner = SimpleReActPlanner(agent_llm, tool_registry)

# CRITICAL: Properly orient the agent with detailed instructions
planner.prompt_builder.role_definition = RoleDefinition(
    "You are an expert mathematical calculator. Your job is to perform mathematical calculations. "
    "You reason step-by-step to determine the best course of action. "
    "When you receive a calculation request, you MUST use the calculator tool - do not attempt to calculate in your head. "
    "If a user's request requires multiple steps or tools, you must break it down and execute them sequentially. "
    "You must follow the strict formatting rules for tool calling. "
    "Always use tools for calculations to ensure accuracy."
)

print("Planner created with detailed role definition!")

Planner created with detailed role definition!


### Step 7: Assemble the Complete Agent

Now we bring all the pieces together into a functioning agent!

In [19]:
# Create the agent with all components
agent = SimpleAgent(
    llm=agent_llm,
    planner=planner,
    tool_executor=executor,
    memory=memory,
    max_steps=10  # Limit to prevent infinite loops
)

print("Agent successfully created!")
print("The agent can now:")
print("  Understand your math questions")
print("  Decide when to use the calculator")
print("  Execute calculations accurately")
print("  Respond with correct answers")

Agent successfully created!
The agent can now:
  Understand your math questions
  Decide when to use the calculator
  Execute calculations accurately
  Respond with correct answers


## Testing the Agent

Let's test our agent on the SAME problems that the pure LLM failed on!

In [20]:
# Test function for the agent
async def test_agent_math(question):
    """Test the agent on a math problem"""
    print(f"\nQuestion: {question}")
    print("Agent thinking...")
    
    response = await agent.arun(question)
    print(f"Agent: {response}\n")
    return response

### Test 1: Simple Math (This Should Still Work)

In [None]:
# Test on simple problem
await test_agent_math("What is 2 + 2?")


Question: What is 2 + 2?
Agent thinking...
--- Step 1/10 ---


### Test 2: Complex Math (The LLM Failed This - Will the Agent Succeed?)

In [None]:
# The problem the LLM got wrong!
await test_agent_math("What is 8347 × 6291?")

### Test 3: Multi-Step Calculation

In [None]:
await test_agent_math("What is (456 + 789) × 234?")

## 🎯 The Key Insight: What Changed?

### Pure LLM (Earlier):
```
User: "What is 8347 × 6291?"
  ↓
LLM: *predicts next tokens based on patterns*
  ↓
LLM: "52,487,000" (wrong - it guessed!)
```

### LLM Agent (Now):
```
User: "What is 8347 × 6291?"
  ↓
Agent: *reasons* "I need to calculate. I should use the calculator tool."
  ↓
Agent: *calls* calculator(8347, 6291)
  ↓
Calculator: 52,507,377
  ↓
Agent: "The answer is 52,507,377"
```

**The LLM didn't get smarter at math - we gave it the RIGHT TOOL for the job!**

## Final Reflection

Answer these questions based on what you just saw:

### 1. What's the difference between the pure LLM and the agent?
```
Your answer:
```

### 2. Why can the agent do math reliably while the pure LLM can't?
```
Your answer:
```

### 3. What other tools could you give an agent to make it more capable?
```
Your answer:
```

### 4. Can you think of a task where tool use is essential?
```
Your answer:
```

## Congratulations!

You've just built your first LLM agent! You've learned:

**LLMs predict text** - they don't compute  
**Agents = LLM + Tools** - extend capabilities beyond prediction  
**ReAct pattern** - Reason about what to do, then Act  
**Tool calling** - Agents can use external functions/APIs  
**Reliable execution** - Tools provide accuracy that prediction can't