<div style="text-align:center;">
    <span style="color:orange; font-size:30px; font-weight:bold;">  
    Learning and Adaptation Module
    </span>
</div>

<div style="text-align:center;">
    <span style="color:orange; font-size:25px; font-weight:bold;">  
    Assignment 7
    </span>
</div>

**Authors:** Vy Vu, Amir K. Saeed, Dr. Benjamin Rodriguez, Dr. Erhan Guven <br>
**Created Date:** 08/09/2025 <br>
**Modified Date:** 10/03/2025

__Q1.__ The Value Iteration algorithm is a fundamental method for solving Markov Decision Processes (MDPs). It calculates the optimal utility values for each state and derives the optimal policy.

![Value_Interation](photos/Value_Iteration_Algorithm.JPG)

i. Explain in your own words what happens in each step of the Value Iteration algorithm above.

ii. Define the following terms and explain their role:
    
- Discount factor ($\gamma$)
- Maximum allowed error ($\epsilon$)
- Q-value
- $\delta$

iii. Prove that the Value Iteration algorithm converges to the optimal utility function $U^{*}$ as $\epsilon \rightarrow 0$.

iv. Discuss the influence of the discount factor $\gamma$ on the convergence rate.

---

__Q2.__ The Policy Iteration algorithm is another fundamental method for solving Markov Decision Processes (MDPs). Instead of directly updating state utilities until convergence (as in Value Iteration), Policy Iteration alternates between policy evaluation (computing the utility of a given policy) and policy improvement (updating the policy based on those utilities)

![Value_Interation](photos/Policy_Iteration_Algorithm.JPG)

i. Explain in your own words what happens in each step of the Policy Iteration algorithm above. Distinguish clearly between policy evaluation and policy improvement.

ii. Define the following terms and explain their role in Policy Iteration:

- Discount factor ($\gamma$)
- Policy evaluation
- Policy improvement
- Convergence criterion

iii. Prove (or provide a reasoning argument) that the Policy Iteration algorithm converges to the optimal policy $\pi^{*}$ in a finite number of steps.

iv. Compare the influence of the discount factor $\gamma$ on Policy Iteration versus Value Iteration. Which algorithm tends to converge faster, and why?

---

__Q3.__

![Value_Interation](photos/POMDP_Value_Iteration_Algorithm.JPG)

Given the POMDP-VALUE-ITERATION function shown in Figure 16.16, consider a simple autonomous robot navigation scenario where the robot has uncertain sensor readings about obstacles and must navigate to a goal while avoiding collisions.

i. Explain the significance of starting with one-step plans [a]. Why does the algorithm initialize $U'$ with all possible single actions rather than starting with empty plans or random plans? How does this relate to the principle of dynamic programming in POMDPs?

ii. Analyze the utility vector computation $\alpha_{[a]}(s) = \sum_{s'} P(s'|s,a) R(s,a,s')$. This represents the expected utility of taking action $a$ in each state $s$. In our robot navigation context, if the robot has actions {move-forward, turn-left, turn-right} and states representing different obstacle configurations, what does this computation tell us about each action's value across different true world states?

iii. The algorithm constructs plans consisting of "an action and, for each possible next percept, $a$ plan in $U$."

- Explain why plans must be conditioned on percepts rather than states in POMDPs
- If our robot can observe {obstacle-detected, clear-path, goal-visible}, construct a sample 2-step plan and explain how it would be executed given different sensor readings
Why does this tree-like plan structure grow exponentially with the planning horizon?

iv. Critical Analysis of the REMOVE-DOMINATED-PLANS step:

- Explain what it means for one plan to dominate another in the context of utility vectors over belief states
- Why is this pruning step essential for computational tractability? What would happen if we skipped this step?
- The convergence criterion MAX-DIFFERENCE $(U,U') \leq \epsilon(1 - \gamma)/\gamma$ is based on the maximum difference in utility vectors. Explain why this specific threshold guarantees that the policy derived from $U$ is $\epsilon$-optimal, and how the discount factor $\gamma$ influences when the algorithm terminates.

---

__Q4__

![Value_Interation](photos/Passive_ADP_Learner.JPG)

i. Walk through the algorithm step-by-step. Why does it only update when encountering a "new" state ($s'$ is new), and what happens when it revisits a previously seen state? What's the significance of this design choice?

ii. Explain the role of the utility table $U$ and outcome count table N in the learning process. How does incrementing $N_{s'|s,a}[s,a][s']$ and the normalization step help the agent improve its policy over time?


iii. The algorithm calls POLICY-EVALUATION $(\pi, U, mdp)$ but doesn't explicitly show policy improvement. How do you think this passive learner eventually converges to an optimal policy? What are the limitations of this "passive" approach compared to active learning?


iv. If you were to implement this algorithm in a real-world scenario (like a robot learning to navigate or a game AI), what challenges might you face? Consider issues like convergence speed, memory requirements, and the assumption of a "fixed policy" during learning.

---

__Q5__

![Value_Interation](photos/Passive_TD_Learner.JPG)

i. Examine the utility update formula: $U[s] \leftarrow U[s] + \alpha (N_s[s]) \times (r + \gamma U[s'] - U[s])$. Break down each component. What does $(r + \gamma U[s'] - U[s])$ represent conceptually, and why is this called a "temporal difference"?

ii. The algorithm uses $\alpha(N_s[s])$ as a step-size function that depends on how often state $s$ has been visited. Why is this important for convergence? What would happen if we used a constant learning rate instead, and how might that affect the agent's learning?

iii. How does this TD approach differ from the Passive-ADP-Learner? Which algorithm would you expect to learn faster, and which would be more memory-efficient? Consider the trade-offs between model-based and model-free learning.

iv. Notice that the algorithm uses $U[s']$ (the current utility estimate of the next state) to update $U[s]$. This is called "bootstrapping" which means learning from your own estimates. What are the advantages and potential risks of this approach? How does it enable learning without knowing the full transition model?

---

__Q6__

![Value_Interation](photos/Q-Learning_Agent.JPG)

i. Compare this algorithm to the passive learners (passive ADP learners and passive TD learners). How does the action selection using $argmax_{a'} f(Q[s',a'], N_{sa}[s',a'])$ make this an "active" learner? What is the agent now optimizing for that it wasn't before?

ii. Examine the Q-value update: $Q[s,a] \leftarrow Q[s,a] + \alpha (N_{sa}[s,a])(r + \gamma max_{a'} Q[s',a'] - Q[s,a])$. Why does this use $max_{a'} Q[s',a']$ instead of following a fixed policy like the previous algorithms? What does this mathematical difference represent conceptually?

iii. The algorithm uses an exploration function $f(Q[s',a'], N_{sa}[s',a'])$ to choose actions. Why is exploration crucial in Q-learning? What might happen if the agent always chose the action with the highest current Q-value? Design a simple exploration function and justify your choice.

iv. Unlike ADP-learner, this algorithm doesn't need to learn transition probabilities or call POLICY-EVALUATION. What are the practical advantages of this model-free approach? In what scenarios would you prefer Q-learning over model-based methods, and why might this be particularly valuable in real-world applications?

---

**Q7. Mini Simplified Auction Game**

We will create an AI agent using smolagent framework that competes in 2-round auction games against another agent. Our goal is to maximize the agent's total utility score by strategically bidding on items while managing uncertainty about opponents' strategies and values. In this game version, the requirements are as follows:

- The number AI players: 2
- The number of auction rounds per game: 2
- Objective: The highest total utility score at the end of the game (after 2 rounds)
- Starting conditions:

    - Budget: each agent starts with 50 coins
    - 2 items will be auctioned in sequence.
        - Item A: Magic Book.
        - Item B: Flying Carpet.

    - Private valuations: Each agent has different private values for each item (assigned by the environment). The agent only knows its private valuations and it doesn't know opponents' valuations (this is the uncertainty the agent has to handle)

- The game rules:

    - Minimum starting bid = 3 coins
    - Minimum raise = 5 coin (each new bid must be at least 5 coin higher)
    - Players bid in rotating order (taking turns)
    - Players can pass to exit the current item's auction permanently
    - Once you pass, you cannot re-enter that specific item's auction
    - You can still participate in future item auctions
    - The auction continues until only one active bidder remains
    - The last remaining bidder wins and pays their final bid amount

- The total utility score formula = Sum of private values of all items won + Remaining coins $\times$ 0.1


*Example:*

We have 2 agents named Mickey and Minnie.

At the beginning, each private valuations (fixed at the game start) are given to the each agent. Please note that the agent only knows its private valuations and it doesn't know opponents' valuations

- Mickey's private valuations:
    - Item A (Magic Book): $\textcolor{teal} {30}$ coins
    - Item B (Flying Carpet): $25$ coins

- Minnie's private valuations:
    - Item A (Magic Book): $20$ coins
    - Item B (Flying Carpet): $20$ coins

Let's go through one auction, item A - Magic Book.

Item A (Magic Book) auction:
- Mickey bids $8$ coins.
- Minnie bids $13$ coins.
- Mickey bids $25$ coins.
- Minnie passes.

$\rightarrow$ Result: Mickey got Magic Book for $25$ coins and he has only $50 - 25 = 25$ coins left.

Now, item B - Flying Carpet.

Item B (Flying Carpet) auction:
- Minnie bids $10$ coins (values it at $20$).
- Mickey bids $15$ coins.
- Minnie bids $20$ coins (maximum she should pay).
- Mickey bids $25$ coins.
- Minnie gets frustrated and bids $35$ coins (OVERBIDDING - mistake!).
- Mickey passes (smart move - won't pay more than it's worth to him).

$\rightarrow$ Minnie wins for $35$ coins - she paid $15$ more than it's worth to her!

Here are the items and the costs each agent obtained:

| Item | Owner | Cost |
|:-----|:------|:-----|
| A | Mickey | $25$ coins |
| B | Minnie | $35$ coins |

And the remaining amount of money at the end:

| Agent | Remaining Amount |
|:------|:-----------------|
| Mickey | $50 - 25 = \textcolor{orange} {25}$ |
| Minnie | $50 - 35 = 15$ |

The total utility score:

| Agent | Utility |
|:------|:--------|
| Mickey | $\textcolor{teal} {30} + \textcolor{orange} {25} \times 0.1 = 32.5$ |
| Minnie | $20 + 15 \times 0.1 = 21.5$ |

$\rightarrow$ Mickey won!
<br>
<br>

In every turn, the agent has to make a decision making, deal with uncertainty, and proactively learn the pattern.

1. Should I bid on items I value less highly? (this is decision making)
2. How much do opponents value this item? (this is uncertainty)
3. What bidding patterns lead to victory? (learning and adaptation)
<br>
<br>

Ineffective strategies will get punished:
- Overbidding: pay more than item's worth (private valuation) to you = spend more money but at the end you total utility is only equal to the private valuation (Minnie with item B).
- Underbidding: miss items you value highly = missed opportunities.
- Poor budget management: spend everything early = cannot compete later and cannot compete on items the agent valued high.
- Ignoring your private valuations.
- Spending all your budget on one item.
- Bidding randomly high amounts or playing without strategy.
<br>
<br>

Effective strategies are:
- Strategic bidding based on "your" valuations.
- Opponent modeling and uncertainty handling.
- Smart budget allocation across all 2 rounds (maybe allocate more coins to the items you value highly)
- Learning and adapting from game experience.

You will implement the decision-making logic for an AI agent that competes in the mini auction game. Review the game rules, example, and strategy guidelines provided in the previous section before beginning. Your agent must make strategic bidding decisions to maximize utility while managing uncertainty about opponent valuations.

Complete the decide_bid() method in the MiniStrategicAgent class. This method is called whenever your agent must decide whether to bid or pass. It must return a tuple: (action, amount) where action is BID or PASS, and amount is the bid value or None.

Your agent has access to a get_game_state tool that returns structured data: your budget, current item name, your valuations, current bid, and minimum raise requirement. You must use this tool to gather information before making decisions.

In [None]:
# Hint:

class MiniStrategicAgent:
    def __init__(self, name: str, env: MiniAuctionEnvironment, model):
        self.name = name
        self.env = env
        
        tools = [GetGameStateTool(env)]
        
        self.agent = CodeAgent(
            tools=tools,
            model=model,
            max_steps=5,
            additional_authorized_imports=["json"]
        )
    
    def decide_bid(self) -> tuple[BidAction, Optional[int]]:
        """
        TODO: Implement your decision-making logic here.
        
        Your agent should:
        1. Check if still active in auction
        2. Use self.agent.run() to call the LLM with a strategic prompt
        3. Instruct the LLM to use get_game_state tool
        4. Parse the result (handle dict, string, or AgentText types)
        5. Validate the bid is legal
        6. Return (BidAction.BID, amount) or (BidAction.PASS, None)
        
        Available information:
        - self.name: your agent's name
        - self.env.game_state: current game state
        - self.env.game_state.current_auction: current auction details
        - self.env.MIN_BID: minimum starting bid (3 coins)
        - self.env.MIN_RAISE: minimum raise (1 coin)
        """
        # YOUR CODE HERE
        pass