# ðŸŽ² Toy Problem: "Two Actions"

Imagine the agent is in an environment with two possible actions:

- Action 0 â†’ Gives a reward of +1 with 30% probability, otherwise 0.
- Action 1 â†’ Gives a reward of +1 with 70% probability, otherwise 0.

At the beginning, the agent doesnâ€™t know this. It must interact with the environment and learn by trial and error which action gives higher reward.

In [1]:
import numpy as np

## Simple environment with two actions

In [2]:
def environment(action):
    if action == 0:
        return 1 if np.random.rand() < 0.3 else 0
    elif action == 1:
        return 1 if np.random.rand() < 0.7 else 0

## Parameters

In [3]:
n_episodes = 100
estimated_values = [0.0, 0.0] # estimated average value for each action
counts = [0, 0]               # how many times each action was chosen

## Strategy: Îµ-greedy (sometimes explore, mostly exploit what seems best)

In [4]:
epsilon = 0.1

for t in range(1, n_episodes + 1):
    # choose action (exploration vs. exploitation)
    if np.random.rand() < epsilon:
        action = np.random.choice([0, 1])    # explore
    else:
        action = np.argmax(estimated_values) # exploit

    # Execute action in the environment
    reward = environment(action)

    # update counts and estimated value of the action
    counts[action] += 1
    estimated_values[action] += (reward - estimated_values[action]) / counts[action]

    print(f"Episode {t:3d} | Action: {action} | Reward: {reward} | Estimated values: {estimated_values}")

Episode   1 | Action: 0 | Reward: 0 | Estimated values: [0.0, 0.0]
Episode   2 | Action: 0 | Reward: 1 | Estimated values: [0.5, 0.0]
Episode   3 | Action: 0 | Reward: 0 | Estimated values: [0.33333333333333337, 0.0]
Episode   4 | Action: 0 | Reward: 0 | Estimated values: [0.25, 0.0]
Episode   5 | Action: 0 | Reward: 1 | Estimated values: [0.4, 0.0]
Episode   6 | Action: 0 | Reward: 0 | Estimated values: [0.33333333333333337, 0.0]
Episode   7 | Action: 0 | Reward: 0 | Estimated values: [0.28571428571428575, 0.0]
Episode   8 | Action: 0 | Reward: 0 | Estimated values: [0.25000000000000006, 0.0]
Episode   9 | Action: 0 | Reward: 0 | Estimated values: [0.22222222222222227, 0.0]
Episode  10 | Action: 0 | Reward: 1 | Estimated values: [0.30000000000000004, 0.0]
Episode  11 | Action: 0 | Reward: 0 | Estimated values: [0.27272727272727276, 0.0]
Episode  12 | Action: 0 | Reward: 0 | Estimated values: [0.25000000000000006, 0.0]
Episode  13 | Action: 0 | Reward: 1 | Estimated values: [0.30769230