# 1. Implement the Monte Carlo Off-Policy Control with Importance Sampling.


In [None]:
import numpy as np
import gym
from collections import defaultdict

def generate_behavior_policy(env):
    num_actions = env.action_space.n
    if isinstance(env.observation_space, gym.spaces.Tuple):
        num_states = [space.n for space in env.observation_space.spaces]
        return defaultdict(lambda: np.ones(num_actions) / num_actions)
    else:
        return defaultdict(lambda: np.ones((num_actions)) / num_actions)

def off_policy_mc_control(env, num_episodes, discount_factor=1.0):
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    C = defaultdict(lambda: np.zeros(env.action_space.n))
    target_policy = defaultdict(lambda: np.zeros(env.action_space.n))
    behavior_policy = generate_behavior_policy(env)

    for episode in range(num_episodes):
        episode_states = []
        episode_actions = []
        episode_rewards = []

        state = env.reset()
        done = False
        while not done:
            action_probs = behavior_policy[state]
            action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
            next_state, reward, done, _ = env.step(action)
            episode_states.append(state)
            episode_actions.append(action)
            episode_rewards.append(reward)
            state = next_state

        G = 0
        W = 1
        for t in range(len(episode_states) - 1, -1, -1):
            state = episode_states[t]
            action = episode_actions[t]
            reward = episode_rewards[t]
            G = discount_factor * G + reward
            C[state][action] += W
            Q[state][action] += (W / C[state][action]) * (G - Q[state][action])

            if action != np.argmax(target_policy[state]):
                break
            W /= behavior_policy[state][action]

            if W == 0:
                break

    for state in Q:
        target_policy[state] = np.zeros(env.action_space.n)
        target_policy[state][np.argmax(Q[state])] = 1.0

    return Q, target_policy

# Example usage
env = gym.make('Blackjack-v1')
num_episodes = 100
Q, policy = off_policy_mc_control(env, num_episodes)
# Print Q-values in a structured format
print("Q-values:")
for state, values in Q.items():
    print("State:", state)
    print("Actions:", values)

# Print policy in a structured format
print("\n\nPolicy:\n")
for state, actions in policy.items():
    print("State:", state)
    print("Policy:", actions)

Q-values:
State: (21, 9, False)
Actions: [ 0. -1.]
State: (18, 10, False)
Actions: [-0.5  0. ]
State: (11, 10, False)
Actions: [0. 1.]
State: (13, 6, False)
Actions: [ 1. -1.]
State: (10, 10, False)
Actions: [-1.  1.]
State: (16, 8, False)
Actions: [-1. -1.]
State: (19, 7, False)
Actions: [1. 0.]
State: (18, 7, False)
Actions: [0.         0.33333333]
State: (19, 1, False)
Actions: [-0.33333333 -1.        ]
State: (16, 3, False)
Actions: [-1.  1.]
State: (17, 1, False)
Actions: [-1.  0.]
State: (12, 10, False)
Actions: [-0.33333333  1.        ]
State: (19, 1, True)
Actions: [ 0. -1.]
State: (17, 8, False)
Actions: [1. 0.]
State: (12, 6, False)
Actions: [0. 0.]
State: (20, 4, False)
Actions: [ 1. -1.]
State: (14, 6, False)
Actions: [-1. -1.]
State: (17, 6, False)
Actions: [0. 0.]
State: (10, 6, False)
Actions: [0. 0.]
State: (19, 10, False)
Actions: [ 0. -1.]
State: (13, 1, False)
Actions: [-1. -1.]
State: (20, 3, False)
Actions: [1. 0.]
State: (19, 5, True)
Actions: [-1.  0.]
State: (19

https://chat.openai.com/share/a22504cf-f8a2-4c64-874e-56d3a8ea7b35

1. **Initialize**: Set up Q-table, C-table, and target policy dictionaries.

2. **Generate Policy**: Create a behavior policy for exploration.

3. **Loop Episodes**:
   - Generate episodes using the behavior policy.
   - Calculate returns (G) for each state-action pair.
   - Update Q-table and C-table with returns using importance sampling.

4. **Update Policy**: Adjust the target policy based on the updated Q-values.

5. **Return**: Provide the learned Q-values and target policy.

This code implements Monte Carlo Off-Policy Control with Importance Sampling for solving reinforcement learning problems. Here's a breakdown:

1. **Environment Setup**: It imports necessary libraries and initializes the Blackjack environment from OpenAI Gym.

2. **Policy Initialization**: The `generate_behavior_policy` function initializes a behavior policy, which dictates how the agent selects actions during exploration. This policy is initialized uniformly at the beginning.

3. **Off-Policy Monte Carlo Control**: The `off_policy_mc_control` function performs off-policy Monte Carlo control. It iterates through a fixed number of episodes and collects experiences while following the behavior policy.

4. **Episode Generation and Importance Sampling**: During each episode, the agent selects actions based on the behavior policy and updates its estimates of state-action values (Q-table) using importance sampling. This allows the agent to learn from experiences even when following a different policy than the target policy.

5. **Policy Update**: After each episode, the target policy is updated based on the learned Q-values. The target policy tends to favor actions with higher estimated returns.

6. **Output**: Finally, the learned Q-values and target policy are returned for evaluation or further use.

This algorithm enables an agent to learn an optimal policy for the given environment, even when the exploration policy (behavior policy) is different from the desired target policy.

# 2. Implement SARSA (On Policy TD Learning)

1. **Initialize Q-Table**: Start with a Q-table of zeros for all state-action pairs.

2. **For Each Episode**:
   - Reset the environment.
   - Choose an action using epsilon-greedy policy based on Q-values.
   - Take the action, observe the next state and reward.
   - Choose the next action using epsilon-greedy policy.
   - Update Q-values using SARSA update rule.
   - Repeat until episode ends.

3. **Policy Evaluation**: After episodes, Q-table approximates optimal state-action values.

4. **Return**: Learned Q-values, representing expected returns for actions in each state.

In [None]:
import numpy as np
import gym

def epsilon_greedy_policy(Q, state, epsilon):
    num_actions = Q.shape[1]
    if np.random.rand() < epsilon:
        return np.random.choice(num_actions)
    else:
        return np.argmax(Q[state])

def sarsa(env, num_episodes, discount_factor=1.0, alpha=0.5, epsilon=0.1):
    Q = np.zeros((env.observation_space.n, env.action_space.n))

    for episode in range(num_episodes):
        state = env.reset()
        action = epsilon_greedy_policy(Q, state, epsilon)
        done = False

        while not done:
            next_state, reward, done, _ = env.step(action)
            next_action = epsilon_greedy_policy(Q, next_state, epsilon)
            td_target = reward + discount_factor * Q[next_state][next_action]
            td_error = td_target - Q[state][action]
            Q[state][action] += alpha * td_error

            state = next_state
            action = next_action

    return Q

# Example usage
env = gym.make('FrozenLake-v1')
num_episodes = 10000
Q = sarsa(env, num_episodes)
print("Q-values:", Q)


  if not isinstance(terminated, (bool, np.bool8)):


Q-values: [[0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.03125 0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.03125 0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.1875  0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.75    0.     ]
 [0.      0.      0.      0.     ]]


This code implements the SARSA (State-Action-Reward-State-Action) algorithm, a form of on-policy Temporal Difference (TD) learning. Here's an explanation of the key components:

1. **epsilon_greedy_policy**: This function defines an epsilon-greedy policy, which selects a random action with probability epsilon, and the action with the highest Q-value with probability (1 - epsilon).

2. **sarsa**: This function initializes a Q-table with zeros and iterates through a specified number of episodes. Within each episode:
   - It resets the environment and selects the initial action based on the epsilon-greedy policy.
   - It iterates through steps within the episode, updating the Q-values using the SARSA update rule: \( Q(s, a) \leftarrow Q(s, a) + \alpha \cdot (r + \gamma \cdot Q(s', a') - Q(s, a)) \), where \( \alpha \) is the learning rate, \( r \) is the reward, \( \gamma \) is the discount factor, \( s \) is the current state, \( a \) is the current action, \( s' \) is the next state, and \( a' \) is the next action.
   - The current state and action are updated for the next iteration based on the environment's response.

3. **Example usage**: It creates an environment (in this case, FrozenLake), runs SARSA for a specified number of episodes, and prints the learned Q-values.

Overall, SARSA learns an optimal policy by iteratively updating Q-values based on observed transitions and rewards, aiming to maximize expected cumulative rewards while following a given policy.