<a href="https://colab.research.google.com/github/boernd/rl-workshop/blob/main/Kopie_von_Part3_1QIteration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q iteration in practice
The central data
structures in this example are as follows:

• Reward table: A dictionary with the composite key "source state" + "action" +
"target state". The value is obtained from the immediate reward.

• Transitions table: A dictionary keeping counters of the experienced transitions.
The key is the composite "state" + "action" and the value is another dictionary that maps the target state into a count of times that we've seen it. For example, if in state 0 we execute action 1 ten times, after three times it leads us to state 4 and after seven times to state 5. Entry with the key (0, 1) in this table will be a dict {4:
3, 5: 7}. We use this table to estimate the probabilities of our transitions.

• Value table: A dictionary that maps a state into the calculated value of this state.

**The overall logic of our code is simple:**
 in the loop, we play 100 random steps from the
environment, populating the reward and transition tables. After those 100 steps, we
perform a value iteration loop over all states, updating our value table. Then we play
several full episodes to check our improvements using the updated value table. If the
average reward for those test episodes is above the **0.8** boundary , then we stop training.
During test episodes, we also update our reward and transition tables to use all data from
the environment.

In [None]:
pip install tensorboardX



In [None]:
#!/usr/bin/env python3
import gym
import collections
from tensorboardX import SummaryWriter
# In the beginning, we import used packages and define constants:

ENV_NAME = "FrozenLake-v0"
GAMMA = 0.9
TEST_EPISODES = 20

# Then we define the Agent class, which will keep our tables and contain functions we'll be
# using in the training loop:

class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME)
        self.state = self.env.reset()
        self.rewards = collections.defaultdict(float)
        self.transits = collections.defaultdict(collections.Counter)
        self.values = collections.defaultdict(float)

# In the class constructor, we create the environment we'll be using for data samples, obtain
# our first observation, and define tables for rewards, transitions, and values.

# This function "play_n_random_steps" is used to gather random experience from the environment and update
# reward and transition tables. Note that we don't need to wait for the end of the episode to
# start learning; we just perform N steps and remember their outcomes. This is one of the
# differences between Value iteration and Cross-entropy, which can learn only on full
# episodes.

    def play_n_random_steps(self, count):
        for _ in range(count):
            action = self.env.action_space.sample()
            new_state, reward, is_done, _ = self.env.step(action)
            self.rewards[(self.state, action, new_state)] = reward
            self.transits[(self.state, action)][new_state] += 1
            self.state = self.env.reset() if is_done else new_state

# The next function "select_action" uses the function we just described to make a decision about the best
# action to take from the given state. It iterates over all possible actions in the environment
# and calculates value for every action. The action with the largest value wins and is returned
# as the action to take. This action selection process is deterministic, as
# the play_n_random_steps() function introduces enough exploration. So, our agent will
# behave greedily in regard to our value approximation.

    def select_action(self, state):
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_action

# The "play_episode function" uses select_action to find the best action to take and plays one
# full episode using the provided environment. This function is used to play test episodes,
# during which we don't want to mess up with the current state of the main environment
# used to gather random data. So, we're using the second environment passed as an
# argument. The logic is very simple and should be already familiar to you: we just loop over
# states accumulating reward for one episode:

    def play_episode(self, env):
        total_reward = 0.0
        state = env.reset()
        while True:
            action = self.select_action(state)
            new_state, reward, is_done, _ = env.step(action)
            self.rewards[(state, action, new_state)] = reward
            self.transits[(state, action)][new_state] += 1
            total_reward += reward
            if is_done:
                break
            state = new_state
        return total_reward

# The code "value_iteration" is very similar to calc_action_value in the previous example and in fact it does
# almost the same thing. For the given state and action, it needs to calculate the value of this
# action using statistics about target states that we've reached with the action. To calculate
# this value, we use the Bellman equation and our counters, which allow us to approximate
# the probability of the target state. However, in Bellman's equation we have the value of the
# state and now we need to calculate it differently. Before, we had it stored in the value table
# (as we approximated the value of states), so we just took it from this table. We can't do this
# anymore, so we have to call the select_action method, which will choose for us the action 
# with the largest Q-value, and then we take this Q-value as the value of the target state. Of
# course, we can implement another function which could calculate for us this value of state,
# but select_action does almost everything we need, so we will reuse it here.

    def value_iteration(self):
        for state in range(self.env.observation_space.n):
            for action in range(self.env.action_space.n):
                action_value = 0.0
                target_counts = self.transits[(state, action)]
                total = sum(target_counts.values())
                for tgt_state, count in target_counts.items():
                    reward = self.rewards[(state, action, tgt_state)]
                    best_action = self.select_action(tgt_state)
                    action_value += (count / total) * (reward + GAMMA * self.values[(tgt_state, best_action)])
                self.values[(state, action)] = action_value


if __name__ == "__main__":
    test_env = gym.make(ENV_NAME)
    agent = Agent()
    writer = SummaryWriter(comment="-q-iteration")

    iter_no = 0
    best_reward = 0.0
    while True:
        iter_no += 1
        agent.play_n_random_steps(100)
        agent.value_iteration()

        reward = 0.0
        for _ in range(TEST_EPISODES):
            reward += agent.play_episode(test_env)
        reward /= TEST_EPISODES
        writer.add_scalar("reward", reward, iter_no)
        if reward > best_reward:
            print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
            best_reward = reward
        if reward > 0.80:
            print("Solved in %d iterations!" % iter_no)
            break
    writer.close()


Best reward updated 0.000 -> 0.250
Best reward updated 0.250 -> 0.550
Best reward updated 0.550 -> 0.600
Best reward updated 0.600 -> 0.700
Best reward updated 0.700 -> 0.850
Solved in 19 iterations!
