<a href="https://colab.research.google.com/github/boernd/rl-workshop/blob/main/Kopie_von_Part3_1ValueIteration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install tensorboardX

Collecting tensorboardX
[?25l  Downloading https://files.pythonhosted.org/packages/af/0c/4f41bcd45db376e6fe5c619c01100e9b7531c55791b7244815bac6eac32c/tensorboardX-2.1-py2.py3-none-any.whl (308kB)
[K     |█                               | 10kB 16.9MB/s eta 0:00:01[K     |██▏                             | 20kB 22.1MB/s eta 0:00:01[K     |███▏                            | 30kB 11.7MB/s eta 0:00:01[K     |████▎                           | 40kB 9.1MB/s eta 0:00:01[K     |█████▎                          | 51kB 4.3MB/s eta 0:00:01[K     |██████▍                         | 61kB 4.9MB/s eta 0:00:01[K     |███████▍                        | 71kB 5.1MB/s eta 0:00:01[K     |████████▌                       | 81kB 5.5MB/s eta 0:00:01[K     |█████████▌                      | 92kB 5.7MB/s eta 0:00:01[K     |██████████▋                     | 102kB 6.1MB/s eta 0:00:01[K     |███████████▊                    | 112kB 6.1MB/s eta 0:00:01[K     |████████████▊                   | 122kB 

In [None]:
#!/usr/bin/env python3
import gym
import collections
from tensorboardX import SummaryWriter
# In the beginning, we import used packages and define constants:
ENV_NAME = "FrozenLake-v0"
GAMMA = 0.9
TEST_EPISODES = 20

# Then we define the Agent class, which will keep our tables and contain functions we'll be
# using in the training loop:
class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME)
        self.state = self.env.reset()
        self.rewards = collections.defaultdict(float)
        self.transits = collections.defaultdict(collections.Counter)
        self.values = collections.defaultdict(float)
# In the class constructor, we create the environment we'll be using for data samples, obtain
# our first observation, and define tables for rewards, transitions, and values.

# This function "play_n_random_steps" is used to gather random experience from the environment and update
# reward and transition tables. Note that we don't need to wait for the end of the episode to
# start learning; we just perform N steps and remember their outcomes. This is one of the
# differences between Value iteration and Cross-entropy, which can learn only on full
# episodes.

    def play_n_random_steps(self, count):
        for _ in range(count):
            action = self.env.action_space.sample()
            new_state, reward, is_done, _ = self.env.step(action)
            self.rewards[(self.state, action, new_state)] = reward
            self.transits[(self.state, action)][new_state] += 1
            self.state = self.env.reset() if is_done else new_state

#The next function calculates the value of the action from the state, using our transition,
# reward and values tables. We will use it for two purposes: to select the best action to
# perform from the state and to calculate the new value of the state on value iteration. Its
# logic is illustrated in the following diagram and we do the following:
# 1. We extract transition counters for the given state and action from the transition
# table. Counters in this table have a form of dict, with target states as key and a
# count of experienced transitions as value. We sum all counters to obtain the total
# count of times we've executed the action from the state. We will use this total
# value later to go from an individual counter to probability.
# 2. Then we iterate every target state that our action has landed on and calculate its
# contribution into the total action value using the Bellman equation. This
# contribution equals to immediate reward plus discounted value for the target
# state. We multiply this sum to the probability of this transition and add the result to
# the final action value. 

    def calc_action_value(self, state, action):
        target_counts = self.transits[(state, action)]
        total = sum(target_counts.values())
        action_value = 0.0
        for tgt_state, count in target_counts.items():
            reward = self.rewards[(state, action, tgt_state)]
            action_value += (count / total) * (reward + GAMMA * self.values[tgt_state])
        return action_value

# The next function "select_action" uses the function we just described to make a decision about the best
# action to take from the given state. It iterates over all possible actions in the environment
# and calculates value for every action. The action with the largest value wins and is returned
# as the action to take. This action selection process is deterministic, as
# the play_n_random_steps() function introduces enough exploration. So, our agent will
# behave greedily in regard to our value approximation.

    def select_action(self, state):
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            action_value = self.calc_action_value(state, action)
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_action

# The "play_episode function" uses select_action to find the best action to take and plays one
# full episode using the provided environment. This function is used to play test episodes,
# during which we don't want to mess up with the current state of the main environment
# used to gather random data. So, we're using the second environment passed as an
# argument. The logic is very simple and should be already familiar to you: we just loop over
# states accumulating reward for one episode:

    def play_episode(self, env):
        total_reward = 0.0
        state = env.reset()
        while True:
            action = self.select_action(state)
            new_state, reward, is_done, _ = env.step(action)
            self.rewards[(state, action, new_state)] = reward
            self.transits[(state, action)][new_state] += 1
            total_reward += reward
            if is_done:
                break
            state = new_state
        return total_reward

# The final method of the Agent class is our "value iteration" implementation and it is
# surprisingly simple, thanks to the preceding functions. What we do is just loop over all
# states in the environment, then for every state we calculate the values for the states
# reachable from it, obtaining candidates for the value of the state. Then we update the value
# of our current state with the maximum value of the action available from the state:

    def value_iteration(self):
        for state in range(self.env.observation_space.n):
            state_values = [self.calc_action_value(state, action)
                            for action in range(self.env.action_space.n)]
            self.values[state] = max(state_values)


if __name__ == "__main__":
    test_env = gym.make(ENV_NAME)
    agent = Agent()
    writer = SummaryWriter(comment="-v-iteration")

    iter_no = 0
    best_reward = 0.0
    while True:
        iter_no += 1
        agent.play_n_random_steps(100)
        agent.value_iteration()

        reward = 0.0
        for _ in range(TEST_EPISODES):
            reward += agent.play_episode(test_env)
        reward /= TEST_EPISODES
        writer.add_scalar("reward", reward, iter_no)
        if reward > best_reward:
            print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
            best_reward = reward
        if reward > 0.80:
            print("Solved in %d iterations!" % iter_no)
            break
    writer.close()

# Our solution is stochastic, and my experiments usually required from 12 to 100 iterations to
# reach a solution, but in all cases, it took less than a second to find a good policy that could
# solve the environment in 80% of runs. 

Best reward updated 0.000 -> 0.050
Best reward updated 0.050 -> 0.150
Best reward updated 0.150 -> 0.300
Best reward updated 0.300 -> 0.400
Best reward updated 0.400 -> 0.650
Best reward updated 0.650 -> 0.700
Best reward updated 0.700 -> 0.750
Best reward updated 0.750 -> 0.800
Best reward updated 0.800 -> 0.850
Solved in 44 iterations!
