# Problems

In order to use the previous methods we need a discrete environment(virtually non-existent in real world) and we need to know the probabilities of state transitions. 

1. Split up cartpole environment into bins (upper/lower bound of scalar values makes 1 bin)
2. learn the transition with experience replay to estimate them

# Solving Frozen Lake with Value Iteration

we need 3 dictionaries

Rewards: key is [s,a,s'], value is the immediate reward 
<br>
Transitions: key is [s,a], value is a dict of s' and the frequency it occured 
<br>
Values: key is [s], value is the calculated value of being in that state


Steps:
1. Perform 100 random steps
2. carry out value iteration over all states updating our values table
3. test on several full episodes
4. repeat until reach avg reward of 0.8
5. During test episodes, we also update our reward and transition tables to use all data from the environment.

In [8]:
import gym
import collections
# from tensorboardX import SummaryWriter
ENV_NAME = "FrozenLake-v0"
GAMMA = 0.9
TEST_EPISODES = 20

In [9]:
class Agent:
    def __init__(self):
        self.env = gym.make(ENV_NAME)
        self.state = self.env.reset()
        self.rewards = collections.defaultdict(float)
        self.transits = collections.defaultdict(collections.Counter)
        self.values = collections.defaultdict(float)
    
    def play_n_random(self, count):
        for i in range(count):
            action = self.env.action_space.sample()
            new_state, reward, is_done, _ = self.env.step(action)
            self.rewards[(self.state, action, new_state)] = reward
            self.transits[(self.state, action)][new_state] += 1
            self.state = self.env.reset() if is_done else new_state
            
    def calc_action_value(self, state, action):
        target_counts = self.transits[(state, action)]
        total = sum(target_counts.values())
        action_value = 0.0
        
        for tgt_state, count in target_counts.items():
            reward = self.rewards[(state, action, tgt_state)]
            
            action_value += (count / total) * (reward + GAMMA *self.values[tgt_state])
        
        return action_value

    def select_action(self, state):
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            action_value = self.calc_action_value(state, action)
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        
        return best_action
    
    def play_episode(self, env):
        total_reward = 0.0
        state = env.reset()
        
        while True:
            action = self.select_action(state)
            new_state, reward, is_done, _ = env.step(action)
            self.rewards[(state, action, new_state)] = reward
            self.transits[(state, action)][new_state] += 1
            total_reward += reward
            if is_done:
                break
            state = new_state
            
        return total_reward
    
    def value_iteration(self):
        for state in range(self.env.observation_space.n):
            state_values = [self.calc_action_value(state, action) 
                            for action in range(self.env.action_space.n)]
            self.values[state] = max(state_values)



In [10]:
if __name__ == "__main__":
    test_env = gym.make(ENV_NAME)
    agent = Agent()

    iter_no = 0
    best_reward = 0.0
    while True:
        iter_no += 1
        agent.play_n_random(100)
        agent.value_iteration()

        reward = 0.0
        for _ in range(TEST_EPISODES):
            reward += agent.play_episode(test_env)
        reward /= TEST_EPISODES
        
        if reward > best_reward:
            print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
            best_reward = reward
        if reward > 0.80:
            print("Solved in %d iterations!" % iter_no)
            break

[2018-07-31 22:15:23,544] Making new env: FrozenLake-v0
[2018-07-31 22:15:23,559] Making new env: FrozenLake-v0


Best reward updated 0.000 -> 0.050
Best reward updated 0.050 -> 0.200
Best reward updated 0.200 -> 0.350
Best reward updated 0.350 -> 0.400
Best reward updated 0.400 -> 0.600
Best reward updated 0.600 -> 0.650
Best reward updated 0.650 -> 0.700
Best reward updated 0.700 -> 0.750
Best reward updated 0.750 -> 0.850
Solved in 55 iterations!
