# FrozenLake
Brief into on FrozenLake: It's a grid-world. Your agent lives in a grid of size 4x4 and can move in four directions: Up, Down, Left, and Right. 

The agent always starts at a top-left position, and its goal is to reach the bottom-right cell of the grid. 

There are holes in the fixed cells of the grid and if you get into those holes, the episode ends and your reward = 0. If the agent reaches the destination cell, then it obtaiend the reward 1.0 and episode ends. 

The world is also slippery hence (Frozen lake), so the agent's actions do not always turn out as expected: **33%** chance that it will slip to the right or to the left. 

*Below we will have code with details. That is it will mirror the python script and will be explained in detail so you know what is going on and how to construct!*

In [1]:
import gym

In [2]:
e = gym.make('FrozenLake-v0')

print(f'How many spaces are there? \n{e.observation_space}\n')
print(f'How many actions can we take? \n{e.action_space}\n')

How many spaces are there? 
Discrete(16)

How many actions can we take? 
Discrete(4)



**Discrete** means we can take an absolute position or that it's an abosolute position of space. Meaning it's not continous such as: Climbing a hill, steering a wheel

In [8]:
# intializing
e.reset()
e.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


So that constructed our observation_space grid (4x4) which is:
* **S**: in red is our start. 
* **G**: at the bottom right is our goal.
* **H**: is a hole (you will fall!)
* **F**: frozen step (this is slippery, like super slippery)

We will now implement **One-Hot-Encoding** to our matrix and action steps, because our NN can only take Tensors which will be floating point numbers. 

This will be wrapped using: ```ObservationWrapper``` class from Gym.

In [None]:
# One-Hot Function
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    """
    This should make our game playable with our other Agent (CartPole). But this isn't enough for the agent to learn.
    """
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)
        
    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res

## Tweaking Cross-Entropy Method
Because of the limitations, we will need to tweak our Cross-Entropy Agent a bit for it to work on this environment. 

Such as: 
* **Larger batches of played episodes**: FrozenLake requires at least 100 (episodes) to get some successful episodes 
* **Discount factor applied to reward**: To make the total reward for the episode depend on episode lenght, and add variety in episodes, we can use a discounted total reward with the discounted factor: 0.9 or 0.95. Therefor, the reward for shorter episodes will be higher than the reward for longer ones. 
* **Keeping 'elite' episodes for a longer time**: Instead of throwing them away after training on the best ones. In FrozenLake, a successful episode is a much rarer animal: so we need to keep them for severl iterations to train on them.
* **Decrease learning rate**: This will give our network time to average more training samples
* **Much longer training time**: To reach 50% successful episodes, about 5k training iterations are required. 

In [None]:
# Changes to filter_batch() function
def filter_batch(batch, percentile):
    """
    We will now calculate discounted reward & return elite episodes (keep track of them)
    """
    disc_rewards = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), batch))
    
    reward_bound = np.percentile(disc_rewards, percentile)
    
    train_obs = []
    train_act = []
    elite_brach = [] # keeping track of winners
    
    for example, discounted_reward in zip(batch, disc_rewards):
        if discounted_reward > reward_bound:
            
            train_obs.extend(map(lambda step: step.observation, example.steps))
            train_act.extend(map(lambda step: step.action, example.steps))
            
            elite_batch.append(example)
    
    return elite_batch, train_obs, train_act, reward_bound 

In [None]:
# Adding this to training loop
"""
In the training loop, we will store previous 'elite' episodes to pass them to the preceding function on the next training iteration
"""
    # Training loop
    full_batch = []
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))

        full_batch, obs, acts, reward_bound = filter_batch(full_batch + batch, PERCENTILE)

        if not full_batch:
            continue

        obs_v = torch.Tensor(obs)
        acts_v = torch.LongTensor(acts)
        full_batch = full_batch[-500:] # get last 500