<a href="https://colab.research.google.com/github/aniruddh-pramod/aniruddh-pramod/blob/main/RL_solved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
# Copyright (c) 2020 Brain and Cognitive Society, IIT Kanpur [ BCS @IITK ]
# Copyright under MIT License, must reference https://github.com/bcs-iitk/BCS_Workshop_Apr_20 if used anywhere else.
# Author: Shashi Kant (http://shashikg.github.io/)

## Reinforcement Learning
In this you have to implement and train an RL agent to find a path for a frozen lake problem. 

### Frozen Lake Problem Description:

> Imagine there is a frozen lake stretching from your home to your office; you have to walk on the frozen lake to reach your office. But oops! There are holes in the frozen lake so you have to be careful while walking on the frozen lake to avoid getting trapped in the holes. [[src](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788836524/3/ch03lvl1sec32/solving-the-frozen-lake-problem)]

![frozen-lake](https://static.packt-cdn.com/products/9781788836524/graphics/49f3e058-2f32-40e8-9992-b53d1f57d138.png)


The task you have to do here:

*  Use the Gym library from OpenAI to setup a frozen lake environment and work till around 2000 time steps. Then finally output the Q-Table .


In [18]:
import gym
import numpy as np
env = gym.make('FrozenLake-v0')
Q = np.zeros([env.observation_space.n,env.action_space.n])

eta = 0.628
gma = 0.9
num_epi = 0
tot_time = 0
while (tot_time<2100):
    obv = env.reset()
    d = False
    t=0
    while not d:
        t+=1
        #env.render()
        #print(observation)
        action = np.argmax(Q[obv,:] + np.random.randn(1,env.action_space.n)*(1./(num_epi+1)))
        #Get new state & reward from environment
        obv_new,reward,done,_ = env.step(action)
        #Update Q-Table with new knowledge
        Q[obv,action] = Q[obv,action] + eta*(reward + gma*np.max(Q[obv_new,:]) - Q[obv,action])
        obv = obv_new
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            tot_time += t+1
            break
    num_epi += 1
env.close()
print("Final Q Table")
print(Q)

Final Q Table
[[6.62989974e-03 5.54308863e-03 3.22210644e-03 6.49932521e-03]
 [0.00000000e+00 1.74172143e-03 2.44909549e-03 3.18073238e-03]
 [1.14635593e-03 1.21238238e-03 5.21884474e-03 1.04616275e-03]
 [1.82672153e-03 1.63860904e-03 0.00000000e+00 3.10414938e-03]
 [7.36690377e-03 3.02270208e-03 9.81465651e-04 3.09524603e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 4.93289700e-03 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.66616768e-03 3.25703258e-04 8.40162619e-05 8.10970507e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 4.47204784e-03]
 [1.50691797e-03 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00647365e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 8.90761773e-02]
 [0.00000000e+00 0.00000000e+00 0.0000000

In [19]:
# Testing Agent Performance
total_time = 0
total_penalty = 0
episodes = 100
for _ in range(episodes):
  state = env.reset()
  time = 0
  penalty = 0
  reward = 0
  done = False
  while not done:
    action = np.argmax(Q[state])
    state, reward, done, info = env.step(action)
    if reward<0:
      penalty += 1
    time += 1
  total_time += time
  total_penalty += penalty
print(f"Results aftetr {episodes} episodes")
print(f"Average penalty per run: {total_penalty/episodes}")
print(f"Average time per run: {total_time/episodes}")

Results aftetr 100 episodes
Average penalty per run: 0.0
Average time per run: 26.3
