# "Reinforcement Learning - Part 2"

> "RL concepts in 1 place"

- toc: true
- branch: master
- badges: false
- comments: true
- categories: [Machine Learning]
- hide: false
- search_exclude: false
- image: images/post-thumbnails/rl.png
- metadata_key1: notes
- metadata_key2: 

# Q Learning

The purpose of this blog post is to learn Q learning using code using the theory from the last blog post. 
- You can find the previous one here.  [Reinforcement Learning - Basics](https://ablearn.io/machine%20learning/2021/08/02/RL.html)

We will use openAI gym to apply Q learning for a FrozenLakeNoSlip environment I will explain the important parts of the code but the full code can be downloaded [here](https://colab.research.google.com/drive/1PbJnJonr8VWaOMnbosgxHbDYX4kwntMp?usp=sharing)

---

*Initialize the OPENAI GYM. Print the action and states (observation space) just to get an idea as to how many actions and states we are dealing with*

```python

try:
    register(
        id='FrozenLakeNoSlip-v0',
        entry_point='gym.envs.toy_text:FrozenLakeEnv',
        kwargs={'map_name' : '4x4', 'is_slippery':False},
        max_episode_steps=100,
        reward_threshold=0.78, # optimum = .8196
    )
except:
    pass

env_name = "FrozenLakeNoSlip-v0"
env = gym.make(env_name)
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

""" Output
#Observation space: Discrete(16)
#Action space: Discrete(4)
"""

```



---

## Initialize

Initialize parameters such as "discount factor", "learning rate" and "Q table"

```python

# Create a class
class QAgent():
    def __init__(self, env, discount_rate=0.97,learning_rate=0.01):
          super().__init__(env)
            self.state_size = env.observation_space.n  # init state size
            self.eps = 1.0
            self.discount_rate = discount_rate  #  discount factor
            self.learning_rate = learning_rate  #  learning rate
            self.build_model()     # Init Q-Table. Just random values for begin with      

    def build_model(self):    # random Q-Table with rows(state size) X columns (action size)
         self.q_table = 1e-4*np.random.random([self.state_size, self.action_size])
  



```



---

## Choosing Action

```python  
def get_action(self, state):
    
# get the Q-value for a state from the "Q table". You will get 4 possible values , since there 4 actions.  (q1, q2, q3, q4)
    q_state = self.q_table[state]  
    
# get the max of the  (q1, q2, q3, q4). That represents the "BEST ACTION". Its called "greedy approach" since we are always trying to priortize "MAX". however there is a inherent problem with choosing "MAX". See below explanation. 
    action_greedy = np.argmax(q_state)
 
# Choose "random" action based on a arbritary criteria (epsilon. you can call it anything. its totally arbritary. we are just trying to avoid max all the time. Thats all)
# If the random.random() < epsilon, choose random, else choose greedy
    action_random = super().get_action(state) 
    return action_random  if random.random() < self.eps else action_greedy

```

---

## TR Update

This is the main function which updates the Q table based on the bellman equation. 

From the last blog post

> Q(s,a) =  Q(s,a) + $ \alpha $ [$ R(s,a,s^1)  + \gamma  max_{a'} Q(s',a')$ - Q(s,a)]

One can read this as 
- **Qvalue = Current Qvalue +  learning rate (expected future reward - current Q value)**
- **Qvalue = Current Qvalue +  learning rate (BELLMAN ERROR)**


```python

def train(self, experience):
    state, action, next_state, reward, done = experience
    q_next = self.q_table[next_state]
    q_next = np.zeros([self.action_size]) if done else q_next
     
    #sample = [R + Discount * max(q values)] 
    sample = reward + self.discount_rate * np.max(q_next) 
    
     # our update function for q(s,a) = q(s,a) + learning rate(bellman error)
                                                             # bellman error : sample - q(s,a))
    self.q_table[state, action] =  self.q_table[state, action] + self.learning_rate * (sample - self.q_table[state, action])
    
    if done:
    self.eps = self.eps * 0.99


```

---


## Bringing it all together


```python

total_reward = 0
for ep in range(100):  # no of episodes
  state = env.reset()  # reset, start from the beginning
  done = False
  while not done:
    action = agent.get_action(state)   # pass the "current state" and get a "random" or "greedy action"
    next_state, reward, done, info =  env.step(action) # step through the env. the openaigym will do the rest. it will give you the reward, tell you if its complete and give other meta info
   
    agent.train((state, action, next_state, reward, done))
    state = next_state        # set the current state to the next state
    total_reward += reward   # add the rewards
    print("s:", state, "a:", action)
    print("Episode: {}, Total reward: {}, eps: {}".format(ep,total_reward,agent.eps))
    env.render()
    print(agent.q_table)
    time.sleep(0.05)
    clear_output(wait=True)


```


In [1]:
#  $V_{i+1}(s)$ = argmax $ \sum \limits_{a\in A} $ P($s^{1}|s,a$) * [ $ R(s,a,s^1)  + \gamma   V_{i}(s^1)$] 

