# Reinforcement Learning
### Model-Free Learning
#### Teaching a mountain car to reach the flag as fast as possible using Q-Learning
# Theory
- *What is Reinforcement Learning?*
  Teaching a system to perform a task based on reward and punishment.
- Example : 
1. Teaching a dog to obey orders.
2. Agent : Dog
3. State : Command
4. Action : Laying down
5. Policy : Rules that the agent uses to perform action
- Why Reinforcement Learning?
- *Disadvantages of Supervised Learning-*
1. large data-set 
2. Imitates the actions of the human player (labelled data-sets). Agent can never be better than the human player.
- Reinforcement Learning-
*Do not have a data set*
<img src="pic.svg">
- *Types of Learning - Model Based and Model Free*
1. Q-Learning : Model free learning
Model free Learning is when an AI can derive an optimal policy from its interactions with the environment
without needing to create a model before hand.
2. Q-Learning is a model-free learning technique that can be used to find the optimal action-selection policy using Q-Function

Q-Table : Gives our state memory.Store values of each state-action combination (Q-values)
<img src="qtable.png">

Q-values : Q-values represent the “quality” of an action taken from that state.
<img src="formula.png">
Task : 
1. Agent- Car
2. Actions- 3 : Drive Left, Do nothing, Drive Right
3. Observation/ State- Position, Velocity (Continuous variables)

## References 
Theory : 
AIDA Lecture 8
https://www.youtube.com/watch?v=aCEvtRtNO-M
Implementation
AIDA Exercise 8
https://www.novatec-gmbh.de/en/blog/introduction-to-q-learning/
Exploration V exploitation : https://www.youtube.com/watch?v=mo96Nqlo1L8

# Idea


# Algorithm

# Familiarization

In [10]:
import gym
import numpy as np
env = gym.make("MountainCar-v0").env
env.seed(0)
print(env.observation_space) # 2 dimensions- position and velocity
print(env.observation_space.high)
print(env.observation_space.low)
print(env.action_space)
print(env.action_space.n)
print(env.reset())
buckets_per_dimension=40
q_table = np.zeros((buckets_per_dimension ** 2,3)) # Initialize q-table with zeroes
print(q_table)
last_state=None
last_Action=None
episode_length=1000
actions_per_episode=10000

Box(2,)
[0.6  0.07]
[-1.2  -0.07]
Discrete(3)
3
[-0.58912799  0.        ]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 ...
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


# Implementation

In [None]:
import sys #provides information in constant, functions and methods
import gym #initialize the gym environment
from gym import wrappers, logger
#wrapper wraps multiple information (e.g. frames) from the env ;
#Loggers contain basic functionality for diagnostics, hyperparameters, etc.
from pprint import pprint #pretty printer
import numpy as np

class QLearningAgent():
    def __init__(self, action_space, observation_space):
        self.action_space = action_space.n  # self.action_space will be a number n. You can return any integer x with 0 <= x <= n in act 
        self.observation_space = observation_space #x,y, v
        self.buckets_per_dimension = 40  # Defines how many discrete bins are used in each dimesnion of the observation.
        # Discrete steps are being made because out observation is continuous and we need discrete steps for markov process
        self.q_table = np.zeros((self.buckets_per_dimension ** 2, self.action_space))
        self.last_state = None
        self.last_action = None

    def act(self, observation, last_reward, episode):
        # Return 0 = push left, 1 = no push, 2 = push right
        learning_rate = max(0.001, 1.0 * (0.85 ** int(episode/100)))
        #reducing the learning rate as the no. of episodes increase but learning_rate>0.001 
        #exploring in the beginning (take a random sample of the action space)= 1.0 epsilon greedy
        #and exploiting in the end (determine next action)
        # 0.85 is the decay multiplied by epsilon after each episode
        lookahead = 1        
        state = self.to_state(observation)
        if self.last_state is not None:
            self.q_table[self.last_state][self.last_action] = self.q_table[self.last_state][self.last_action] \
                + learning_rate * (last_reward + lookahead * np.max(self.q_table[state]) - self.q_table[self.last_state][self.last_action])
        possible_actions = self.q_table[state]
        probabilities = np.exp(possible_actions) / np.sum(np.exp(possible_actions))
        choice = np.random.choice(self.action_space, p=probabilities)
        self.last_state = state
        self.last_action = choice
        return choice


    def to_state(self, observation):
        upper_bound_position = self.observation_space.high[0]
        lower_bound_position = self.observation_space.low[0]
        upper_bound_velocity = self.observation_space.high[1]
        lower_bound_velocity = self.observation_space.low[1]
        step_size_position = (upper_bound_position - lower_bound_position) / self.buckets_per_dimension
        step_size_velocity = (upper_bound_velocity - lower_bound_velocity) / self.buckets_per_dimension
        bucket_position = int((observation[0] - lower_bound_position) / step_size_position)
        bucket_velocity = int((observation[1] - lower_bound_velocity) / step_size_velocity)
        return bucket_position * self.buckets_per_dimension + bucket_velocity

    
    def get_best_actions(self):
        return np.argmax(self.q_table, axis=1)

if __name__ == '__main__':
    # You can set the level to logger.DEBUG or logger.WARN if you
    # want to change the amount of output.
    logger.set_level(logger.INFO)

    env = gym.make("MountainCar-v0").env

    env.seed(0)
    agent = QLearningAgent(env.action_space, env.observation_space)

    episode_count = 1000
    attempts_in_episode = 10000

    for ep in range(episode_count):
        ob = env.reset()
        reward = 0
        action = None
        done = False
        for j in range(attempts_in_episode):
            action = agent.act(ob, reward, ep)
            ob, reward, done, _ = env.step(action)
             env.render()
            if done:
                print("Flag reached!")
                break
    
    input("Press Enter to continue and show best solution...")
    ob = env.reset()    
    best = agent.get_best_actions()
    done = False
    while not done:
        action = best[agent.to_state(ob)]
        ob, _, done, _ = env.step(action)
        env.render()
    print("Done") 

    env.close()
