<a href="https://colab.research.google.com/github/format37/cartpole22/blob/with-colab/DeepLearningLab_Hackathon_Cartpole_Team.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hackathon: Basic RL



## 3rd party libaries and imports

### Youtube

In [None]:
from IPython.display import YouTubeVideo

### Visualisation Libraries

In [None]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install box2d-py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!apt-get install -y xvfb x11-utils
!pip install gym[box2d]==0.17.* pyvirtualdisplay==0.2.* PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.*

Reading package lists... Done
Building dependency tree       
Reading state information... Done
x11-utils is already the newest version (7.7+3build1).
xvfb is already the newest version (2:1.19.6-1ubuntu4.11).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 71 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1005'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '1400x900x24', ':1005'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

### Gym/Numpy

In [None]:
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay

### Video Renderer Helpers

In [None]:
def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

In [None]:
import tensorflow as tf

### Load Models

In [25]:
!git clone git@github.com:format37/cartpole22.git

Cloning into 'cartpole22'...
ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory
Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Introduction

### Deep Q-Network (DQN) Algorithm
Deep Q-network has played an important role in deep reinforcement learning. The article, Playing atari with
deep reinforcement learning, proposed the breakthrough algorithm applied the deep neural network for
Q-learning along with the convolutional neural network as a feature extractor from input pixels allowing
the deep neural network able to play 2600 atari games without teaching the rules or changing the network
architecture.

**Deep Q Network** (DQN): 

*   Model Free : 
*   Off policy : each update can use data collected at any point during training, regardless of how the agent was choosing to explore the environment when the data was obtained.  

Q-learning method aims to learn an aprroximation of $Q_{ϴ}(s,a)$ for the optimal action-value function $Q_{ϴ}^*(s,a)$.


### Advantage Actor-Critic (A2C) Algorithm

The Advantage Actor-Critic algorithm, often known as A2C, is an algorithm that combines the methods
from policy network in REINFORCE algorithm as the actor network and the V-network together as the
critic network. Because the policy networks or actor networks in A2C, are often too variance to produce the policy
with the policy gradient, in Equation 4.5, the V-network is introduced to solve such problems by providing
the dvantage Function to optimise the policy networks as the critic network.

## Environment 1: Cartpole

### Algorithm 1: DQN

In [None]:
GAMMA = 0.95
LEARNING_RATE = 0.001

MEMORY_SIZE = 1000000
BATCH_SIZE = 20

EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995

In [None]:
class DQNAgent:
    def __init__(self, observation_space, action_space):
        self.exploration_rate = EXPLORATION_MAX

        self.action_space = action_space
        self.memory = deque(maxlen=MEMORY_SIZE)

        self.model = Sequential()
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))
        self.model.add(Dense(24, activation="relu"))
        self.model.add(Dense(self.action_space, activation="linear"))
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() < self.exploration_rate:
            return random.randrange(self.action_space)
        q_values = self.model.predict(state)
        return np.argmax(q_values[0])

    def experience_replay(self):
        if len(self.memory) < BATCH_SIZE:
            return
        batch = random.sample(self.memory, BATCH_SIZE)
        for state, action, reward, state_next, terminal in batch:
            q_update = reward
            if not terminal:
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))
            q_values = self.model.predict(state)
            q_values[0][action] = q_update
            self.model.fit(state, q_values, verbose=0)
        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)

### Algorithm 2:

### Run Cartpole

In [None]:
env = wrap_env(gym.make("CartPole-v1"))

In [None]:
print(env.action_space)
print(env.reset())

Discrete(2)
[ 0.04858072  0.0159292  -0.00900628  0.02020027]


In [None]:
observation = env.reset()
while True:
  
    env.render()
    
    #your agent goes here
    action = env.action_space.sample() 
         
    observation, reward, done, info = env.step(action) 
   
    if done:
      break;
env.close()
show_video()

---

## Environment 2: Luna Landing

### Run Luna Lander

In [None]:
env = wrap_env(gym.make("LunarLander-v2"))
observation = env.reset()
while True:
  
    env.render()
    #your agent goes here
    action = env.action_space.sample() 
         
    observation, reward, done, info = env.step(action) 
   
    if done: 
      break;
env.close()
show_video()

NameError: ignored

## Environment 3: Bipedal Walker

In [None]:
YouTubeVideo('Aq3s5mhz1kw')

### Run Bipedal Walker

In [None]:
env = wrap_env(gym.make("BipedalWalker-v3"))
observation = env.reset()
while True:
  
    env.render()
    #your agent goes here
    action = env.action_space.sample() 
         
    observation, reward, done, info = env.step(action) 
   
    if done: 
      break;
env.close()
show_video()