# Build-A-Bot

#### Quick and easy explanation of building a RL Agent to play Airstriker-Genesis

***

### Download all need python libraries

In [None]:
!pip3 install box2d-py
!pip3 install gym[Box_2D]
!pip3 install stable-baselines3
!pip3 install gym==0.21.0
!pip3 install pyglet

### Import all needed libraries

In [1]:
import os
import torch
import gym
from gym.wrappers import Monitor

from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3 import DQN
import stable_baselines3

### Instantiate the environment
 - [OpenAI Documentation](https://gym.openai.com/docs/)
 - [Gym-Retro Documentation](https://retro.readthedocs.io/en/latest/getting_started.html)

> Gym started in 2016 as a goal to make open-source environments for RL Agents, similar to how ImageNet was started in 2008 for training Convolutional Neural Networks. Environments, by nature, are complex and uncertain, and the purpose is to provide agents opportunities to learn in a simulated fashion that could represent what they would see in the real world. 

In [2]:
env = gym.make("LunarLander-v2")

### Let's make an agent that takes random actions

In [3]:
env.reset()
while True:
    env.render()
    obs, rew, done, info = env.step(env.action_space.sample())
    if done:
        break

2022-03-26 13:00:21.771 Python[22839:741631] ApplePersistenceIgnoreState: Existing state will not be touched. New state will be written to /var/folders/h_/sk5bt9fn1g96qtys09hv8kzc0000gn/T/org.python.python.savedState


### Now... Let's discuss how a machine makes an action

In [4]:
print(f"Action Space - {env.action_space}")
print(f"Example of a Possible Random Action - {env.action_space.sample()}")

Action Space - Discrete(4)
Example of a Possible Random Action - 3


#### Why length 4?

> 4 possible actions
> - Do Nothing
> - Fire Main Engine
> - Fire Left Engine
> - Fire Right Engine

### Now... Let's discuss how an agent sees the world

In [5]:
print(f"Observation Space - {env.observation_space.shape}")
print(f"Example of a Possible Observation - {env.observation_space.sample()}")

Observation Space - (8,)
Example of a Possible Observation - [-0.54296476 -2.2571726  -1.4538776   2.8159966  -1.3927038  -1.6634027
 -1.1655447  -1.7009467 ]


#### Why length 8?
> Agent receives 8 inputs as an observation
> - Horizontal Position
> - Vertical Position
> - Horizontal Velocity
> - Vertical Velocity
> - Angle
> - Angular Velocity
> - Left Leg Contact
> - Right Leg Contact

### Create the model
#### [Deep Q-Network(DQN)](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html)
> DQN is a Q-Table based algorithm, which basically is a fancy lookup table. The "Deep" part is the addition of a neural network to help it understand more complex environments and observations, such as, atari games. It's what we call an "Off-Policy" algorithm, which means, it independently tries to figure out the best policy for a given outcome, regardless of what the agent does. It is also a "Model-Free" algorithm, and at a high level, means it takes the route of trial and error.

<img src="./images/Q-table.png"/>

#### Variables for monitoring model performance through evaluation

In [6]:
log_dir = "./logs/"
os.makedirs(log_dir, exist_ok=True)

env = stable_baselines3.common.monitor.Monitor(env, log_dir )

callback = EvalCallback(env,log_path = log_dir, deterministic=True) #For evaluating the performance of the agent periodically and logging the results.

#### Specific neural network and model training parameters

In [7]:
nn_layers = [64,64]

learning_rate = 0.001 #0.001-0.00001

policy_kwargs = dict(activation_fn=torch.nn.ReLU, net_arch=nn_layers)

#### Put it all together in the DQN Model
> Don't be scared of all the parameters, DQN comes with defaults and parameters are things you can optionally change during your runs to see if there is better performance.

In [8]:
#model = DQN("MlpPolicy", env)
model = DQN("MlpPolicy", env,policy_kwargs = policy_kwargs,
            learning_rate=learning_rate,
            batch_size=1,  #keep the learning simple, learn after each game
            buffer_size=1, #how much you want it to learn, set at 1 since batch_size is 1
            learning_starts=1, #start learning from the get-go
            gamma=0.99, #how much your discount your policy by, between 0-1
            tau = 1,  #impacts your learning policy, but by less than gamma
            target_update_interval=1, #update the network right away
            train_freq=(1,"step"), #always train the network at each step
            exploration_fraction = 0.5, #how much do you reduce random actions by after each step
            gradient_steps = 1, #number of gradient steps
            seed = 1, #seed for the pseudo random generators
            verbose=1) #output for training logs

Using cpu device
Wrapping the env in a DummyVecEnv.


### Run the model with no training

In [9]:
observation = env.reset()
while True:
    env.render()
    action, _states = model.predict(observation, deterministic=True)
    observation, reward, done, info = env.step(action)
    if done:
        break

### Train and Save the model

In [10]:
model.learn(total_timesteps=3000)
model.save("./dqn_models/dqn_lunar_3k")

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 105      |
|    ep_rew_mean      | -172     |
|    exploration_rate | 0.733    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 414      |
|    time_elapsed     | 1        |
|    total_timesteps  | 421      |
| train/              |          |
|    learning_rate    | 0.001    |
|    loss             | 6.49     |
|    n_updates        | 419      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 98       |
|    ep_rew_mean      | -147     |
|    exploration_rate | 0.503    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 407      |
|    time_elapsed     | 1        |
|    total_timesteps  | 784      |
| train/              |          |
|    learning_rate    | 0.001    |
|    loss             | 10.3     |
|    n_updates      

### What is it even learning?

<img src="./images/Q-table-learning.png"/>

### Agent with very little training - 3000 time steps

In [11]:
observation = env.reset()
while True:
    env.render()
    action, _states = model.predict(observation, deterministic=True)
    observation, reward, done, info = env.step(action)
    if done:
        break

### Agent with little training - 100k time steps

In [12]:
del model
model = DQN.load("./dqn_models/dqn_lunar_100k", env=env)
observation = env.reset()
while True:
    env.render()
    action, _states = model.predict(observation, deterministic=True)
    observation, reward, done, info = env.step(action)
    if done:
        break

Wrapping the env in a DummyVecEnv.


### Agent with medium training - 500k time steps

In [13]:
del model
model = DQN.load("./dqn_models/dqn_lunar_500k", env=env)
observation = env.reset()
while True:
    env.render()
    action, _states = model.predict(observation, deterministic=True)
    observation, reward, done, info = env.step(action)
    if done:
        break

Wrapping the env in a DummyVecEnv.


### Agent with larger amounts of training - 1 million time steps

In [14]:
del model
model = DQN.load("./dqn_models/dqn_lunar_1mill", env=env)
observation = env.reset()
while True:
    env.render()
    action, _states = model.predict(observation, deterministic=True)
    observation, reward, done, info = env.step(action)
    if done:
        break

Wrapping the env in a DummyVecEnv.


### References

 - https://deeplearning.neuromatch.io/projects/ReinforcementLearning/lunar_lander.html
 - https://www.duckietown.org/
 - https://www.kaggle.com/competitions/kore-2022-beta
 - https://www.youtube.com/watch?v=WXuK6gekU1Y