 # **Soft Actor-Critic**
Implemented in TensorFlow 2.6 with TF-Agents.
**Soft Actor-Critic** algorithm learns not only rewards, but also tries to maximize the entropy of its actions. In other words, it tries to be as unpredicatable as possible while still getting as many rewards as possible. This encourages the agent to explore the environment, which speeds up training, and makes it less likely to repeatedly execute same action when DQN produces imperfect estimates. This should lead to amazing sample efficiency.

[Soft Actor-Critic Algorithms and Applications](https://arxiv.org/abs/1812.05905)

### How to get started

**In simulation environment**

- collect 50k steps with random policy to get a baseline to compare against
- find network that looks the most promising and train 100k iterations on collected random buffer with learning rate 0.001
- use discount factor (gamma) 0.98 for future rewards. This forces agent to do less steps to get the reward, which in turn increases replay buffer rewards density and therefore learning becomes more efficent.
- after 100k-300k iterations decrease learning rate to 0.0001 to get more stable results
- check model summaries with Tensorboard (tensorboard --logdir=models) to fine-tune new steps vs training iteration steps. If losses increase then more iterations is needed.

**In real environment**
- Do transfer learning on sim network
- Start collecting steps with SIM trained Agent to get SIM 2 REAL accuracy
- After 25 episodes start training Agent in real environment
- Learning rate should be higher, because there are less steps in real env

### Comparison "Micro-net" spec

Agent with this parameters learn faster, but not that stable. Maby learning rate was too high...

- conv_layer_params = (4,7,4), (1,4,1)
- fc_layer_params = (16, 16, 8)
- dropout_layer_params = (0.25, 0.25, 0.125)
- action_fc_layer_params = (256, 256, 256, 256)
- action_dropout_layer_params = (0.25, 0.25, 0.125, 0.25)
- joint_fc_layer_params = (32, 16, 16, 4)
- joint_dropout_layer_params = (0.25, 0.25, 0.125, 0.25)
- activation_fn = tf.keras.activations.swish

### Other remarks

- activation functrion ReLU doesn't seem to work, because when inputs approach zero or are negative, the gradient of the function becomes zero. The network cannot perform backpropagation and cannot learn. More info: [Comparison of Reinforcement Learning Activation Functions to Improve the Performance of the Racing Game Learning Agent](https://s3.ap-northeast-2.amazonaws.com/journal-home/journal/jips/fullText/477/jips_v16n5_7.pdf)

- when chaning network layers, change experiment name or delete model checkpoints in models/'changed_model/checkpoints. Otherwise tensor shape mismatch is rised.

- when learning error 'loss in inf or nan' occured, changing network learning rate might help

# Imports

In [1]:
import gym
from offworld_gym.envs.common.channels import Channels
from offworld_gym.envs.common.enums import AlgorithmMode, LearningType

import silence_tensorflow.auto
import os
import sys
import logging
import tempfile
import numpy as np
import datetime
import certifi
import urllib3
import shutil

import tensorflow as tf
from tf_agents.agents.ddpg import critic_network
from tf_agents.agents.sac import sac_agent 
from tf_agents.agents.sac import tanh_normal_projection_network
from tf_agents.drivers import dynamic_episode_driver
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.environments import wrappers
from tf_agents.metrics import tf_metrics
from tf_agents.networks import actor_distribution_network
from tf_agents.policies import py_tf_eager_policy
from tf_agents.policies import random_tf_policy
from tf_agents.policies import greedy_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.specs import tensor_spec
from tf_agents.train import actor
from tf_agents.train import learner
from tf_agents.train.utils import spec_utils
from tf_agents.train.utils import strategy_utils
from tf_agents.train.utils import train_utils
from tf_agents.utils import common

# Hyperparameters

### Offworld-Gym env parameters

In [2]:
real = True                       # True = Real environment | False = Simulated docker environment
experiment_name = 'Macro_net'     # Experiment name
resume_experiment = True          # Resume training. When going from SIM to REAL, then transferlearning

learning_type = LearningType.END_TO_END  # Description of training method
algorithm_mode = AlgorithmMode.TRAIN     # .TEST or .TRAIN  (TEST needed for Offworld leaderboard)
channel_type = Channels.DEPTH_ONLY       # Which sensors to use: .RGB_ONLY or .DEPTH_ONLY or .RGBD

# Access token for Real environment from https://gym.offworld.ai/account
import my_Gym_token # I have this in separate file, but it can be added as string
os.environ['OFFWORLD_GYM_ACCESS_TOKEN'] = my_Gym_token.is_secret # 'insert_as_a_string'

# Project root folder. If path unknown, run in terminal: pwm
os.environ['OFFWORLD_GYM_ROOT'] = '/home/karlaru/PycharmProjects/offworld-gym'

# Python environment used for this project (eg miniconda env). If path unknown, run in terminal: which python
os.environ['PYTHONPATH'] = '/home/karlaru/miniconda3/envs/offworld-karl/bin/python'

# Load right environment
if real:
    env_name = 'OffWorldMonolithContinuousReal-v0'  
else:
    env_name = 'OffWorldDockerMonolithContinuousSim-v0'

### Logging

In [3]:
# Show only INFO messages
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)

# Disable connection security warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)    

### Model checkpoints

In [4]:
# Tempdir for model checkpoints
tempdir = 'models/' + experiment_name + '/'
logging.info(f'Model tempdir: {tempdir}')

INFO:Model tempdir: models/Macro_net/


### Replay buffer

In [5]:
# Max number of steps to keep in replay buffer
replay_max_length = 50000 # 14.3GB file, but uses at least triple the RAM when starting training or saving to file

# Fill replay buffer with random steps (recomended in sim environment)
random_replay_fill = False

# Patch size to get from buffer for one network training interation
sample_batch_size = 128

# Load replay buffer seed from file for quicker training?
buffer_from_file = True           # start training with "warm" (with previously made steps) buffer
save_buffer_to_file = True        # save last buffer state to a file so training can be continued where left of
buffer_save_interval = 9999999999 # save buffer after x interval, when huge then save only after training end

# Replay buffer directory
if real:
    rb_tempdir = 'data/real_buffer/'
else:
    rb_tempdir = 'data/sim_buffer/'

### Training

In [6]:
# Total number of patches to run 'range(1, patches_to_run+1)'
patches_to_run = 1000

# Start collecting new steps after patch number 'current_patch_nr > start_collecting' (before only training)
start_collecting = 0

# Episodes to collect before retraining network
episodes_in_patch = 5      # basis for calculating Tensorboard stats 'tensorboard --logdir=logs'
times_to_collect = 4       # collect 25 x 80 = 2k episodes (~10k steps) before retraining

# Training iterations after collecting patch
training_iterations = 250   # basis for calculating Tensorboard stats 'tensorboard --logdir=models'
times_to_iterate = 20       # train 250 x 20 = 5k iterations

# Computation distribution strategy

Enables running computations on one or more devices in a way that model definition code can remain unchanged when running on different hardware.

In [7]:
# Number of GPU-s available in current machine
num_GPUs = len(tf.config.list_physical_devices('GPU'))

# If no GPU-s available, use CPU
if num_GPUs == 0:
    strategy = tf.distribute.OneDeviceStrategy(device="/cpu:0")
    logging.info('No GPUs available. Using only CPU.')

# If one GPU available
elif num_GPUs == 1:
    strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
    logging.info('Using 1 GPU')

# If more than one GPU available, mirror data for compute in parallel
else:
    strategy = tf.distribute.MirroredStrategy()
    logging.info('Using multible GPUs' + str(num_GPUs))

INFO:Using 1 GPU


# Environment

In Reinforcement Learning (RL), an environment represents the task or problem to be solved. TF-Agents has suites for loading environments such as the OpenAI Gym. OpenAI Gym is written in pure Python. This is converted to TensorFlow using the TFPyEnvironment wrapper. The original environment's API uses Numpy arrays. The TFPyEnvironment converts these to Tensors to make it compatible with Tensorflow agents and policies.

In [8]:
if real == False:
    env = suite_gym.wrap_env(gym_env=gym.make(env_name, channel_type=channel_type))

else:
    env = suite_gym.wrap_env(gym_env=gym.make(env_name, 
                                              experiment_name=experiment_name,
                                              resume_experiment=resume_experiment,
                                              channel_type=channel_type, 
                                              learning_type=learning_type,
                                              algorithm_mode=algorithm_mode
                                              )) 
# Wrap Gym env into TF env
tf_env = tf_py_environment.TFPyEnvironment(env)

2021-12-06 01:21:09,122 - offworld_gym - INFO - Environment has been initiated.
INFO:Environment has been initiated.
2021-12-06 01:21:09,124 - offworld_gym - INFO - Environment has been started.
INFO:Environment has been started.
  "Box bound precision lowered by casting to {}".format(self.dtype)
2021-12-06 01:21:09,125 - offworld_gym - INFO - Waiting to connect to the environment server.
INFO:Waiting to connect to the environment server.
2021-12-06 01:21:13,448 - offworld_gym - INFO - Experiment has been resumed.
INFO:Experiment has been resumed.
2021-12-06 01:21:13,452 - offworld_gym - INFO - The environment server is running.
INFO:The environment server is running.


In [9]:
# Get tensor specs
observation_spec, action_spec, time_step_spec = spec_utils.get_tensor_specs(tf_env)

### Observation space

One observation is one frame from depth camera senor. Sensor resolution is 240x320 pix. Values for each pix range from 0 to 255.

In [10]:
observation_spec

BoundedTensorSpec(shape=(1, 240, 320, 1), dtype=tf.float32, name='observation', minimum=array(0., dtype=float32), maximum=array(255., dtype=float32))

### Action space

Robot movement command is defined by 2 element vector and has continuous values

In [11]:
action_spec

BoundedTensorSpec(shape=(2,), dtype=tf.float32, name='action', minimum=array([-0.7, -2.5], dtype=float32), maximum=array([0.7, 2.5], dtype=float32))

### Time step

A TimeStep contains the data emitted by an environment at each step of interaction. A TimeStep holds a step_type, an observation (typically a NumPy array or a dict or list of arrays), and an associated reward and discount.

In [12]:
print(time_step_spec.discount)
print(time_step_spec.observation)
print(time_step_spec.reward)
print(time_step_spec.step_type)

BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32))
BoundedTensorSpec(shape=(1, 240, 320, 1), dtype=tf.float32, name='observation', minimum=array(0., dtype=float32), maximum=array(255., dtype=float32))
TensorSpec(shape=(), dtype=tf.float32, name='reward')
TensorSpec(shape=(), dtype=tf.int32, name='step_type')


# Agent

### Critic
Gives value estimates for Q(s,a)

In [13]:
# Observation layer planner for critic and actor (to see output shapes) Dropout layers are not displayed.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, Input

model = Sequential(name='Observation layer')
model.add(Input((240, 320, 1)))
model.add(Conv2D(16, 7, strides=3, activation='swish'))
model.add(Conv2D(8, 5, strides=2, activation='swish')) 
model.add(Conv2D(1, 4, strides=1, activation='swish'))
model.add(Flatten())
model.add(Dense(128, activation='swish'))
model.add(Dense(64, activation='swish'))
model.add(Dense(64, activation='swish'))
model.add(Dense(64, activation='swish'))
model.add(Dense(64, activation='swish'))
model.add(Dense(64, activation='swish'))
model.add(Dense(64, activation='swish'))
model.add(Dense(32, activation='swish'))
model.summary()

Model: "Observation layer"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 78, 105, 16)       800       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 37, 51, 8)         3208      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 34, 48, 1)         129       
_________________________________________________________________
flatten (Flatten)            (None, 1632)              0         
_________________________________________________________________
dense (Dense)                (None, 128)               209024    
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_2 (Dense)              (None, 64)          

In [14]:
with strategy.scope():
  critic_net = critic_network.CriticNetwork(
        (observation_spec, action_spec),
        
        # INPUT = observation (depth sensor image 240x320 pix)
        # Conv2D(filters, kernel size, stride)
        observation_conv_layer_params=((16,7,4), (8,5,2), (1,3,1)),
        # Dense(number_of_units)
        observation_fc_layer_params=(128, 64, 64, 64, 64, 64, 64, 32),
        # Dropout(rate) dropout layer is after each fully connected layer
        observation_dropout_layer_params=(0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25),

        
        # INPUT = actions
        # Dense(number_of_units)
        action_fc_layer_params=(64, 64, 64, 64, 64, 32),
        # Dropout(rate) dropout layer is after each fully connected layer
        action_dropout_layer_params=(0.25, 0.25, 0.25, 0.25, 0.25, 0.25),
        
      
        # INPUT = [observation, action]
        # Dense(number_of_units) 
        joint_fc_layer_params=(32, 32, 32, 32, 32),
        # Dropout(rate) dropout layer is after each fully connected layer
        joint_dropout_layer_params=(0.25, 0.25, 0.25, 0.25, 0.25),
        # activation function for all layers (conv2D and Dense)
        activation_fn=tf.keras.activations.swish, 
        output_activation_fn=None,
        name='CriticNetwork')

### Actor
Generates actions for given observation

In [15]:
with strategy.scope():
  actor_net = actor_distribution_network.ActorDistributionNetwork(
        observation_spec,
        action_spec,
        preprocessing_layers = None,
        preprocessing_combiner=None, 
        conv_layer_params=((16,7,4), (8,5,2), (1,3,1)),
        fc_layer_params=(128, 64, 64, 64, 64, 64, 64, 64), 
        dropout_layer_params=(0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25),
        kernel_initializer=None,
        activation_fn=tf.keras.activations.swish,
        continuous_projection_net=tanh_normal_projection_network.TanhNormalProjectionNetwork,
        name='ActorDistributionNetwork')

### Agent


In [35]:
with strategy.scope():
  train_step = train_utils.create_train_step()
  tf_agent = sac_agent.SacAgent(
        time_step_spec,
        action_spec,
        critic_network=critic_net,
        critic_optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=0.00004), #0.00003
        actor_network=actor_net,
        actor_optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=0.00005),
        alpha_optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=0.00002),
        actor_loss_weight = 1.0,
        critic_loss_weight = 0.5,
        alpha_loss_weight = 0.8,
        target_update_tau=0.02,
        target_update_period=1,
        td_errors_loss_fn=tf.math.squared_difference,
        gamma=0.98,
        reward_scale_factor=1.0,
        initial_log_alpha = 0.1,
        use_log_alpha_in_alpha_loss = False,
        target_entropy = -0.1,
        gradient_clipping = None,
        debug_summaries = True,
        summarize_grads_and_vars = False,
        train_step_counter=train_step,
        name='Agent')

tf_agent.initialize()

### Critic net

In [17]:
critic_net.summary()

Model: "CriticNetwork"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
observation_encoding/conv2d  multiple                  800       
_________________________________________________________________
observation_encoding/conv2d  multiple                  3208      
_________________________________________________________________
observation_encoding/conv2d  multiple                  73        
_________________________________________________________________
flatten_1 (Flatten)          multiple                  0         
_________________________________________________________________
observation_encoding/dense ( multiple                  119936    
_________________________________________________________________
permanent_variable_rate_drop multiple                  0         
_________________________________________________________________
observation_encoding/dense ( multiple                

### Actor net

In [18]:
actor_net.summary()

Model: "ActorDistributionNetwork"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
EncodingNetwork (EncodingNet multiple                  157233    
_________________________________________________________________
TanhNormalProjectionNetwork  multiple                  260       
Total params: 157,493
Trainable params: 157,493
Non-trainable params: 0
_________________________________________________________________


# Observer

In [19]:
# Tensorflow Agents metrics as Observer assistant
train_metrics = [
    tf_metrics.NumberOfEpisodes(),
    tf_metrics.EnvironmentSteps(),
    tf_metrics.AverageReturnMetric(buffer_size=episodes_in_patch),
    tf_metrics.AverageEpisodeLengthMetric(buffer_size=episodes_in_patch)]

# Custom observer for Tensorboard logging
class Observer:
    def __init__(self):
        
        # Save episode count between kernel restarts to keep Tensorboard graphs from resetting step count
        self.episode_file = log_dir+'TB_episodes.txt'
        
        # Initialize writer with class to avoid empty log files
        self.summary_writer = tf.summary.create_file_writer(log_dir)
        
        try:
            # Read episode count (in previous training) from temp file
            self.episodes = int(open(self.episode_file, 'r').read())
        
        except:
            # If file not found
            self.episodes = 0
            
            
    def __call__(self, trajectory):
        
        # Values from Tensorflow train_metrics
        current_episode = train_metrics[0].result().numpy()
        current_steps = train_metrics[1].result().numpy()
        avg_return = train_metrics[2].result().numpy()
        avg_steps = train_metrics[3].result().numpy()
        
        with self.summary_writer.as_default(step = (self.episodes + current_episode)):             
            
            # Store summaries after each patch
            if current_episode % episodes_in_patch == 0:
                
                # Last episode count written to Tensorboard is stored in file
                open(self.episode_file, 'w+').write(str(self.episodes + current_episode))
                
                # Write to Tensorboard log folder
                tf.summary.scalar(f'Avg Reward per episode', avg_return)
                tf.summary.scalar(f'Avg Steps per episode', avg_steps)
            
            # Show live step and episode count after each step
            print(f"\rTotal steps: {current_steps} in {current_episode} episodes", end="")

# Replay buffer

Store data about previous training experiences

In [20]:
# Create replay buffer (step database)
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec = tf_agent.collect_data_spec,
    batch_size = 1,                 # Replays are stored one at the time
    max_length = replay_max_length, # Total buffer size
    device='cpu:0',                 # Use CPU for data storage and compute (GPU needed for network training)
    dataset_drop_remainder=True,    
    dataset_window_shift=None, 
    stateful_dataset=False)

In [21]:
if buffer_from_file:
    
    # Try loading replay buffer seed from a file
    try:
        tf.train.Checkpoint(replay_buffer=replay_buffer).restore(rb_tempdir+ '-1')
        logging.info(f"Loaded {replay_buffer.gather_all().action[0].shape[0]} steps into buffer from {rb_tempdir}")
    
    except:
        logging.info(f"No previous replay buffer steps found in {rb_tempdir}")

INFO:Loaded 13859 steps into buffer from data/real_buffer/


#### Reading data from buffer

In [22]:
# Create dataset from replay buffer
dataset = replay_buffer.as_dataset(
    sample_batch_size=sample_batch_size,
    num_parallel_calls = 1,
    num_steps=2).prefetch(training_iterations)

experience_dataset_fn = lambda: dataset

# Tensorboard

For model metrics run in terminal: **tensorboard --logdir=models**

In [23]:
%load_ext tensorboard
%tensorboard --logdir logs
logging.info(f"For model metrics run in terminal: tensorboard --logdir=models")

INFO:For model metrics run in terminal: tensorboard --logdir=models


# Random training

To get baseline results

In [24]:
if random_replay_fill:
    # Folder for random policy logs
    log_dir = 'logs/RANDOM'

    # Real and Sim in different subfolders
    if real:
        log_dir += '/REAL/'
    else:
        log_dir += '/SIM/'
    
    
    # Use random policy
    initial_collect_policy = random_tf_policy.RandomTFPolicy(action_spec = tf_env.action_spec(),
                                                          time_step_spec = tf_env.time_step_spec())
    # Use episode driver
    inital_driver = dynamic_step_driver.DynamicStepDriver(
        tf_env, 
        initial_collect_policy, 
        observers = [replay_buffer.add_batch, Observer()] + train_metrics, 
        num_steps = 1000)
    
    for _ in range(round(replay_max_length/1000)):
        
        try:
            # Do 1000 steps
            inital_driver.run()
        except:
            logging.info(f"Exception occured: {sys.exc_info()[0]}")
            logging.info("Saving replay buffer!")
            logging.info(f"Saved replay buffer at {replay_buffer.gather_all().action[0].shape[0]} steps")
    
    # Save replay buffer
    tf.train.Checkpoint(replay_buffer=replay_buffer).save(rb_tempdir)
    logging.info(f"Saved replay buffer at {replay_buffer.gather_all().action[0].shape[0]} steps")
else:
    logging.info(f"Using replay from file. No random steps")

INFO:Using replay from file. No random steps


# Collect driver

Driver for running a policy in an environment. Does steps until num_episodes episodes is done.

### Tensorboard logs

In [25]:
# Folder for training logs
log_dir = 'logs/' + experiment_name

# Place logs in env type subfolder
if real:
    log_dir += '/REAL/'
else:
    log_dir += '/SIM/'

In [26]:
# Use episode driver (nr of steps per episode is limited by Gym env)
collect_actor = dynamic_episode_driver.DynamicEpisodeDriver(
    tf_env, 
    py_tf_eager_policy.PyTFEagerPolicy(tf_agent.collect_policy, use_tf_function=True),
    observers = [replay_buffer.add_batch, Observer()] + train_metrics,
    num_episodes = episodes_in_patch)

# Learner

Learner loads checkpoint from tempdir if available from previous learning session. Learning will be resumed from saved point.

In [36]:
agent_learner = learner.Learner(
    tempdir,                               
    train_step, 
    tf_agent,                               
    experience_dataset_fn,                  
    checkpoint_interval=training_iterations,
    summary_interval=training_iterations, 
    max_checkpoints_to_keep=2,
    strategy=strategy,
    run_optimizer_variable_init=True)

# DEBUG: ckpt-xxxx <-shows how many training iterations has been done previously for current experiment

INFO:Checkpoint available: models/Macro_net/train/checkpoints/ckpt-1135750


# Training

### Training loop

In [None]:
for i in range(1, patches_to_run+1):

    try:
        # Collecting delay is used for enabling pretraining on previously saved replay buffer
        if i > start_collecting:
            for _ in range(times_to_collect):
                collect_actor.run()
                   
            # Save replay buffer into data folder file
            if i % buffer_save_interval == 0 and save_buffer_to_file:
                tf.train.Checkpoint(replay_buffer=replay_buffer).save(rb_tempdir)
                logging.info(f"Saved replay buffer at {replay_buffer.gather_all().action[0].shape[0]} steps")
     
        
        # Only training when algorithm mode = TRAIN. Learning is disabled when .TEST -ing
        if  algorithm_mode == AlgorithmMode.TRAIN:         
            for _ in range(times_to_iterate):
                agent_learner.run(iterations=training_iterations)

    except:
        logging.info(f"Exception occured: {sys.exc_info()[0]}")
        
        # Close env to save replay buffer faster for sim env
        if real == False:
            tf_env.close()
        
        # When session ends, on exception or kernerl interrupt: save latest buffer state to a file
        if save_buffer_to_file:
            logging.info("Exception occured! Saving last buffer state to a file.")
            tf.train.Checkpoint(replay_buffer=replay_buffer).save(rb_tempdir)
            logging.info(f"Saved replay buffer at {replay_buffer.gather_all().action[0].shape[0]} steps")
        
        # Exit loop
        break
        
logging.info("Training stopped")