<a href="https://colab.research.google.com/github/akirakudo901/link_unity_to_self_made_agents/blob/master/MINT_RL_Colab_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MINT Code Base on Google Colab!**
Gonna try my own code base with Google Colab to have RL run in the background without me having to wait for it to terminate.

**Two ways** for **integration of code base** with **Google Colab**:

1. **Clone git every time**. Might be time consuming, but useful if I want to frequently change the code on the other side.

2. **Upload the entire code** to **Google Drive** and **mount it**. Will maybe try later?

## **Set Up**

### **Mounting Google Drive**
We mount the google drive folder we will use.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/MINT/Self_implemented_algorithm_trials
%ls

/content/drive/MyDrive/MINT/Self_implemented_algorithm_trials
[0m[01;34mlink_unity_to_self_made_agents[0m/


### **Cloning repo from github**
We will clone the repo from github to have all code accessible.
Only run this again if we deem it needed to copy the whole thing.

In [None]:
import os
from datetime import datetime

if input("Do you want to clone the new repo? y/n" =="y"):
  # if previous version exists, rename the old one
  if os.path.exists("link_unity_to_self_made_agents"):
    new_name = datetime.now().strftime("%m_%d_%Y_%Hh_%Mm_%Ss")
    os.rename("link_unity_to_self_made_agents", f"backup_{create_time}")
  # then clone the whole thing
  !git clone https://github.com/akirakudo901/link_unity_to_self_made_agents.git
  %cd link_unity_to_self_made_agents
  %ls

fatal: destination path 'link_unity_to_self_made_agents' already exists and is not an empty directory.
/content/drive/MyDrive/MINT/Self_implemented_algorithm_trials/link_unity_to_self_made_agents
[0m[01;34mgridworld_example_breakdown[0m/  README.md                                 todo.txt
main.py                       tentative_requirements_on_08_09_2023.txt  [01;34mtrained_algorithms[0m/
[01;34mmodels[0m/                       [01;34mtests[0m/                                    [01;34mwandb[0m/


### **Installing other packages**

We first seem to have to install swig to this disk. Then, we can use it to build the box2D wheel for gymnasium.

In [None]:
!sudo apt-get install swig3.0
!ln -s /usr/bin/swig3.0 /usr/bin/swig
!swig -version

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig3.0 is already the newest version (3.0.12-2.2ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.
ln: failed to create symbolic link '/usr/bin/swig': File exists

SWIG Version 4.1.1

Compiled with /opt/rh/devtoolset-2/root/usr/bin/c++ [Linux]

Configured options: +pcre

Please see https://www.swig.org for reporting bugs and further information


In [None]:
!pip install gymnasium
!pip install wandb
!pip install gymnasium[Box2D]



## **SAC Training With BipedalWalker-V3**

In [None]:
import gymnasium

from models.policy_learning_algorithms.policy_learning_algorithm import generate_name_from_parameter_dict, generate_parameters
from models.policy_learning_algorithms.soft_actor_critic import SoftActorCritic, uniform_random_sampling_wrapper, no_exploration_wrapper, train_SAC
from models.trainers.gym_base_trainer import GymOffPolicyBaseTrainer

# create the environment and determine specs about it
env = gymnasium.make("BipedalWalker-v3")#, render_mode="human")
trainer = GymOffPolicyBaseTrainer(env)
MAX_EPISODE_STEPS = 1600

parameters = {
    "play_around" : {
        "q_net_learning_rate"  : 1e-3,
        "policy_learning_rate" : 1e-3,
        "discount" : 0.99,
        "temperature" : 0.10,
        "qnet_update_smoothing_coefficient" : 0.005,
        "pol_eval_batch_size" : 1024,
        "pol_imp_batch_size" : 64,
        "update_qnet_every_N_gradient_steps" : 1,
        "qnet_layer_sizes" : (64, 64),
        "policy_layer_sizes" : (64, 64),
        "num_training_steps" : MAX_EPISODE_STEPS * 10,
        "num_init_exp" : 0,
        "evaluate_N_samples" : 1,
        "evaluate_every_N_epochs" : MAX_EPISODE_STEPS,
        "buffer_size" : int(1e6),
        "save_after_training" : True,
        "num_new_exp" : 1,
        "render_evaluation" : False
    },
    "try_256by256" : {
        "q_net_learning_rate"  : 1e-3,
        "policy_learning_rate" : 1e-3,
        "discount" : 0.99,
        "temperature" : 0.75,
        "qnet_update_smoothing_coefficient" : 0.005,
        "pol_eval_batch_size" : 1024,
        "pol_imp_batch_size" : 1024,
        "update_qnet_every_N_gradient_steps" : 1,
        "qnet_layer_sizes" : (256, 256),
        "policy_layer_sizes" : (256, 256),
        "num_training_steps" : MAX_EPISODE_STEPS * 10,
        "num_init_exp" : 5000,
        "num_new_exp" : 1,
        "evaluate_every_N_epochs" : MAX_EPISODE_STEPS,
        "buffer_size" : int(1e6),
        "save_after_training" : True,
        "evaluate_N_samples" : 1,
        "render_evaluation" : False
    },
    "policy_learning_rate_0.005" : {
        "q_net_learning_rate"  : 1e-3,
        "policy_learning_rate" : 5e-3,
        "discount" : 0.99,
        "temperature" : 0.5,
        "qnet_update_smoothing_coefficient" : 0.005,
        "pol_eval_batch_size" : 1024,
        "pol_imp_batch_size" : 1024,
        "update_qnet_every_N_gradient_steps" : 1,
        "qnet_layer_sizes" : (256, 256),
        "policy_layer_sizes" : (256, 256),
        "num_training_steps" : MAX_EPISODE_STEPS * 50,
        "num_init_exp" : 10000,
        "num_new_exp" : 1,
        "evaluate_every_N_epochs" : MAX_EPISODE_STEPS,
        "buffer_size" : int(1e6),
        "save_after_training" : True,
        "evaluate_N_samples" : 1,
        "render_evaluation" : False
    },
}

The environment has observation size: 24 & action size: 4.


In [None]:
# TRAIN MANY COMBINATIONS
params_to_try = generate_parameters(default_parameters=parameters["try_256by256"],
                                    default_name = "try_256by256",
                                    policy_learning_rate = [1e-2, 5e-3, 5e-4, 1e-4, 5e-5],
                                    num_training_steps = MAX_EPISODE_STEPS * 50,
                                    pol_eval_batch_size = 1024,
                                    num_init_exp = 10000,
                                    temperature = 0.5,
                                    qnet_layer_sizes = (256, 256),
                                    policy_layer_sizes = (256, 256))

for i, name_and_dict in enumerate(params_to_try.items()):
    name, p = name_and_dict
    train_SAC(parameters=p, parameter_name=name, env=env, trainer=trainer, training_id=None)

Training: policy_learning_rate_0.01.
Using device: cuda
Newly generated training id : 6p544ey3 will be used for training.
Generating 10000 initial experiences...
Generation successful!


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Training loop 1600/80000 successfully ended: reward=-94.29294866753878.

Training loop 3200/80000 successfully ended: reward=-101.2627134033818.

Training loop 4800/80000 successfully ended: reward=-97.60282997829606.

Training loop 6400/80000 successfully ended: reward=-4.838644486881923.

Training loop 8000/80000 successfully ended: reward=-11.910516916933577.

Training loop 9600/80000 successfully ended: reward=-14.494238653683551.

Training loop 11200/80000 successfully ended: reward=-19.393051205680116.

Training loop 12800/80000 successfully ended: reward=-26.918485219192828.

Training loop 14400/80000 successfully ended: reward=-112.83495878205915.

Training loop 16000/80000 successfully ended: reward=-119.78008419228718.

Training loop 17600/80000 successfully ended: reward=-119.6517900393835.

Training loop 19200/80000 successfully ended: reward=-19.911692433339272.

Training loop 20800/80000 successfully ended: reward=-23.220615723785333.

Training loop 22400/80000 successful

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Cumulative Reward,▃▂▂█▇▇▇▆▁▁▁▇▇▁▇███████▇
Policy Loss,██▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
QNet1 Loss,█▂▁▁▁▁▁▁▁▁▁▁▄▁▂▁▂▂▂▁▂▁▂▂▁▂▂▂▁▂▂▂▂▂▁▂▂▁▂▂
QNet2 Loss,█▁▁▁▁▁▁▁▁▁▁▁▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▁▂▂▁▂▂
Run Cumulative Reward,▆█▄▂▆▁▁▆█▇▆▃█▆▇█▇▇▆▆█▆▇█▇▇▆▇▇
Time Elapsed,▁▁▂▂▂▃▃▃▃▄▄▄▅▅▅▆▆▆▇▇▇██
Total Loss,█▇▇▆▆▆▆▅▅▅▅▄▅▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁

0,1
Cumulative Reward,-10.05345
Policy Loss,-105.30489
QNet1 Loss,0.7312
QNet2 Loss,0.44329
Run Cumulative Reward,-78.82509
Time Elapsed,1381.38342
Total Loss,-103.86022


Training: policy_learning_rate_0.005.
Using device: cuda
Newly generated training id : gzxeggzz will be used for training.
Generating 10000 initial experiences...


[34m[1mwandb[0m: Currently logged in as: [33makirakudo901[0m. Use [1m`wandb login --relogin`[0m to force relogin


Generation successful!



Training interrupted...
Closing envs...
Successfully closed envs!
Execution time for this session: 30.46925505299987 sec.
Execution time for the entire training so far: 46.821646682999926 sec.


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Policy Loss,█▇▇▇▆▆▆▆▆▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▁▁
QNet1 Loss,▄▄▄▆▄▆▄▆▁▃██▅▁▇▇█▃▄▅▃▁▅▁▁▂▂▃▂▂▃▃▂▂▂▃▁▂▁▁
QNet2 Loss,▄▄▄▆▄▆▄▆▁▃██▅▁▇▇█▃▄▅▃▁▅▁▁▂▂▃▂▂▃▃▂▁▂▂▁▂▁▁
Run Cumulative Reward,▁
Total Loss,▄▄▄▆▄▆▄▆▁▄██▆▂▇▇█▃▄▅▃▁▅▁▁▂▂▃▂▂▃▃▂▁▂▂▁▂▁▁

0,1
Policy Loss,-6.74092
QNet1 Loss,1.40834
QNet2 Loss,0.88466
Run Cumulative Reward,-87.32196
Total Loss,-4.44792


Training: policy_learning_rate_0.0005.
Using device: cuda
Newly generated training id : y5e1nuqc will be used for training.
Generating 10000 initial experiences...
Generation successful!


Training loop 1600/80000 successfully ended: reward=-113.67351207727039.

Training loop 3200/80000 successfully ended: reward=-105.31917169471504.

Training loop 4800/80000 successfully ended: reward=-97.78065244604522.

Training loop 6400/80000 successfully ended: reward=-18.426063180761165.

Training loop 8000/80000 successfully ended: reward=-16.948377960581904.

Training loop 9600/80000 successfully ended: reward=-15.140000428304498.

Training loop 11200/80000 successfully ended: reward=-4.693802250991939.

Training loop 12800/80000 successfully ended: reward=-7.394061463818652.

Training loop 14400/80000 successfully ended: reward=-16.25294873538148.

Training loop 16000/80000 successfully ended: reward=-14.206822768705718.

Training loop 17600/80000 successfully ended: reward=-3.113285754842973.

Training loop 19200/80000 successfully ended: reward=-15.649708994264564.

Training loop 20800/80000 successfully ended: reward=-8.648525762321443.

Training loop 22400/80000 successfull

VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.1300778605280975, max=1.0…

0,1
Cumulative Reward,▁▂▂▇▇██▇█▇█▇▁▇▁▇▇▆▁▇▇█▁▂▇▇█▇▇▇▇▇▇▇▇▇▇▇▇▇
Policy Loss,██▇▆▆▆▅▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁
QNet1 Loss,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
QNet2 Loss,▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Run Cumulative Reward,▆▁▂▁▂▆▄▆▂▂▇▇▇█▆▇▅▇▅▆▆█▇▇█▇█▇▇▆▇▆▇█▇▇▇█▇▇
Time Elapsed,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇██
Total Loss,▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂█▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
Cumulative Reward,-13.29237
Policy Loss,-120.6161
QNet1 Loss,0.75881
QNet2 Loss,0.64654
Run Cumulative Reward,-79.99832
Time Elapsed,3164.14555
Total Loss,-119.21074


Training: policy_learning_rate_0.0001.
Using device: cuda
Newly generated training id : 7vg7qpt3 will be used for training.
Generating 10000 initial experiences...
Generation successful!


Training loop 1600/80000 successfully ended: reward=-103.03895021804593.

Training loop 3200/80000 successfully ended: reward=-101.18550150048763.

Training loop 4800/80000 successfully ended: reward=-96.44910785791045.

Training loop 6400/80000 successfully ended: reward=-100.25941929922719.

Training loop 8000/80000 successfully ended: reward=-20.23391067403272.

Training loop 9600/80000 successfully ended: reward=-14.60113416562211.

Training loop 11200/80000 successfully ended: reward=-8.709635468926805.

Training loop 12800/80000 successfully ended: reward=-9.258679485215064.

Training loop 14400/80000 successfully ended: reward=-32.148703107574036.

Training loop 16000/80000 successfully ended: reward=-6.418781635861748.

Training loop 17600/80000 successfully ended: reward=-2.4805946321129366.

Training loop 19200/80000 successfully ended: reward=-3.6499079628518083.

Training loop 20800/80000 successfully ended: reward=-10.428321095943753.

Training loop 22400/80000 successfull

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Cumulative Reward,▁▁▁▁▇██▆██▇▇▇█▇██████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▇
Policy Loss,██▇▇▆▆▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁
QNet1 Loss,█▁▂▂▁▂▃▃▂▃▃▃▅▄▄▃▃▃▂▃▂▃▃▂▃▅▃▃▅▃▅▄▂▄▃▄▃▃▃▅
QNet2 Loss,█▁▂▂▁▂▂▃▂▂▃▃▄▃▃▃▂▂▃▃▁▃▂▂▃▃▃▃▄▂▃▂▃▃▂▃▂▂▃▂
Run Cumulative Reward,▄▅▂▁▆▃▁▇█▂▅▆▇█▇█▆▇▇▅█▇▇▆▆▆▆█▆▇▇██▇▅▇▆▄▆▇
Time Elapsed,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇███
Total Loss,█▇▇▆▆▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
Cumulative Reward,-13.56469
Policy Loss,-121.60065
QNet1 Loss,0.50076
QNet2 Loss,0.61473
Run Cumulative Reward,-84.71753
Time Elapsed,3205.95469
Total Loss,-120.48516


Training: policy_learning_rate_5e-05.
Using device: cuda
Newly generated training id : k6od38e2 will be used for training.
Generating 10000 initial experiences...
Generation successful!


Training loop 1600/80000 successfully ended: reward=-30.789514506455767.

Training loop 3200/80000 successfully ended: reward=-13.069799970985862.

Training loop 4800/80000 successfully ended: reward=-5.803897000032436.

Training loop 6400/80000 successfully ended: reward=-15.631738802026183.

Training loop 8000/80000 successfully ended: reward=-94.10031409251255.

Training loop 9600/80000 successfully ended: reward=-10.693681102317173.

Training loop 11200/80000 successfully ended: reward=-8.956568202979188.

Training loop 12800/80000 successfully ended: reward=-15.811781232918312.

Training loop 14400/80000 successfully ended: reward=-14.96064662591856.

Training loop 16000/80000 successfully ended: reward=-8.104657326309004.

Training loop 17600/80000 successfully ended: reward=-16.914649182167473.

Training loop 19200/80000 successfully ended: reward=-119.07613038534609.

Training loop 20800/80000 successfully ended: reward=-23.64655187529611.

Training loop 22400/80000 successfull

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Cumulative Reward,▆▇█▇██▇▇▇▁▇▂▂▂▇▂▇█████▇▇████████████████
Policy Loss,██▇▆▆▆▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁
QNet1 Loss,█▁▁▁▂▁▁▂▂▂▃▃▃▂▅▂▃▂▂▃▂▂▂▁▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▂
QNet2 Loss,█▁▁▁▂▁▁▂▂▃▂▃▄▂▆▂▃▂▂▃▂▂▂▂▂▃▂▂▂▂▂▂▂▂▁▁▁▂▁▂
Run Cumulative Reward,▆██▁▅▃▇▇█▇▇▃▆▅▆▄▅▆▇▇▇▇▇▇▆▆█▇███▇▇█▇▇▇▇▇█
Time Elapsed,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇███
Total Loss,█▇▇▆▆▅▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
Cumulative Reward,-9.08223
Policy Loss,-123.67213
QNet1 Loss,0.29437
QNet2 Loss,0.21516
Run Cumulative Reward,-77.1173
Time Elapsed,3226.59558
Total Loss,-123.16261


In [None]:
# JUST TRAIN ONE PARAMETER
train_SAC(parameters=p, parameter_name=name, env=env, trainer=trainer, training_id="7kkv8stn")