# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym 🤖

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/thumbnail.png"  alt="Thumbnail"/>

In this notebook, you'll learn to use A2C with [Panda-Gym](https://github.com/qgallouedec/panda-gym). You're going **to train a robotic arm** (Franka Emika Panda robot) to perform a task:

- `Reach`: the robot must place its end-effector at a target position.

After that, you'll be able **to train in other robotics tasks**.


### 🎮 Environments:

- [Panda-Gym](https://github.com/qgallouedec/panda-gym)

###📚 RL-Library:

- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)

## Objectives of this notebook 🏆

At the end of the notebook, you will:

- Be able to use **Panda-Gym**, the environment library.
- Be able to **train robots using A2C**.
- Understand why **we need to normalize the input**.
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.




# Let's train our first robots 🤖

## Create a virtual display 🔽

During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).

Hence the following cell will install the librairies and create and run a virtual screen 🖥

In [None]:
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7a3ed44b15b0>

### Install dependencies 🔽

The first step is to install the dependencies, we’ll install multiple ones:
- `gymnasium`
- `panda-gym`: Contains the robotics arm environments.
- `stable-baselines3`: The SB3 deep reinforcement learning library.
- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.

⏲ The installation can **take 10 minutes**.

In [3]:
!pip install stable-baselines3[extra]
!pip install gymnasium

Collecting stable-baselines3[extra]
  Downloading stable_baselines3-2.7.1-py3-none-any.whl.metadata (4.8 kB)
Downloading stable_baselines3-2.7.1-py3-none-any.whl (188 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.0/188.0 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: stable-baselines3
Successfully installed stable-baselines3-2.7.1


In [4]:
!pip install huggingface_sb3
!pip install huggingface_hub
!pip install panda_gym
!pip install moviepy

Collecting huggingface_sb3
  Downloading huggingface_sb3-3.0-py3-none-any.whl.metadata (6.3 kB)
Collecting huggingface-hub~=0.8 (from huggingface_sb3)
  Downloading huggingface_hub-0.36.2-py3-none-any.whl.metadata (15 kB)
Downloading huggingface_sb3-3.0-py3-none-any.whl (9.7 kB)
Downloading huggingface_hub-0.36.2-py3-none-any.whl (566 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.4/566.4 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: huggingface-hub, huggingface_sb3
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface_hub 1.3.7
    Uninstalling huggingface_hub-1.3.7:
      Successfully uninstalled huggingface_hub-1.3.7
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
transformers 5.0.0 requires huggingface-hub<2.0,>=1.3.0, but you have huggingface-hub 0.36.2 which i

## Import the packages 📦

In [5]:
import os

import gymnasium as gym
import panda_gym

from huggingface_sb3 import load_from_hub, package_to_hub

from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.env_util import make_vec_env

from huggingface_hub import notebook_login

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
  return datetime.utcnow().replace(tzinfo=utc)


## PandaReachDense-v3 🦾

The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).

In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.

In `PandaReach`, the robot must place its end-effector at a target position (green ball).

We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.

Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg"  alt="Robotics"/>


This way **the training will be easier**.



### Create the environment

#### The environment 🎮

In `PandaReachDense-v3` the robotic arm must place its end-effector at a target position (green ball).

In [None]:
env_id = "PandaReachDense-v3"

# Create the env
env = gym.make(env_id)

# Get the state space and action space
s_size = env.observation_space.shape
a_size = env.action_space

In [None]:
print(env.observation_space)

Dict('achieved_goal': Box(-10.0, 10.0, (3,), float32), 'desired_goal': Box(-10.0, 10.0, (3,), float32), 'observation': Box(-10.0, 10.0, (6,), float32))


In [None]:
print("_____OBSERVATION SPACE_____ \n")
print("The State Space is: ", s_size)
print("Sample observation", env.observation_space.sample()) # Get a random observation

_____OBSERVATION SPACE_____ 

The State Space is:  None
Sample observation {'achieved_goal': array([ 5.9338927 , -4.3634834 ,  0.31537512], dtype=float32), 'desired_goal': array([-6.9096546, -7.1386933, -2.759951 ], dtype=float32), 'observation': array([-9.576857 ,  9.591352 ,  9.407738 , -9.660893 ,  8.552757 ,
       -0.9398179], dtype=float32)}


In [None]:
for k, sp in env.observation_space.spaces.items():
    print(k, sp, "shape:", getattr(sp, "shape", None))

achieved_goal Box(-10.0, 10.0, (3,), float32) shape: (3,)
desired_goal Box(-10.0, 10.0, (3,), float32) shape: (3,)
observation Box(-10.0, 10.0, (6,), float32) shape: (6,)


The observation space **is a dictionary with 3 different elements**:
- `achieved_goal`: (x,y,z) the current position of the end-effector.
- `desired_goal`: (x,y,z) the target position for the end-effector.
- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).

Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**.

In [None]:
print("\n _____ACTION SPACE_____ \n")
print("The Action Space is: ", a_size)
print("Action Space Sample", env.action_space.sample()) # Take a random action


 _____ACTION SPACE_____ 

The Action Space is:  Box(-1.0, 1.0, (3,), float32)
Action Space Sample [ 0.9585859   0.08961037 -0.35455078]


The action space is a vector with 3 values:
- Control x, y, z movement

### Normalize observation and rewards

A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).

For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.

We also normalize rewards with this same wrapper by adding `norm_reward = True`

[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)

#### Solution

In [None]:
# 4개의 동일환경을 구축하여 업데이트를 위한 (s, a, r, s') 데이터를 병렬적으로 처리한다
# Policy Gradient 방식에서 전체 trajectory를 거친 후 return을 계산했던 것과 달리 bootstrapping 방식으로 전체 에피소드가 끝나지 않아도 업데이트 가능
env = make_vec_env(env_id, n_envs=4)

env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)

### Create the A2C Model 🤖

For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes

To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3).

#### Solution

In [None]:
# MultiputPolicy의 경우 여러 개의 벡터를 저장하는 input에 대해 flatten + concat 하여 MLP 통과 가능한 형태로 다루게 해준
model = A2C(policy = "MultiInputPolicy",
            env = env,
            verbose=1)

Using cuda device


  return datetime.utcnow().replace(tzinfo=utc)


### Train the A2C agent 🏃
- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min

In [None]:
model.learn(1_000_000)

[1;30;43m스트리밍 출력 내용이 길어서 마지막 5000줄이 삭제되었습니다.[0m
|    std                | 0.332    |
|    value_loss         | 0.00154  |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 2.84      |
|    ep_rew_mean        | -0.221    |
|    success_rate       | 1         |
| time/                 |           |
|    fps                | 431       |
|    iterations         | 23800     |
|    time_elapsed       | 1102      |
|    total_timesteps    | 476000    |
| train/                |           |
|    entropy_loss       | -0.876    |
|    explained_variance | 0.98      |
|    learning_rate      | 0.0007    |
|    n_updates          | 23799     |
|    policy_loss        | -0.000476 |
|    std                | 0.332     |
|    value_loss         | 0.000105  |
-------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 2.64      |
|  

<stable_baselines3.a2c.a2c.A2C at 0x7a3d48ddccb0>

In [None]:
# Save the model and  VecNormalize statistics when saving the agent
model.save("a2c-PandaReachDense-v3")
env.save("vec_normalize.pkl")

### Evaluate the agent 📈
- Now that's our  agent is trained, we need to **check its performance**.
- Stable-Baselines3 provides a method to do that: `evaluate_policy`

In [None]:
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

# Load the saved statistics
eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v3")])
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)

# We need to override the render_mode
eval_env.render_mode = "rgb_array"

#  do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False

# Load the agent
model = A2C.load("a2c-PandaReachDense-v3")

mean_reward, std_reward = evaluate_policy(model, eval_env)

print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")

Mean reward = -0.17 +/- 0.07




### Publish your trained model on the Hub 🔥
Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.

📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20


In [24]:
from huggingface_hub import notebook_login
notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

  return datetime.utcnow().replace(tzinfo=utc)


For this environment, **running this cell can take approximately 10min**

In [None]:
from huggingface_sb3 import package_to_hub

package_to_hub(
    model=model,
    model_name=f"a2c-{env_id}",
    model_architecture="A2C",
    env_id=env_id,
    eval_env=eval_env,
    repo_id=f"WJLeeLouis/A2C_PandaReachDense_RL", # Change the username
    commit_message="Initial commit",
)

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m




Saving video to /tmp/tmpwxqmkh8j/-step-0-to-step-1000.mp4


  IMAGEMAGICK_BINARY = r"C:\Program Files\ImageMagick-6.8.8-Q16\magick.exe"


Moviepy - Building video /tmp/tmpwxqmkh8j/-step-0-to-step-1000.mp4.
Moviepy - Writing video /tmp/tmpwxqmkh8j/-step-0-to-step-1000.mp4



  return datetime.utcnow().replace(tzinfo=utc)


Moviepy - Done !
Moviepy - video ready /tmp/tmpwxqmkh8j/-step-0-to-step-1000.mp4
[38;5;1m✘ 'DummyVecEnv' object has no attribute 'video_recorder'[0m
[38;5;1m✘ We are unable to generate a replay of your agent, the package_to_hub
process continues[0m
[38;5;1m✘ Please open an issue at
https://github.com/huggingface/huggingface_sb3/issues[0m
[38;5;4mℹ Pushing repo WJLeeLouis/A2C_PandaReachDense_RL to the Hugging Face
Hub[0m


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...-v3/pytorch_variables.pth: 100%|##########| 1.26kB / 1.26kB            

  ...e-v3/policy.optimizer.pth: 100%|##########| 48.9kB / 48.9kB            

  ...aReachDense-v3/policy.pth: 100%|##########| 46.8kB / 46.8kB            

  ...2c-PandaReachDense-v3.zip: 100%|##########|  114kB /  114kB            

  ...blb8mxm/vec_normalize.pkl: 100%|##########| 2.64kB / 2.64kB            

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/WJLeeLouis/A2C_PandaReachDense_RL/tree/main/[0m


CommitInfo(commit_url='https://huggingface.co/WJLeeLouis/A2C_PandaReachDense_RL/commit/221ce5d7ffa2d1d6fcde590a7b22f69c1a081203', commit_message='Initial commit', commit_description='', oid='221ce5d7ffa2d1d6fcde590a7b22f69c1a081203', pr_url=None, repo_url=RepoUrl('https://huggingface.co/WJLeeLouis/A2C_PandaReachDense_RL', endpoint='https://huggingface.co', repo_type='model', repo_id='WJLeeLouis/A2C_PandaReachDense_RL'), pr_revision=None, pr_num=None)

## Some additional challenges 🏆
The best way to learn **is to try things by your own**! Why not trying  `PandaPickAndPlace-v3`?

If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.

PandaPickAndPlace-v1 (this model uses the v1 version of the environment): https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1

And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html

We provide you the steps to train another agent (optional):

1. Define the environment called "PandaPickAndPlace-v3"
2. Make a vectorized environment
3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
4. Create the A2C Model (don't forget verbose=1 to print the training logs).
5. Train it for 1M Timesteps
6. Save the model and  VecNormalize statistics when saving the agent
7. Evaluate your agent
8. Publish your trained model on the Hub 🔥 with `package_to_hub`


### Solution (optional)

A2C에서는 평균 에피소드 길이는 단순한 관측 통계치일 뿐이며,
성공 신호가 희박한 장기 조작 문제에서는
rollout 단위로 학습할 때 성공 샘플이 극도로 희석되어
sample efficiency가 낮아질 수밖에 없다.

In [10]:
env_id = "PandaPickAndPlace-v3"
env = gym.make(env_id)

# Get the state space and action space
s_size = env.observation_space
a_size = env.action_space
print(s_size)
print(a_size)
print('---example---')
print(env.observation_space.sample())
print(env.action_space.sample())

Dict('achieved_goal': Box(-10.0, 10.0, (3,), float32), 'desired_goal': Box(-10.0, 10.0, (3,), float32), 'observation': Box(-10.0, 10.0, (19,), float32))
Box(-1.0, 1.0, (4,), float32)
---example---
{'achieved_goal': array([5.492008 , 5.2551165, 1.7372192], dtype=float32), 'desired_goal': array([ 5.890043 ,  1.8043162, -2.0791264], dtype=float32), 'observation': array([-8.2537985e+00,  1.4931312e+00,  5.7731762e+00, -1.6476558e-01,
       -7.8196473e+00, -9.5044107e+00, -6.3049502e+00,  4.7883472e+00,
       -6.5505009e+00,  5.1958280e+00, -9.8264561e+00,  3.5979166e+00,
       -2.4656255e+00, -4.4787126e+00, -8.5872059e+00, -2.1689909e+00,
        2.3571086e+00,  1.4326302e-03, -5.4976220e+00], dtype=float32)}
[-0.7224062  -0.18290673  0.7736824  -0.20979021]


In [None]:
# 1 - 2
env_id = "PandaPickAndPlace-v3"
env = make_vec_env(env_id, n_envs=4)

# 3
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)

# 4
model = A2C(policy = "MultiInputPolicy",
            env = env,
            verbose=1)
# 5
model.learn(1_000_000)

[1;30;43m스트리밍 출력 내용이 길어서 마지막 5000줄이 삭제되었습니다.[0m
|    std                | 0.823    |
|    value_loss         | 7.27e-05 |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 47.1     |
|    ep_rew_mean        | -47      |
|    success_rate       | 0.06     |
| time/                 |          |
|    fps                | 372      |
|    iterations         | 23800    |
|    time_elapsed       | 1278     |
|    total_timesteps    | 476000   |
| train/                |          |
|    entropy_loss       | -4.86    |
|    explained_variance | 0.911    |
|    learning_rate      | 0.0007   |
|    n_updates          | 23799    |
|    policy_loss        | 0.00624  |
|    std                | 0.822    |
|    value_loss         | 5.77e-06 |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 48.1     |
|    ep_rew_mean        |

<stable_baselines3.a2c.a2c.A2C at 0x7a3d41532840>

## Video Generation Issues

### PandaReachDense

In [2]:
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub

repo_id = "WJLeeLouis/A2C_PandaReachDense_RL"
filename = "a2c-PandaReachDense-v3.zip"
model_path = load_from_hub(repo_id=repo_id, filename=filename)
model = A2C.load(model_path)

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


a2c-PandaReachDense-v3.zip:   0%|          | 0.00/114k [00:00<?, ?B/s]

  return datetime.utcnow().replace(tzinfo=utc)


In [4]:
import gymnasium as gym
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv, VecVideoRecorder
import panda_gym

env_id = "PandaReachDense-v3"

def make_eval_env():
    env = gym.make(env_id, render_mode="rgb_array")
    env = Monitor(env)
    return env

eval_env = DummyVecEnv([make_eval_env])

eval_env = VecVideoRecorder(
    eval_env,
    video_folder="/tmp/hf-replay",
    record_video_trigger=lambda step: step == 0,
    video_length=1000,
    name_prefix=f"a2c-{env_id}"
)

  return datetime.utcnow().replace(tzinfo=utc)


In [5]:
from stable_baselines3.common.vec_env import VecNormalize
from huggingface_sb3 import load_from_hub

vn_path = load_from_hub(repo_id=repo_id, filename="vec_normalize.pkl")
eval_env = VecNormalize.load(vn_path, eval_env)

# 평가 중 통계가 바뀌지 않게 고정
eval_env.training = False
eval_env.norm_reward = False  # 영상/평가용이면 보상 정규화는 보통 꺼둠(선택)

vec_normalize.pkl:   0%|          | 0.00/2.64k [00:00<?, ?B/s]

In [9]:
from huggingface_sb3 import package_to_hub

package_to_hub(
    model=model,
    model_name=f"a2c-{env_id}",
    model_architecture="A2C",
    env_id=env_id,
    eval_env=eval_env,
    repo_id=repo_id,
    commit_message="Fix video replay generation",
)

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m
Saving video to /tmp/hf-replay/a2c-PandaReachDense-v3-step-0-to-step-1000.mp4
MoviePy - Building video /tmp/hf-replay/a2c-PandaReachDense-v3-step-0-to-step-1000.mp4.
MoviePy - Writing video /tmp/hf-replay/a2c-PandaReachDense-v3-step-0-to-step-1000.mp4





MoviePy - Done !
MoviePy - video ready /tmp/hf-replay/a2c-PandaReachDense-v3-step-0-to-step-1000.mp4
Saving video to /tmp/tmpndffp6u0/-step-0-to-step-1000.mp4
MoviePy - Building video /tmp/tmpndffp6u0/-step-0-to-step-1000.mp4.
MoviePy - Writing video /tmp/tmpndffp6u0/-step-0-to-step-1000.mp4





MoviePy - Done !
MoviePy - video ready /tmp/tmpndffp6u0/-step-0-to-step-1000.mp4
[38;5;1m✘ 'DummyVecEnv' object has no attribute 'video_recorder'[0m
[38;5;1m✘ We are unable to generate a replay of your agent, the package_to_hub
process continues[0m
[38;5;1m✘ Please open an issue at
https://github.com/huggingface/huggingface_sb3/issues[0m
[38;5;4mℹ Pushing repo WJLeeLouis/A2C_PandaReachDense_RL to the Hugging Face
Hub[0m


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...aReachDense-v3/policy.pth: 100%|##########| 46.8kB / 46.8kB            

  ...qwx7kc5/vec_normalize.pkl: 100%|##########| 2.64kB / 2.64kB            

  ...-v3/pytorch_variables.pth: 100%|##########| 1.26kB / 1.26kB            

  ...e-v3/policy.optimizer.pth: 100%|##########| 48.9kB / 48.9kB            

  ...2c-PandaReachDense-v3.zip: 100%|##########|  114kB /  114kB            

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/WJLeeLouis/A2C_PandaReachDense_RL/tree/main/[0m


CommitInfo(commit_url='https://huggingface.co/WJLeeLouis/A2C_PandaReachDense_RL/commit/d2048b07b5b545ad45218ed816e8f69ad7330381', commit_message='Fix video replay generation', commit_description='', oid='d2048b07b5b545ad45218ed816e8f69ad7330381', pr_url=None, repo_url=RepoUrl('https://huggingface.co/WJLeeLouis/A2C_PandaReachDense_RL', endpoint='https://huggingface.co', repo_type='model', repo_id='WJLeeLouis/A2C_PandaReachDense_RL'), pr_revision=None, pr_num=None)

In [10]:
from huggingface_hub import HfApi

repo_id = "WJLeeLouis/A2C_PandaReachDense_RL"
video_path = "/tmp/hf-replay/a2c-PandaReachDense-v3-step-0-to-step-1000.mp4"

api = HfApi()
api.upload_file(
    path_or_fileobj=video_path,
    path_in_repo="replay.mp4",
    repo_id=repo_id,
    repo_type="model",
    commit_message="Add replay.mp4 (manual upload; package_to_hub replay bug workaround)"
)
print("Uploaded replay.mp4")

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...3-step-0-to-step-1000.mp4:  82%|########1 |  530kB /  649kB            

Uploaded replay.mp4


### PandaPickAndPlace

In [14]:
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub

repo_id = "WJLeeLouis/A2C_Robotics_RL"
filename = "a2c-PandaPickAndPlace-v3.zip"
model_path = load_from_hub(repo_id=repo_id, filename=filename)
model = A2C.load(model_path)

a2c-PandaPickAndPlace-v3.zip:   0%|          | 0.00/131k [00:00<?, ?B/s]

In [15]:
import gymnasium as gym
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv, VecVideoRecorder
import panda_gym

env_id = "PandaPickAndPlace-v3"

def make_eval_env():
    env = gym.make(env_id, render_mode="rgb_array")
    env = Monitor(env)
    return env

eval_env = DummyVecEnv([make_eval_env])

eval_env = VecVideoRecorder(
    eval_env,
    video_folder="/tmp/hf-replay",
    record_video_trigger=lambda step: step == 0,
    video_length=1000,
    name_prefix=f"a2c-{env_id}"
)

In [16]:
from stable_baselines3.common.vec_env import VecNormalize
from huggingface_sb3 import load_from_hub

vn_path = load_from_hub(repo_id=repo_id, filename="vec_normalize.pkl")
eval_env = VecNormalize.load(vn_path, eval_env)

# 평가 중 통계가 바뀌지 않게 고정
eval_env.training = False
eval_env.norm_reward = False  # 영상/평가용이면 보상 정규화는 보통 꺼둠(선택)

vec_normalize.pkl:   0%|          | 0.00/3.04k [00:00<?, ?B/s]

In [17]:
from huggingface_sb3 import package_to_hub

package_to_hub(
    model=model,
    model_name=f"a2c-{env_id}",
    model_architecture="A2C",
    env_id=env_id,
    eval_env=eval_env,
    repo_id=repo_id,
    commit_message="Fix video replay generation",
)

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m
Saving video to /tmp/hf-replay/a2c-PandaPickAndPlace-v3-step-0-to-step-1000.mp4
MoviePy - Building video /tmp/hf-replay/a2c-PandaPickAndPlace-v3-step-0-to-step-1000.mp4.
MoviePy - Writing video /tmp/hf-replay/a2c-PandaPickAndPlace-v3-step-0-to-step-1000.mp4





MoviePy - Done !
MoviePy - video ready /tmp/hf-replay/a2c-PandaPickAndPlace-v3-step-0-to-step-1000.mp4
Saving video to /tmp/tmpujp3lrrx/-step-0-to-step-1000.mp4
MoviePy - Building video /tmp/tmpujp3lrrx/-step-0-to-step-1000.mp4.
MoviePy - Writing video /tmp/tmpujp3lrrx/-step-0-to-step-1000.mp4





MoviePy - Done !
MoviePy - video ready /tmp/tmpujp3lrrx/-step-0-to-step-1000.mp4
[38;5;1m✘ 'DummyVecEnv' object has no attribute 'video_recorder'[0m
[38;5;1m✘ We are unable to generate a replay of your agent, the package_to_hub
process continues[0m
[38;5;1m✘ Please open an issue at
https://github.com/huggingface/huggingface_sb3/issues[0m
[38;5;4mℹ Pushing repo WJLeeLouis/A2C_Robotics_RL to the Hugging Face Hub[0m


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...-v3/pytorch_variables.pth: 100%|##########| 1.26kB / 1.26kB            

  ...t6_7oxm/vec_normalize.pkl: 100%|##########| 3.04kB / 3.04kB            

  ...ickAndPlace-v3/policy.pth: 100%|##########| 53.7kB / 53.7kB            

  ...e-v3/policy.optimizer.pth: 100%|##########| 55.8kB / 55.8kB            

  ...-PandaPickAndPlace-v3.zip: 100%|##########|  131kB /  131kB            

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/WJLeeLouis/A2C_Robotics_RL/tree/main/[0m


CommitInfo(commit_url='https://huggingface.co/WJLeeLouis/A2C_Robotics_RL/commit/044e9d53c5e941cec6643d0a057c59be2a89b34a', commit_message='Fix video replay generation', commit_description='', oid='044e9d53c5e941cec6643d0a057c59be2a89b34a', pr_url=None, repo_url=RepoUrl('https://huggingface.co/WJLeeLouis/A2C_Robotics_RL', endpoint='https://huggingface.co', repo_type='model', repo_id='WJLeeLouis/A2C_Robotics_RL'), pr_revision=None, pr_num=None)

In [18]:
from huggingface_hub import HfApi

repo_id = "WJLeeLouis/A2C_Robotics_RL"
video_path = "/tmp/hf-replay/a2c-PandaPickAndPlace-v3-step-0-to-step-1000.mp4"

api = HfApi()
api.upload_file(
    path_or_fileobj=video_path,
    path_in_repo="replay.mp4",
    repo_id=repo_id,
    repo_type="model",
    commit_message="Add replay.mp4 (manual upload; package_to_hub replay bug workaround)"
)
print("Uploaded replay.mp4")

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...3-step-0-to-step-1000.mp4:  70%|#######   |  530kB /  755kB            

Uploaded replay.mp4


### Not Good at PickAndPlace

In [26]:
import gymnasium as gym
import panda_gym
import numpy as np

from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

env_id = "PandaPickAndPlace-v3"
vn_path = "/root/.cache/huggingface/hub/models--WJLeeLouis--A2C_Robotics_RL/snapshots/09f5f6708fbc365f7ed52cec8e153b712cf40464/vec_normalize.pkl"

def make_eval_env_novideo():
    env = gym.make(env_id)
    env = Monitor(env)
    return env

def run_one_episode_success(model, vn_path, deterministic=True, max_steps=200):
    venv = DummyVecEnv([make_eval_env_novideo])
    venv = VecNormalize.load(vn_path, venv)
    venv.training = False
    venv.norm_reward = False

    obs = venv.reset()
    success = 0.0

    for _ in range(max_steps):
        action, _ = model.predict(obs, deterministic=deterministic)
        obs, reward, dones, infos = venv.step(action)

        if dones[0]:
            success = float(infos[0].get("is_success", 0.0))
            break

    venv.close()
    return success

def success_rate_safe(model, vn_path, n_episodes=50, deterministic=True, max_steps=200):
    successes = []
    for i in range(n_episodes):
        try:
            s = run_one_episode_success(model, vn_path, deterministic=deterministic, max_steps=max_steps)
            successes.append(s)
        except Exception as e:
            print(f"[Episode {i}] Error: {type(e).__name__}: {e}")
            successes.append(0.0)
    return float(np.mean(successes))

sr_det = success_rate_safe(model, vn_path, n_episodes=50, deterministic=True, max_steps=200)
sr_sto = success_rate_safe(model, vn_path, n_episodes=50, deterministic=False, max_steps=200)

print("Success rate (det=True): ", sr_det)
print("Success rate (det=False):", sr_sto)


Success rate (det=True):  0.02
Success rate (det=False): 0.0


In [27]:
sr_det_200 = success_rate_safe(model, vn_path, n_episodes=200, deterministic=True, max_steps=200)
sr_sto_200 = success_rate_safe(model, vn_path, n_episodes=200, deterministic=False, max_steps=200)
print(sr_det_200, sr_sto_200)

0.06 0.025
