<a href="https://colab.research.google.com/github/frankzamma/AntiPiracyPlatform/blob/main/Notebook/PPO_new.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training e Test con PPO
In questo notebook è presente il training di Q*Bert sfruttando l'algoritmo PPO

## Download Repository

In [1]:
from google.colab import userdata

In [2]:
!git clone https://{userdata.get('TokenGithub')}"@github.com/amigli/Q-Bert_RL.git"

Cloning into 'Q-Bert_RL'...
remote: Enumerating objects: 606, done.[K
remote: Counting objects: 100% (78/78), done.[K
remote: Compressing objects: 100% (61/61), done.[K
remote: Total 606 (delta 46), reused 27 (delta 17), pack-reused 528 (from 1)[K
Receiving objects: 100% (606/606), 10.49 MiB | 7.81 MiB/s, done.
Resolving deltas: 100% (383/383), done.


In [3]:
%cd Q-Bert_RL/

/content/Q-Bert_RL


## Installazione dei requirements

In [4]:
!pip install gymnasium



In [5]:
!pip install ale-py



In [6]:
!pip install moviepy



In [7]:
!pip install stable-baselines3

Collecting stable-baselines3
  Downloading stable_baselines3-2.5.0-py3-none-any.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3.0,>=2.3->stable-baselines3)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (

In [8]:
!pip install wandb



## Algoritmo

In [9]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
import ale_py
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import tensorflow as tf
from stable_baselines3.common.logger import configure
import torch
from torch.utils.tensorboard import SummaryWriter
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv
from EnvironmentWrappers.RewardFunction import RewardFunction
from EnvironmentWrappers.ObsRewardWrapper import ObsRewardWrapper
import wandb
from stable_baselines3.common.callbacks import BaseCallback, EvalCallback
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.utils import get_linear_fn
import torch as th
from wandb.integration.sb3 import WandbCallback
from stable_baselines3.common.atari_wrappers import NoopResetEnv


## Wandb

In [10]:
!wandb login --relogin

  and should_run_async(code)


[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


In [11]:
config = {
    "policy_type": "MlpPolicy",
    "total_timesteps": 7_000_000,
    "env_name": "ALE/Qbert-ram-v5",
}
run = wandb.init(
    project="QBERT-RL",
    config=config,
    sync_tensorboard=True,  # auto-upload sb3's tensorboard metrics
    monitor_gym=True,  # auto-upload the videos of agents playing the game
    save_code=True,  # optional
    entity="Q-BertRLTeam"
)


  return LooseVersion(v) >= LooseVersion(check)
[34m[1mwandb[0m: Currently logged in as: [33mfrank581-fgz[0m ([33mfrankzamma[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


## Training

In [12]:
gym.register_envs(ale_py)

def make_env(env_id):
    def _init():
        env = gym.make(env_id)
        env = ObsRewardWrapper(env)
        env = NoopResetEnv(env, noop_max=30)
        env = Monitor(env)
        # env = OurRewardWrapper(env)
        return env
    return _init

In [13]:
policy_kwargs = dict(activation_fn=th.nn.ReLU,
                     net_arch=dict(pi=[256, 512, 256, 128, 64], vi=[256, 512, 256, 128, 64]))


In [14]:
from typing import Callable


def linear_schedule(initial_value: float) -> Callable[[float], float]:
    """
    Linear learning rate schedule.

    :param initial_value: Initial learning rate.
    :return: schedule that computes
      current learning rate depending on remaining progress
    """
    def func(progress_remaining: float) -> float:
        """
        Progress will decrease from 1 (beginning) to 0.

        :param progress_remaining:
        :return: current learning rate
        """
        return progress_remaining * initial_value

    return func

In [15]:
lr_schedule = linear_schedule(0.0001)

num_envs = 25
envs = DummyVecEnv([make_env("ALE/Qbert-ram-v5") for _ in range(num_envs)])

model = PPO(
    "MlpPolicy",
    envs,
    verbose=1,
    n_steps = 1024,
    batch_size=64,
    ent_coef= 0.03,
    learning_rate=lr_schedule,
    policy_kwargs=policy_kwargs,
    tensorboard_log=f"runs/{run.id}"
    )

Using cuda device




In [16]:
env = gym.make("ALE/Qbert-ram-v5")
env = ObsRewardWrapper(env)

eval_callback = EvalCallback(env, best_model_save_path="./BestModels/",
                             log_path="./logs/", eval_freq=250,
                             deterministic=True, render=False)

wandb_callback = WandbCallback(
        gradient_save_freq=500,
        model_save_path=f"models/{run.id}",
        verbose=2,
    )

model.learn(total_timesteps = 7_000_000, callback=[wandb_callback,eval_callback])

wandb.finish()

# Valutazione del modello
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"Ricompensa media: {mean_reward:.2f}, deviazione standard: {std_reward:.2f}")

  and should_run_async(code)


Logging to runs/vx73kc3x/PPO_1




[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
|    explained_variance   | 0.84        |
|    learning_rate        | 5.83e-05    |
|    loss                 | 0.21        |
|    n_updates            | 1140        |
|    policy_gradient_loss | -0.0149     |
|    value_loss           | 1.5         |
-----------------------------------------
Eval num_timesteps=2925000, episode_reward=8.80 +/- 8.35
Episode length: 812.80 +/- 191.76
---------------------------------
| eval/              |          |
|    mean_ep_length  | 813      |
|    mean_reward     | 8.8      |
| time/              |          |
|    total_timesteps | 2925000  |
---------------------------------
Eval num_timesteps=2931250, episode_reward=12.20 +/- 8.03
Episode length: 917.80 +/- 148.03
---------------------------------
| eval/              |          |
|    mean_ep_length  | 918      |
|    mean_reward     | 12.2     |
| time/              |          |
|    total_timesteps | 2931250  |
-----------------

KeyboardInterrupt: 

In [17]:
run.finish()

0,1
eval/mean_ep_length,▁▁███▅▅▅▅▅▄▅▃▅▆▆▅▃▅▅▄▅▅▇█▅▅▅▅▅██▇▇▆█▅▇▇▅
eval/mean_reward,▃▁▄▁▂▃▅▄▅▅▆▇▇█▅▄▅▄▅▅▇▅▅▄▅▅█▄▆▇▆▇▆█▆▅▆▅▇▅
global_step,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇████
rollout/ep_len_mean,▂▁▁▁▁▃▃▃▄▄▄▅▅▅▆▅▅▅▆▆▆▆▇▇▇▇▆▆▆▇▇▇▆▆▇▇▆▇██
rollout/ep_rew_mean,▂▁▂▂▂▁▁▂▂▃▄▄▅▅▆▆▆▆▅▆▆▆▆▇▇▆▆▇▇▆▇▇▇▇▇▇▇▇▇█
time/fps,█▇▇▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁
train/approx_kl,▃▂▂▁▂▆▅▅██▇█▆▇▆▆▇▇▇█▇▇██▆▆█▆▆▇▆▆▇▇▆▆▅▆▆▅
train/clip_fraction,▅▂▁▃▃▄▆▆▅▄▆▆▅▄▆▅██▅▇▆▇▅▅▆▇▇▅▅▇▇▅▅▆▅▅▄▃▄▄
train/clip_range,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/entropy_loss,▂▁▁▁▁▂▂▂▂▂▂▂▂▃▃▄▄▄▄▄▄▄▄▅▄▄▅▅▅▅▅▅▅▆▅▇▆▇▇█

0,1
eval/mean_ep_length,775.0
eval/mean_reward,12.4
global_step,5131250.0
rollout/ep_len_mean,511.3
rollout/ep_rew_mean,9.26
time/fps,362.0
train/approx_kl,0.01264
train/clip_fraction,0.13461
train/clip_range,0.2
train/entropy_loss,-1.31794


In [18]:
env = gym.make("ALE/Qbert-ram-v5", render_mode="rgb_array")
env = ObsRewardWrapper(env)

In [19]:
# DummyVecEnv per compatibilità con Stable-Baselines3
env = DummyVecEnv([lambda: env])

In [20]:
# Registra video
video_folder = "./videos/"
env = VecVideoRecorder(
    env,               # Ambiente
    video_folder,      # Cartella per salvare i video
    record_video_trigger=lambda x: x % 10000000 == 0,  # Registra ogni 1000 passi
    video_length=10000000 # Durata massima del video in passi
)

In [21]:
# Resetta l'ambiente per registrare un episodio
obs = env.reset()

# Registra 3 episodi
for episode in range(3):
    obs = env.reset()
    for _ in range(10000000):  # Durata massima dell'episodio
        action, _states = model.predict(obs, deterministic=True)
        obs, rewards, dones, info = env.step(action)
        if dones[0]:  # L'episodio è terminato
            break

env.close()  # Salva il video

  """


Moviepy - Building video /content/Q-Bert_RL/videos/rl-video-step-0-to-step-10000000.mp4.
Moviepy - Writing video /content/Q-Bert_RL/videos/rl-video-step-0-to-step-10000000.mp4



                                                  

Moviepy - Done !
Moviepy - video ready /content/Q-Bert_RL/videos/rl-video-step-0-to-step-10000000.mp4




Moviepy - Building video /content/Q-Bert_RL/videos/rl-video-step-0-to-step-10000000.mp4.
Moviepy - Writing video /content/Q-Bert_RL/videos/rl-video-step-0-to-step-10000000.mp4



                                                                 

Moviepy - Done !
Moviepy - video ready /content/Q-Bert_RL/videos/rl-video-step-0-to-step-10000000.mp4


