This module contains the implementation of the PPO algorithm.
Ci basiamo sullo pseudocodice presente sul sito di OpenAI per la realizzazione del ppo.
https://spinningup.openai.com/en/latest/algorithms/ppo.html#id7
Utilizzando un Actor-Critic Method.
Ciò suddivide l'implementazione in 8 passi principali:
1. Inizializzazione dell'ambiente con policy parameters theta_0, e l'inizial value function parameters w_0.
2. Ciclare per k iterazioni
3. Raccogliere un set di traiettorie D_k = {τ_i} con una policy pi_k = pi(theta_k)
4. Calcolare i reward-to-go R_t
5. Calcolare gli advantage estimates A_t basandoci sulla value function V_{w_k}
6. Aggiornare la policy massimizzando la PPO-Clip objective (Gradient ascent con adam) . Non scriverò la formula che è complessa
7. Aggiornare la value function minimizzando la MSE tra V_{w_k} e R_t (Gradient descent con adam)
8. Fine ciclo.

Implementiamo tutti i passi nella funzione learn.

In [1]:
import warnings
warnings.filterwarnings('ignore') #ignora warnings
#Check if colab is used:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False
  print("Not running on CoLab")
if IN_COLAB:
  !pip install procgen
  !pip install tensorflow_probability
  !pip install numpy
from rete import ReteNeurale
import tensorflow as tf
import tensorflow_probability as tfp
import gym
import numpy as np
from tensorflow import keras
import math


Not running on CoLab


2024-12-15 10:46:53.226229: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1734256013.240915    6249 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1734256013.245322    6249 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-15 10:46:53.259730: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
class PPO:
    def learn(self,env):
        #Passo 1 --> Inizializzazione dell'ambiente con policy parameters theta_0, e l'inizial value function parameters w_0.
        #Dobbiamo creare una rete neurale per la policy e per la value function.
        self.env=env
        self.nAzioni=env.action_space.n
        self.nStati=env.observation_space.shape
        self.listaAzioni=[i for i in range(self.nAzioni)]

        #self.stepsPerEpisode=2048 #Per produzione
        #self.episodesPerBatch=8 #per produzione
        #self.nEpoche=200 #per produzione.
        self.stepsPerEpisode=512
        self.episodesPerBatch=1
        self.nEpoche=5

        self.gamma=0.99
        self.epsilon=0.2
        self.nUpdatesPerIteration=10
        self.cov_mat=tf.linalg.diag(tf.fill([self.nAzioni], 0.5))
        self.policyNN=ReteNeurale(self.nStati,self.nAzioni) #Actor
        #self.valueNN=ReteNeurale(self.nStati,1,False) #Critic
        self.policy_optimizer=keras.optimizers.Adam(learning_rate=5e-4)
        
        #self.value_optimizer=keras.optimizers.Adam(learning_rate=0.0005)
        self.policyNN.compile(optimizer=self.policy_optimizer)
        #self.valueNN.compile(optimizer=self.value_optimizer)
        #passo 2 ciclare per k iterazioni.
        for k in range(self.nEpoche):
            states, actions, rewards_to_go, log_probs =self.collect_trajectories()
            #print("Trajectories collected")
           
            num_samples=states.shape[0]
            batch_size=64 #Faccio calcoli con mini-batches perchè altrimenti vado in Run out of memory fisso.
            for i in range(0,num_samples, batch_size):
              batch_states=states[i:i+batch_size]
              batch_actions=actions[i:i+batch_size]
              batch_rewards_to_go=rewards_to_go[i:i+batch_size]
              batch_log_probs=log_probs[i:i+batch_size]

            
              V,latest_log_probs,_=self.evaluate(batch_states,batch_actions)
              advantage=self.calcAdvantages(batch_rewards_to_go,V)
              
              with tf.GradientTape(persistent=True) as tape:
                  _,latest_log_probs,probs=self.evaluate(batch_states,batch_actions)
                  surrogated_loss_1, surrogated_loss_2=self.calcSurrogatedLoss(batch_log_probs,latest_log_probs,advantage)
                  policy_loss = -tf.reduce_mean(tf.minimum(surrogated_loss_1, surrogated_loss_2))
                  value_loss=tf.reduce_mean(tf.square(batch_rewards_to_go-V)) #MSE tra rewards to go e V
                  #print("Policy Loss:", policy_loss)
                  #print("Value Loss:", value_loss)

                  #Aggiungo entropia alla loss per incentivare l'esplorazione
                  entropy = -tf.reduce_mean(probs * tf.math.log(probs + 1e-10))
                  total_loss=policy_loss+ value_loss*0.5 - entropy*0.01
              gradientsPolicy = tape.gradient(total_loss, self.policyNN.trainable_variables)
              self.policy_optimizer.apply_gradients(zip(gradientsPolicy, self.policyNN.trainable_variables))

              #gradientsValue = tape.gradient(value_loss, self.valueNN.trainable_variables)
              #self.value_optimizer.apply_gradients(zip(gradientsValue, self.valueNN.trainable_variables))
              #del tape
              print("EPOCA:",k," POLICY LOSS:",policy_loss," VALUE LOSS:",value_loss)
            self.evaluate_policy()



    def evaluate_policy(self, episodes=10):
        total_rewards = []
        for _ in range(episodes):
            state = self.env.reset()
            done = False
            cumulative_reward = 0
            while not done:
                state_tensor = tf.convert_to_tensor(state, dtype=tf.float32)
                state_tensor = tf.expand_dims(state_tensor, axis=0)
                probs, _ = self.policyNN(state_tensor)
                action = np.argmax(probs.numpy())
                stato, reward, done, info =self.env.step(action)
                cumulative_reward += reward
            total_rewards.append(cumulative_reward)
        print(f"Average Reward: {np.mean(total_rewards):.2f}")

    def collect_trajectories(self):
        #Passo 3 --> Raccogliere un set di traiettorie D_k = {τ_i} con una policy pi_k = pi(theta_k)
        #Dobbiamo raccogliere un set di traiettorie e per fare ciò dobbiamo raccogliere: stati, azioni, rewards, rewards to go, log_prob delle azioni.
        batch={
            'states':[],
            'actions':[],
            'rewards':[],
            'rewards_to_go':[],
            'log_probs':[],
        }
        done = False
        stato = self.env.reset()

        #Abbiamo un fisso di 8 episodi per batch con 2048 steps per episodio
        for i in range(self.episodesPerBatch):
            if done == True:
                stato = self.env.reset()
                done=False
            rewardPerEpisode=[]
            print("Episode: ",i)
            for j in range(self.stepsPerEpisode):
                batch['states'].append(stato)
                azione,log_prob=self.getAction(stato)
                #azione sarà un int, mentre log_prob sarà il logaritmo della probabilità dell'azione
                batch['actions'].append(azione)
                batch['log_probs'].append(log_prob)
                stato, reward, done, info = self.env.step(azione)
                print("Info:",info)
                #info non usata.
                rewardPerEpisode.append(reward)
                if done:
                    print("DONE EPISODE")
                    print("REWARD ",reward)
                    print("STEPS : ",j)                    
                    stato=self.env.reset()
                    break #Ha raggiunto il termine dell'episodio.
            batch['rewards'].append(rewardPerEpisode)
        #Calcoliamo i rewards to go --> PASSO 4
        batch['rewards_to_go']=self.calcRTG(batch['rewards'])
        #return batch states, actions, rewards, rewards to go, log_probs
        #print("BATCH LOG PROBS:",batch['log_probs'])
        batch_statiTensor=tf.convert_to_tensor(batch['states'],dtype=tf.uint8)
        batch_azioniTensor=tf.convert_to_tensor(batch['actions'],dtype=tf.int32)
        batch_rewards_to_goTensor=tf.convert_to_tensor(batch['rewards_to_go'],dtype=tf.float32)
        batch_log_probsTensor=tf.convert_to_tensor(batch['log_probs'],dtype=tf.float32)


        return batch_statiTensor, batch_azioniTensor,batch_rewards_to_goTensor,batch_log_probsTensor

    def getAction(self,stato):
        stato=tf.convert_to_tensor(np.expand_dims(stato, axis=0) ,dtype=tf.float32)# Diventa (1, 64, 64, 3)
        azione_pred,_=self.policyNN(stato)
        #Somma probabilità
        dist=tfp.distributions.Categorical(probs=tf.squeeze(azione_pred))
        azionePresa=dist.sample()
        log_prob=dist.log_prob(azionePresa)
        return azionePresa, tf.stop_gradient(log_prob)

    def calcRTG(self,rewards):
        #print("CALC REWARDS TO GO")
        #print(rewards)
        #Prendo la formula per calcolare i rewards to go e richiede i cumulative rewards e un fattore di sconto.
        rtg=[]
        for episode_reward in reversed(rewards):
            cumulative_reward=0
            for single_reward in reversed(episode_reward):
                cumulative_reward=single_reward+cumulative_reward*self.gamma
                rtg.append(cumulative_reward)
        return tf.convert_to_tensor(rtg,dtype=tf.float32)

    def calcAdvantages(self, rtg,values):
        advantages=rtg-tf.stop_gradient(values)
        return (advantages - tf.reduce_mean(advantages)) / (tf.math.reduce_std(advantages) + 1e-10)

    def calcSurrogatedLoss(self,log_probs_old, log_probs_new, advantages):
        advantages = tf.stop_gradient(advantages)
        #print("CALC SURROGATED LOSS, ADVANTAGES:",advantages)
        #print("CALC SURROGATED LOSS, Log probs old:",log_probs_old)
        #print("CALC SURROGATED LOSS, Log probs new:",log_probs_new)
        policy_ratio = tf.exp(log_probs_old - log_probs_new)
        #print("CALC SURROGATED LOSS, Policy ratio :",policy_ratio)
        surrogated_loss_1 = policy_ratio * advantages
        surrogated_loss_2 = tf.clip_by_value(policy_ratio, clip_value_min=1.0-self.epsilon, clip_value_max=1.0+self.epsilon) * advantages
        return surrogated_loss_1, surrogated_loss_2

    def evaluate(self, batch_states,batch_actions):
        batch_states=tf.cast(batch_states, tf.float32)
        #retVal=self.valueNN(batch_states)
        mean,retVal=self.policyNN(batch_states)
        V= tf.squeeze(retVal)
        #print("V EVALUATE:",V)
        #print("MEAN EVALUATE:",mean)
        dist=tfp.distributions.Categorical(probs=mean)
        log_probs=dist.log_prob(batch_actions)
        #print("LOG PROBS EVALUATE:",log_probs)
        return V, log_probs, mean



In [3]:
# Configurazione ed esecuzione
env = gym.make('procgen:procgen-coinrun-v0',distribution_mode='easy', start_level=0, num_levels=1)
ppo_model=PPO()
ppo_model.learn(env)

I0000 00:00:1734256017.281409    6249 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2735 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5


Episode:  0


I0000 00:00:1734256017.950099    6249 cuda_dnn.cc:529] Loaded cuDNN version 90300
  logger.deprecation(
  if not isinstance(done, (bool, np.bool8)):


Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_seed': 0}
Info: {'prev_level_seed': 0, 'prev_level_complete': 0, 'level_se

2024-12-15 10:47:18.761737: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 420.50MiB (rounded to 440926208)requested by op Mul
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2024-12-15 10:47:18.761767: I external/local_xla/xla/tsl/framework/bfc_allocator.cc:1053] BFCAllocator dump for GPU_0_bfc
2024-12-15 10:47:18.761775: I external/local_xla/xla/tsl/framework/bfc_allocator.cc:1060] Bin (256): 	Total Chunks: 76, Chunks in use: 76. 19.0KiB allocated for chunks. 19.0KiB in use in bin. 5.3KiB client-requested in use in bin.
2024-12-15 10:47:18.761780: I external/local_xla/xla/tsl/framework/bfc_allocator.cc:1060] Bin (512): 	Total Chunks: 6, Chunks in use: 5. 3.0KiB allocated for chunks. 2.5KiB in use in bin. 2.5KiB client-requested in use in bin.
2024-12-15 10:

ResourceExhaustedError: {{function_node __wrapped__Mul_device_/job:localhost/replica:0/task:0/device:GPU:0}} failed to allocate memory [Op:Mul] name: 