# Laboratorium 6

Celem szóstego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmu głębokiego uczenia aktywnego - REINFORCE. Zaimplementowany algorytm będzie testowany z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [29]:
from collections import deque
import gym
import numpy as np
import random

Dołączenie bibliotek do obsługi sieci neuronowych

In [30]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.math import reduce_sum
from tensorflow.math import log

## Zadanie 1 - REINFORCE


Przygotuj funkcję obliczającą wartość nagrody skumulowanej:

In [31]:
def get_cumulative_rewards(rewards,  # rewards at each step
                           gamma=0.99  # discount for reward
                           ):
    """
    based on https://github.com/yandexdataschool/Practical_RL/blob/spring20/week06_policy_based/reinforce_tensorflow.ipynb
    take a list of immediate rewards r(s,a) for the whole session
    compute cumulative rewards R(s,a) (a.k.a. G(s,a) in Sutton '16)
    R_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...

    The simple way to compute cumulative rewards is to iterate from last to first time tick
    and compute R_t = r_t + gamma*R_{t+1} recurrently

    You must return an array/list of cumulative rewards with as many elements as in the initial rewards.
    """
    cumulative_rewards = []
    for i in range(len(rewards)):
        cum_reward = 0
        k = i
        for j in range(len(rewards)):
            if k + j < len(rewards):
                cum_reward += rewards[k + j] * (gamma ** j)
        cumulative_rewards.append(cum_reward)
    
    return cumulative_rewards


assert len(get_cumulative_rewards(range(100))) == 100
assert np.allclose(get_cumulative_rewards([0, 0, 1, 0, 0, 1, 0], gamma=0.9),
                   [1.40049, 1.5561, 1.729, 0.81, 0.9, 1.0, 0.0])
assert np.allclose(get_cumulative_rewards([0, 0, 1, -2, 3, -4, 0], gamma=0.5),
                   [0.0625, 0.125, 0.25, -1.5, 1.0, -4.0, 0.0])
assert np.allclose(get_cumulative_rewards([0, 0, 1, 2, 3, 4, 0], gamma=0), [0, 0, 1, 2, 3, 4, 0])

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu REINFORCE. Wagi sieci aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    \theta \leftarrow \theta + \alpha G_t \nabla_\theta log \pi_{\theta}(a_t, s_t | \theta)
\end{equation*}.
</p>

In [32]:
class REINFORCEAgent:
    def __init__(self, state_size, action_size, model):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = 0.99    # discount rate
        self.learning_rate = 0.001
        self.model = model
        self.state_memory = []
        self.action_memory = []
        self.reward_memory = []
        
        
    def remember(self, state, action, reward):
        #Function adds information to the memory about last action and its results
        self.state_memory.append(state)
        self.action_memory.append(action)
        self.reward_memory.append(reward)

    def get_action(self, state):
        """
        Compute the action to take in the current state, basing on policy returned by the network.

        Note: To pick action according to the probability generated by the network
        """

        #
        # INSERT CODE HERE to get action in a given state
        #
        prediction = self.model.predict_on_batch(state)
        best_action = np.random.choice(np.arange(self.action_size), p=prediction[0])
        return best_action

  

    def learn(self):
        """
        Function learn network using data stored in state, action and reward memory. 
        First calculates G_t for each state and train network
        """
        #
        # INSERT CODE HERE to train network
        #
        state_memory = np.array(self.state_memory)
        action_memory = np.array(self.action_memory)
        reward_memory = np.array(self.reward_memory)

        actions = np.zeros([len(action_memory), self.action_size])
        actions[np.arange(len(action_memory)), action_memory] = 1

        G_t = np.array(get_cumulative_rewards(reward_memory, self.gamma))
        self.model.train_on_batch([state_memory, G_t], actions)

        self.state_memory = []
        self.action_memory = []
        self.reward_memory = []


Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [33]:
env = gym.make("CartPole-v1").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
learning_rate = 0.001

def custom_loss(y_true, y_pred):
    log_lik = y_true * log(y_pred)
    return reduce_sum(-log_lik * state_size)


model = Sequential()
model.add(Dense(32, input_shape=(state_size,), activation = 'relu'))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(action_size, activation = 'softmax'))
model.compile (loss = tf.keras.losses.mse, optimizer = Adam(learning_rate = learning_rate))
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_22 (Dense)            (None, 32)                160       
                                                                 
 dense_23 (Dense)            (None, 64)                2112      
                                                                 
 dense_24 (Dense)            (None, 128)               8320      
                                                                 
 dense_25 (Dense)            (None, 2)                 258       
                                                                 
Total params: 10,850
Trainable params: 10,850
Non-trainable params: 0
_________________________________________________________________


Czas nauczyć agenta gry w środowisku *CartPool*:

In [34]:
agent = REINFORCEAgent(state_size, action_size, model)


def generate_session(t_max=1000):
    """play env with REINFORCE agent and train at the session end"""

    reward = 0

    s = env.reset()[0]
    s = s[np.newaxis, :]
    for t in range(t_max):

        # chose action
        a = agent.get_action(s)

        new_s, r, done, info, _ = env.step(a)
        new_s = new_s[np.newaxis, :]
        # record session history to train later
        agent.remember(s, a, r)

        reward += r

        s = new_s
        if done: break

    agent.learn()

    return reward


for i in range(100):

    rewards = [generate_session() for _ in range(100)]  # generate new sessions

    print("mean reward:%.3f" % (np.mean(rewards)))

    if np.mean(rewards) > 300:
        print("You Win!")
        break

[[0.49817592 0.501824  ]]
[[0.49977684 0.50022316]]
[[0.49815544 0.5018445 ]]
[[0.49864963 0.50135034]]
[[0.49793097 0.502069  ]]
[[0.4984476  0.50155246]]
[[0.49736997 0.50263   ]]
[[0.49558055 0.5044195 ]]
[[0.4959877 0.5040123]]
[[0.49421236 0.5057877 ]]
[[0.49456415 0.5054359 ]]
[[0.4960798  0.50392026]]
[[0.49376014 0.50623983]]
[[0.4952329 0.5047671]]
[[0.4961323 0.5038677]]
[[0.49434483 0.50565517]]
[[0.49536958 0.5046304 ]]
[[0.49372828 0.5062717 ]]
[[0.49211365 0.50788635]]
[[0.49284062 0.5071594 ]]
[[0.49301496 0.5069851 ]]
[[0.49741054 0.50258946]]
[[0.4920823  0.50791776]]
[[0.4950029 0.5049972]]
[[0.49625516 0.50374484]]
[[0.49486327 0.5051367 ]]
[[0.4938148  0.50618523]]
[[0.49449357 0.5055064 ]]
[[0.4941024 0.5058976]]
[[0.49427673 0.5057233 ]]
[[0.49299124 0.5070087 ]]
[[0.49180067 0.5081993 ]]


ValueError: in user code:

    File "c:\Users\Bartycja\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\training.py", line 1249, in train_function  *
        return step_function(self, iterator)
    File "c:\Users\Bartycja\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\training.py", line 1233, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "c:\Users\Bartycja\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\training.py", line 1222, in run_step  **
        outputs = model.train_step(data)
    File "c:\Users\Bartycja\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\training.py", line 1023, in train_step
        y_pred = self(x, training=True)
    File "c:\Users\Bartycja\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "c:\Users\Bartycja\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\input_spec.py", line 216, in assert_input_compatibility
        raise ValueError(

    ValueError: Layer "sequential_2" expects 1 input(s), but it received 2 input tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=(32, 1, 4) dtype=float32>, <tf.Tensor 'IteratorGetNext:1' shape=(32,) dtype=float32>]
