# Laboratorium 5 (4 pkt)

Celem czwartego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmów głębokiego uczenia aktywnego. Zaimplementowane algorytmy będą testowane z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [15]:
from collections import deque
import gym
import numpy as np
import random
from tqdm import tqdm
import random

Dołączenie bibliotek do obsługi sieci neuronowych

In [16]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD

## Zadanie 1 - Double Deep Q-Network

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Double Deep Q-Network. Wartoscią oczekiwaną sieci jest:
\begin{equation}
       Q^*(s, a) \approx r + \gamma argmax_{a'}Q_\theta'(s', a') 
\end{equation}
a wagi pomiędzy sieciami wymieniane są co dziesięć aktualizacji wag sieci sterującej poczynaniami agenta ($Q$).
</p>

In [17]:
class DDQNAgent:
    def __init__(self, state_size, action_size, model):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 0.5  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.95
        self.learning_rate = 0.001
        self.model = model
        self.target_model = model
        self.replay_counter = 1
        self.update_weights()
        

    def remember(self, state, action, reward, next_state, done):
        #Function adds information to the memory about last action and its results
        self.memory.append((state, action, reward, next_state, done)) 

    def get_action(self, state):
        """
        Compute the action to take in the current state, including exploration.
        With probability self.epsilon, we should take a random action.
            otherwise - the best policy action (self.get_best_action).

        Note: To pick randomly from a list, use random.choice(list).
              To pick True or False with a given probablity, generate uniform number in [0, 1]
              and compare it with your probability
        """

        #
        # INSERT CODE HERE to get action in a given state (according to epsilon greedy algorithm)
        #
        epsilon = self.epsilon
        if random.random() >= epsilon:
            chosen_action = self.get_best_action(state)

        else:
            chosen_action = random.randrange(2)

        return chosen_action

  
    def get_best_action(self, state):
        """
        Compute the best action to take in a state.
        """

        #
        # INSERT CODE HERE to get best possible action in a given state (remember to break ties randomly)
        #

        prediction = self.model.predict(state, verbose=0)
        best_action = np.argmax(prediction)
        return best_action

    def replay(self, batch_size):
        """
        Function learn network using randomly selected actions from the memory. 
        First calculates Q value for the next state and choose action with the biggest value.
        Target value is calculated according to:
                Q(s,a) := (r + gamma * max_a(Q(s', a)))
        except the situation when the next action is the last action, in such case Q(s, a) := r.
        In order to change only those weights responsible for chosing given action, the rest values should be those
        returned by the network for state state.
        The network should be trained on batch_size samples.
        After each 10 Q Network trainings parameters should be copied to the target Q Network
        """
        #
        # INSERT CODE HERE to train network
        #

        if len(self.memory) < batch_size:
            return

        batches = random.sample(self.memory, batch_size)
        states = np.concatenate([batch[0] for batch in batches])
        next_states = np.concatenate([batch[3] for batch in batches])
        Q_array = self.model.predict(states, verbose = 0)
        Q_next_state_array = self.target_model.predict(next_states, verbose = 0)

        for i, batch in enumerate(batches):
            _, action, reward, _, done = batch
            Q_array[i][action] = reward
            if not done:
                Q_next_state = max(Q_next_state_array[i])
                Q_array[i][action] = reward + Q_next_state * self.gamma

        self.model.train_on_batch(states, Q_array)
        self.update_weights()
        self.replay_counter += 1
        

    def update_epsilon_value(self):
        #Every each epoch epsilon value should be updated according to equation: 
        #self.epsilon *= self.epsilon_decay, but the updated value shouldn't be lower then epsilon_min value
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

        else:
            self.epsilon = self.epsilon_min

    def update_weights(self):
        """copy trained Q Network params to target Q Network"""
        #
        # INSERT CODE HERE to train network
        #
        
        if not self.replay_counter % 10:
            self.target_model.set_weights(self.model.get_weights())


In [18]:
lst = [10,100,50,6]

for i in lst:
    if not i % 10:
        print('jest')

jest
jest
jest


Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [19]:
env = gym.make("CartPole-v1").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
print(action_size)
learning_rate = 0.001

model = Sequential()
model.add(Dense(16, input_dim = state_size, activation = 'relu'))
model.add(Dense(32, activation = 'relu'))
model.add(Dense(action_size))
model.compile (loss ='mse', optimizer = Adam(learning_rate = learning_rate))
model.summary()

2
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 16)                80        
                                                                 
 dense_7 (Dense)             (None, 32)                544       
                                                                 
 dense_8 (Dense)             (None, 2)                 66        
                                                                 
Total params: 690
Trainable params: 690
Non-trainable params: 0
_________________________________________________________________


Czas nauczyć agenta gry w środowisku *CartPool*:

In [20]:
def one_state(env_state):
    state = np.array([np.array(env_state).flatten()])
    return state

In [22]:
agent = DDQNAgent(action_size, learning_rate, model)

agent.epsilon = 0.9

done = False
batch_size = 64
EPISODES = 10000
counter = 0
for e in range(EPISODES):
    summary = []
    for _ in tqdm(range(100)):
        total_reward = 0
        env_state = env.reset()
        
        #
        # INSERT CODE HERE to prepare appropriate format of the state for network
        #

        state = one_state(env_state[0])
        for time in range(1000):
            action = agent.get_action(state)
            next_state_env, reward, done, _, _ = env.step(action)
            total_reward += reward

            #
            # INSERT CODE HERE to prepare appropriate format of the next state for network
            #
            next_state = one_state(next_state_env)

            #add to experience memory
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            if done:
                break

        #
        # INSERT CODE HERE to train network if in the memory is more samples then size of the batch
        #
        
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)
            
        agent.update_epsilon_value()
        summary.append(total_reward)

    print("epoch #{}\tmean reward = {:.3f}\tepsilon = {:.3f}".format(e, np.mean(summary), agent.epsilon))
    if np.mean(summary) > 195:
        print ("You Win!")
        break


100%|██████████| 100/100 [00:14<00:00,  6.76it/s]


epoch #0	mean reward = 28.270	epsilon = 0.855


100%|██████████| 100/100 [00:18<00:00,  5.27it/s]


epoch #1	mean reward = 33.150	epsilon = 0.812


100%|██████████| 100/100 [00:23<00:00,  4.28it/s]


epoch #2	mean reward = 34.630	epsilon = 0.772


100%|██████████| 100/100 [00:31<00:00,  3.19it/s]


epoch #3	mean reward = 38.720	epsilon = 0.733


100%|██████████| 100/100 [00:36<00:00,  2.78it/s]


epoch #4	mean reward = 40.960	epsilon = 0.696


100%|██████████| 100/100 [00:49<00:00,  2.03it/s]


epoch #5	mean reward = 52.790	epsilon = 0.662


100%|██████████| 100/100 [01:00<00:00,  1.66it/s]


epoch #6	mean reward = 58.790	epsilon = 0.629


100%|██████████| 100/100 [01:12<00:00,  1.39it/s]


epoch #7	mean reward = 67.220	epsilon = 0.597


100%|██████████| 100/100 [01:42<00:00,  1.03s/it]


epoch #8	mean reward = 89.120	epsilon = 0.567


100%|██████████| 100/100 [02:10<00:00,  1.30s/it]


epoch #9	mean reward = 105.380	epsilon = 0.539


100%|██████████| 100/100 [02:30<00:00,  1.51s/it]


epoch #10	mean reward = 116.230	epsilon = 0.512


100%|██████████| 100/100 [03:24<00:00,  2.05s/it]


epoch #11	mean reward = 149.890	epsilon = 0.486


100%|██████████| 100/100 [03:44<00:00,  2.24s/it]


epoch #12	mean reward = 154.530	epsilon = 0.462


100%|██████████| 100/100 [04:43<00:00,  2.84s/it]


epoch #13	mean reward = 187.790	epsilon = 0.439


100%|██████████| 100/100 [05:21<00:00,  3.21s/it]

epoch #14	mean reward = 202.260	epsilon = 0.417
You Win!



