# Laboratorium 7

Celem siódmego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmu głębokiego uczenia aktywnego - Actor-Critic. Zaimplementowany algorytm będzie testowany z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [8]:
from collections import deque
import gym
import numpy as np
import random

Dołączenie bibliotek do obsługi sieci neuronowych

In [9]:
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

## Zadanie 1 - Actor-Critic

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Actor-Critic. W tym celu należy utworzyć dwie głębokie sieci neuronowe:
    1. *actor* - sieć, która będzie uczyła się optymalnej strategii (podobna do tej z laboratorium 6),
    2. *critic* - sieć, która będzie uczyła się funkcji oceny stanu (podobnie jak się DQN).
Wagi sieci *actor* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    \theta \leftarrow \theta + \alpha \delta_t \nabla_\theta log \pi_{\theta}(a_t, s_t | \theta).
\end{equation*}
Wagi sieci *critic* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    w \leftarrow w + \beta \delta_t \nabla_w\upsilon(s_{t + 1}, w),
\end{equation*}
gdzie:
\begin{equation*}
    \delta_t \leftarrow r_t + \gamma \upsilon(s_{t + 1}, w) - \upsilon(s_t, w).
\end{equation*}
</p>

In [10]:
class ActorNetwork(nn.Module):
    def __init__(self, lr, state_shape, n_actions, fc1, fc2):
        super(ActorNetwork, self).__init__()

        self.fc1 = nn.Linear(state_shape, fc1)
        self.fc2 = nn.Linear(fc1, fc2)
        self.output = nn.Linear(fc2, n_actions)
        self.softmax = nn.Softmax(dim=-1)
        self.optimizer = optim.Adam(self.parameters(), lr=lr)
        self.loss = nn.MSELoss()
        self.device = T.device('cuda:0') if T.cuda.is_available() else T.device('cpu')
        self.to(self.device)

    def forward(self, state):
        state = F.relu(self.fc1(state))
        state = F.relu(self.fc2(state))
        probs = self.softmax(self.output(state))

        return probs

In [11]:
class CriticModel(nn.Module):
    def __init__(self, lr, state_shape, fc1, fc2):
        super(CriticModel, self).__init__()

        self.fc1 = nn.Linear(state_shape, fc1)
        self.fc2 = nn.Linear(fc1, fc2)
        self.output = nn.Linear(fc2, 1)
        self.softmax = nn.Softmax(dim=-1)
        self.optimizer = optim.Adam(self.parameters(), lr=lr)
        self.loss = nn.MSELoss()
        self.device = T.device('cuda:0') if T.cuda.is_available() else T.device('cpu')
        self.to(self.device)

    def forward(self, state):
        state = F.relu(self.fc1(state))
        state = F.relu(self.fc2(state))
        probs = self.output(state)

        return probs

In [12]:
class ACActor:
    def __init__(self, state_shape, n_actions, fc1: int = 128, fc2: int = 128, gamma: float = 0.99, alpha: float = 0.001, beta: float = 0.001):
        self.n_actions = n_actions
        self.gamma = gamma
        self.alpha = alpha
        self.beta = beta
        self.actor = ActorNetwork(alpha, state_shape, n_actions, fc1, fc2).float()
        self.critic = CriticModel(beta, state_shape, fc1, fc2).float()

    def get_action(self, state):
        state = T.tensor(state).to(self.actor.device)
        probs = self.actor.forward(state).cpu().detach().numpy()
        action = np.random.choice(self.n_actions, p=probs)

        return action

    def learn(self, state, action, reward, _state, done):

        state = T.tensor(state).to(self.actor.device)
        _state = T.tensor(_state).to(self.actor.device)
        reward = T.tensor(reward).to(self.actor.device)
        action = T.tensor(action).to(self.actor.device)

        probs = self.actor.forward(state)
        cat = Categorical(probs)
        log_action = cat.log_prob(action)
        v = self.critic.forward(state)
        _v = self.critic.forward(_state)

        delta = reward + self.gamma * _v * (1-done) - v

        self.actor.optimizer.zero_grad()
        actor_loss = -delta * log_action
        actor_loss.backward(retain_graph=True)
        self.actor.optimizer.step()

        self.critic.optimizer.zero_grad()
        critic_loss = delta ** 2
        critic_loss.backward()
        self.critic.optimizer.step()
        


Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [15]:
env = gym.make("CartPole-v1")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
alpha_learning_rate = 0.0001
beta_learning_rate = 0.0005
np.bool8 = np.bool_

Czas nauczyć agenta gry w środowisku *CartPool*:

In [16]:
agent = ACActor(state_size, action_size, alpha=alpha_learning_rate, beta=beta_learning_rate)


for i in range(100):
    score_history = []

    for i in range(100):
        done = False
        score = 0
        state = env.reset()
        state = state[0].astype(np.float32)
        while not done:
            action = agent.get_action(state)
            _state, reward, done, _, _ = env.step(action)
            _state.astype(np.float32)
            agent.learn(state, action, reward, _state, done)
            state = _state
            score += reward
        score_history.append(score)

    print("mean reward:%.3f" % (np.mean(score_history)))

    if np.mean(score_history) > 300:
        print("You Win!")
        break

mean reward:34.320
mean reward:91.860
mean reward:197.750
mean reward:221.600
mean reward:302.450
You Win!
