# Resource administrator Artificial Intelligence

With the rise of Artificial Intelligence thanks to a lot of advancements being made through the last two decades, we've been finding ways to implement it to a lot of aspects of our lifes: From our video recomendations, to our research on fields like astronomy and medicine; from funny robots that can learn to kind of speak like a real human, to advanced AIs that can make a three dimensional space out of a two dimensional picture. And amidst all of these different applications, what I personally find the most interesting is the prospect of giving AIs the ability to tackle on those problems that, while obviously complicated in a real-life environment with hundreds of factors, at their core are actually mostly mechanical, and could potentially be solved using algorithms made by computers.

This is the scenario where I came up with the present project. From the natural resources of a country, to the economy of a small family, to the human resources of a business; the role of managing some form of resource is almost inescapable to humans, so I asked the question of to what extend an AI could be trained to learn how to manage some resources in the confined environment of a game, designed and coded by myself. This project doesn't intend to make an AI that could actually be used in a real life situation, but to explore the ability of an AI made using the currently available resources to tackle on this issue.

### Overview of the game

In the present game, the player is tasked with the mission to manage the resources of a spaceship to try and survive for as long as possible without running out of power, gas or crewmembers. 

Each playthrough is divided in turns, and in every turn four things happen: The resource being used as fuel is lowered, the crewmembers' hunger and thirst levels are lowered, a random event happens, and, depending on the turn, the player might have to make a decision on which resource is used as fuel or if the crew should eat.

#### Main Resources
There are four main resources that the player needs to manage: Food, Water, Gas and Power. Food and Water are used to feed the crewmembers during the Meal events, and Gas and Power are used as fuel for the ship every turn, according to the player's decision on the Propulsion Method Choice event. The player starts with 100 units of each of these resources, and can resupply them in some of the Random events that happen every turn.

#### Crewmembers
The crewmembers are the fifth, unofficial, resource that the player has to manage. Initially the player gets 6 crewmembers, but depending on the decisions made during the game, more might be added or some might be lost.

Each crewmember has a personal thirst and hunger gauge, which initially starts with 20 units and gets deducted by 1 each turn. If either one of these gauges reaches zero for one of the crewmembers, that specific crewmember will leave the spaceship. To prevent this from happening, the player can feed the crew during the Meal events, if there is enough food and water available.

#### Random events

Every turn, the player gets to make a decision based on an event selected at random out of 6 possibilities. These are:

    Space Station Event: In this event, the ship encounters a Space Station which offers to replenish one of the four main resources. The player has to select between two of said resources, selected at random.
    Pirate Attack Event: Here, a fleet of space pirates wanders near the ship. The player has to make a decision on whether to spend some gas to safely escape, or not use any fuel and run the risk of getting caught by the pirates, in which case a random resource level is lowered.
    Trade Event: In this one, another spaceship appears and offers to trade one resource for another. The decision lies in either accepting the trade or not.
    Supply Beacon Event: The spaceship stumbles upon a supply beacon, which contain one of the main resources, and the player gets to option to either open it or leave it. There are no possible harmful outcomes of opening the beacon.
    SOS Call Event: A sudden SOS message is received by the crewmembers from a nearby stranded ship. If the player decides to help the stranded ship's crew, they will join the spaceship.
    Planet Exploration Event: Finally, for this event the spaceship gets in range of a planet, and some crewmates offer to go look for resources. If the player sends the expedition, two outcomes might happen: The crewmembers might return with one resource, or they could get lost in the planet.

#### Periodic events

Besides the random events that happen every turn, there are 2 events that happen at fixated intervals and involve a different kind of choice from the player. These are the Propulsion Method Choice, and the Meal Event:

    Propulsion Method Choice: Every 5 turns, the player is asked to select one of the 3 different propulsion methods: Standard, Electrical and Gas. If the standard method is selected, until the next time this event is triggered, the spaceship will use 2 units of power and 2 of gas as fuel. If the Electrical or Gas methods are selected, 4 units of power or gas, respectively, will be used.
    Meal Event: Every 6 turns, the player can decide if the crewmembers should go eat. If the player decides to make them eat, 6 units of water and food will be lowered and fed to each of the crewmembers, within the capabilities of the tripulation. This means that, if the player runs out of water of food midway through this event, the rest of the crewmembers that didn't get to eat will keep their current thirst and hunger levels.
    
#### Score system
The performance of the player gets determined by the score they get during their playthrough. Each turn, the player gets 100 points for each of the crewmembers that are still in the ship, however if the player loses one of their crewmembers (either by failling to keep them fed or losing them in the Exploration event) 500 points will be substracted from their score. 

#### Game over
Finally, the game ends when one of three conditions are met: The ship runs out of power, runs out of gas, or loses all the members of its crew. Running out of food or water won't result in a game over, but will obviously mean that the player is at risk of losing their crew to hunger and thirst.

## Game Environment

The first cell contains imports of the libraries that will be used throughout the program. The featured libraries are:
    
    The _random_ library, which comes included in Python
    The _matplotlib_ library (https://matplotlib.org/users/installing.html), used to display graphs for trained agent's performance
    The _tensorflow_ library (https://www.tensorflow.org/install), which contains all of the machine learning functions used for this program.
    The _numpy_ library (https://numpy.org/install/), which is used by tensorflow.

In [None]:
from __future__ import absolute_import, division, print_function

import random
import matplotlib
import matplotlib.pyplot as plt

import abc
import tensorflow as tf
import numpy as np

from tf_agents.environments import py_environment
from tf_agents.environments import tf_environment
from tf_agents.environments import tf_py_environment
from tf_agents.environments import utils
from tf_agents.specs import array_spec
from tf_agents.networks import actor_distribution_network
from tf_agents.environments import wrappers
from tf_agents.environments import suite_gym
from tf_agents.trajectories import time_step as ts
from tf_agents.agents.dqn import dqn_agent
from tf_agents.utils import common
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.agents.reinforce import reinforce_agent

tf.compat.v1.enable_v2_behavior()

This next cell contains some definitions for several of the constants used by the Game environment

In [None]:
GAS_LEVEL = 100
POWER_LEVEL = 100
FOOD_LEVEL = 100
WATER_LEVEL = 100
CREW_NUMBER = 6 # This is the initial number of crewmembers
FOOD_NEED = 20
WATER_NEED = 20
PROB_STATION = 0.045
PROB_ATTACK = 0.05
PROB_BEACON = 0.035
PROB_TRADE = 0.06
PROB_EXPLORATION = 0.05
PROB_SOS = 0.035

In this next cell, we define the Python Environment that will be passed to the agent to train in. This code is based on the one found in Tensorflow's official documentation, which can be found here: https://www.tensorflow.org/agents/tutorials/2_environments_tutorial, which was adapted to suit the specific requirements for this project.

In [None]:
class GameEnv(py_environment.PyEnvironment):
    
    def __init__(self):
        #For the action array, 0-food/water decision, 1-propulsion decision, 2-event decision
        self._action_spec = array_spec.BoundedArraySpec(shape=(3,), dtype=np.int32, minimum=0, maximum=2,
                                                        name='action')
        self._observation_spec = array_spec.BoundedArraySpec(shape=(12,), dtype=np.int32, name='observation')
        self._game = Game()
        self._episode_ended = False
        self.turn = 0
        self.crewmates = Crewmates()
        
    def action_spec(self):
        return self._action_spec
    
    def observation_spec(self):
        return self._observation_spec
    
    def _reset(self):
        self._game.reset_game()
        self.crewmates.reset_crewmates()
        self._episode_ended = False
        self.turn+=1
        return ts.restart(np.array([self._game.resources[0], self._game.resources[2], self._game.resources[3],
                                    self._game.resources[1], self._game.crew, self._game.selected_event,
                                    self._game.options_station_event[0], self._game.options_station_event[1],
                                    self._game.resources_trade_event[0], self._game.resources_trade_event[1],
                                    self._game.crew_exploration_event, self._game.crew_sos_event], dtype=np.int32))
    
    def _step(self, action):
        if self._episode_ended:
            return self.reset()
        
        if not self._episode_ended:
            if self.turn % 5 == 0 and self.turn != 0:
                self._game.change_mode(action[1])

            if self.turn % 6 == 0 and self.turn != 0:
                self._game.meal(action[0], self.crewmates)   

            self._game.check_events(self.crewmates, action[2])

            self._game.ship()
        
        for i in range(len(self.crewmates.crewmates_left)):
            self.crewmates.run_crewmates(i, self._game)
            
        if self._game.resources[0] <= 0:
            self._episode_ended = True
        if self._game.resources[1] <= 0:
            self._episode_ended = True
        if self._game.crew == 0:
            self._episode_ended = True

        if self._episode_ended:
            reward = self._game.points
            return ts.termination(np.array([self._game.resources[0], self._game.resources[2], self._game.resources[3],
                                            self._game.resources[1], self._game.crew, self._game.selected_event,
                                            self._game.options_station_event[0], self._game.options_station_event[1],
                                            self._game.resources_trade_event[0], self._game.resources_trade_event[1],
                                            self._game.crew_exploration_event, self._game.crew_sos_event], 
                                           dtype=np.int32), reward)
        
        else:
            for i in range(len(self.crewmates.crewmates_left)):
                if self.crewmates.crewmates_left[i] == False:
                    self._game.points += 100
                    
            self.turn += 1
                    
            self._game.select_events()
                
            return ts.transition(np.array([self._game.resources[0], self._game.resources[2], self._game.resources[3],
                                           self._game.resources[1], self._game.crew, self._game.selected_event,
                                           self._game.options_station_event[0], self._game.options_station_event[1],
                                           self._game.resources_trade_event[0], self._game.resources_trade_event[1],
                                           self._game.crew_exploration_event, self._game.crew_sos_event],
                                          dtype=np.int32), 
                                 reward=0.0, discount=1.0)

This cell contains the Game environment. In here is defined a the class _Game_ which will be called by the Python environment and which contains the functions and variables that control the game that will be played by the agent.

In [None]:
class Game(object):
    def __init__(self):
        self.resources = [GAS_LEVEL, POWER_LEVEL, FOOD_LEVEL, WATER_LEVEL] # 0-gas, 1-power, 2-food, 3-water
        self.crew = CREW_NUMBER
        self.chance_station = PROB_STATION
        self.chance_attack = PROB_ATTACK
        self.chance_beacon = PROB_BEACON
        self.chance_trade = PROB_TRADE
        self.chance_exploration = PROB_EXPLORATION
        self.chance_sos = PROB_SOS
        self.control_mode = 0
        self.points = 0
        self.events_prob = [0, 0, 0, 0, 0, 0] # [station, attack, beacon, trade, exploration, sos]
        self.options_station_event = [0,0] # 1 = power, 2 = gas, 3 = food, 4 = water
        self.resources_trade_event = [0,0] # [resource requested, resource offered]
        self.crew_exploration_event = 0
        self.crew_sos_event = 0
        self.selected_event = None
        self.select_events()
        
    def select_events(self):
        station = random.random()
        attack = random.random()
        beacon = random.random()
        trade = random.random()
        exploration = random.random()
        sos = random.random()
        
        events_selected = 0
        
        if station < self.chance_station:
            self.events_prob[0] = 1
            events_selected += 1
        else:
            self.events_prob[0] = 0
            
        if attack < self.chance_attack:
            self.events_prob[1] = 1
            events_selected += 1
        else:
            self.events_prob[1] = 0
            
        if beacon < self.chance_beacon:
            self.events_prob[2] = 1
            events_selected += 1
        else:
            self.events_prob[2] = 0
            
        if trade < self.chance_trade:
            self.events_prob[3] = 1
            events_selected += 1
        else:
            self.events_prob[3] = 0
            
        if exploration < self.chance_exploration:
            self.events_prob[4] = 1
            events_selected += 1
        else:
            self.events_prob[4] = 0
            
        if sos < self.chance_sos:
            self.events_prob[5] = 1
            events_selected += 1
        else:
            self.events_prob[5] = 0
            
        if events_selected != 0 and events_selected != 1:
            events_to_randomize = []
            for i in range(len(self.events_prob)):
                if self.events_prob[i] == 1:
                    events_to_randomize.append(i)
                    
            randomize = random.randint(0, len(events_to_randomize)-1)
            
            for i in range(len(self.events_prob)):
                if i != events_to_randomize[randomize]:
                    self.events_prob[i] = 0
                    
            events_to_randomize.clear()
            
        for i in range(len(self.events_prob)):
            if self.events_prob[i] == 1:
                self.selected_event = i
                
        if self.selected_event == None:
            self.select_events()
        
        self.options_station_event = [0,0]
        self.resources_trade_event = [0,0] 
        self.crew_exploration_event = 0
        self.crew_sos_event = 0
            
        self.select_event_random_values(self.selected_event)
        
    def select_event_random_values(self, event):
        #station event
        if event == 0:
            self.options_station_event[0] = random.randint(1, 4)
            self.options_station_event[1] = random.randint(1, 4)
            
            while self.options_station_event[1] == self.options_station_event[0]:
                self.options_station_event[1] = random.randint(0, 3)
        
        #trade event
        elif event == 3:
            self.resources_trade_event[0] = random.randint(1, 4)
            self.resources_trade_event[1] = random.randint(1, 4)
            
            while self.resources_trade_event[1] == self.resources_trade_event[0]:
                self.resources_trade_event[1] = random.randint(1, 4)
        
        #exploration event
        elif event == 4:
            self.crew_exploration_event = random.randint(1, 4)
            while self.crew - self.crew_exploration_event < 0:
                self.crew_exploration_event = 0
                self.crew_exploration_event = random.randint(1, 4)
            
        #sos event
        elif event == 5:
            self.crew_sos_event = random.randint(1, 5)
    
    def check_events(self, crewmates, action):
        if self.events_prob[0] == 1:
            self.station_event(action)
            
        if self.events_prob[1] == 1:
            self.attack_event(action)
            
        if self.events_prob[2] == 1:
            self.beacon_event(action)
            
        if self.events_prob[3] == 1:
            self.trade_event(action)
            
        if self.events_prob[4] == 1:
            self.exploration_event(action, crewmates)
            
        if self.events_prob[5] == 1:
            self.sos_event(action, crewmates)
    
    # Controls the power and gas depletion
    def ship(self):
        if self.control_mode == 0:
            self.resources[1] -= 2
            self.resources[0] -= 2
            return
            
        elif self.control_mode == 1:
            self.resources[0] -= 4
            return
            
        elif self.control_mode == 2:
            self.resources[1] -= 4
            return
            
    def change_mode(self, action):
        if action == 0:
            self.control_mode = 0
            return

        elif action == 1:
            self.control_mode = 1
            return

        else:
            self.control_mode = 2
            return
                
    # Code for supply replenishment event
    def station_event(self, action):
        if action == 0:
            action = self.options_station_event[0]
            
        else:
            action = self.options_station_event[1]
            
        refill = random.randint(20, 50)
            
        self.update_post_event(action - 1, refill)
            
        return
        
    # Code for space pirate attack event
    def attack_event(self, action):
        if action == 0:
            depletion = random.randint(15, 35)
            self.resources[0] -= depletion
            return
                
        else:
            escape_prob = 0.4
            escape_try = random.random()
            
            if escape_prob >= escape_try:
                return
                
            else:
                supply_lost = random.randint(0, 3)
                amount_lost = -1 * random.randint(30, 50)
                
                self.update_post_event(supply_lost, amount_lost)
            
                return
        
    # Code for supply beacon event
    def beacon_event(self, action):        
        if action == 0:
            supply = random.randint(0, 3)
            amount_found = random.randint(10, 50)
            
            self.update_post_event(supply, amount_found)
            
            return
                
        else:
            return

    # Code for trade event
    def trade_event(self, action):
        amount_offered = random.randint(30, 60)
        amount_requested = random.randint(10, 40)
        
        if action == 0:
            self.update_post_event(self.resources_trade_event[1] - 1, amount_offered)
            self.update_post_event(self.resources_trade_event[0] - 1, -1 * amount_requested)
            return
                
        else:
            return
            
    # Code for exploration event
    def exploration_event(self, action, crewmates):
        sucess_chance = 0.65
        
        if action == 0:
            exploration_result = random.random()
            if sucess_chance > exploration_result:
                found = random.randint(0, 3)
                amount_found = random.randint(50, 80)
                self.update_post_event(found, amount_found)
                return
            
            else:
                crewmates.lost_in_planet(self.crew_exploration_event)
                for i in range(self.crew_exploration_event):
                    self.crew -= 1
                    self.points -= 500
                return
                
        else:
            return

    # Code for sos call event
    def sos_event(self, action, crewmates):     
        if action == 0:
            new_crew = self.crew + self.crew_sos_event
            
            for i in range(self.crew, new_crew):
                crewmates.crewmate()
                
            self.crew = new_crew
            
            return
                
        else:
            return
    
    # Code for meal event
    def meal(self, action, crewmates):
        if action == 0:
            for i in range(len(crewmates.crewmates_left)):
                if not crewmates.crewmates_left[i]:
                    self.give_food(i, crewmates)
                    self.give_water(i, crewmates)
            return
        
        else:
            return
        
    # Code for giving food to the crew
    def give_food(self, name, crewmates):
        if self.resources[2] > 5:
            self.resources[2] -= 6
            crewmates.crewmates_food[name] += 6
        return
            
    # Code for giving water to the crew
    def give_water(self, name, crewmates):
        if self.resources[3] > 5:
            self.resources[3] -= 6
            crewmates.crewmates_water[name] += 6
        return
    
    def update_post_event(self, resource, amount):
        if resource == 0:
            self.resources[0] += amount

        elif resource == 1:
            self.resources[1] += amount

        elif resource == 2:
            self.resources[2] += amount
            if self.resources[2] < 0:
                self.resources[2] = 0

        elif resource == 3:
            self.resources[3] += amount
            if self.resources[3] < 0:
                self.resources[3] = 0
                
    def reset_game(self):
        self.resources = [GAS_LEVEL, POWER_LEVEL, FOOD_LEVEL, WATER_LEVEL]
        self.crew = CREW_NUMBER
        self.chance_station = PROB_STATION
        self.chance_attack = PROB_ATTACK
        self.chance_beacon = PROB_BEACON
        self.chance_trade = PROB_TRADE
        self.chance_exploration = PROB_EXPLORATION
        self.chance_sos = PROB_SOS
        self.control_mode = 0
        self.points = 0
        self.events_prob = [0, 0, 0, 0, 0, 0]
        self.options_station_event = [0,0]
        self.resources_trade_event = [0,0]
        self.crew_exploration_event = 0
        self.crew_sos_event = 0
        self.selected_event = None
        self.select_events()

This cells defines the class _Crewmates_ which contains the code that controls the crewmates' behaviour: stores a record of each crewmate's water and food levels, and defines the functions used to generate a new crewmember, update their resource need everyturn and remove crewmates if needed.

In [None]:
class Crewmates(object):
    def __init__(self):
        self.crewmates_food = []
        self.crewmates_water = []
        self.crewmates_left = []
        for i in range(CREW_NUMBER):
            self.crewmate()
        
    # Generates a crewmate
    def crewmate(self):
        self.crewmates_food.append(FOOD_NEED)
        self.crewmates_water.append(WATER_NEED)
        self.crewmates_left.append(False)
    
    # Updates each crewmate's food and water levels
    def run_crewmates(self, name, game):
        if not self.crewmates_left[name]:
            if self.crewmates_food[name] == 0:
                game.crew -= 1
                self.crewmates_left[name] = True
                game.points -= 500
                return

            if self.crewmates_water[name] == 0:
                game.crew -= 1
                self.crewmates_left[name] = True
                game.points -= 500
                return

            self.crewmates_food[name] -= 1
            self.crewmates_water[name] -= 1

        if self.crewmates_left[name]:
            return

    # Code for crewmates lost in exploration event
    def lost_in_planet(self, number):
        pool = []
        
        for i in range(len(self.crewmates_left)):
            if not self.crewmates_left[i]:
                pool.append(i)
        
        len_pool = len(pool)
        for i in range(number):
            name = random.randint(0, len_pool - 1)
            while self.crewmates_left[pool[name]] == True:
                name = random.randint(0, len_pool - 1)

            self.crewmates_left[pool[name]] = True
                        
    def reset_crewmates(self):
        self.crewmates_food = []
        self.crewmates_water = []
        self.crewmates_left = []
        for i in range(CREW_NUMBER):
            self.crewmate()

Once the environment is coded, it is necesary to test it using the tools provided by Tensorflow. In the next cell, the validate_py_environment is used to iterate through 5 episodes of the environment created using a random policy. If we have made any mistake with our code (on the environment definition or on the classes), an error message will appear

In [None]:
environment = GameEnv()
utils.validate_py_environment(environment, episodes=5)

Now we define two environments, one will be used to train the agent and the second will be used to test the trained agent.

In here, we are translating our Python Environment to a Tensorflow Environment using TfPyEnvironment. What this does is basically transform the numpy arrays to _Tensors_, which makes it easier for the user to interact with the policies and the agent.

In [None]:
train_env = tf_py_environment.TFPyEnvironment(GameEnv())
eval_env = tf_py_environment.TFPyEnvironment(GameEnv())

Once we have defined and checked that our environment is worked as intended, we can go to the agent code. 

For the next cells, the code used is an adaptation of the code on this Tensorflow's documentation webpage: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial

## Agent code

First, define some hyperparameters that will be used throughout the coding of the agent.

Tweaking the hyperparameters is a very delicate part of the process of training an AI, as finding the optimal values for each of the parameters can make a huge different on the extend to which the AI learns, the time it takes to do so, the resources it takes to run, and so on. 

The values present in this code were found through trial and error: Some parameters were changed, then the agent was run, then the results were compared with others using different configurations, then repeat. And while these are the values that were found to work the best, that doesn't mean they are the most optimal configuration; maybe changing some could achieve higher results or lower runtimes (while obviously negatively impacting other metric), so feel free to experiment if you want.

To make the search for new configurations easier, what follows is a brief description of what each of these hyperparameters do:

    num_iterations: The AI learns to play by coursing through a number of iterations, each comprised of a set number of episodes. Having more iterations might allow the AI to learn more by having more experience with the game, but it will also make the runtime increase.
    
    collect_episodes_per_iteration: Each episode refers to a playthrough of the game: Starts by creating a new instance of  the game, and ends when the AI gets a Game Over. Same as with before, augmenting this number will give the AI more experience, at a runtime cost.
    
    replay_buffer_capacity: The AI needs to store its results in a buffer to properly learn. This parameter refers to the   maximum capacity of said buffer. Having more buffer capacity will allow the AI to store more information of its previous runs, but might put strain on your computer.
    
    fc_layer_params: This hyperparameter defines the "network" of the AI; it's an array where each value refers to the number of nodes on the layer of the index of its position, e.g.: inputting (1, 50, 25) would represent one node on its first layer, fifty on its second and twenty five on its third. Both having more layers and nodes might improve the learning capability of the agent, but will again augment the runtime. Also, it is worth noting that once arrived to certain values adding more layers or nodes won't have a significant effect on the agent, and might even be counterproductive. Optimal values can be reached by just changing this value, but researching external documentation might be helpful.
    
    learning_rate: This hyperparameter is one of the hardest to change effectively, and before changing it you might want to research at least the basics of how this works. Still, an oversimplification might be the following: The agent learns by trying different solutions when given a certain problem. This solutions can be illustrated as points in an axis, and the AI will jump from solution to solution in search for the one that best suits the problem. In this scenario, the learning rate is how far this "jumps" are. So having a value too big might make the agent come with the answer faster, but at the risk of entirely missing it; while on the other extreme, having a value too small might ensure that the answer is reached, but the runtime would be enormous.
    
    log_interval: When training, the agent will periodically record an average of the results obtained the last iterations. This hyperparameter defines how often that log happens.
    
    num_eval_episodes: Every so often, the agent will submit itself to an evaluation where its efficacy is gauged. This hyperparameter changes the number of episodes in the evaluation. Greater values will give more information on the performance of the agent, at some cost of runtime.
    
    eval_interval: This hyperparameter affects how often the evaluation happens. While having more evaluations might help the agent to gauge it performance more often, it is also important to have a significant number of training iterations before arriving to a evaluation, to give the agent time to change its approach to the problem.

In [None]:
num_iterations = 1000 # @param {type:"integer"}
collect_episodes_per_iteration = 20 # @param {type:"integer"}
replay_buffer_capacity = 100000 # @param {type:"integer"}

fc_layer_params = (50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,)

learning_rate = 1e-3 # @param {type:"number"}
log_interval = 25 # @param {type:"integer"}
num_eval_episodes = 10 # @param {type:"integer"}
eval_interval = 50 # @param {type:"integer"}

The first thing needed to code our actor is an Actor Network, tasked with predicting an action given a certain observation from the environment. We pass our observation and action's spec, as well as _fc_layer_params_ which is a tuple of ints, each representing the number of nodes for the hidden layers.

In [None]:
actor_net = actor_distribution_network.ActorDistributionNetwork(
    train_env.observation_spec(),
    train_env.action_spec(),
    fc_layer_params=fc_layer_params)

Here we define an optimizer for the created network. Tensorflow has a variety of diferent optimizers, here the Adam Optimization Algorithm is used 

In [None]:
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)

train_step_counter = tf.compat.v2.Variable(0)

tf_agent = reinforce_agent.ReinforceAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    actor_network=actor_net,
    optimizer=optimizer,
    normalize_returns=True,
    train_step_counter=train_step_counter)
tf_agent.initialize()

A policy represents an action to take on a specific timestep, as a response to an observation from the environment.

The agents contain two policies, one is used to for the training and later is the policy used by the trained agent (evaluation policy), and one is used to collect data (collect policy)

In [None]:
eval_policy = tf_agent.policy
collect_policy = tf_agent.collect_policy

It is also important to define a return to evaluate how good the policy is. The one used by the Tensorflow documentation is the average return policy, and is one of the most commonly used. It works by computing an average the rewards over a set number of episodes:

In [None]:
def compute_avg_return(environment, policy, num_episodes=10):

    total_return = 0.0
    for _ in range(num_episodes):

        time_step = environment.reset()
        episode_return = 0.0

        while not time_step.is_last():
            action_step = policy.action(time_step)
            time_step = environment.step(action_step.action)
            episode_return += time_step.reward
        total_return += episode_return

    avg_return = total_return / num_episodes
    return avg_return.numpy()[0]

The replay buffer allows the agent to store the data from its previous episodes

In [None]:
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=tf_agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_capacity)

Here we define a function that allows the agent to store the results for a given episode in the replay buffer

In [None]:
def collect_episode(environment, policy, num_episodes):

    episode_counter = 0
    environment.reset()

    while episode_counter < num_episodes:
        time_step = environment.current_time_step()
        action_step = policy.action(time_step)
        next_time_step = environment.step(action_step.action)
        traj = trajectory.from_transition(time_step, action_step, next_time_step)

        replay_buffer.add_batch(traj)

        if traj.is_boundary():
            episode_counter += 1

With this we have all the pieces needed to actually train the agent, so we pass to the actual training loop.

In [None]:
try:
    %%time
except:
    pass

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
tf_agent.train = common.function(tf_agent.train)

# Reset the train step
tf_agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, tf_agent.policy, num_eval_episodes)
returns = [avg_return]

for _ in range(num_iterations):

    # Collect a few episodes using collect_policy and save to the replay buffer.
    collect_episode(train_env, tf_agent.collect_policy, collect_episodes_per_iteration)

    # Use data from the buffer and update the agent's network.
    experience = replay_buffer.gather_all()
    train_loss = tf_agent.train(experience)
    replay_buffer.clear()

    step = tf_agent.train_step_counter.numpy()

    if step % log_interval == 0:
        print('step = {0}: loss = {1}'.format(step, train_loss.loss))

    if step % eval_interval == 0:
        avg_return = compute_avg_return(eval_env, tf_agent.policy, num_eval_episodes)
        print('step = {0}: Average Return = {1}'.format(step, avg_return))
        returns.append(avg_return)

## Visualization

These final two cells contain the code needed to visualize what we have just trained.

The first one contains code to put into a matplotlib plot the results obtained throughout the previous training, and helps primarily by telling the user the performance of the agent during training, thus providing helpful information for the process of selecting the best hyperparameter configuration.

In [None]:
steps = range(0, num_iterations + 1, eval_interval)
plt.plot(steps, returns)
plt.ylabel('Average Return')
plt.xlabel('Step')

The second and final cell of the project contains a test run for the trained agent, allowing us to see how the  agent applies the policies made during training to a general scenario.

In [None]:
def bot_evaluation(policy, num_episodes=5):
    for _ in range(num_episodes):
        print("Episode: {}".format(_))
        time_step = eval_env.reset()
        while not time_step.is_last():
            print(time_step)
            action_step = policy.action(time_step)
            time_step = eval_env.step(action_step.action)
        print("Final Score: {}".format(time_step.reward))
        print("------------------------------------------------------------------------------------------")

bot_evaluation(tf_agent.policy)