Theory behind the project

We present here our work investigating the effects of Reward Shaping on a Hierarchical Reinforcement Learning (HRL) environment while specifying increasingly complex temporal goals defined as LTL$$_f$$ formulas. Thanks to the flexibility of the LTL$$_f$$ language, it is trivial to define such goals and, thanks to existing Python libraries, convert them to DFA automatons that work best with RL tasks.

We also investigate the possibility to discard the LTL$$_f$$ goals and instead redefine the environment using STRIPS, which could later be extended to accept a multitude of temporal goals implicitly. "

Theory behind the project
Our experiments
Results
References

Theory behind the project

Reinforcement Learning

Reinforcement Learning (RL) is a branch of Machine Learning (ML), one of the most prominent and established field of study in the Artificial Intelligence (AI) sector.

AI algorithms that belong to this sub-class are designed to improve their capabilities of performing some task with experience: learning . In general, RL aims at learning an optimal behavior function, called policy, $$ \pi : S \rightarrow A$$ , given a set of states, actions and rewards, $$ D = {(\langle s_0, a_1, r_1, s_1, ..., a_n, r_n, s_n\rangle_i)^n_{i=1}}$$, in order to maximize the final cumulative reward provided by a reward function.

A reinforcement learning problem may be modeled as a Markov Decision Process (MDP).

A MDP model can be defined as:

$$ MDP= \langle S, A, \delta, r \rangle$$ with:

$$ S$$ is a finite set of states, they may be fully or partially observable
$$ A$$ is a finite set of actions
$$ \delta$$ is a transition function, it may be deterministic, non-deterministic or stochastic
$$ r: S \times A \times S \rightarrow \mathbb{R}$$ is a reward function

Taxi-v3

Taxi-v3 is one among all the proposed tasks in the set of environments provided by OpenAI, a prominent AI research and deployment company that offers a powerful toolkit, named Gym, for developing and testing reinforcement learning algorithms. This library consists in a collection of different challenges, called environments, which covers all the main known problems and weaknesses that afflicted, or still afflict, the RL field of study.

The task, inspired by the taxi driver job, is relatively simple and takes place in a $$ 5 \times 5$$ grid with 4 marked tiles called R, G, Y, B. At each round, called episode, two of these positions are randomly chosen as passenger and destination locations, while another random one is picked along the whole grid as taxi's starting location. The goal of the agent is to bring the passenger to the destination, trying to maximize the final total reward. In case of coincidence of passenger and destination tiles, the agent still has to pick up the passenger and drop him off in place. To accomplish this objective, the driver can freely move inside the grid in any orthogonal direction with the exception of trespassing the outer edges or some fixed walls between tiles.

The environment is represented at each time step by its state: considering all the possibilities made by 25 taxi positions, 5 passenger positions, on taxi included, and 4 destinations, there are in total 500 states that are indexed with an encoding of discrete values in the interval $$ [0, 499]$$ .

The state is updated only by the agent's actions. There are 6 possible actions the agent can perform at any step of the episode, which are:

Move South,
Move North,
Move East,
Move West,
Pick up the passenger and
Drop off the passenger.

In order to train the agent, a $$ -1$$ per-step reward is applied, except for the Pick up and Drop off actions which, if done illegally, costs $$ -10$$ . The episode ends if the agent performs the maximum number of fixed steps or if it manages to drop off the passenger at the destination obtaining a $$ +20$$ reward.

Temporal Goals

We now explore Goal-Directed Exploration with a Reinforcement Learning Algorithm. This means that an agent has to learn a policy to reach a goal state in an initially unknown state space. Once the reinforcement learning problem reaches the goal state the agent has to find some path to a goal state in an initially unknown (or partially known) state space with no need to find the shortest path. The idea is using a set of temporal goals in order to reach the final goal. It is possible to consider a set of temporal goals expressed in LTL$$ _f$$ notation.

Reward Shaping

Reward shaping is a method for engineering a reward function in order to provide more frequent feedback on appropriate behaviors. The reward shaping in RL consist of supplying additional rewards to a learning agent to guide its learning process more efficiently, helping it in achieving the optimal policy faster. The reward shaping function can be integrated in the MDP, $$ M=(S, A, T, \gamma, R)$$ , modifying its reward function, $$ R$$ , as $$ R' = R + F $$ where $$ F: S \times A \times S \Rightarrow \mathbb{R}$$ is a bounded real-valued function. Then if the original MDP would have received a reward $$ R(s,a,s')$$ for transitioning from $$ s$$ to $$ s'$$ on an action $$ a$$ then in the new MDP it would receive the following reward: $$ R'(s,a,s')=R(s,a,s')+F(s,a,s') $$ This process will encourage the learner to take the action $$ a$$ in some set of states $$ S_0$$ leading faster to the optimal solution.

Restraining Bolts

A restraining bolt is a device that restricts an agent’s actions when connected to its system. The restraining bolt has the aim of limiting actions to a set of desired behaviors. The restraining bolt gives an additional representation of the world with respect to the one that the agent already has. The agent has to conform as much as possible to the restraining specifications. The link between the two representation is given by studying this problem in the Reinforcement Learning scope.

In a RL task with restraining bolt:

The learning agent is modeled by the MDP: $$ M_{ag} = \langle S, A, T_{r_{ag}}, R_{ag} \rangle$$
The restraining bolt is modeled as $$ RB = \langle L,;{(\phi_i,; r_i)}^m_{i=1} \rangle$$ , where $$ L$$ is the set of possible fluents and the set is the restraining specifications represented in LTL$$ _f$$ formulas.

Automata as Reward Shaping

A Reward Machine indicates what reward function should be used to obtain a reward signal given the states label that the agent met up until now. The idea is to approximate the cumulative discounted reward in any RM state by treating the RM itself as an MDP. The value iteration would be the maximum value of the optimal policy.

Temporal Logic

Temporal Logic (TL) is a logic language that offers some time-related rules and operators to deal with propositions in which truth varies with time. Its complex syntax involves the introduction, along with the classical connectives $$ \neg, \land, \lor$$ , of two shared operators:

The Necessarily operator, $$ \square \varphi$$ : states that $$ \varphi$$ is true in all the possible worlds,
The Possibly operator, $$ \diamond \varphi$$ : states that $$ \varphi$$ is true or will be true in all the possible worlds.

This semantic representation allows also simple graphical representations based on graphs.

LTL$$ _f$$

Linear Temporal Logic (LTL) is an extention of TL which is characterized by the formulation of a wider set of operators:

The Globally operator, $$ G \varphi$$ : like $$ \square \varphi$$ states that $$ y$$ is true now and in all the future time points,
The Finally operator, $$ F \varphi$$ : like $$ \diamond \varphi$$ states that $$ y$$ is true now or it will at some time point in the future,
The Next operator, $$ X \varphi$$ or $$ \circ \varphi$$ : states that $$ \varphi$$ will be true in the next time point,
The Until operator, $$ \varphi_1$$ $$ \cal U$$ $$ \varphi_2$$ : states that $$ \varphi_1$$ is true until $$ \varphi_2$$ is true.

Finally, the complete description of evolution over time, consisting in a sequence of propositional interpretations, one for each time point, is called trace and a formula's truth is evaluated over that trace. For the standard LTL semantics, traces are infinite.

DFA

A Deterministic Finite Automaton (DFA) is an abstract representation shaped as a finite state machine that takes a string of symbols as input and, running through a sequence of states, may accept or reject it.

STRIPS

The STanford Research Institute Problem Solver (STRIPS) is a problem solver that aims to find the optimal composition of operators that transforms a given initial world model/state into one that satisfies some stated goal condition, if reachable. The operators are the basic elements from which a solution is built. The problem space for STRIPS is defined by a quadruple $$ \Pi = \langle P, O, I, G \rangle$$ where:

$$ P$$ is the set of conditions,
$$ O$$ is the finite set of available operators and their effects on world models,
$$ I \subseteq P$$ the initial world model,
$$ G \subseteq P$$ is the goal statement.

The optimal path $$ \pi_{opt}$$ that can be obtained is the path with the lowest possible cost. The cost is defined as $$ c(x_t,u_t)$$ , where $$ x_t$$ and $$ u_t$$ are the Markovian system state and action at the time step $$ t \in [1, T]$$ , respectively. This means that there exists neither path nor plan $$ \pi'$$ such that $$ c(\pi') < c(\pi_{opt})$$ .

It is possible to translate the STRIPS domain into an equivalent LTL$$ _f$$ formula.

Once we have expressed the STRIPS domain in a LTL$$ _f$$ formula, it is possible to convert such formula in DFA to obtain the corresponding automaton.

Our experiments

We decided to test two categories of experiments:

Translation of non-trivial LTLf formulae to a DFA Automaton to use as reward machine in a RL task
Translation of a complete STRIPS description of an environment to a DFA Automaton to use as reward machine in a RL taks

In the first category we tested the following formulae, ordered by increasing complexity:

Base environment goal: $$ a; \mathrm{U}; (b; \mathrm{U}; c) $$ . In this goal we check first that there is a taxi anywhere in the map ($$ a$$ ), then that the passenger is picked up by the taxi ($$ b$$ ) and finally that the destination is reached ($$ c$$ ). Note that this is the equivalent goal of the environment, which we used as baseline to ensure the correctness of our solution.