We present here our work investigating the effects of Reward Shaping on a Hierarchical Reinforcement Learning (HRL) environment while specifying increasingly complex temporal goals defined as LTL$$_f$$ formulas. Thanks to the flexibility of the LTL$$_f$$ language, it is trivial to define such goals and, thanks to existing Python libraries, convert them to DFA automatons that work best with RL tasks.
We also investigate the possibility to discard the LTL$$_f$$ goals and instead redefine the environment using STRIPS, which could later be extended to accept a multitude of temporal goals implicitly. "
Reinforcement Learning (RL) is a branch of Machine Learning (ML), one of the most prominent and established field of study in the Artificial Intelligence (AI) sector.
AI algorithms that belong to this sub-class are designed to improve their capabilities of performing some task with experience: learning . In general, RL aims at learning an optimal behavior function, called policy, $$ \pi : S \rightarrow A$$ , given a set of states, actions and rewards, $$ D = {(\langle s_0, a_1, r_1, s_1, ..., a_n, r_n, s_n\rangle_i)^n_{i=1}}$$, in order to maximize the final cumulative reward provided by a reward function.
A reinforcement learning problem may be modeled as a Markov Decision Process (MDP).
A MDP model can be defined as:
$$ MDP= \langle S, A, \delta, r \rangle$$ with:
- $$ S$$ is a finite set of states, they may be fully or partially observable
- $$ A$$ is a finite set of actions
- $$ \delta$$ is a transition function, it may be deterministic, non-deterministic or stochastic
- $$ r: S \times A \times S \rightarrow \mathbb{R}$$ is a reward function
Taxi-v3 is one among all the proposed tasks in the set of environments provided by OpenAI, a prominent AI research and deployment company that offers a powerful toolkit, named Gym, for developing and testing reinforcement learning algorithms. This library consists in a collection of different challenges, called environments, which covers all the main known problems and weaknesses that afflicted, or still afflict, the RL field of study.
The task, inspired by the taxi driver job, is relatively simple and takes place in a $$ 5 \times 5$$ grid with 4 marked tiles called R, G, Y, B. At each round, called episode, two of these positions are randomly chosen as passenger and destination locations, while another random one is picked along the whole grid as taxi's starting location. The goal of the agent is to bring the passenger to the destination, trying to maximize the final total reward. In case of coincidence of passenger and destination tiles, the agent still has to pick up the passenger and drop him off in place. To accomplish this objective, the driver can freely move inside the grid in any orthogonal direction with the exception of trespassing the outer edges or some fixed walls between tiles.
The environment is represented at each time step by its state: considering all the possibilities made by 25 taxi positions, 5 passenger positions, on taxi included, and 4 destinations, there are in total 500 states that are indexed with an encoding of discrete values in the interval $$ [0, 499]$$ .
The state is updated only by the agent's actions. There are 6 possible actions the agent can perform at any step of the episode, which are:
- Move South,
- Move North,
- Move East,
- Move West,
- Pick up the passenger and
- Drop off the passenger.
In order to train the agent, a $$ -1$$ per-step reward is applied, except for the Pick up and Drop off actions which, if done illegally, costs $$ -10$$ . The episode ends if the agent performs the maximum number of fixed steps or if it manages to drop off the passenger at the destination obtaining a $$ +20$$ reward.
We now explore Goal-Directed Exploration with a Reinforcement Learning Algorithm. This means that an agent has to learn a policy to reach a goal state in an initially unknown state space. Once the reinforcement learning problem reaches the goal state the agent has to find some path to a goal state in an initially unknown (or partially known) state space with no need to find the shortest path. The idea is using a set of temporal goals in order to reach the final goal. It is possible to consider a set of temporal goals expressed in LTL$$ _f$$ notation.
Reward shaping is a method for engineering a reward function in order to provide more frequent feedback on appropriate behaviors. The reward shaping in RL consist of supplying additional rewards to a learning agent to guide its learning process more efficiently, helping it in achieving the optimal policy faster. The reward shaping function can be integrated in the MDP, $$ M=(S, A, T, \gamma, R)$$ , modifying its reward function, $$ R$$ , as $$ R' = R + F $$ where $$ F: S \times A \times S \Rightarrow \mathbb{R}$$ is a bounded real-valued function. Then if the original MDP would have received a reward $$ R(s,a,s')$$ for transitioning from $$ s$$ to $$ s'$$ on an action $$ a$$ then in the new MDP it would receive the following reward: $$ R'(s,a,s')=R(s,a,s')+F(s,a,s') $$ This process will encourage the learner to take the action $$ a$$ in some set of states $$ S_0$$ leading faster to the optimal solution.
A restraining bolt is a device that restricts an agent’s actions when connected to its system. The restraining bolt has the aim of limiting actions to a set of desired behaviors. The restraining bolt gives an additional representation of the world with respect to the one that the agent already has. The agent has to conform as much as possible to the restraining specifications. The link between the two representation is given by studying this problem in the Reinforcement Learning scope.
In a RL task with restraining bolt:
- The learning agent is modeled by the MDP: $$ M_{ag} = \langle S, A, T_{r_{ag}}, R_{ag} \rangle$$
- The restraining bolt is modeled as $$ RB = \langle L,;{(\phi_i,; r_i)}^m_{i=1} \rangle$$ , where $$ L$$ is the set of possible fluents and the set is the restraining specifications represented in LTL$$ _f$$ formulas.
A Reward Machine indicates what reward function should be used to obtain a reward signal given the states label that the agent met up until now. The idea is to approximate the cumulative discounted reward in any RM state by treating the RM itself as an MDP. The value iteration would be the maximum value of the optimal policy.
Temporal Logic (TL) is a logic language that offers some time-related rules and operators to deal with propositions in which truth varies with time. Its complex syntax involves the introduction, along with the classical connectives $$ \neg, \land, \lor$$ , of two shared operators:
- The Necessarily operator, $$ \square \varphi$$ : states that $$ \varphi$$ is true in all the possible worlds,
- The Possibly operator, $$ \diamond \varphi$$ : states that $$ \varphi$$ is true or will be true in all the possible worlds.
This semantic representation allows also simple graphical representations based on graphs.
Linear Temporal Logic (LTL) is an extention of TL which is characterized by the formulation of a wider set of operators:
- The Globally operator, $$ G \varphi$$ : like $$ \square \varphi$$ states that $$ y$$ is true now and in all the future time points,
- The Finally operator, $$ F \varphi$$ : like $$ \diamond \varphi$$ states that $$ y$$ is true now or it will at some time point in the future,
- The Next operator, $$ X \varphi$$ or $$ \circ \varphi$$ : states that $$ \varphi$$ will be true in the next time point,
- The Until operator, $$ \varphi_1$$ $$ \cal U$$ $$ \varphi_2$$ : states that $$ \varphi_1$$ is true until $$ \varphi_2$$ is true.
Finally, the complete description of evolution over time, consisting in a sequence of propositional interpretations, one for each time point, is called trace and a formula's truth is evaluated over that trace. For the standard LTL semantics, traces are infinite.
A Deterministic Finite Automaton (DFA) is an abstract representation shaped as a finite state machine that takes a string of symbols as input and, running through a sequence of states, may accept or reject it.
More in depth, a DFA is defined as:
with:
- $$ \Sigma$$ : a finite and non-empty alphabet, where a word is an element of $$ \Sigma^\ast$$ (i.e. a finite sequence of symbols $$ a_0 ... a_n, a_i \in \Sigma$$ ),
- $$ S$$ : a finite and non-empty set of states,
- $$ s^0 \in S$$ : the initial state,
- $$ F \subseteq S$$ : the set of accepting states,
- $$ \rho: S \times \Sigma \rightarrow S$$ : a transition function.
The STanford Research Institute Problem Solver (STRIPS) is a problem solver that aims to find the optimal composition of operators that transforms a given initial world model/state into one that satisfies some stated goal condition, if reachable. The operators are the basic elements from which a solution is built. The problem space for STRIPS is defined by a quadruple $$ \Pi = \langle P, O, I, G \rangle$$ where:
- $$ P$$ is the set of conditions,
- $$ O$$ is the finite set of available operators and their effects on world models,
- $$ I \subseteq P$$ the initial world model,
- $$ G \subseteq P$$ is the goal statement.
The optimal path $$ \pi_{opt}$$ that can be obtained is the path with the lowest possible cost. The cost is defined as $$ c(x_t,u_t)$$ , where $$ x_t$$ and $$ u_t$$ are the Markovian system state and action at the time step $$ t \in [1, T]$$ , respectively. This means that there exists neither path nor plan $$ \pi'$$ such that $$ c(\pi') < c(\pi_{opt})$$ .
It is possible to translate the STRIPS domain into an equivalent LTL$$ _f$$ formula.
Once we have expressed the STRIPS domain in a LTL$$ _f$$ formula, it is possible to convert such formula in DFA to obtain the corresponding automaton.
We decided to test two categories of experiments:
- Translation of non-trivial LTLf formulae to a DFA Automaton to use as reward machine in a RL task
- Translation of a complete STRIPS description of an environment to a DFA Automaton to use as reward machine in a RL taks
In the first category we tested the following formulae, ordered by increasing complexity:
- Base environment goal: $$ a; \mathrm{U}; (b; \mathrm{U}; c) $$ . In this goal we check first that there is a taxi anywhere in the map ($$ a$$ ), then that the passenger is picked up by the taxi ($$ b$$ ) and finally that the destination is reached ($$ c$$ ). Note that this is the equivalent goal of the environment, which we used as baseline to ensure the correctness of our solution.
The generated DFA for this task is reported in the following figure:
Fig.1: The Base environment goal generated DFA- Pass through center: $$ a; \mathrm{U}; (b; \mathrm{U}; (c; \mathrm{U}; d)) $$ . In this goal first that there is a taxi anywhere in the map ($$ a$$ ), then that the passenger is picked up by the taxi ($$ b$$ ), that the taxi passes through the center of the map ($$ c$$ ) and finally that the destination is reached ($$ d$$ ).
The generated DFA for this task is reported in the following figure:
Fig.2: The Through center environment goal generated DFA- Pass through 1 checkpoint: $$ a; \mathrm{U}; (b; \mathrm{U}; (c; \mathrm{U}; d)) $$. This goal is similar to the previous one, but instead of checking if the taxi passes through the center, we check that it passes through a different checkpoint before reaching the destination.
The generated DFA for this task is reported in the following figure:
Fig.3: The Through 1 checkpoint goal generated DFA- Pass through 2 checkpoints: $$ a; \mathrm{U}; (b; \mathrm{U}; (c; \mathrm{U}; (d; \mathrm{U}; e))) $$. This goal is similar to the previous one, but we check that it passes through two different checkpoints before reaching the destination.
The generated DFA for this task is reported in the following figure:
Fig.4: The Through 2 checkpoints goal generated DFA- Pass through 3 checkpoints: $$ a; \mathrm{U}; (b; \mathrm{U}; (c; \mathrm{U}; (d; \mathrm{U}; (e; \mathrm{U}; f)))) $$. This goal is similar to the previous one, but we check that it passes through two different checkpoints before reaching the destination.
The generated DFA for this task is reported in the following figure:
Fig.5: The Through 3 checkpoints goal generated DFAFor all of these, the training process ran for 2500 steps on consumer level hardware.
For the second category, we defined the STRIPS-equivalent problem. This was then used to generate the LTL$$ _f$$ formula that would, in turn, generate the DFA automaton we needed as temporal goal. As a test, we implemented only the base environment in STRIPS, so the environment goal and the temporal goal coincided. Using STRIPS formalism makes it easier to define even more complex temporal goals and it is able to encapsulate a much broader variety of goals. In our case, the definition of the base environment goal actually generated a DFA that represented all possible environment runs (since, as we explained earlier, the environment is randomly initialized).
We developed the experiments using Python 3.7. We employed the following libraries during our development:
- NumPy for mathematical computations,
- Logaut defines most of the temporal logic paradigm we used,
- Lydia is used as a backend during the LTL$$ _f$$ to DFA translation and
- TempRL offers a high-level API for temporal goal (TG) based reinforcement learning.
When exploring the effects of reward shaping on different TG tasks, we used TempRL to wrap the base Gym environment in a TG-oriented environment that contained the DFA automaton of the task expressed in LTL$$ _f$$ logic. With such an environment it was possible to compute both observations for the agent and rewards that reflected not only the base environment but the DFA as well. In particular, we have that
- The observations are tuple of $$ (env_{state},; DFA_{state})$$ .
- The reward is either $$ env_{reward} + DFA_{reward}$$ when reward shaping is applied of just $$ env_{reward}$$ for all transitions that do not result in a goal state to be reached, otherwise $$ env_{reward} + DFA_{task; reward}$$ .
When testing the STRIPS formalism to define the temporal goals, we first defined the STRIPS problem. We then converted it into an LTL$$ _f$$ formula, making sure that
- we defined a
startfluent that is True in the initialization and is False when the goal is reached, - all negated fluents are computed for any possible initial condition, as we noticed that leaving them as don't care resulted in an explosion of possible DFA states, and
- all traces end in a
sinkstate after the goal is satisfied, otherwise there could be duplicated states that differ only in whether the goal has been reached or not. Since the states in the DFA generated in this way had no loops over themselves, we introduced an object in the TempRL library that would control whether or not to apply a step on the DFA.
For both experiments, the agent itself was trained via tabular Q-Learning. Exploration was ensured via an
We provide the code for our solution at https://github.com/gallorob/ra_project.
We report here the results of our experiments. Each experiment consists in training the agent on the assigned LTL$$ _f$$ formula and checking if reward shaping has a significant impact on performances. For each experiment we tracked the episodic rewards during both training and testing.
We found that the agent reliably clears the environment, obtaining the temporal goal reward consistently during training and during testing, where the episodic reward is higher than the temporal goal reward (set at 100) and varies only on the number of steps taken by the agent.
Fig.6: Train and test rewards for the base environment goalWhen reward shaping is applied, we can see that the agent converges more rapidly to a stable high reward during training, though it also increases the frequency of failed episodes. During testing, however, the learned Q values are good enough that no episode failed.
Fig.7: Train and test rewards for the base environment goal with reward shapingThis modified temporal goal is still easily solved by the agent during training and testing.
Fig.8: Train and test rewards for the passing through the center goalAll of the tests succeed
Fig.9: Train and test rewards for the passing through the center goal with reward shapingAs the taxi almost always has to move through the center tile, it does not surprises us that this goal is achieved easily.
This modified temporal goal is similar to the previous one and as such is still easily solved by the agent during training, though in testing it is still possible to see some failed episodes.
Fig.10: Train and test rewards for the passing through 1 checkpoint goalSimilarly to the previous goal, the addition of the reward shaping effect actually hinders performances: the training is still solved but the average episodic reward is much lower and some of the tests episode actually fail.
Fig.11: Train and test rewards for the passing through 1 checkpoint goal with reward shapingThe episodic rewards curve plateaus during training so we do not believe additional training steps would improve the agent's performances.
This modified temporal goal is more complex than the previous one but the agent is still able to obtain high rewards during training and testing.
Fig.12: Train and test rewards for the passing through 2 checkpoints goalWe note however that some of the tests fail or have a final rewards that consists only on the base environment's reward (ie: it does not solve the temporal goal). The higher
The addition of the reward shaping leads to faster convergence during training but some of the tests still fail.
Fig.13: Train and test rewards for the passing through 2 checkpoints goal with reward shapingHowever, with reward shaping there is no episode in the test that is completed without achieving the temporal goal as well.
This modified temporal goal is more complex than the previous one but the agent is still able to obtain high rewards during training and testing.
Fig.14: Train and test rewards for the passing through 3 checkpoints goalWhile sometimes the agent solves the temporal goal along with the base environment goal, it clearly does not solve the entire environment. This result may change with many, many more time steps provided to allow more exploration to take place, though we believe Q-Learning with a simple
The addition of the reward shaping does not improve the results.
Fig.15: Train and test rewards for the passing through 3 checkpoints goal with reward shapingIn fact, during test we see only a few episodes where the base environment goal is solved, whereas no episodes solves the temporal goal.
We report the generated graph of the STRIPS problem.
Fig.14: The complete Taxi environment described as a STRIPS problemWe ran the training for just 1500 steps as it rapidly converged to a solution.
Fig.17: Train and test rewards for the STRIPS-defined temporal goal.The rewards graph is similar to that of the base environment goal with reward shaping, reported in Fig.7, but with less bad rewards during training. We think this is due to the smaller value of
To qualitatively measure possible overhead during training with the STRIPS approach, we evaluated a simple Q-Learning approach with the same hyper-parameters but no temporal goals.
Fig.18: Train and test rewards for the base environment with no temporal goal.It is clear that the agent in both environments learns roughly at the same rate, indicating the there is no apparent overhead between the two approaches.
- OpenAI: Ai research and deployment company
- OpenAI: Gym, reinforcement learning toolkit
- Dietterich, T.G.: Hierarchical reinforcement learning with the maxq value functiondecomposition. Journal of Artificial Intelligence Research13(1) (2000)
- Hengst, B. In: Hierarchical Reinforcement Learning. Springer US, Boston, MA(2010) 495–502
- Koenig, S., Simmons, R.G.: The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Machine Learning22(1) (Mar 1996) 227–250
- E., W. In: Hierarchical Reinforcement Learning. Springer US, Boston, MA (2011)
- Andrew Y Ng, D.H., Russell, S.: Policy invariance under reward transformations:Theory and application to reward shaping (1999)
- Giacomo, G.D., Iocchi, L., Favorito, M., Patrizi, F.: Foundations for restrainingbolts: Reinforcement learning with ltlf/ldlf restraining specifications (2019)
- Camacho, A., Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Ltland beyond: Formal languages for reward function specification in reinforcementlearning. In: Proceedings of the Twenty-Eighth International Joint Conferenceon Artificial Intelligence, IJCAI-19, International Joint Conferences on ArtificialIntelligence Organization (7 2019) 6065–6073
- Pnueli, A.: The temporal logic of programs. In: 2013 IEEE 54th Annual Symposiumon Foundations of Computer Science, Los Alamitos, CA, USA, IEEE ComputerSociety (oct 1977) 46–57
- De Giacomo, G., Vardi, M.Y.: Linear temporal logic and linear dynamic logic onfinite traces. In: Proceedings of the Twenty-Third International Joint Conferenceon Artificial Intelligence. IJCAI ’13, AAAI Press (2013) 854–860
- Fikes, R.E., Nilsson, N.J.: Strips: A new approach to the application of theoremproving to problem solving. Artificial Intelligence2(3) (1971) 189–208
- Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cour-napeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M.,Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M.,Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Ab-basi, H., Gohlke, C., Oliphant, T.E.: Array programming with NumPy. Nature585(7825) (September 2020) 357–362
- Favorito, M.: Logaut (v0.1.1) (2021)
- Favorito, M., Fuggitti, F.: Lydia (v0.1.2) (2021)
- Favorito, M.: Temprl (v0.3.0) (2021)
We embed here the PDFs of both the report:
and the presentation:
and the video presentation:
<iframe width="630px" height="500px" src="https://www.youtube.com/embed/0iTsZmkfpoY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>







































