# URAP RL tutorial

This tutorial explores the basics of how to implement and train RL algorithms using the `rl_eco_benchmarks` package. 


## RL Basics

Reinforcement learning algorithms have two conceptual components: an Agent and an Environment. 
The Agent interacts with the Environment with some goal, and the Environment reacts to the Agent's actions.
(Monday Sept. 18th's meeting, I called the Environment a 'system'. Here I'll switch back to the more common term used, 'Environment'.)

To recap the reading materials: In each time-step,
1. the agent **observes** the environment---that is, it receives some information about the state of the environment,
2. the agent **acts** on the environment,
3. the environment changes its state accordingly, and the agent receives a **reward** that depends on the action taken, as well as on the corresponding change of environmental state.

The basic unit of an RL algorithm is an *Episode*, a sequence of time-steps with a pre-specified maximum length. The goal of the agent is to be able to “play” episodes with high rewards, and especially to avoid episodes with low reward. 

Alright, so let's go to an example.


# RL for Fisheries

**Problem context.** The classic example we will work with is the one where 1. the Agent is a fishery that wants to engage in sustainable fishing over a long period of time, and 2. the Environment is a dynamical model for the fish population size. (This model, we will see later on, can include other non-fished populations with which our species interacts.) 

**Time-steps.** Our time-steps usually represent a year: at the beginning of the fishing year, the agent decides how much fishing it will allow throughout the year, and then we simulate the consequences of that decision. Our episodes will typically have a maximum length of 100 or 200 years, although we typically include a condition that the episode “ends early” if there is a near-extinction. 

**Dynamical model.** Our dynamical model will be a *discrete time population dynamics model*, for example something of the following form:
$$
N_{t+1} = N_t + f(N_t, t, a_t),
$$
where $N_t$ is the population size at time-step $t$, $a_t$ is the action taken by the Agent in that time-step, and $f$ is some function.
Notice that we (optionally) include a time dependence in $f$---this can be useful when trying to model the effects of e.g. climate change or habitat loss due to factors external to the ecosystem itself.
We will return to this point in a bit.

That equation is still a bit abstract, let's get more concrete.
A classic model used in fishery science is a model of *logistic growth*, of the form
$$
N_{t+1} = N_t + r \times N_t \times (1-N_t/K) - a_t,
$$
where $r$ and $K$ are parameters of the model ($r$ is sometimes called a *reproduction rate* as it regulates how the rate at which the population grows, and $K$ is the *carrying capacity* which gives an upper limit to how much the population can grow).

**Rewards.** The Agent *fishes out* a mass $a_t$ of fish. We will use a simple way of modelling the economic benefits of having a high harvest: the agent receives a reward of $a_t$. This means that the Agent wants to fish as much as possible---but there is a balance here: because the time window of an episode is long (100-200 years), it can make more sense for the Agent to fish sustainably for a long time than to fish extremely for a short.

We include an extra component of the reward function: if at any time-step $N_t < N_{\text{thresh.}}$ (the population falls below some threshold value), then the episode ends immediatly and a penalty of $-200/t$ is added to the episode reward. This extra component encodes the dire consequences that could come out of a species extinction for the ecosystem as a whole (which could affect the economy in ways beyond the loss of the ability to fish $N$). This term, moreover, helps RL algorithms converge to sustainable solutions faster. 



# Python classes and objects

We will run our RL algorithms in Python. There are a couple of basic but key aspects of object-oriented programming (OOP) in Python that we'll need to cover for that.

In OOP, the code is organized around 'objects'. These objects have certain qualities (called *properties* of the object) and objects can also perform actions that process data to produce a result (called *methods* of the object). 

## How to create objects in your code?

To create an object within your code, you need to first write the code for the *class* of that object. This code will let the computer know which properties and which methods will the objects of this class have. 

For example, let's make a class of objects which are a very simple type of chat-bot. The chatbot has two properties: its name, and its general mood (namely, will it be mean or nice to you!). It also has two methods -- both equally useless -- the first one is a greeting, and the second one is a template answer to any question you give it.

This is how the code to that chatbot would look like:

In [21]:
class chatbot:
    def __init__(self, name, mood):
        """
        uses input name and mood to generate the chatbot object.

        args:
            self = don't worry about it, this is just standard python syntax for classes
            name = str
            mood = 'nice' or 'mean' (str)
        """

        # make sure that the mood input has one of the accepted values
        assert mood in ['nice', 'mean'], "'mood' variable must have value 'nice' or 'mean'!"

        # define the object properties based on input provided
        self.name = name
        self.mood = mood
    
    def greet(self):
        """ you always need to pass 'self' as input to object methods -- just standard Python syntax. """
        if self.mood == "nice":
            print(f"Hi, my name is {self.name}, it's so nice to chat with you! How can I help?")
        if self.mood == "mean":
            print(f"Ugh, do you even know who I am? I'm {self.name}, I don't have time for you.")
    
    def reply(self, question):
        """ question = str """
        if self.mood=="nice":
            print(f"Thank you so much for your question ({question}). I don't know the answer, but I hope you find out!")
        if self.mood=="mean":
            print(f"Lol, '{question}', you're laaame.")
    


Notice the first method that we coded in our class above, `__init__`. This is the method that 'sets up' the object when you create it. Namely, to create it, you will provide two inputs: 'name' and 'mood', with these two inputs the computer knows which type of chatbot to create.

## Creating objects once the class is defined

Now that we have coded our class, we may instantiate it---that is, we may generate chatbot objects. This is the way we do it:

In [22]:
chatbot1 = chatbot(name="Felipe (the nice one)", mood="nice")
chatbot2 = chatbot(name="Felipe (the mean one)", mood="mean")

We can now access the properties of the chatbots and also call their methods in the following way:

In [23]:
chatbot1.name

'Felipe (the nice one)'

In [24]:
chatbot1.greet()

Hi, my name is Felipe (the nice one), it's so nice to chat with you! How can I help?


In [26]:
chatbot2.reply(question="How are you?")

Lol, 'How are you?', you're laaame.


## The 'self' argument...?

Notice that when I called the `reply` method, I just provided the question string as an input, *not* this mystery `self` input that was required for defining the method.

That's because the `self` argument is the object itself! When you call `chatbot1.greet()`, for example, the argument `self` is `chatbot1`. This 'quirk' of python is a design choice that allows for 'uninstantiated class methods'. I honestly don't know enough about these to understand this design choice, but it won't be relevant to us, and it's just a nuisance to remember to include that `self` argument in all our class methods!

# OpenAI gym classes for RL

The frameworks we will use for RL are based on so-called OpenAI-gyms. These are classes that have a specific form -- they need to have some standard properties and some standard methods which we will cover below. This standardization is done so that RL optimization algorithms can easily communicate with your custom environment.

This is the standard template for a gym environment:

In [29]:
# we need to make sure that the gymnasium package is installed first
!pip install gymnasium

# now we import the gymnasium package
import gymnasium as gym

class fishing_env(gym.Env):
    """ we always 'inherit' from the template calss gym.Env in the package gymnasium. 
        this sets up some basic functionality for our environment class. 
    """
    def __init__(self, *args):
        """ I used the argument '*args' for the moment cause I don't want to commit yet
            to which arguments we should provide to the environment.

            In this method we will need to define two properties:
            self.observation_space
            and
            self.action_space.

            We'll get to that later!
        """
        ...
    
    def reset(self, *, seed=42, options=None):
        """ this method is called to reset the state of the system to an initial value. 
            the next episode will start by using this initial value.

            this method should return the initial state of the environment as an output.

            don't worry about the '*, seed=42, options=None' arguments for now, they won't 
            change for our intents! We'll discuss these a bit though :)
        """
        ...
    
    def step(self, action):
        """ here, we tell the system how it should react to an action we take. 
        
        the output of this method should be a tuple of the form:
        (
            system state (array), 
            reward (float, as a result of the action performed), 
            terminated (boolean, whether the episode ended with this timestep), 
            done (boolean, irrelevant for our purposes, we will just set it to be = False)
            info (a python dictionary, which we will set to be just the empty dict = {} for simplicity),
        )
        """
        ...



# Putting in some actual content in our environment!

Now we will put in some actual code inside of the methods of the `fishing_env` above. This will be mostly on you all to complete the code as necessary in order for the environment to reproduce the behavior that I introduced at the start of the document (the dynamics of the system, the actions available, etc).

We will let the episode lengths be 200 time-steps.

As a hint, the following code shows what the `__init__` method should be.

In [1]:
from gymnasium import spaces

def __init__(self, init_state):
    self.t_max = 200
    self.init_state = init_state
    self.state = self.reset()

    self.observation_space = spaces.Box(
            np.array([0]),
            np.array([1]),
            dtype = np.float32, # use 32-bit floats for more efficiency on GPU computations! >:-)
            )

    self.action_space = spaces.Box(
            np.array([0]),
            np.array([1]),
            dtype = np.float32,
            )


Notice that the `observation_space` and `action_space` properties are defined using `gymnasium.spaces.Box` objects. These simply represent 'boxes' of possible numbers. In our case they are 1-D boxes (notice that the first and second arguments to `Box(...)` are arrays corresponding to opposing corners of the box---in our case, the number 0, and the number 1). In a 2-D case, we'd have something such as, for example, 
```
self.observation_space = spaces.Box(
            np.array([0,0]),
            np.array([1,1]),
            dtype = np.float32,
            )
```

Notice also that the `__init__` function has no `return` statement: it is a 'void' function that returns None always. The importance of this function is not the value it returns---as we said before, what is important is what happens when it runs: it *creates* an object with properties and methods which are influenced by the input given. (In this case, the input given is the initial state `init_state`. In the chatbox example, the input given was the chatbot's name and mood.)