# Reinforcement Learning

# Introduction

Reinforcement Learning is a branch of machine learning that focuses on how agents should act in an environment to maximize a notion of cumulative reward. Unlike supervised learning, where algorithms are trained on labeled examples, RL learns from a trial-and-error process based on the environment's responses to its actions.

Differences between RL and other learning methods:

- **Supervised learning**: Algorithms learn from labeled examples, trying to predict label based on the training set;
- **Unsupervised learning**: Algorithms try to find patterns in the data, without any labels;
- **Reinforcement learning**: Agent learns to make decision through rewards (or punishments) received from its actions.

**Key elements:**

- Agent:

    - **Definition**: The agent is the entity that makes decisions and learns through interaction with the environment. In RL, the agent chooses actions to take based on its current policy.
    - **Role in RL**: The agent is at the center of learning in RL. Its goal is to learn the best possible policy, that is, a map from states to actions, to maximize the total reward collected over time. 

- Enviroment:

    - **Definition**: Environment represents the context or world in which the agent operates. It includes anything that the agent can interact with but does not have direct control over;
    - **Interaction with the agent**: The environment responds to the agent's actions and presents new states and rewards to the agent. The nature of the environment can vary from simple and static to complex and dynamic.

- State:

    - **Definition**: A state is a configuration or representation of the environment at a given time. States provide the information the agent needs to make decisions.
    - **Importance**: The quality and quantity of information available in states can significantly influence the effectiveness of agent learning.

- Action:

    - **Definition**: Actions are the various behaviors or moves that the agent can perform. The set of all available actions is known as the action space.
    - **Action-Stata dynamics**: Every action taken by the agent affects the state of the environment. The relationship between actions and their consequence on states is fundamental to the agent's decision making. 

- Reward:

    - **Definnition**: A reward is immediate feedback provided to the agent by the environment as a consequence of his or her actions. Rewards can be positive (to encourage certain actions) or negative (to discourage certain actions).
    - **Role in learning**: Rewards are the main guide for the agent in the learning process. The agent's goal is to maximize the sum of rewards over time, often referred to as return.

- Policy:

    - **Definition**: A policy is a strategy adopted by the agent, a kind of rule or algorithm that decides what action to take based on the current state.
    - **Types**: Policies can be deterministic or stochastic. A deterministic policy always provides the same action for a given state, while a stochastic policy selects actions according to a probability distribution.

- Evaluation functions:

    - **Goal**: These functions help the agent evaluate the effectiveness of his or her actions and policies.
    - **Types**:
        - **Value function**: Estimates the expected return from a state following a given policy.
        - **Q-Value function**: also known as **Action-Value function**. Estimates the expected return for a state-action pair.

- Model:

    - **Description**: A model is an internal representation of the environment that the agent uses to predict how the environment will respond to its actions.
    - **Model-based vs model-free RL**: In model-based RL, the agent uses an explicit model of the environment to plan its actions. In model-free RL, the agent learns directly from interactions with the environment without an explicit model.

# Theoretical concepts

#### 1. Markov Decision Process

The Markov Decision Process (MDP) is a fundamental mathematical framework in the field of Reinforcement Learning. It provides a formalization for decision-making in uncertain and dynamic situations. An MDP is characterized by a set of states, a set of actions, transition probabilities and reward functions.

Key MDP elements:

- **States**: A set of states $S$ represents all the possible configurations the environment can be in.
- **Actions**: A set of actions $A$ that the agent can take. The set of available actions may depend on the current state.
- **Transition probability**: A transaction function $P(s_{t+1} | s_t, a_t)$ which defines the probability of transition to the state $s_{t+1}$ given the current state $s_t$ and action $a_t$.
- **Reward**: A reward function $R(s_t, a_t, s_{t+1})$ which assigns a reward (or a punishment) to the agent for the transition from state $s_t$ to state $s_{t+1}$ after taking action $a_t$.

The fundamental property of an MDP is the "Markov property," which states that the future is independent of the past given the present. This means that the transition probability and reward depend only on the current state and the action taken, not on the history of previous actions or states.

#### 2. Reward and evaluation function
 
In the context of Reinforcement Learning, the reward and evaluation function are central concepts that guide agent learning and decision-making. This chapter explores the nature of these components and their role in RL.

Reward:

- **Definition**: A reward is a scalar value that represents the immediate feedback provided to the agent by the environment as a consequence of his or her actions. Rewards can be positive (to encourage certain actions) or negative (to discourage certain actions).
- **Role in learning**: Rewards are the main guide for the agent in the learning process. The agent's goal is to maximize the sum of rewards over time, often referred to as return.

Evaluation function:
 
- **Value function**: The value function, denoted as $V(s)$, estimates the expected total return from a states following a given policy. It provides a measure of the goodness-of-fit of a state.
- **Q-value function**: The Q-value function, or action-value function, denoted as $Q(s, a)$, evaluates the action $a$ in the state $s$. It estimates the expected return following the action $a$ in the state $s$ and then adhering to a specific policy. 

#### 3. Common RL algorithms

- Q-learning is a model-free learning method that seeks to learn an optimal policy independently of the agent's current action.
The algorithm iteratively updates the Q-value estimates for each state-action pair using the Q-learning update formula, based on the reward received and the maximum Q-value of the next state. Q-learning is widely used for problems with discrete state and action spaces and is well suited to situations with uncertain environmental dynamics.


- SARSA is a model-based differential time learning (TD) algorithm.
Unlike Q-learning, SARSA updates its Q-values based on the agent's current policy (on-policy). The update considers the current transition and the next action the agent intends to perform. SARSA is useful in environments where risk assessment and safety considerations are important because it takes into account the actual path the agent plans to follow.

Some remarks:
- **Exploration vs. exploitation**: A key aspect in RL is the balance between exploration (trying new actions) and exploitation (using acquired knowledge). RL algorithms must effectively manage this balance;
- **Scalability and complexity**: the scalability of algorithms in environments with large state and action spaces is a significant challenge. Methods such as deep learning have been integrated into RL to address this challenge.

# Use of RL in finance

In trading, RL can be used to develop automated strategies that decide when to buy, sell, or hold a stock or cryptocurrency.
Financial markets are complex, noisy, and nonstationary, making trading an ideal challenge for RL, which can adapt to such dynamic conditions.

Challenges:

- Financial data are often noisy and exhibit nonstationarity, which can lead to overfitting and inconsistent model performance.
- Markets change rapidly, and what has worked in the past may no longer be valid, requiring the model to continually adapt.

Pros of using RL:

- **Automation and Scalability**: The RL can automate trading decisions and operate on a large scale, analyzing huge amounts of data more efficiently than human analysis.
- **Adaptability**: RL models can dynamically adapt to changes in the market, continuously learning from new data.
- **High Performance Potential**: When configured well, RL models can potentially outperform traditional strategies and human traders, especially in highly volatile markets.

Common approaches:

- **Deep reinforcement learning**: The use of deep neural networks to handle the complexity of market data and capture nonlinear relationships.
- **Optimized exploration strategies**: Development of exploration methods that balance between learning from historical market situations and exploring new trading strategies.


# Practical Implementation

OpenAI Gym is an open-source library developed by OpenAI that provides a set of environments for developing and comparing Reinforcement Learning algorithms. Gym is designed to make access to and implementation of RL environments simple and standardized. 

Features of Gym:
- **Diversified Environments:** Gym offers a wide range of environments, from simple control problems to complex physics- and pixel-based environments.
- **Standardized Interface:** Provides a consistent API for interacting with environments, making it easier to test different algorithms.
- **Flexibility:** Gym is compatible with several machine learning frameworks and libraries, allowing integration with advanced RL algorithms.

"TradingEnv" is a Gym-specific environment that simulates the stock or cryptocurrency market for trading. This environment provides an ideal platform for testing and evaluating RL-based trading strategies. The main features of "TradingEnv" are:

- **Market Simulation:** Reproduces stock or cryptocurrency price movements, providing the agent with realistic data on which to act.
- **Actions and Trading Decisions:** The agent can execute actions such as buy, sell or hold, based on current market information.
- **Feedback and Rewards:** The environment provides feedback in terms of rewards or penalties based on the agent's trading performance.

Considerations:
- **Realism and Limitations**: although "TradingEnv" offers a realistic simulation, it is important to recognize its limitations and the difference between a simulated environment and real trading.
- **Parameter Sensitivity**: the effectiveness of the RL agent may be sensitive to the parameters of the environment, including simulated pricing patterns and market conditions.
- **Testing and Evaluation**: it is crucial to perform extensive testing and model evaluations to ensure that the learned strategies are robust and reliable before considering their application in real trading situations.


## Understanding action space
#### Positions
The agent can take two positions:
- `1`: convert the whole of the portfolio into BTC
- `0`: the portfolio is converted into USD

ref: [https://gym-trading-env.readthedocs.io/en/latest/index.html](https://gym-trading-env.readthedocs.io/en/latest/index.html)

---
# Import dataset
In order for this enviroment to work, the dataset must be in the following format:
- it must be ordered by **ascending date**
- the index must be a **DatetimeIndex**
- it must have **Close**, **Open**, **High**, **Low**, **Volume** labels at least

In [8]:
import pandas as pd

In [9]:
dataset = './datasets/BTC_USD-Hourly.csv'
df = pd.read_csv(dataset, parse_dates = ['date'], index_col = 'date')
df.sort_index(inplace = True)
df.dropna(inplace = True)
df.drop_duplicates(inplace = True)

# Adding static features
The reinforcement learning agent will need inputs.
This enviroment treats as input every column that has **feature** in its name.

In [10]:
df['feature_close'] = df['close'].pct_change()
df['feature_open'] = df['open'] / df['close']
df['feature_high'] = df['high'] / df['close']
df['feature_low'] = df['low'] / df['close']
df['feature_volume'] = df['Volume USD'] / df['Volume USD'].rolling(7 * 24).max()

df.dropna(inplace = True)

The above features are called **static features**; this means they are computed once and they are not updated at each step.
We'd also need **dynamic features**, which are computed at each step.

# Adding dynamic features
A **dynamic feature** is computed at each step, that's why we need to be careful: dynamic features can be *computationally more expensive* than static features.
The dynamic features below are the default dynamic features of the enviroment.

In [11]:
def dynamic_feature_last_position_taken(history):
    return history['position', -1]

def dynamic_feature_real_position(history):
    return history['real_position', -1]

# Creating the enviroment
Let's examine the parameters of the enviroment:
- **name**: the name of the enviroment
- **df**: the dataframe
- **positions**: list of the positions allowed by the enviroment
- **trading_fees**: the trading fees (buy and sell operations)
- **borrow_interest_rate**: the interest rate for borrowing money

In [12]:
import gymnasium as gym
import gym_trading_env
env = gym.make(
    'TradingEnv',
    name = 'BTCUSD',
    df = df,
    dynamic_feature_functions = [
        dynamic_feature_last_position_taken,
        dynamic_feature_real_position
    ],
    positions = [-1, 0, 1],
    trading_fees = 0.01 / 100,
    borrow_interest_rate = 0.003 / 100
)

# Run the enviroment

In [13]:
done, truncated = False, False
observation, info = env.reset()
while not done and not truncated:
    position_index = env.action_space.sample()
    observation, reward, done, truncated, info = env.step(position_index)

Market Return : 423.10%   |   Portfolio Return : -99.12%   |   
