# A guide Portfolio Optimization Environment

This notebook aims to provide an example of using PortfolioOptimizationEnv (or POE) to train a reinforcement learning model that learns to solve the portfolio optimization problem.

In this document, we will reproduce a famous architecture called EIIE (ensemble of identical independent evaluators), introduced in the following paper:

- Zhengyao Jiang, Dixing Xu, & Jinjun Liang. (2017). A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem. https://doi.org/10.48550/arXiv.1706.10059.

It's advisable to read it to understand the algorithm implemented in this notebook.

### Note
If you're using this environment, consider citing the following paper (in adittion to FinRL references):

- Caio Costa, & Anna Costa (2023). POE: A General Portfolio Optimization Environment for FinRL. In *Anais do II Brazilian Workshop on Artificial Intelligence in Finance* (pp. 132–143). SBC. https://doi.org/10.5753/bwaif.2023.231144.

```
@inproceedings{bwaif,
 author = {Caio Costa and Anna Costa},
 title = {POE: A General Portfolio Optimization Environment for FinRL},
 booktitle = {Anais do II Brazilian Workshop on Artificial Intelligence in Finance},
 location = {João Pessoa/PB},
 year = {2023},
 keywords = {},
 issn = {0000-0000},
 pages = {132--143},
 publisher = {SBC},
 address = {Porto Alegre, RS, Brasil},
 doi = {10.5753/bwaif.2023.231144},
 url = {https://sol.sbc.org.br/index.php/bwaif/article/view/24959}
}

```

## Installation and imports

To run this notebook in google colab, uncomment the cells below.

In [1]:
## install finrl library
# !sudo apt install swig
# !pip install git+https://github.com/AI4Finance-Foundation/FinRL.git

In [2]:
## We also need to install quantstats, because the environment uses it to plot graphs
# !pip install quantstats

#### Import the necessary code libraries

## Fetch data

In his paper, *Jiang et al* creates a portfolio composed by the top-11 cryptocurrencies based on 30-days volume. Since it's not specified when this classification was done, it's difficult to reproduce, so we will use a similar approach in the Brazillian stock market:

- We select top-10 stocks from Brazillian stock market;
- For simplicity, we disconsider stocks that have missing data for a days in period 2011-01-01 to 2019-12-31 (9 years);

## Portfolio Optimization Environment

Since POE was not merged to FinRL yet, we add its code below.

## Define Architecture

#### Create the gradient policy

The gradient policy below is identical to the EIIE architecture from *Jiang et al* paper.

#### Portfolio Vector Memory

The portfolio vector memory is an object that saves all the portfolio vectors generated by the policy. It's useful because the algorithm can get the last action performed, an information necessary to do forward propagation. Read *Jiang et al* article for more information.

#### ReplayBuffer and RLDataset

The replay buffer implemented in this work is slightly different from the one in famous algorithms. Usually, experiences are constantly added to the replay buffer and when it's completely filled, older experiences are overwritten by new ones (deque behavior). A reinforcement learning algorithm can constantly sample experiences from the buffer to update their neural networks and, as the time passes, older experiences will be disconsidered due to the deque behavior.

In this replay buffer, however, the deque behavior is still present, but when an algorithm sample a batch of experiences, all the experiences in the replay buffer are returned and it is cleared. This behavior is necessary given the policy gradient algorithm introduced by *Jiang et al*.

#### Create Polyak average function

The Polyak average function allows us to have a target function that is incrementally updated. It's useful to avoid oscillations in the final policy performance.

### Create PG class

This class implements the Policy Gradient algorithm used in *Jiang et al* paper. This algorithm is inspired by DDPG (deep deterministic policy gradient), but there are a couple of differences: 
- DDPG is an actor-critic algorithm, so it has an actor and a critic neural network. The algorithm below, however, doesn't have a critic neural network and uses the portfolio value as value function: the policy will be updated to maximize the portfolio value.
- DDPG usually makes use of a noise parameter in the action during training to create an exploratory behavior. PG algorithm, on the other hand, has a full-exploit approach.
- DDPG randomly samples experiences from its replay buffer. The implemented policy gradient, however, samples a sequential batch of experiences in time, to make it possible to calculate the variation of the portfolio value in the batch and use it as value function.

The algorithm can be described as follows:
1. Initializes policy network, target policy network (if used) and replay buffer;
2. For each episode, do the following:
    1. For each period of `batch_size` timesteps, do the following:
        1. For each timestep, define an action to be performed, simulate the timestep and save the experiences in the replay buffer.
        2. After `batch_size` timesteps are simulated, sample the replay buffer.
        3. Update target policy network (if used).
        4. Calculate the value function: $V = \sum\limits_{t=1}^{batch\_size} ln(\mu_{t}(W_{t} \cdot P_{t}))$, where $W_{t}$ is the action performed at timestep t, $P_{t}$ is the price variation vector at timestep t and $\mu_{t}$ is the transaction remainder factor at timestep t. Check *Jiang et al* paper for more details.
        5. Perform gradient ascent in the policy network.
    2. If, in the and of episode, there is sequence of remaining experiences in the replay buffer, perform steps "a" to "e" with the remaining experiences.

### Train Algorithm

### Save Model

## Test Model

### Define test periods
In this work, we are going to use three annual test periods: the year of 2020, 2021 and 2022. To get data from Yahoo Finance, we do just like in the training data.

### Instantiate different environments

Since we have three different periods of time, we need three different environments instantiated to simulate them.

### Test EIIE architecture
Now, we can test the EIIE architecture in the three different test periods. It's important no note three things:
- In this code, we load the saved policy even though it's not necessary just to show how to save and load your model;
- It's important to reset the environment before an episode and to create a new PVM (portfolio vector memory) for each case
- The neural network is expecting a batch of data, so it's necessary to create an additional dimension to the input data if it's a single item.

### Test Uniform Buy and Hold
For comparison, we will also test the performance of a uniform buy and hold strategy. In this strategy, the portfolio has no remaining cash and the same percentage of money is allocated in each asset.

### Plot graphics