# Stock NeurIPS2018 Part 2. Train
This series is a reproduction of *the process in the paper Practical Deep Reinforcement Learning Approach for Stock Trading*. 

This is the second part of the NeurIPS2018 series, introducing how to use FinRL to make data into the gym form environment, and train DRL agents on it.

Other demos can be found at the repo of [FinRL-Tutorials]((https://github.com/AI4Finance-Foundation/FinRL-Tutorials)).

# Part 1. Install Packages

In [1]:
## install finrl library
!pip install git+https://github.com/AI4Finance-Foundation/FinRL.git

Collecting git+https://github.com/AI4Finance-Foundation/FinRL.git
  Cloning https://github.com/AI4Finance-Foundation/FinRL.git to /tmp/pip-req-build-lo58n3p6
  Running command git clone --filter=blob:none --quiet https://github.com/AI4Finance-Foundation/FinRL.git /tmp/pip-req-build-lo58n3p6
  Resolved https://github.com/AI4Finance-Foundation/FinRL.git to commit 9e8c38aa5b92bbf0e20f65fc611fd43b43196859
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting elegantrl@ git+https://github.com/AI4Finance-Foundation/ElegantRL.git (from finrl==0.3.8)
  Cloning https://github.com/AI4Finance-Foundation/ElegantRL.git to /tmp/pip-install-dpxur5sx/elegantrl_b3c5387d2f6f493e8f408eede54077cc
  Running command git clone --filter=blob:none --quiet https://github.com/AI4Finance-Foundation/ElegantRL.git /tmp/pip-install-dpxur5sx/elegantrl_b3c5387d2f6f493e8f408eede54077cc
  Resolved ht

In [2]:
import numpy as np
import pandas as pd
from stable_baselines3.common.logger import configure
from finrl.agents.stablebaselines3.models import DRLAgent
from finrl.config import INDICATORS, TRAINED_MODEL_DIR, RESULTS_DIR
from finrl.main import check_and_make_directories

import sys, pathlib
sys.path.insert(0, str(pathlib.Path.cwd().parents[1]))
from finai_contest.env_stock_trading.env_stock_trading_meta import StockTradingEnv_FinRLMeta
from finai_contest.env_stock_trading.env_stock_trading_gym_anytrading import StockTradingEnv_gym_anytrading
check_and_make_directories([TRAINED_MODEL_DIR])

# Part 2. Build A Market Environment in OpenAI Gym-style

rl_diagram_transparent_bg.png

The core element in reinforcement learning are **agent** and **environment**. You can understand RL as the following process: 

The agent is active in a world, which is the environment. It observe its current condition as a **state**, and is allowed to do certain **actions**. After the agent execute an action, it will arrive at a new state. At the same time, the environment will have feedback to the agent called **reward**, a numerical signal that tells how good or bad the new state is. As the figure above, agent and environment will keep doing this interaction.

The goal of agent is to get as much cumulative reward as possible. Reinforcement learning is the method that agent learns to improve its behavior and achieve that goal.

To achieve this in Python, we follow the OpenAI gym style to build the stock data into environment.

state-action-reward are specified as follows:

* **State s**: The state space represents an agent's perception of the market environment. Just like a human trader analyzing various information, here our agent passively observes the price data and technical indicators based on the past data. It will learn by interacting with the market environment (usually by replaying historical data).

* **Action a**: The action space includes allowed actions that an agent can take at each state. For example, a ∈ {−1, 0, 1}, where −1, 0, 1 represent
selling, holding, and buying. When an action operates multiple shares, a ∈{−k, ..., −1, 0, 1, ..., k}, e.g.. "Buy 10 shares of AAPL" or "Sell 10 shares of AAPL" are 10 or −10, respectively

* **Reward function r(s, a, s′)**: Reward is an incentive for an agent to learn a better policy. For example, it can be the change of the portfolio value when taking a at state s and arriving at new state s',  i.e., r(s, a, s′) = v′ − v, where v′ and v represent the portfolio values at state s′ and s, respectively


**Market environment**: 30 constituent stocks of Dow Jones Industrial Average (DJIA) index. Accessed at the starting date of the testing period.

## Read data

We first read the .csv file of our training data into dataframe.

In [3]:
train = pd.read_csv('./data/train_data.csv')
# If you are not using the data generated from part 1 of this tutorial, make sure 
# it has the columns and index in the form that could be make into the environment. 
# Then you can comment and skip the following two lines.
train = train.set_index(train.columns[0])
train.index.names = ['']

train.rename(columns={train.columns[0]: "Time"}, inplace=True)
train = train[train["tic"] == "AAPL"]
train

Unnamed: 0,Time,tic,close,high,low,open,volume,day,macd,boll_ub,boll_lb,rsi_30,cci_30,dx_30,close_30_sma,close_60_sma,vix,turbulence
,,,,,,,,,,,,,,,,,,
0,2009-01-02,AAPL,2.724325,2.733032,2.556514,2.578128,7.460152e+08,4.0,0.000000,2.944416,2.619212,100.000000,66.666667,100.000000,2.724325,2.724325,39.189999,0.000000
1,2009-01-05,AAPL,2.839303,2.887335,2.783165,2.796974,1.181608e+09,0.0,0.002580,2.944416,2.619212,100.000000,66.666667,100.000000,2.781814,2.781814,39.080002,0.000000
2,2009-01-06,AAPL,2.792471,2.917055,2.773559,2.880430,1.289310e+09,1.0,0.001835,2.901000,2.669733,70.355481,45.847664,100.000000,2.785366,2.785366,38.560001,0.000000
3,2009-01-07,AAPL,2.732131,2.776861,2.709616,2.756148,7.530488e+08,2.0,-0.000728,2.880446,2.663669,50.429353,-30.767103,43.608189,2.772058,2.772058,43.389999,0.000000
4,2009-01-08,AAPL,2.782864,2.796374,2.703011,2.714719,6.735008e+08,3.0,-0.000086,2.868583,2.679855,60.226993,-8.240125,48.358063,2.774219,2.774219,42.560001,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3268,2021-12-27,AAPL,176.796051,176.884284,173.599943,173.619540,7.491960e+07,0.0,4.966004,179.988879,156.216779,65.037363,111.207403,48.878593,163.182840,153.508200,17.680000,9.832455
3269,2021-12-28,AAPL,175.776413,177.776443,175.031312,176.629374,7.914430e+07,1.0,5.015726,180.195297,157.878026,64.013402,108.220094,50.226185,164.140366,154.110296,17.540001,8.468482
3270,2021-12-29,AAPL,175.864700,177.090204,174.648995,175.815677,6.234890e+07,2.0,5.004566,180.775811,158.677922,64.064081,99.807160,48.433792,165.100508,154.771134,16.950001,6.345946


## Construct the environment

Calculate and specify the parameters we need for constructing the environment.

In [4]:
stock_dimension = len(train.tic.unique())
state_space = 1 + 2*stock_dimension + len(INDICATORS)*stock_dimension
print(f"Stock Dimension: {stock_dimension}, State Space: {state_space}")

Stock Dimension: 1, State Space: 11


In [5]:
buy_cost_list = sell_cost_list = [0.001] * stock_dimension
num_stock_shares = [0] * stock_dimension

env_kwargs = {
    "hmax": np.inf,
    "initial_amount": 1000000,
    "num_stock_shares": num_stock_shares,
    "buy_cost_pct": buy_cost_list,
    "sell_cost_pct": sell_cost_list,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "action_space": 2*stock_dimension,
    "reward_scaling": 1e-4,
    "window_size": 30
}


e_train_gym = StockTradingEnv_gym_anytrading(df = train, **env_kwargs)

## Environment for training

In [6]:
env_train, _ = e_train_gym.get_sb_env()

In [7]:
# train_aapl = train[train["tic"] == "aapl"]
# train_gym_anytrade = pd.DataFrame()
# train_gym_anytrade['Time'] = pd.to_datetime(train_aapl['date'])  # Convert to datetime
# train_gym_anytrade['Open'] = train_aapl['open']
# train_gym_anytrade['High'] = train_aapl['high']
# train_gym_anytrade['Low'] = train_aapl['low']
# train_gym_anytrade['Close'] = train_aapl['close']
# train_gym_anytrade['Volume'] = train_aapl['volume']
# train_aapl

In [8]:
# import gymnasium as gym
# import gym_anytrading
# env_train = gym.make(
#     'stocks-v0',
#     df=train_gym_anytrade,
#     window_size=30,
#     frame_bound=(30, len(train_gym_anytrade))
# )
# from stable_baselines3.common.vec_env import DummyVecEnv

# def get_sb_env(self):
#     e = DummyVecEnv([lambda: self])
#     obs = e.reset()
#     return e, obs
# # Patch the method
# env_train = env_train.env
# env_train = env_train.env
# env_train.get_sb_env = get_sb_env.__get__(env_train)
# env_train

# Part 3: Train DRL Agents
* Here, the DRL algorithms are from **[Stable Baselines 3](https://stable-baselines3.readthedocs.io/en/master/)**. It's a library that implemented popular DRL algorithms using pytorch, succeeding to its old version: Stable Baselines.
* Users are also encouraged to try **[ElegantRL](https://github.com/AI4Finance-Foundation/ElegantRL)** and **[Ray RLlib](https://github.com/ray-project/ray)**.

In [9]:
agent = DRLAgent(env = env_train)

# Set the corresponding values to 'True' for the algorithms that you want to use
if_using_a2c = False
if_using_ddpg = False
if_using_ppo = True
if_using_td3 = False
if_using_sac = False

## Agent Training: 5 algorithms (A2C, DDPG, PPO, TD3, SAC)


### Agent 1: A2C


In [10]:
agent = DRLAgent(env = env_train)
model_a2c = agent.get_model("a2c")

if if_using_a2c:
  # set up logger
  tmp_path = RESULTS_DIR + '/a2c'
  new_logger_a2c = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_a2c.set_logger(new_logger_a2c)

{'n_steps': 5, 'ent_coef': 0.01, 'learning_rate': 0.0007}
Using cuda device




In [11]:
trained_a2c = agent.train_model(model=model_a2c, 
                             tb_log_name='a2c',
                             total_timesteps=50000) if if_using_a2c else None

In [12]:
trained_a2c.save(TRAINED_MODEL_DIR + "/agent_a2c") if if_using_a2c else None

### Agent 2: DDPG

In [13]:
# agent = DRLAgent(env = env_train)
# model_ddpg = agent.get_model("ddpg")

# if if_using_ddpg:
#   # set up logger
#   tmp_path = RESULTS_DIR + '/ddpg'
#   new_logger_ddpg = configure(tmp_path, ["stdout", "csv", "tensorboard"])
#   # Set new logger
#   model_ddpg.set_logger(new_logger_ddpg)

In [14]:
trained_ddpg = agent.train_model(model=model_ddpg, 
                             tb_log_name='ddpg',
                             total_timesteps=50000) if if_using_ddpg else None

In [15]:
trained_ddpg.save(TRAINED_MODEL_DIR + "/agent_ddpg") if if_using_ddpg else None

### Agent 3: PPO

In [16]:
agent = DRLAgent(env = env_train)
PPO_PARAMS = {
    "n_steps": 2048,
    "ent_coef": 0.01,
    "learning_rate": 0.00025,
    "batch_size": 128,
}
model_ppo = agent.get_model("ppo",model_kwargs = PPO_PARAMS)

if if_using_ppo:
  # set up logger
  tmp_path = RESULTS_DIR + '/ppo'
  new_logger_ppo = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_ppo.set_logger(new_logger_ppo)

{'n_steps': 2048, 'ent_coef': 0.01, 'learning_rate': 0.00025, 'batch_size': 128}
Using cuda device
Logging to results/ppo




In [17]:
trained_ppo = agent.train_model(model=model_ppo, 
                             tb_log_name='ppo',
                             total_timesteps=100000) if if_using_ppo else None

------------------------------------
| time/              |             |
|    fps             | 519         |
|    iterations      | 1           |
|    time_elapsed    | 3           |
|    total_timesteps | 2048        |
| train/             |             |
|    reward          | 0.0         |
|    reward_max      | 7.8124175   |
|    reward_mean     | -0.01372738 |
|    reward_min      | -10.3398    |
------------------------------------


KeyboardInterrupt: 

In [None]:
trained_ppo.save(TRAINED_MODEL_DIR + "/agent_ppo_gym_anytrade") if if_using_ppo else None

### Agent 4: TD3

In [None]:
# agent = DRLAgent(env = env_train)
# TD3_PARAMS = {"batch_size": 100, 
#               "buffer_size": 1000000, 
#               "learning_rate": 0.001}

# model_td3 = agent.get_model("td3",model_kwargs = TD3_PARAMS)

# if if_using_td3:
#   # set up logger
#   tmp_path = RESULTS_DIR + '/td3'
#   new_logger_td3 = configure(tmp_path, ["stdout", "csv", "tensorboard"])
#   # Set new logger
#   model_td3.set_logger(new_logger_td3)

In [None]:
trained_td3 = agent.train_model(model=model_td3, 
                             tb_log_name='td3',
                             total_timesteps=50000) if if_using_td3 else None

In [None]:
trained_td3.save(TRAINED_MODEL_DIR + "/agent_td3") if if_using_td3 else None

### Agent 5: SAC

In [None]:
# agent = DRLAgent(env = env_train)
# SAC_PARAMS = {
#     "batch_size": 128,
#     "buffer_size": 100000,
#     "learning_rate": 0.0001,
#     "learning_starts": 100,
#     "ent_coef": "auto_0.1",
# }

# model_sac = agent.get_model("sac",model_kwargs = SAC_PARAMS)

# if if_using_sac:
#   # set up logger
#   tmp_path = RESULTS_DIR + '/sac'
#   new_logger_sac = configure(tmp_path, ["stdout", "csv", "tensorboard"])
#   # Set new logger
#   model_sac.set_logger(new_logger_sac)

In [None]:
trained_sac = agent.train_model(model=model_sac, 
                             tb_log_name='sac',
                             total_timesteps=70000) if if_using_sac else None

In [None]:
trained_sac.save(TRAINED_MODEL_DIR + "/agent_sac") if if_using_sac else None

## Save the trained agent
Trained agents should have already been saved in the "trained_models" drectory after you run the code blocks above.

For Colab users, the zip files should be at "./trained_models" or "/content/trained_models".

For users running on your local environment, the zip files should be at "./trained_models".