<a href="https://colab.research.google.com/github/AI4Finance-Foundation/FinRL-Tutorials/blob/master/1-Introduction/Stock_NeurIPS2018_SB3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Reinforcement Learning for Stock Trading from Scratch: Multiple Stock Trading

* **Pytorch Version**



# Content

* [1. Task Description](#0)
* [2. Install Python packages](#1)
    * [2.1. Install Packages](#1.1)    
    * [2.2. A List of Python Packages](#1.2)
    * [2.3. Import Packages](#1.3)
    * [2.4. Create Folders](#1.4)
* [3. Download and Preprocess Data](#2)
* [4. Preprocess Data](#3)        
    * [4.1. Technical Indicators](#3.1)
    * [4.2. Perform Feature Engineering](#3.2)
* [5. Build Market Environment in OpenAI Gym-style](#4)  
    * [5.1. Data Split](#4.1)  
    * [5.3. Environment for Training](#4.2)    
* [6. Train DRL Agents](#5)
* [7. Backtesting Performance](#6)  
    * [7.1. BackTestStats](#6.1)
    * [7.2. BackTestPlot](#6.2)   
  

<a id='0'></a>
# Part 1. Task Discription

We train a DRL agent for stock trading. This task is modeled as a Markov Decision Process (MDP), and the objective function is maximizing (expected) cumulative return.

We specify the state-action-reward as follows:

* **State s**: The state space represents an agent's perception of the market environment. Just like a human trader analyzing various information, here our agent passively observes many features and learns by interacting with the market environment (usually by replaying historical data).

* **Action a**: The action space includes allowed actions that an agent can take at each state. For example, a ∈ {−1, 0, 1}, where −1, 0, 1 represent
selling, holding, and buying. When an action operates multiple shares, a ∈{−k, ..., −1, 0, 1, ..., k}, e.g.. "Buy
10 shares of AAPL" or "Sell 10 shares of AAPL" are 10 or −10, respectively

* **Reward function r(s, a, s′)**: Reward is an incentive for an agent to learn a better policy. For example, it can be the change of the portfolio value when taking a at state s and arriving at new state s',  i.e., r(s, a, s′) = v′ − v, where v′ and v represent the portfolio values at state s′ and s, respectively


**Market environment**: 30 consituent stocks of Dow Jones Industrial Average (DJIA) index. Accessed at the starting date of the testing period.


The data for this case study is obtained from Yahoo Finance API. The data contains Open-High-Low-Close price and volume.


<a id='1'></a>
# Part 2. Install Python Packages

<a id='1.1'></a>
## 2.1. Install packages



<a id='1.2'></a>
## 2.2. A list of Python packages
* Yahoo Finance API
* pandas
* numpy
* matplotlib
* stockstats
* OpenAI gym
* stable-baselines
* tensorflow
* pyfolio

<a id='1.3'></a>
## 2.3. Import Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# matplotlib.use('Agg')
%matplotlib inline

import sys
sys.path.append(r"D:\FinRL-master\FinRL-master")

from finrl.meta.preprocessor.yahoodownloader import YahooDownloader
from finrl.meta.preprocessor.preprocessors import FeatureEngineer, data_split
from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv
from finrl.agents.stablebaselines3.models import DRLAgent
from stable_baselines3.common.logger import configure
from finrl.meta.data_processor import DataProcessor
from finrl.meta.data_processors.processor_yahoofinance import YahooFinanceProcessor
from finrl.plot import backtest_stats, backtest_plot, get_daily_return, get_baseline
from pprint import pprint


import itertools

  PANDAS_VERSION = LooseVersion(pd.__version__)


<a id='1.4'></a>
## 2.4. Create Folders

In [2]:
from finrl import config
from finrl import config_tickers
import os
from finrl.main import check_and_make_directories
from finrl.config import (
    DATA_SAVE_DIR,
    TRAINED_MODEL_DIR,
    TENSORBOARD_LOG_DIR,
    RESULTS_DIR,
    INDICATORS,
    TRAIN_START_DATE,
    TRAIN_END_DATE,
    TEST_START_DATE,
    TEST_END_DATE,
    TRADE_START_DATE,
    TRADE_END_DATE,
)
check_and_make_directories([DATA_SAVE_DIR, TRAINED_MODEL_DIR, TENSORBOARD_LOG_DIR, RESULTS_DIR])



In [3]:
# -------------------------
# Risk-Aware Reward Wrapper
# -------------------------
import numpy as np
from collections import deque

try:
    import gymnasium as gym
    GYMNASIUM = True
except ImportError:
    import gym
    GYMNASIUM = False

def _unpack_step(result):
    if len(result) == 5:
        obs, reward, terminated, truncated, info = result
        done = terminated or truncated
        return obs, reward, done, info, (terminated, truncated)
    else:
        obs, reward, done, info = result
        return obs, reward, done, info, None

def _pack_step(obs, reward, done, info, tt):
    if tt is None:
        return obs, reward, done, info
    else:
        terminated, truncated = tt
        return obs, reward, terminated, truncated, info

def _unpack_reset(result):
    if isinstance(result, tuple) and len(result) == 2:
        return result[0], result[1], True
    else:
        return result, {}, False

def _pack_reset(obs, info, is_gymnasium):
    return (obs, info) if is_gymnasium else obs


class RiskAwareRewardWrapper(gym.Wrapper):
    def __init__(self, env, mode="sharpe", window=63, annualization=252, cvar_alpha=0.05, sortino_target=0.0, scale=1.0):
        super().__init__(env)
        self.mode = mode.lower()
        self.window = window
        self.annualization = annualization
        self.cvar_alpha = cvar_alpha
        self.sortino_target = sortino_target
        self.scale = scale
        self._rets = deque(maxlen=window)
        self._prev_asset = None

    def _rolling_sharpe(self, rets):
        return 0.0 if rets.size < 2 or np.std(rets) == 0 else \
            (np.mean(rets) / (np.std(rets) + 1e-12)) * np.sqrt(self.annualization)

    def _rolling_sortino(self, rets):
        if rets.size < 2:
            return 0.0
        downside = rets[rets < self.sortino_target] - self.sortino_target
        if downside.size == 0:
            return 0.0
        downside_std = np.sqrt((downside ** 2).mean()) + 1e-12
        return (np.mean(rets) - self.sortino_target) / downside_std * np.sqrt(self.annualization)

    def _rolling_cvar(self, rets):
        if rets.size < 2:
            return 0.0
        losses = -rets
        q = np.quantile(losses, 1 - self.cvar_alpha)
        tail = losses[losses >= q]
        return -tail.mean() if tail.size > 0 else 0.0

    def reset(self, **kwargs):
        res = self.env.reset(**kwargs)
        obs, info, is_gymnasium = _unpack_reset(res)
        self._rets.clear()
        asset_memory = getattr(self.env.unwrapped, "asset_memory", [])
        self._prev_asset = asset_memory[-1] if len(asset_memory) > 0 else getattr(self.env.unwrapped, "initial_amount", 1.0)
        return _pack_reset(obs, info, is_gymnasium)

    def step(self, action):
        res = self.env.step(action)
        obs, orig_reward, done, info, tt = _unpack_step(res)

        asset_memory = getattr(self.env.unwrapped, "asset_memory", [])
        if len(asset_memory) > 0:
            current_asset = asset_memory[-1]
            ret = (current_asset - self._prev_asset) / (self._prev_asset + 1e-12)
            self._prev_asset = current_asset
            self._rets.append(ret)

            if self.mode == "pnl":
                shaped_reward = orig_reward
            elif self.mode == "sharpe":
                shaped_reward = self._rolling_sharpe(np.array(self._rets)) * self.scale
            elif self.mode == "sortino":
                shaped_reward = self._rolling_sortino(np.array(self._rets)) * self.scale
            elif self.mode == "cvar":
                shaped_reward = self._rolling_cvar(np.array(self._rets)) * self.scale
        else:
            shaped_reward = orig_reward

        return _pack_step(obs, shaped_reward, done, info, tt)


<a id='2'></a>
# Part 3. Download Data
Yahoo Finance provides stock data, financial news, financial reports, etc. Yahoo Finance is free.
* FinRL uses a class **YahooDownloader** in FinRL-Meta to fetch data via Yahoo Finance API
* Call Limit: Using the Public API (without authentication), you are limited to 2,000 requests per hour per IP (or up to a total of 48,000 requests a day).



-----
class YahooDownloader:
    Retrieving daily stock data from
    Yahoo Finance API

    Attributes
    ----------
        start_date : str
            start date of the data (modified from config.py)
        end_date : str
            end date of the data (modified from config.py)
        ticker_list : list
            a list of stock tickers (modified from config.py)

    Methods
    -------
    fetch_data()


In [None]:
TRAIN_START_DATE = '2010-01-01'
TRAIN_END_DATE = '2023-10-01'
TRADE_START_DATE = '2023-10-01'
TRADE_END_DATE = '2025-06-01'

In [5]:
# from config.py, TRAIN_START_DATE is a string
TRAIN_START_DATE
# from config.py, TRAIN_END_DATE is a string
TRAIN_END_DATE

'2023-10-01'

In [7]:
df = YahooDownloader(start_date = TRAIN_START_DATE,
                    end_date = TRADE_END_DATE,
                    ticker_list = config_tickers.DOW_30_TICKER).fetch_data()
# yfp = YahooFinanceProcessor()
# df = yfp.scrap_data(['AXP', 'AMGN', 'AAPL'], '2010-01-01', '2010-02-01')
print(df)



[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%********

Shape of DataFrame:  (112073, 8)
Price         date       close        high         low        open     volume  \
0       2010-01-04    6.431897    6.446623    6.382908    6.414465  493729600   
1       2010-01-04   39.913239   40.016962   39.111101   39.159506    5277400   
2       2010-01-04   32.637966   32.781535   32.215237   32.550232    6894300   
3       2010-01-04   43.777550   43.941189   42.702201   43.419101    6186700   
4       2010-01-04   39.403469   39.834181   38.703561   38.797781    7325600   
...            ...         ...         ...         ...         ...        ...   
112068  2025-02-28  469.605164  470.989374  459.243337  461.734915    6146300   
112069  2025-02-28  362.108612  363.396482  353.123534  354.121876   15857300   
112070  2025-02-28   41.743721   42.382952   41.269142   42.063338   25197500   
112071  2025-02-28   10.680000   11.490000   10.480000   10.630000   52407400   
112072  2025-02-28   98.102760   98.351474   96.670165   97.286979   2545130

In [8]:
print(config_tickers.DOW_30_TICKER)

['AXP', 'AMGN', 'AAPL', 'BA', 'CAT', 'CSCO', 'CVX', 'GS', 'HD', 'HON', 'IBM', 'INTC', 'JNJ', 'KO', 'JPM', 'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PG', 'TRV', 'UNH', 'CRM', 'VZ', 'V', 'WBA', 'WMT', 'DIS', 'DOW']


In [9]:
df.shape

(112073, 8)

In [10]:
df.sort_values(['date','tic'],ignore_index=True).head()

Price,date,close,high,low,open,volume,tic,day
0,2010-01-04,6.431897,6.446623,6.382908,6.414465,493729600,AAPL,0
1,2010-01-04,39.913239,40.016962,39.111101,39.159506,5277400,AMGN,0
2,2010-01-04,32.637966,32.781535,32.215237,32.550232,6894300,AXP,0
3,2010-01-04,43.77755,43.941189,42.702201,43.419101,6186700,BA,0
4,2010-01-04,39.403469,39.834181,38.703561,38.797781,7325600,CAT,0


# Part 4: Preprocess Data
We need to check for missing data and do feature engineering to convert the data point into a state.
* **Adding technical indicators**. In practical trading, various information needs to be taken into account, such as historical prices, current holding shares, technical indicators, etc. Here, we demonstrate two trend-following technical indicators: MACD and RSI.
* **Adding turbulence index**. Risk-aversion reflects whether an investor prefers to protect the capital. It also influences one's trading strategy when facing different market volatility level. To control the risk in a worst-case scenario, such as financial crisis of 2007–2008, FinRL employs the turbulence index that measures extreme fluctuation of asset price.

In [11]:
fe = FeatureEngineer(
                    use_technical_indicator=True,
                    tech_indicator_list = INDICATORS,
                    use_vix=True,
                    use_turbulence=True,
                    user_defined_feature = False)

processed = fe.preprocess_data(df)

Successfully added technical indicators


[*********************100%***********************]  1 of 1 completed


Shape of DataFrame:  (3812, 8)
Successfully added vix
Successfully added turbulence index


In [12]:
list_ticker = processed["tic"].unique().tolist()
list_date = list(pd.date_range(processed['date'].min(),processed['date'].max()).astype(str))
combination = list(itertools.product(list_date,list_ticker))

processed_full = pd.DataFrame(combination,columns=["date","tic"]).merge(processed,on=["date","tic"],how="left")
processed_full = processed_full[processed_full['date'].isin(processed['date'])]
processed_full = processed_full.sort_values(['date','tic'])

processed_full = processed_full.fillna(0)

In [13]:
processed_full.sort_values(['date','tic'],ignore_index=True).head(10)

Unnamed: 0,date,tic,close,high,low,open,volume,day,macd,boll_ub,boll_lb,rsi_30,cci_30,dx_30,close_30_sma,close_60_sma,vix,turbulence
0,2010-01-04,AAPL,6.431897,6.446623,6.382908,6.414465,493729600.0,0.0,0.0,6.453182,6.421731,100.0,66.666667,100.0,6.431897,6.431897,20.040001,0.0
1,2010-01-04,AMGN,39.913239,40.016962,39.111101,39.159506,5277400.0,0.0,0.0,6.453182,6.421731,100.0,66.666667,100.0,39.913239,39.913239,20.040001,0.0
2,2010-01-04,AXP,32.637966,32.781535,32.215237,32.550232,6894300.0,0.0,0.0,6.453182,6.421731,100.0,66.666667,100.0,32.637966,32.637966,20.040001,0.0
3,2010-01-04,BA,43.77755,43.941189,42.702201,43.419101,6186700.0,0.0,0.0,6.453182,6.421731,100.0,66.666667,100.0,43.77755,43.77755,20.040001,0.0
4,2010-01-04,CAT,39.403469,39.834181,38.703561,38.797781,7325600.0,0.0,0.0,6.453182,6.421731,100.0,66.666667,100.0,39.403469,39.403469,20.040001,0.0
5,2010-01-04,CRM,18.542521,18.718478,18.386389,18.490477,7906000.0,0.0,0.0,6.453182,6.421731,100.0,66.666667,100.0,18.542521,18.542521,20.040001,0.0
6,2010-01-04,CSCO,16.158169,16.256335,15.713148,15.778593,59853700.0,0.0,0.0,6.453182,6.421731,100.0,66.666667,100.0,16.158169,16.158169,20.040001,0.0
7,2010-01-04,CVX,42.60355,42.678992,42.118565,42.140116,10173800.0,0.0,0.0,6.453182,6.421731,100.0,66.666667,100.0,42.60355,42.60355,20.040001,0.0
8,2010-01-04,DIS,27.47588,28.058468,27.304531,27.844281,13700400.0,0.0,0.0,6.453182,6.421731,100.0,66.666667,100.0,27.47588,27.47588,20.040001,0.0
9,2010-01-04,GS,131.99202,132.884268,129.269507,129.681321,9135000.0,0.0,0.0,6.453182,6.421731,100.0,66.666667,100.0,131.99202,131.99202,20.040001,0.0


In [14]:
mvo_df = processed_full.sort_values(['date','tic'],ignore_index=True)[['date','tic','close']]

<a id='4'></a>
# Part 5. Build A Market Environment in OpenAI Gym-style
The training process involves observing stock price change, taking an action and reward's calculation. By interacting with the market environment, the agent will eventually derive a trading strategy that may maximize (expected) rewards.

Our market environment, based on OpenAI Gym, simulates stock markets with historical market data.

## Data Split
We split the data into training set and testing set as follows:

Training data period: 2009-01-01 to 2020-07-01

Trading data period: 2020-07-01 to 2021-10-31


In [15]:
train = data_split(processed_full, TRAIN_START_DATE,TRAIN_END_DATE)
trade = data_split(processed_full, TRADE_START_DATE,TRADE_END_DATE)
train_length = len(train)
trade_length = len(trade)
print(train_length)
print(trade_length)

100311
10237


In [16]:
train.tail()

Unnamed: 0,date,tic,close,high,low,open,volume,day,macd,boll_ub,boll_lb,rsi_30,cci_30,dx_30,close_30_sma,close_60_sma,vix,turbulence
3458,2023-09-29,UNH,489.152222,494.672516,488.773839,494.643412,3006200.0,4.0,4.758833,499.945947,450.959745,55.148637,128.373531,19.202287,475.052339,477.039841,17.52,57.508258
3458,2023-09-29,V,227.020599,230.001353,226.665278,229.81382,6045200.0,4.0,-3.066167,251.200594,224.032595,41.887448,-154.263239,36.909268,238.148426,237.149248,17.52,57.508258
3458,2023-09-29,VZ,28.312366,28.60938,28.277423,28.495815,19787600.0,4.0,-0.276669,30.421852,28.184129,41.243825,-137.570404,33.867227,29.353075,29.363413,17.52,57.508258
3458,2023-09-29,WBA,20.083014,20.254587,18.854915,18.963277,25663400.0,4.0,-1.050146,20.962601,18.592989,36.119163,-53.011622,26.754771,20.958936,23.656806,17.52,57.508258
3458,2023-09-29,WMT,52.213684,53.258414,51.972087,53.225766,18842400.0,4.0,0.139828,54.153628,52.232611,50.578496,-24.941142,17.468396,52.741927,52.075983,17.52,57.508258


In [17]:
trade.head()

Unnamed: 0,date,tic,close,high,low,open,volume,day,macd,boll_ub,boll_lb,rsi_30,cci_30,dx_30,close_30_sma,close_60_sma,vix,turbulence
0,2023-10-02,AAPL,172.259903,172.805189,169.46408,169.751602,52164500.0,0.0,-2.561662,183.575462,165.836042,45.24316,-87.977365,26.12904,176.566325,181.009776,17.610001,34.480751
0,2023-10-02,AMGN,252.198303,254.262779,250.37058,253.524112,1912300.0,0.0,3.867161,261.151685,237.536332,59.425474,72.228976,31.542808,247.539639,237.412815,17.610001,34.480751
0,2023-10-02,AXP,145.907852,146.317814,144.531558,144.873186,2657600.0,0.0,-3.052554,160.817397,143.761846,37.445467,-174.733412,36.526719,153.348369,158.819425,17.610001,34.480751
0,2023-10-02,BA,187.830002,192.440002,186.929993,191.470001,5244700.0,0.0,-8.257769,223.250012,185.200986,32.814602,-138.625438,66.41739,211.389666,218.712,17.610001,34.480751
0,2023-10-02,CAT,262.851013,267.774871,261.41931,263.740997,1778200.0,0.0,-0.878852,277.101112,259.928749,51.346785,-69.197357,1.038354,268.083485,264.02185,17.610001,34.480751


In [18]:
INDICATORS

['macd',
 'boll_ub',
 'boll_lb',
 'rsi_30',
 'cci_30',
 'dx_30',
 'close_30_sma',
 'close_60_sma']

In [19]:
stock_dimension = len(train.tic.unique())
state_space = 1 + 2*stock_dimension + len(INDICATORS)*stock_dimension
print(f"Stock Dimension: {stock_dimension}, State Space: {state_space}")

Stock Dimension: 29, State Space: 291


In [20]:
buy_cost_list = sell_cost_list = [0.001] * stock_dimension
num_stock_shares = [0] * stock_dimension

env_kwargs = {
    "hmax": 100,
    "initial_amount": 1000000,
    "num_stock_shares": num_stock_shares,
    "buy_cost_pct": buy_cost_list,
    "sell_cost_pct": sell_cost_list,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "action_space": stock_dimension,
    "reward_scaling": 1e-4
}


e_train_gym = StockTradingEnv(df = train, **env_kwargs)

## Environment for Training



In [24]:
from stable_baselines3.common.vec_env import DummyVecEnv

def make_wrapped_env():
    return RiskAwareRewardWrapper(
        e_train_gym,   # your original StockTradingEnv instance
        mode="sharpe", # "sharpe", "sortino", "cvar", or "pnl"
        window=63,
        annualization=252,
        scale=1.0
    )

env_train = DummyVecEnv([make_wrapped_env])

<a id='5'></a>
# Part 6: Train DRL Agents
* The DRL algorithms are from **Stable Baselines 3**. Users are also encouraged to try **ElegantRL** and **Ray RLlib**.
* FinRL includes fine-tuned standard DRL algorithms, such as DQN, DDPG, Multi-Agent DDPG, PPO, SAC, A2C and TD3. We also allow users to
design their own DRL algorithms by adapting these DRL algorithms.

### Agent Training: 5 algorithms (A2C, DDPG, PPO, TD3, SAC)


### Agent 3: PPO

In [26]:
agent = DRLAgent(env = env_train)
PPO_PARAMS = {
    "n_steps": 2048,
    "ent_coef": 0.01,
    "learning_rate": 0.00025,
    "batch_size": 128,
}
model_ppo = agent.get_model("ppo",model_kwargs = PPO_PARAMS)

# if if_using_ppo:
#   # set up logger
#   tmp_path = RESULTS_DIR + '/ppo'
#   new_logger_ppo = configure(tmp_path, ["stdout", "csv", "tensorboard"])
#   # Set new logger
#   model_ppo.set_logger(new_logger_ppo)

{'n_steps': 2048, 'ent_coef': 0.01, 'learning_rate': 0.00025, 'batch_size': 128}
Using cpu device


In [None]:
trained_ppo = agent.train_model(model=model_ppo,
                             tb_log_name='ppo',
                             total_timesteps=200000)

-----------------------------------
| time/              |            |
|    fps             | 75         |
|    iterations      | 1          |
|    time_elapsed    | 27         |
|    total_timesteps | 2048       |
| train/             |            |
|    reward          | 1.3621887  |
|    reward_max      | 8.5048275  |
|    reward_mean     | 1.285091   |
|    reward_min      | -10.501652 |
-----------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 74          |
|    iterations           | 2           |
|    time_elapsed         | 55          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.013983684 |
|    clip_fraction        | 0.212       |
|    clip_range           | 0.2         |
|    entropy_loss         | -41.2       |
|    explained_variance   | -0.00591    |
|    learning_rate        | 0.00025     |
|    loss             

## In-sample Performance

Assume that the initial capital is $1,000,000.

### Set turbulence threshold
Set the turbulence threshold to be greater than the maximum of insample turbulence data. If current turbulence index is greater than the threshold, then we assume that the current market is volatile

In [29]:
data_risk_indicator = processed_full[(processed_full.date<TRAIN_END_DATE) & (processed_full.date>=TRAIN_START_DATE)]
insample_risk_indicator = data_risk_indicator.drop_duplicates(subset=['date'])

In [30]:
insample_risk_indicator.vix.describe()

Unnamed: 0,vix
count,3459.0
mean,18.633854
std,7.150486
min,9.14
25%,13.705
50%,16.870001
75%,21.57
max,82.690002


In [31]:
insample_risk_indicator.vix.quantile(0.996)

np.float64(54.89679946899445)

In [32]:
insample_risk_indicator.turbulence.describe()

Unnamed: 0,turbulence
count,3459.0
mean,34.754934
std,42.34923
min,0.0
25%,15.118423
50%,24.215262
75%,39.726569
max,652.508567


In [33]:
insample_risk_indicator.turbulence.quantile(0.996)

np.float64(263.5029778106239)

In [50]:
from stable_baselines3.common.vec_env import DummyVecEnv

# ---- 1) Tiny adapter so DRLAgent.DRL_prediction() is happy ----
class SB3EnvAdapter:
    """
    Wraps your (wrapped) env and exposes:
      - get_sb_env(): returns (DummyVecEnv, first_obs)
      - df: forwarded so DRLAgent can read environment.df
    """
    def __init__(self, wrapped_env):
        self.wrapped_env = wrapped_env
        # DRLAgent.DRL_prediction uses environment.df, so forward it
        self.df = wrapped_env.unwrapped.df

    def get_sb_env(self):
        e = DummyVecEnv([lambda: self.wrapped_env])
        obs = e.reset()
        return e, obs

### Trading (Out-of-sample Performance)

We update periodically in order to take full advantage of the data, e.g., retrain quarterly, monthly or weekly. We also tune the parameters along the way, in this notebook we use the in-sample data from 2009-01 to 2020-07 to tune the parameters once, so there is some alpha decay here as the length of trade date extends.

Numerous hyperparameters – e.g. the learning rate, the total number of samples to train on – influence the learning process and are usually determined by testing some variations.

In [52]:
e_trade_gym = StockTradingEnv(df = trade, turbulence_threshold = 55,risk_indicator_col='vix', **env_kwargs)
# env_trade, obs_trade = e_trade_gym.get_sb_env()

In [53]:
wrapped_e_trade_gym = RiskAwareRewardWrapper(
    e_trade_gym,
    mode="sharpe",     # same as you used for training
    window=63,
    annualization=252,
    scale=1.0
)

# ---- 3) Adapt it for DRLAgent.DRL_prediction ----
env_adapter = SB3EnvAdapter(wrapped_e_trade_gym)


In [48]:
trade.head()

Unnamed: 0,date,tic,close,high,low,open,volume,day,macd,boll_ub,boll_lb,rsi_30,cci_30,dx_30,close_30_sma,close_60_sma,vix,turbulence
0,2023-10-02,AAPL,172.259903,172.805189,169.46408,169.751602,52164500.0,0.0,-2.561662,183.575462,165.836042,45.24316,-87.977365,26.12904,176.566325,181.009776,17.610001,34.480751
0,2023-10-02,AMGN,252.198303,254.262779,250.37058,253.524112,1912300.0,0.0,3.867161,261.151685,237.536332,59.425474,72.228976,31.542808,247.539639,237.412815,17.610001,34.480751
0,2023-10-02,AXP,145.907852,146.317814,144.531558,144.873186,2657600.0,0.0,-3.052554,160.817397,143.761846,37.445467,-174.733412,36.526719,153.348369,158.819425,17.610001,34.480751
0,2023-10-02,BA,187.830002,192.440002,186.929993,191.470001,5244700.0,0.0,-8.257769,223.250012,185.200986,32.814602,-138.625438,66.41739,211.389666,218.712,17.610001,34.480751
0,2023-10-02,CAT,262.851013,267.774871,261.41931,263.740997,1778200.0,0.0,-0.878852,277.101112,259.928749,51.346785,-69.197357,1.038354,268.083485,264.02185,17.610001,34.480751


In [54]:
trained_model = trained_ppo
df_account_value_ppo, df_actions_ppo = DRLAgent.DRL_prediction(
    model=trained_model,
    environment=env_adapter,
    deterministic=True
)

hit end!


<a id='7'></a>
# Part 6.5: Mean Variance Optimization

Mean Variance optimization is a very classic strategy in portfolio management. Here, we go through the whole process to do the mean variance optimization and add it as a baseline to compare.

First, process dataframe to the form for MVO weight calculation.

In [55]:
def process_df_for_mvo(df):
  df = df.sort_values(['date','tic'],ignore_index=True)[['date','tic','close']]
  fst = df
  fst = fst.iloc[0:stock_dimension, :]
  tic = fst['tic'].tolist()

  mvo = pd.DataFrame()

  for k in range(len(tic)):
    mvo[tic[k]] = 0

  for i in range(df.shape[0]//stock_dimension):
    n = df
    n = n.iloc[i * stock_dimension:(i+1) * stock_dimension, :]
    date = n['date'][i*stock_dimension]
    mvo.loc[date] = n['close'].tolist()

  return mvo

### Helper functions for mean returns and variance-covariance matrix

In [56]:
# Codes in this section partially refer to Dr G A Vijayalakshmi Pai

# https://www.kaggle.com/code/vijipai/lesson-5-mean-variance-optimization-of-portfolios/notebook

def StockReturnsComputing(StockPrice, Rows, Columns):
  import numpy as np
  StockReturn = np.zeros([Rows-1, Columns])
  for j in range(Columns):        # j: Assets
    for i in range(Rows-1):     # i: Daily Prices
      StockReturn[i,j]=((StockPrice[i+1, j]-StockPrice[i,j])/StockPrice[i,j])* 100

  return StockReturn

### Calculate the weights for mean-variance

In [57]:
train_mvo = data_split(processed_full, TRAIN_START_DATE,TRAIN_END_DATE).reset_index()
trade_mvo = data_split(processed_full, TRADE_START_DATE,TRADE_END_DATE).reset_index()

In [58]:
StockData = process_df_for_mvo(train_mvo)
TradeData = process_df_for_mvo(trade_mvo)

TradeData.to_numpy()

array([[172.25990295, 252.19830322, 145.90785217, ...,  27.74454498,
         20.24555588,  52.26919174],
       [170.92149353, 247.17913818, 141.58372498, ...,  27.93672943,
         20.36294746,  51.9394455 ],
       [172.17070007, 251.37440491, 143.12597656, ...,  27.55235863,
         20.12816429,  52.5630188 ],
       ...,
       [246.71646118, 312.86508179, 291.63308716, ...,  42.33452606,
         11.38000011,  97.18749237],
       [240.04521179, 303.69613647, 294.37704468, ...,  41.82120895,
         11.03999996,  95.70515442],
       [236.98922729, 303.14102173, 292.48812866, ...,  41.90837479,
         11.22999954,  96.29212189]])

In [60]:
#compute asset returns
arStockPrices = np.asarray(StockData)
[Rows, Cols]=arStockPrices.shape
arReturns = StockReturnsComputing(arStockPrices, Rows, Cols)

#compute mean returns and variance covariance matrix of returns
meanReturns = np.mean(arReturns, axis = 0)
covReturns = np.cov(arReturns, rowvar=False)

#set precision for printing results
np.set_printoptions(precision=3, suppress = True)

#display mean returns and variance-covariance matrix of returns
print('Mean returns of assets in k-portfolio 1\n', meanReturns)
print('Variance-Covariance matrix of returns\n', covReturns)

Mean returns of assets in k-portfolio 1
 [0.111 0.065 0.06  0.069 0.072 0.095 0.047 0.052 0.044 0.041 0.089 0.065
 0.027 0.046 0.043 0.061 0.038 0.06  0.025 0.053 0.089 0.071 0.043 0.054
 0.099 0.084 0.027 0.012 0.048]
Variance-Covariance matrix of returns
 [[3.196 0.949 1.409 1.618 1.351 1.859 1.415 1.091 1.236 1.404 1.247 1.242
  1.022 1.667 0.667 1.294 0.693 0.835 1.068 0.693 1.739 1.329 0.696 0.837
  1.124 1.414 0.543 0.983 0.655]
 [0.949 2.305 1.006 0.93  0.992 1.097 0.964 0.865 0.875 1.039 0.921 0.973
  0.808 1.104 0.865 1.052 0.625 0.641 0.883 1.002 1.004 0.828 0.699 0.811
  1.076 0.98  0.609 0.998 0.61 ]
 [1.409 1.006 3.426 2.516 1.926 1.679 1.435 1.805 1.806 2.225 1.403 1.801
  1.314 1.58  0.805 2.337 0.971 1.053 1.399 0.877 1.435 1.546 0.723 1.464
  1.341 1.858 0.736 1.251 0.582]
 [1.618 0.93  2.516 5.157 2.126 1.819 1.487 2.018 1.951 2.194 1.527 2.079
  1.456 1.822 0.817 2.24  1.062 1.172 1.464 0.843 1.501 1.72  0.727 1.527
  1.376 1.734 0.721 1.417 0.597]
 [1.351 0.992 1.92

### Use PyPortfolioOpt

In [61]:
from pypfopt.efficient_frontier import EfficientFrontier

ef_mean = EfficientFrontier(meanReturns, covReturns, weight_bounds=(0, 0.5))
raw_weights_mean = ef_mean.max_sharpe()
cleaned_weights_mean = ef_mean.clean_weights()
mvo_weights = np.array([1000000 * cleaned_weights_mean[i] for i in range(29)])
mvo_weights

array([248190.,  14900.,      0.,      0.,      0.,   5590.,      0.,
            0.,      0.,      0., 190580.,      0.,      0.,      0.,
            0.,      0.,      0., 111530.,      0.,  24280.,  13000.,
            0.,      0.,      0., 277910.,  42680.,      0.,      0.,
        71340.])

In [62]:
LastPrice = np.array([1/p for p in StockData.tail(1).to_numpy()[0]])
Initial_Portfolio = np.multiply(mvo_weights, LastPrice)
Initial_Portfolio

array([1462.163,   58.542,    0.   ,    0.   ,    0.   ,   27.808,
          0.   ,    0.   ,    0.   ,    0.   ,  658.71 ,    0.   ,
          0.   ,    0.   ,    0.   ,    0.   ,    0.   ,  441.225,
          0.   ,  248.607,   41.724,    0.   ,    0.   ,    0.   ,
        568.146,  188.001,    0.   ,    0.   , 1366.308])

In [85]:
Portfolio_Assets = TradeData @ Initial_Portfolio
MVO_result = pd.DataFrame(Portfolio_Assets, columns=["Mean Var"])
MVO_result

Unnamed: 0,Mean Var
2023-10-02,1.005879e+06
2023-10-03,9.929165e+05
2023-10-04,9.981443e+05
2023-10-05,1.000998e+06
2023-10-06,1.008246e+06
...,...
2025-02-21,1.263438e+06
2025-02-24,1.261625e+06
2025-02-25,1.280903e+06
2025-02-26,1.259473e+06


In [86]:
# Print final account value from MVO_result
final_account_value = MVO_result["Mean Var"].iloc[-1]
print("Final Account Value (Mean-Variance Portfolio):", final_account_value)


Final Account Value (Mean-Variance Portfolio): 1261514.514853801


In [83]:
# Print the final account value
final_account_value = df_account_value_ppo['account_value'].iloc[-1]
print("Final Account Value (PPO):", final_account_value)


Final Account Value (PPO): 1048368.5514036537


<a id='6'></a>
# Part 7: Backtesting Results
Backtesting plays a key role in evaluating the performance of a trading strategy. Automated backtesting tool is preferred because it reduces the human error. We usually use the Quantopian pyfolio package to backtest our trading strategies. It is easy to use and consists of various individual plots that provide a comprehensive image of the performance of a trading strategy.

In [87]:
import numpy as np

# Calculate metrics for MVO
mvo_daily_returns = MVO_result["Mean Var"].pct_change().dropna()
mvo_cum_return = (MVO_result["Mean Var"].iloc[-1] / MVO_result["Mean Var"].iloc[0]) - 1
mvo_sharpe = (np.mean(mvo_daily_returns) / np.std(mvo_daily_returns)) * np.sqrt(252)
mvo_max_drawdown = ((MVO_result["Mean Var"] / MVO_result["Mean Var"].cummax()) - 1).min()

# Calculate metrics for PPO
ppo_daily_returns = df_account_value_ppo["account_value"].pct_change().dropna()
ppo_cum_return = (df_account_value_ppo["account_value"].iloc[-1] / df_account_value_ppo["account_value"].iloc[0]) - 1
ppo_sharpe = (np.mean(ppo_daily_returns) / np.std(ppo_daily_returns)) * np.sqrt(252)
ppo_max_drawdown = ((df_account_value_ppo["account_value"] / df_account_value_ppo["account_value"].cummax()) - 1).min()

# Print comparison
print("---- Mean-Variance Portfolio ----")
print(f"Cumulative Return: {mvo_cum_return:.2%}")
print(f"Sharpe Ratio: {mvo_sharpe:.2f}")
print(f"Max Drawdown: {mvo_max_drawdown:.2%}")

print("\n---- PPO Strategy ----")
print(f"Cumulative Return: {ppo_cum_return:.2%}")
print(f"Sharpe Ratio: {ppo_sharpe:.2f}")
print(f"Max Drawdown: {ppo_max_drawdown:.2%}")


---- Mean-Variance Portfolio ----
Cumulative Return: 25.41%
Sharpe Ratio: 1.39
Max Drawdown: -8.44%

---- PPO Strategy ----
Cumulative Return: 4.84%
Sharpe Ratio: 0.29
Max Drawdown: -16.78%


In [73]:
df_result_ppo

Unnamed: 0_level_0,ppo
date,Unnamed: 1_level_1
2023-10-02,1.000000e+06
2023-10-03,9.997958e+05
2023-10-04,1.000007e+06
2023-10-05,9.999740e+05
2023-10-06,1.000594e+06
...,...
2025-02-21,1.047275e+06
2025-02-24,1.065161e+06
2025-02-25,1.073665e+06
2025-02-26,1.046418e+06
