# Deep Reinforcement Learning for Portfolio Optimization - MLP Architecture


This experiement demonstrates the application of deep reinforcement learning (DRL) techniques for portfolio optimization.

- Policy network architecture: **MLP backbone**
- Compares `A2C`, `PPO`, `SAC`, `DDPG`, `TD3` all with simple MLPs

## Dependencies


In [1]:
# ! pip install pandas numpy matplotlib \
#                stable-baselines3 \
#                PyPortfolioOpt \
#                pandas_market_calendars quantstats gymnasium \
#                git+https://github.com/AI4Finance-Foundation/FinRL.git -q

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

from tqdm.auto import tqdm

import torch

from stable_baselines3 import A2C, PPO, SAC, DDPG, TD3
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import EvalCallback

from finrl import config_tickers
from finrl.meta.preprocessor.yahoodownloader import YahooDownloader
from finrl.meta.preprocessor.preprocessors import FeatureEngineer, data_split
from finrl.meta.env_portfolio_allocation.env_portfolio import StockPortfolioEnv
from finrl.agents.stablebaselines3.models import DRLAgent
from finrl.plot import backtest_stats, get_baseline, backtest_plot

from pypfopt.efficient_frontier import EfficientFrontier



In [3]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

%matplotlib inline

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


In [None]:
experiment_name = "mlp_models_val"
results_dir = f"results/models/{experiment_name}"
os.makedirs(results_dir, exist_ok=True)

## Data loading and pre-processing


Define training, validation and trading/test periods


In [6]:
train_start, train_end = "2010-01-01", "2018-12-31"
val_start,   val_end   = "2019-01-01", "2019-12-31"
test_start,  test_end  = "2020-01-01", "2021-12-31"

train_dates = (train_start, train_end)
val_dates = (val_start, val_end)
test_dates = (test_start, test_end)

print(f"Train period:      {train_start} → {train_end}")
print(f"Validation period: {val_start} → {val_end}")
print(f"Test period:       {test_start} → {test_end}")

Train period:      2010-01-01 → 2018-12-31
Validation period: 2019-01-01 → 2019-12-31
Test period:       2020-01-01 → 2021-12-31


- Fetch historical stock data for a given list of tickers within a specified date range.
- We use the DOW_30_TICKER stocks
- The data includes `date`, `close`, `high`, `low`, `open`, `volume`, and `tic` (ticker symbol).


In [7]:
def download_data(tickers, start_date, end_date):
    print(f"Downloading {start_date} → {end_date}")
    return YahooDownloader(
        start_date=start_date, end_date=end_date, ticker_list=tickers
    ).fetch_data()


df = download_data(config_tickers.DOW_30_TICKER, train_start, test_end)

Downloading 2010-01-01 → 2021-12-31


[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%********

Shape of DataFrame:  (88283, 8)


----
We apply feature engineering to the dataset of stock data:

- Add technical indicators (e.g., moving averages, RSI).
- Calculate turbulence indicators, which measure market volatility.

This Enhance the dataset with features that are critical for modeling market dynamics and making informed trading decisions.


In [8]:
def preprocess_data(df):
    fe = FeatureEngineer(use_technical_indicator=True, use_turbulence=True)
    return fe.preprocess_data(df)


df_feat = preprocess_data(df)

# TODO: Normalise the data??

Successfully added technical indicators
Successfully added turbulence index


## Covariance & Returns for State


- Calculate the rolling covariance matrices and daily returns for the given dataset of stock prices.
- This prepares the state representation (the state of the portfolio) for the RL models in the RL environments for portfolio optimization.
- The **rolling covariance matrices** (`cov_list`) capture the relationships between asset returns, while the daily returns (`return_list`) provide information about recent price movements.
- These metrics are critical for modeling the dynamics of the financial market and making informed trading decisions.


In [9]:
def compute_covariance_and_returns(df_feat, lookback=252):
    df_sorted = df_feat.sort_values(["date", "tic"], ignore_index=True)
    df_sorted.index = df_sorted.date.factorize()[0]
    cov_list, return_list = [], []

    dates = df_sorted.date.unique()
    for i in tqdm(range(lookback, len(dates)), desc="Computing covariance and returns"):
        win = df_sorted.loc[i - lookback : i]
        pm = win.pivot_table(index="date", columns="tic", values="close")
        rm = pm.pct_change().dropna()
        cov_list.append(rm.cov().values)
        return_list.append(rm)
    df_cov = pd.DataFrame(
        {"date": dates[lookback:], "cov_list": cov_list, "return_list": return_list}
    )

    return pd.merge(df_feat, df_cov, on="date", how="left").dropna(subset=["cov_list"])


df_all = compute_covariance_and_returns(df_feat)

Computing covariance and returns:   0%|          | 0/2768 [00:00<?, ?it/s]

## Train / Validation / Test split

In [10]:
def split_data(df_all, train_dates, validate_dates, test_dates):
    train = data_split(df_all, *train_dates)
    validate = data_split(df_all, *validate_dates)
    test = data_split(df_all, *test_dates)

    return train, validate, test


train_df, validate_df, test_df = split_data(df_all, train_dates, val_dates, test_dates)


print(f"Train set shape: {train_df.shape}")
print(f"Validation set shape: {validate_df.shape}")
print(f"Test set shape: {test_df.shape}")

Train set shape: (58319, 19)
Validation set shape: (7279, 19)
Test set shape: (14616, 19)


## Environment setup


- Create instances of the StockPortfolioEnv class for both training and testing datasets.
- It also wrap the training environment for use with Stable-Baselines3 (SB3).


In [11]:
def configure_environment(train, test, fe):
    stock_dim = len(train.tic.unique())
    env_kwargs = dict(
        stock_dim=stock_dim,  # Number of unique stocks
        hmax=100,  # Maximum number of shares that can be traded
        initial_amount=1e6,  # Initial portfolio value (e.g., $1,000,000)
        transaction_cost_pct=0.001,  # Transaction cost as a percentage (e.g., 0.1%)
        reward_scaling=1e-4,  # Scaling factor for rewards
        state_space=stock_dim,  # State space dimension (equal to stock_dim)
        action_space=stock_dim,  # Action space dimension (equal to stock_dim)
        tech_indicator_list=fe.tech_indicator_list,  # List of technical indicators
    )

    # create the two StockPortfolioEnv objects
    raw_train_env = StockPortfolioEnv(df=train, **env_kwargs)
    raw_test_env = StockPortfolioEnv(df=test, **env_kwargs)    
    raw_val_env = StockPortfolioEnv(df=validate_df, **env_kwargs)
        
    # Wrap the *training* env for SB3
    env_train_sb3, _ = raw_train_env.get_sb_env()

    return env_train_sb3, raw_train_env, raw_val_env, raw_test_env, env_kwargs

env_train_sb3, raw_train_env, raw_val_env, raw_test_env, env_kwargs = configure_environment(
    train_df, test_df, FeatureEngineer()
)

In [17]:
def create_validation_env_and_callback(
    raw_val_env, save_dir, eval_freq=20, n_eval_episodes=10_000
):

    val_env_sb3 = DummyVecEnv([lambda: raw_val_env])

    eval_callback = EvalCallback(
        val_env_sb3,
        best_model_save_path=os.path.join(save_dir, "best_model/"),
        log_path=os.path.join(save_dir, "logs/"),
        eval_freq=eval_freq,
        n_eval_episodes=n_eval_episodes,
        deterministic=True,
        render=False,
    )

    return val_env_sb3, eval_callback

## Training


- We define the configuration for various RL models to be trained in the portfolio optimization environment.
- The training environment (`env_train_sb3`) is wrapped for use with Stable-Baselines3 (SB3).
- The SB3 environment provides the `state` and `action space` dimensions needed for configuring the models.


In [18]:
def prepare_models():
    model_configs = [
        (A2C, "A2C", {}),
        (PPO, "PPO", {}),
        (SAC, "SAC", {}),
        (DDPG, "DDPG", {}),
        (TD3, "TD3", {}),
    ]
    return model_configs

model_configs = prepare_models()

Train multiple reinforcement learning (RL) models using the specified training environment and configuration.


In [None]:
def train_models(env, model_configs, save_dir, timesteps_override=None):
    # Build validation env + callback once
    val_env_sb3, eval_callback = create_validation_env_and_callback(
        raw_val_env,
        results_dir,
        # eval_freq=timesteps_map[name] // 4,  # e.g. every 25% of total steps
        eval_freq=20, # TODO: make this dynamic
        n_eval_episodes=5
    )
    timesteps_map = {
        "A2C": 150_000,
        "PPO": 250_000,
        "SAC": 1_000_000,
        "DDPG":1_000_000,
        "TD3": 1_000_000,
    }

    trained, times = {}, {}
    for cls, name, kwargs in tqdm(model_configs, desc="Training with validation"):
        n_steps = timesteps_override or timesteps_map[name]
        print(f"→ Training {name} ({n_steps} steps) with validation…")

        # create fresh model
        model = cls("MlpPolicy", env, verbose=0, **kwargs)
        start = time.time()
        model.learn(total_timesteps=n_steps, callback=eval_callback)
        duration = (time.time() - start) / 60
        times[name] = duration

        # Try to load the best checkpoint
        best_path = os.path.join(save_dir, "best_model", "best_model.zip")
        if os.path.exists(best_path):
            print(f"  Loading best {name} from validation checkpoint…")
            best_model = cls.load(best_path)   # load without env
            best_model.set_env(env)            # attach train env
            trained[name] = best_model
        else:
            print(f"  No validation checkpoint for {name}, saving final model.")
            model.save(os.path.join(save_dir, f"{name}_final.zip"))
            trained[name] = model

        print(f"✔ {name} done in {duration:.1f} min")

    return trained, times


In [21]:
models, training_times = train_models(env_train_sb3, model_configs, results_dir, timesteps_override=20)

# TODO: make this dynamic

Training with validation:   0%|          | 0/5 [00:00<?, ?it/s]

→ Training A2C (20 steps) with validation…
begin_total_asset:1000000.0
end_total_asset:1275426.5139942889
Sharpe:  2.081232858253942
begin_total_asset:1000000.0
end_total_asset:1275426.5139942889
Sharpe:  2.081232858253942
begin_total_asset:1000000.0
end_total_asset:1275426.5139942889
Sharpe:  2.081232858253942
begin_total_asset:1000000.0
end_total_asset:1275426.5139942889
Sharpe:  2.081232858253942
begin_total_asset:1000000.0
end_total_asset:1275426.5139942889
Sharpe:  2.081232858253942
Eval num_timesteps=20, episode_reward=289118080.62 +/- 0.00
Episode length: 251.00 +/- 0.00
New best mean reward!
  Loading best A2C from validation checkpoint…
✔ A2C done in 0.0 min
→ Training PPO (20 steps) with validation…
begin_total_asset:1000000.0
end_total_asset:1273431.1366929766
Sharpe:  2.0843060310113115
begin_total_asset:1000000.0
end_total_asset:1273431.1366929766
Sharpe:  2.0843060310113115
begin_total_asset:1000000.0
end_total_asset:1273431.1366929766
Sharpe:  2.0843060310113115
begin_to



begin_total_asset:1000000.0
end_total_asset:1291424.5341880028
Sharpe:  2.2065077296019724
begin_total_asset:1000000.0
end_total_asset:1291424.5341880028
Sharpe:  2.2065077296019724
begin_total_asset:1000000.0
end_total_asset:1291424.5341880028
Sharpe:  2.2065077296019724
begin_total_asset:1000000.0
end_total_asset:1291424.5341880028
Sharpe:  2.2065077296019724
Eval num_timesteps=12, episode_reward=288576383.69 +/- 0.00
Episode length: 251.00 +/- 0.00
  Loading best SAC from validation checkpoint…




AttributeError: 'ActorCriticPolicy' object has no attribute 'actor'

In [None]:
training_times_df = pd.DataFrame(
    list(training_times.items()), columns=["model", "training_duration (min)"]
)

training_times_df.to_csv(f"{results_dir}/training_times.csv", index=False)

print("Training summary:")
display(training_times_df)

## Model loading


Load the trained models from memory for analysis without the need for time consuming retraining


In [None]:
def load_models(model_configs, results_dir):
    models = {}
    for _, name, _ in model_configs:
        model_path = f"{results_dir}/{name}_mlp_model.zip"
        if os.path.exists(model_path):
            print(f"Loading saved model for {name}...")
            models[name] = globals()[name].load(model_path)
        else:
            print(f"No saved model found for {name}.")
    return models


# models = load_models(model_configs, results_dir)

## Backtesting


- Evaluates the performance of the RL models/algorithms in a trading environment.
- We do this by calculating the **cumulative portfolio value** and **performance metrics** for each RL model.


In [None]:
def backtest_rl_strategies(models, raw_env, env_kwargs):
    results = {}
    for name, model in models.items():
        print(f"Backtesting {name}…")
        # Simulate trading using the model in the raw_env environment
        df_ret, _ = DRLAgent.DRL_prediction(
            model=model, environment=raw_env, deterministic=True
        )
        df_ret["account_value"] = (df_ret.daily_return + 1).cumprod() * env_kwargs[
            "initial_amount"
        ]
        stats = backtest_stats(df_ret, value_col_name="account_value")
        results[name] = {"df": df_ret, "stats": stats}
    return results


results = backtest_rl_strategies(models, raw_test_env, env_kwargs)

### Plotting


In [None]:
def plot_backtest_results():
    for name, res in results.items():
        print(f"Plotting {name}…")
        backtest_plot(
            account_value=res["df"],
            baseline_start=test_start,
            baseline_end=test_end,
            baseline_ticker="SPY",
            value_col_name="account_value",
        )

plot_backtest_results()

## Benchmarks


These benchmarks will provide baseline performance metrics for comparison with the RL strategies.
We evaluate the performance of **Mean-Variance Optimization (MVO)** and simple benchmarks (**Equal-Weighted Portfolio** and **SPY**) in terms of returns, volatility, and cumulative portfolio value.


---

### Mean-Variance Optimization Benchmark

- **Objective**: Calculate the benchmark portfolio using **Mean-Variance Optimization (MVO)**.
- **Purpose**: This function benchmarks the performance of a portfolio optimized for minimum volatility using **Modern Portfolio Theory (MPT)**.
- **Comparison**: It allows us to compare the MPT strategy with other RL strategies by analyzing metrics like returns, volatility, and cumulative performance.

##### Workflow:

1. **Covariance Matrix**:

   - Extract the covariance matrix of asset returns for each trading day in the test period.
   - Use this matrix to model the relationships between asset returns.

2. **Optimization**:

   - Apply **Efficient Frontier** to minimize portfolio volatility.
   - Compute the optimal weights for each asset in the portfolio.

3. **Portfolio Value Calculation**:

   - Calculate the portfolio's account value over time using the optimized weights and asset prices.

4. **Performance Metrics**:
   - Evaluate the portfolio's performance using metrics such as annual return, cumulative return, and volatility.
   - Add the results to the `results` dictionary under the `"MPT"` key.


In [None]:
def compute_mpt_benchmark(test, env_kwargs):
    dates_test = test.date.unique()
    min_vals = [env_kwargs["initial_amount"]]
    for i in range(len(dates_test) - 1):
        curr = test[test.date == dates_test[i]]
        nxt = test[test.date == dates_test[i + 1]]
        covm = np.array(curr.cov_list.values[0])
        ef = EfficientFrontier(None, covm, weight_bounds=(0, 1))
        ef.min_volatility()
        w = ef.clean_weights()
        prices = curr.close.values
        nextp = nxt.close.values
        shares = np.array(list(w.values())) * min_vals[-1] / prices
        min_vals.append(np.dot(shares, nextp))
    min_df = pd.DataFrame({"date": dates_test, "account_value": min_vals})
    stats_mpt = backtest_stats(min_df, value_col_name="account_value")
    return {"df": min_df, "stats": stats_mpt}


mpt_benchmark = compute_mpt_benchmark(test_df, env_kwargs)

---

### Equal-Weighted Portfolio Benchmark

- Calculate the performance of an **equal-weighted portfolio** benchmark.
- This benchmark assumes that all assets in the portfolio are equally weighted, and their daily returns are averaged to compute the portfolio's overall return.

##### Workflow:

1. **Daily Returns Calculation**:

   - Group the test dataset by `date`.
   - Compute the percentage change (`pct_change`) in the `close` prices for each group.
   - Calculate the mean of the daily percentage changes to represent the portfolio's daily return.

2. **Cumulative Portfolio Value**:

   - Reset the index of the daily returns to create a DataFrame (`ew_df`).
   - Compute the cumulative product of the daily returns (`cumprod`) to calculate the portfolio's cumulative value over time.
   - Multiply the cumulative returns by the initial portfolio value (`initial_amount`) to get the portfolio's account value.

3. **Performance Metrics**:
   - Use the `backtest_stats` function to calculate performance metrics for the equal-weighted portfolio, such as annual return, cumulative return, and volatility.


In [None]:
def compute_equal_weighted_benchmark(df, initial_amount=100_000):
    # Pivot to have one column per ticker
    price_wide = df.pivot_table(
        index="date", columns="tic", values="close"
    ).sort_index()

    # Compute each ticker's daily return, then average equally
    daily_rets = price_wide.pct_change().fillna(0).mean(axis=1)

    # Build the equity curve
    ew_df = pd.DataFrame({"date": daily_rets.index, "daily_return": daily_rets.values})
    ew_df["account_value"] = (ew_df["daily_return"] + 1).cumprod() * initial_amount

    # Compute performance statistics
    stats_ew = backtest_stats(ew_df, value_col_name="account_value")

    return {"df": ew_df.reset_index(drop=True), "stats": stats_ew}


ew_benchmark = compute_equal_weighted_benchmark(test_df, env_kwargs["initial_amount"])

---

### SPY Benchmark

- **Objective**: Calculate the benchmark performance of the `SPY ETF`, which tracks the **S&P 500** index.
- **Purpose**: This function provides a baseline for comparing the performance of reinforcement learning models and other portfolio strategies.

##### Workflow:

1. **Data Retrieval**:
   - Use the `get_baseline` function to fetch the historical closing prices of the SPY ETF for the test period.
2. **Daily Returns Calculation**:
   - Compute the percentage change (`pct_change`) in the SPY closing prices to calculate daily returns.
3. **Cumulative Portfolio Value**:
   - Create a DataFrame (`spy_df`) with the daily returns and calculate the cumulative product (`cumprod`) of the daily returns to compute the portfolio's cumulative value over time.
   - Multiply the cumulative returns by the initial portfolio value (`initial_amount`) to get the portfolio's account value.
4. **Performance Metrics**:
   - Use the `backtest_stats` function to calculate performance metrics for the SPY benchmark, such as annual return, cumulative return, and volatility.


In [None]:
def compute_spy_benchmark(test, env_kwargs):
    spy_close = get_baseline("SPY", test_start, test_end)['close']
    spy_ret = spy_close.pct_change().dropna()
    spy_df = pd.DataFrame({"date": spy_ret.index, "daily_return": spy_ret.values})
    spy_df["account_value"] = (spy_df.daily_return + 1).cumprod() * env_kwargs[
        "initial_amount"
    ]
    stats_spy = backtest_stats(spy_df, value_col_name="account_value")
    return {"df": spy_df, "stats": stats_spy}


spy_benchmark = compute_spy_benchmark(test_df, env_kwargs)

In [None]:
benchmarks = {
    "MPT": mpt_benchmark,
    "EW": ew_benchmark,
    "SPY": spy_benchmark,
}

results.update(benchmarks)

## Performance Summary


In [None]:
perf_stats = pd.DataFrame({key.upper(): res["stats"] for key, res in results.items()})
display(perf_stats)

In [None]:
comparison_metrics = [
    "Cumulative returns",
    "Annual return",
    "Annual volatility",
    "Sharpe ratio",
    "Max drawdown",
]

# Filter the performance statistics for the selected metrics
comparison_table = perf_stats.loc[comparison_metrics]


# Plot the comparison metrics as a bar chart
comparison_table.T.plot(kind='bar', figsize=(16, 8))
plt.title("Comparison of Key Metrics Across Models")
plt.ylabel("Metric Value")
plt.xlabel("Models")
plt.xticks(rotation=45)
plt.legend(title="Metrics", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


Visualize the cumulative returns of various strategies over time


In [None]:
def plot_cumulative_returns(results):
    plt.figure(figsize=(12, 8))
    for name, res in results.items():
        # Ensure the date column is converted to datetime
        res["df"]["date"] = pd.to_datetime(res["df"]["date"])
        # Filter data to start from the trade start date
        filtered_df = res["df"][res["df"]["date"] >= test_start]
        cum = (
            (filtered_df["daily_return"] + 1).cumprod() - 1
            if "daily_return" in filtered_df
            else filtered_df["account_value"] / filtered_df["account_value"].iloc[0] - 1
        )
        plt.plot(filtered_df["date"], cum, label=name)
    plt.title("Cumulative Returns")
    plt.xlabel("Date")
    plt.ylabel("Cumulative Return")
    plt.legend()
    plt.show()

plot_cumulative_returns(results)