**Hyperparameters to tune:**
- Learning Rate: Controls how much the model's weights are updated during training. A higher learning rate might lead to faster learning but can cause instability. A lower learning rate ensures more stable but slower learning.
- N_steps: Number of steps the agent takes before updating its policy. It's a trade-off between performance and memory usage. In a trading environment, this could be aligned with the frequency of decision-making.
- Gamma (Discount Factor): Determines the importance of future rewards. A lower value makes the agent short-sighted by discounting future rewards heavily.
- Gae_lambda (Generalized Advantage Estimator): Balances bias and variance in the advantage estimation. It affects how the agent evaluates the trade-off between immediate and future rewards.
- Ent_coef (Entropy Coefficient): Encourages exploration by adding an entropy bonus to the objective function. Higher entropy can help explore more strategies in a complex trading market.
- Seed: Sets the random seed for reproducibility of training results.
- Use_sde (Stochastic Differential Equations): If enabled, introduces stochasticity in the policy, which can help exploration.

In [None]:
df = pd.read_csv('ETHUSD_5.csv', header=None, names=['Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'Trades'])
df['Time'] = pd.to_datetime(df['Time'], unit='s')
df.set_index('Time', inplace=True)
df = df.to_numpy()

In [None]:
import numpy as np
from sb3_contrib import RecurrentPPO
from stable_baselines3.common.evaluation import evaluate_policy

env = TradeEnv(6, 288, 1000, df)
model = RecurrentPPO("MlpLstmPolicy", env, verbose=1, learning_rate=0.001, n_steps=32, gamma=0.65, ent_coef=0.05, seed=42)
model.learn(10000)

# model.save("ppo_recurrent")
# del model # remove to demonstrate saving and loading

# model = RecurrentPPO.load("ppo_recurrent")

# obs = vec_env.reset()
# # cell and hidden state of the LSTM
# lstm_states = None
# num_envs = 1
# # Episode start signals are used to reset the lstm states
# episode_starts = np.ones((num_envs,), dtype=bool)
# while True:
#     action, lstm_states = model.predict(obs, state=lstm_states, episode_start=episode_starts, deterministic=True)
#     obs, rewards, dones, info = vec_env.step(action)
#     episode_starts = dones
#     vec_env.render("human")