# Reinforcement Learning for Algorithmic Trading
## Step-by-Step Implementation of DQN (Fixed Version)
This notebook improves upon the previous DQN implementation by fixing key issues:

- **Better Reward Function**: Encourages profitable trades instead of just holding cash.
- **Increased Exploration**: Ensures the agent explores more during training.
- **Hold Penalty**: Prevents the agent from always holding cash.
- **Longer Training Time**: Allows more learning for better performance.

In [32]:
import numpy as np
import pandas as pd
import gym
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy
from gym import spaces

# Load preprocessed data
train_df = pd.read_csv('train_data.csv')
test_df = pd.read_csv('test_data.csv')

# Convert datadate to datetime format
train_df['datadate'] = pd.to_datetime(train_df['datadate'])
test_df['datadate'] = pd.to_datetime(test_df['datadate'])

## Step 1: Define the Fixed Trading Environment
We improve the environment by:
- **Fixing the Reward Function** to reward profitable trades.
- **Adding a Hold Penalty** to discourage inactivity.

In [39]:
import numpy as np
import pandas as pd
import gym
from gym import spaces

class MeanReversionTradingEnv(gym.Env):
    def __init__(self, df, window_size=20, initial_cash=10000):
        super(MeanReversionTradingEnv, self).__init__()

        self.df = df
        self.window_size = window_size
        self.initial_cash = initial_cash
        self.current_step = 0
        self.cash = initial_cash
        self.shares_held = 0
        self.total_value = initial_cash  # Track total portfolio value

        # Action space: 0 = Buy, 1 = Hold, 2 = Sell
        self.action_space = spaces.Discrete(3)

        # Observation space: [current price, moving average, cash, shares held]
        self.observation_space = spaces.Box(
            low=0, high=np.inf, shape=(4,), dtype=np.float32
        )

    def reset(self):
        """Reset environment for a new episode"""
        self.current_step = self.window_size  # Start after enough data points
        self.cash = self.initial_cash
        self.shares_held = 0
        self.total_value = self.initial_cash
        return self._get_observation()

    def _get_observation(self):
        """Return the current state representation"""
        current_price = self.df.iloc[self.current_step]['adjcp']
        moving_avg = self.df.iloc[self.current_step]['20_day_MA']
        return np.array([current_price, moving_avg, self.cash, self.shares_held], dtype=np.float32)

    def step(self, action):
        """Take an action and return next state, reward, done flag"""
        current_price = self.df.iloc[self.current_step]['adjcp']
        moving_avg = self.df.iloc[self.current_step]['20_day_MA']
        
        # Fix: Ensure done flag stops at valid indices
        done = self.current_step >= len(self.df) - 1  # Stop at the last valid row
        reward = 0

        # Mean Reversion Logic: Trade only when price deviates significantly from the moving average
        if action == 0 and current_price < moving_avg:  # Buy if price < MA
            num_shares = self.cash // current_price  # Buy as many as possible
            if num_shares > 0:
                self.cash -= num_shares * current_price
                self.shares_held += num_shares
                reward = 1  # Positive reward for buying at a discount

        elif action == 2 and current_price > moving_avg and self.shares_held > 0:  # Sell if price > MA
            self.cash += self.shares_held * current_price
            self.shares_held = 0
            reward = 1  # Positive reward for selling at a premium

        elif action == 2 and self.shares_held == 0:
            reward = -1  # Penalize selling when no shares are held

        # Update portfolio value
        self.total_value = self.cash + (self.shares_held * current_price)

        # Move to the next step
        if not done:  # Only increment if not done
            self.current_step += 1
            
        return self._get_observation(), reward, done, {}

    def render(self):
        """Print current state for debugging"""
        print(f"Step: {self.current_step}, Cash: {self.cash}, Shares: {self.shares_held}, Total Value: {self.total_value}")

# Initialize environment
env = MeanReversionTradingEnv(train_df)
state = env.reset()
print("Sample state:", state)

Sample state: [   27.98        29.451279 10000.           0.      ]


## Step 2: Train the Fixed DQN Agent
We modify the training process by:
- Increasing exploration
- Extending training time
- Adjusting reward signals.

In [40]:
# Initialize improved DQN model
model = DQN(
    'MlpPolicy', env, verbose=1, learning_rate=0.001,
    buffer_size=10000, batch_size=32, exploration_fraction=0.2,  # More exploration
    exploration_final_eps=0.1  # Allow some randomness
)

# Train the model for longer
model.learn(total_timesteps=100000)

# Save the trained model
model.save('dqn_trading_model_fixed')

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 2.24e+03 |
|    ep_rew_mean      | -320     |
|    exploration_rate | 0.596    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 812      |
|    time_elapsed     | 11       |
|    total_timesteps  | 8976     |
| train/              |          |
|    learning_rate    | 0.001    |
|    loss             | 82.8     |
|    n_updates        | 2218     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 2.24e+03 |
|    ep_rew_mean      | -364     |
|    exploration_rate | 0.192    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 812      |
|    time_elapsed     | 22       |
|    total_timesteps  | 17952    |
| train/              |        

## Step 3: Evaluate the Fixed Model
We test the trained model on **unseen test data (2019-2021)**.

In [42]:
# Load trained model
model = DQN.load('dqn_trading_model_fixed')

# Initialize test environment
test_env = MeanReversionTradingEnv(test_df)
mean_reward, std_reward = evaluate_policy(model, test_env, n_eval_episodes=10)

print(f"Mean Reward: {mean_reward}, Std Reward: {std_reward}")



Mean Reward: -390.0, Std Reward: 0.0


## Step 4: Run the Fixed Agent on Test Data
Let's visualize how the improved agent performs in the test environment.

In [None]:
# Reset environment
obs = test_env.reset()
done = False
cash_available = test_env.cash  # Assuming test_env has a cash attribute
stock_holdings = test_env.shares_held  # Assuming test_env tracks shares

while not done:    
    action, _ = model.predict(obs)

    # Prevent selling if no stocks are held
    if action == 2 and stock_holdings == 0:
        action = 1  # Hold instead of selling

    obs, reward, done, _ = test_env.step(action)

    # Update cash and stock holdings
    cash_available = test_env.cash
    stock_holdings = test_env.shares_held

    # If a trade is made, render and print
    if action == 0 or action == 2:
        print(f"Trade executed at step {test_env.current_step}: Action {action}")

    # Render every 500 steps
    if test_env.current_step % 100 == 0:
        test_env.render()


AttributeError: 'MeanReversionTradingEnv' object has no attribute 'holdings'

## Conclusion
- The **reward function** now incentivizes profitable trades.
- The agent **actively explores** trading actions.
- We **increased training time** to improve learning.
- Next steps: Compare DQN with **PPO and SAC** for even better performance!