# Reinforcement Learning for Algorithmic Trading
## Step-by-Step Implementation of DQN (Fixed Version)
This notebook improves upon the previous DQN implementation by fixing key issues:

- **Better Reward Function**: Encourages profitable trades instead of just holding cash.
- **Increased Exploration**: Ensures the agent explores more during training.
- **Hold Penalty**: Prevents the agent from always holding cash.
- **Longer Training Time**: Allows more learning for better performance.

In [2]:
import numpy as np
import pandas as pd
import gym
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy
from gym import spaces

# Load preprocessed data
train_df = pd.read_csv('train_data.csv')
test_df = pd.read_csv('test_data.csv')

# Convert datadate to datetime format
train_df['datadate'] = pd.to_datetime(train_df['datadate'])
test_df['datadate'] = pd.to_datetime(test_df['datadate'])

## Step 1: Define the Fixed Trading Environment
We improve the environment by:
- **Fixing the Reward Function** to reward profitable trades.
- **Adding a Hold Penalty** to discourage inactivity.

In [3]:
class TradingEnv(gym.Env):
    """Custom Trading Environment for RL with Fixes"""
    def __init__(self, data):
        super(TradingEnv, self).__init__()
        self.data = data
        self.current_step = 0
        self.cash = 10000  # Initial cash balance
        self.holdings = 0  # Number of shares held
        self.action_space = spaces.Discrete(3)  # 0=Buy, 1=Hold, 2=Sell
        self.observation_space = spaces.Box(
            low=0, high=1, shape=(len(data.columns) - 2,), dtype=np.float32)

    def reset(self):
        self.current_step = 0
        self.cash = 10000
        self.holdings = 0
        return self._next_observation()

    def _next_observation(self):
        """Return current market state as observation (drop non-numeric columns)."""
        obs = self.data.iloc[self.current_step].drop(['datadate', 'tic']).values
        return obs.astype(np.float32)

    def step(self, action):
        """Execute action and move to the next step."""
        price = self.data.iloc[self.current_step]['adjcp']

        prev_value = self.cash + (self.holdings * price)
        if action == 0:  # Buy
            self.holdings += self.cash / price
            self.cash = 0
        elif action == 2:  # Sell
            self.cash += self.holdings * price
            self.holdings = 0

        self.current_step += 1
        done = self.current_step >= len(self.data) - 1

        # Calculate new portfolio value
        new_price = self.data.iloc[self.current_step]['adjcp'] if self.current_step < len(self.data) else price
        new_value = self.cash + (self.holdings * new_price)
        reward = new_value - prev_value  # Reward based on portfolio value change

        if action == 1:  # Small penalty for holding
            reward -= 0.01

        return self._next_observation(), reward, done, {}

    def render(self):
        print(f'Step: {self.current_step}, Cash: {self.cash}, Holdings: {self.holdings}')

# Initialize environment
env = TradingEnv(train_df)
state = env.reset()
print("Sample state:", state)

Sample state: [7.5600000e+03 4.6470854e-02 4.5648310e-02 4.5391802e-02 4.6561643e-02
 1.7540960e+07 7.2138506e-01 6.2133706e-01 6.4582926e-01 3.3790949e-01
 0.0000000e+00 5.3226990e-01]


## Step 2: Train the Fixed DQN Agent
We modify the training process by:
- Increasing exploration
- Extending training time
- Adjusting reward signals.

In [5]:
# Initialize improved DQN model
model = DQN(
    'MlpPolicy', env, verbose=1, learning_rate=0.001,
    buffer_size=10000, batch_size=32, exploration_fraction=0.2,  # More exploration
    exploration_final_eps=0.1  # Allow some randomness
)

# Train the model for longer
model.learn(total_timesteps=100000)

# Save the trained model
model.save('dqn_trading_model_fixed')

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




## Step 3: Evaluate the Fixed Model
We test the trained model on **unseen test data (2019-2021)**.

In [6]:
# Load trained model
model = DQN.load('dqn_trading_model_fixed')

# Initialize test environment
test_env = TradingEnv(test_df)
mean_reward, std_reward = evaluate_policy(model, test_env, n_eval_episodes=10)

print(f"Mean Reward: {mean_reward}, Std Reward: {std_reward}")



Mean Reward: 0.0, Std Reward: 0.0


## Step 4: Run the Fixed Agent on Test Data
Let's visualize how the improved agent performs in the test environment.

In [10]:
# Reset environment
obs = test_env.reset()
done = False

while not done:    
    action, _ = model.predict(obs)
    obs, reward, done, _ = test_env.step(action)
    # if trade is made, render the environment
    if action == 0 or action == 2:
        test_env.render()
    if test_env.current_step % 500 == 0:
        test_env.render()

Step: 1, Cash: 10000.0, Holdings: 0
Step: 2, Cash: 10000.0, Holdings: 0
Step: 3, Cash: 10000.0, Holdings: 0
Step: 4, Cash: 10000.0, Holdings: 0
Step: 5, Cash: 10000.0, Holdings: 0
Step: 7, Cash: 10000.0, Holdings: 0
Step: 8, Cash: 10000.0, Holdings: 0
Step: 9, Cash: 10000.0, Holdings: 0
Step: 10, Cash: 10000.0, Holdings: 0
Step: 11, Cash: 10000.0, Holdings: 0
Step: 12, Cash: 10000.0, Holdings: 0
Step: 13, Cash: 10000.0, Holdings: 0
Step: 14, Cash: 10000.0, Holdings: 0
Step: 15, Cash: 10000.0, Holdings: 0
Step: 16, Cash: 10000.0, Holdings: 0
Step: 17, Cash: 10000.0, Holdings: 0
Step: 18, Cash: 10000.0, Holdings: 0
Step: 19, Cash: 10000.0, Holdings: 0
Step: 20, Cash: 10000.0, Holdings: 0
Step: 21, Cash: 10000.0, Holdings: 0
Step: 22, Cash: 10000.0, Holdings: 0
Step: 23, Cash: 10000.0, Holdings: 0
Step: 24, Cash: 10000.0, Holdings: 0
Step: 25, Cash: 10000.0, Holdings: 0
Step: 26, Cash: 10000.0, Holdings: 0
Step: 27, Cash: 0, Holdings: 97115.51937900927
Step: 28, Cash: 12591.774028354275, 

## Conclusion
- The **reward function** now incentivizes profitable trades.
- The agent **actively explores** trading actions.
- We **increased training time** to improve learning.
- Next steps: Compare DQN with **PPO and SAC** for even better performance!