# Assignment : Stock market prediction with Double DQN
---

<b><div style="text-align: right">[TOTAL POINTS: 15]</div></b>

In this assignment, we will solve stock market prediction problem using Double DQN. By the end of the Assignment you will be able to:

- implement Double DQN algorithm in stock market prediction problem 

Before proceeding to the exercises, let's first know about the environment.

## Environment

`AnyTrading` is the trading environment provided by OpenAI gym environment. Generally, this environment provides two markets to implement: `Forex` and `Stock`. The main goal of this environment is to provide different trading markets for testing different RL algorithms in the same way as other gym environments.

We will be using `Stock` trading environment for this implementation. Let's understand more about it.

### Properties

> - `action_space:` There are two possible actions: 
    - `Sell: 0`
    - `Buy: 1`
- `observation_space:` It returns the state of the environment with size equals to `window_size x 2`. Here `window_size` determines how large or small state we are looking for and the later `2` dimensions value represents the following informations:
  - `price:` The zeroth element of the each window's array is price value.
  - `price_diff:` The oneth element of the each window's array is price difference value.
- `shape:` It gives shape of a single observation. Generally, it is `window_size x 2`.
- `history:` It stores all the info of all steps.
- `frame_bound:` It is tuple that represents lower index and upper index of the environments' dataframe `df` to be used for training.
- `window_size:` It is a integer value that determines the size of our state space. For eg., if `window_size=10`, our state space size will be of `10x2`. Here, `2` represents `price` and `price_diff` quantities.

### Methods

> - `seed:` It is same as typical gym `env.seed()` method.
- `reset:` It is same as typical gym `env.reset()` method.
- `step:` It is same as typical gym `env.step()` method.
- `render:` It is same as typical gym `env.render()` method.
- `render_all:` Unlike gym, it renders the whole environment.
- `close:` It is same as typical gym `env.close()` method.

## Assignment overview

This assignment is divided into two major exercises. Due to free mode of this assignment, there are no explict tasks inside the exercise. However, You will perform various tasks inside these two exercises. 

You can implement Double DQN algorithm freely as you want with certain restrictions. These restrictions will be explained inside each exercise. 

These are some overview of tasks that you will complete in each exercise.

### Exercise 1: Create an agent class
- Initialize agent class
- Implement methods for Double DQN algorithms including build network, experience replay buffer, run episodes, etc. 

### Exercise 2: Train the model
- Create an stock market environment
- Train the model and return reward history

## Import libraries
Let's first import all the necessary libraries for this assignment.

In [None]:
!pip install gym_anytrading # for gym trading environment

In [None]:
from collections import deque
import random

import numpy as np
import matplotlib.pyplot as plt

import torch
from torch import nn
from torch.nn import functional as F

# for stock market environment
import gym
import gym_anytrading
from gym_anytrading.envs import TradingEnv, ForexEnv, StocksEnv, Actions, Positions 
from gym_anytrading.datasets import FOREX_EURUSD_1H_ASK, STOCKS_GOOGL

# for visualization
import matplotlib.pyplot as plt

# you can import additional libraries here

These are some essential libraries. If you need to import other libraries you can import in above section the way you want to implement.


# Agent environment

This section consists of overall agent environment

## Exercise 1: Create an agent class

<b><div style="text-align: right">[Marks: 5]</div></b>

In this exercise, you will implement all methods for Double DQN algorithm inside the agent class from scratch. You can refer to the assignment or any other material to implement the agent.

There are few things you need to consider before writing your tasks:

- The agent class should be `class Agent` and it should take initial arguments as: `state_size`, `num_actions` and `window_size`.
  - `state_size:` It is the size of the stock environment's observation space (i.e., `env.observation_space.shape`)
  - `num_actions:` It the number of possible actions of the environment (i.e., `env.action_space.n`).
  - `window_size:` It determines how large state space you want. 

- Implement `build_network()` method to build architecture for target Q network and current Q network. This method can take `state_size` and `num_actions` as an input variables and it should return the `model`. You can freely create your own architecture for this environment but remember total number of parameter of the both model should not exceed `8000` i.e., `number of network parameters < 8000`.

- Implement `train()` method that takes an environment object and number of epochs as arguments. It should return `all episodic rewards in list` during the training.

Possible hyperparameters:

- `discount rate (`$\gamma$`)= 0.9`
- `learning rate (lr) = 0.0005`
- `batch size = 32`
- `buffer memory length = 50000`
- `update target step = 200`

Besides these requirements and possible hints, you are free to implement Double DQN algorithm, especially the one you studied in this unit. Furthermore, it would be best if you considered varying the model architecture, exploration rate and different hyperparameters to get best result.

For your reference, the following is the major difference in target equation in Double DQN algorithm compared to DQN algorithm. For complete algorithm you can refer to DQN chapter.

> - if episode terminates at t+1: 
>     - $y_j = r_j$
> - else 
>     - $y_j = r_j+ \gamma\hat{Q}(\phi_{j+1}, argmax_{a'}\hat{Q}(\phi_{j+1}, a'; \theta);\theta^-)$

### Solution Code Snippet
<details>
    <summary style="color:red">Click here and copy the code to cell below</summary>
    
        def __init__(self, state_size, num_actions=2, window_size=20):
        self.num_actions = None
        self.state_size = None
        self.window_size = None
        
        self.lr = None
        self.gamma = None
        self.batch_size = None
        self.epsilon = None
        self.epsilon_min = None
        self.epsilon_decay = None
        self.rs = np.random.RandomState(seed=42)

        self.memory = deque(maxlen=50000)
        self.update_target = 200

        self.q_network = None
        self.target_network = None
        self.optimizer = None
        self.loss_fn=  None
    
    def build_network(self, state_size, num_actions=2):
        size = state_size[0] * state_size[1]
        model = nn.Sequential(
          nn.Flatten(),
          nn.Linear(None, None),
          nn.ReLU(),
          nn.Linear(None, None),
          nn.ReLU(),
          nn.Linear(None, None),
          nn.ReLU(),
          nn.Linear(None, None)
        )
        return model
  
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def sample(self):
        replay = random.sample(self.memory, self.batch_size)
        states, actions, rewards, next_states, done = map( list , zip(*replay))
        states = torch.stack(states).reshape(-1, *self.state_size)
        actions = torch.Tensor(actions).to(torch.int64)
        rewards = torch.Tensor(rewards)
        next_states = torch.stack(next_states).reshape(-1, *self.state_size)
        done = torch.Tensor(done)
        return states, actions, rewards, next_states, done

    def optimize_network(self, states, actions, rewards, next_states, dones):
        self.target_network.train(False)
        self.q_network.train(False)
        with torch.no_grad():
            max_actions = None
            update_values = None
            targets = None

            actions_one_hot = None

        self.q_network.train(True)
        q_values = self.q_network(states)
        action_values = None
        loss = None

        loss.backward()

        self.optimizer.step()
    
    def run_episode(self, env, frame_count):
        state = env.reset()
        episode_reward = 0
        done = False
        state = torch.Tensor(state).unsqueeze(0)
        while not done:
            frame_count += 1
            if self.epsilon > self.rs.rand():
                action = None
            else:
                q_values = None
                action = None
            if self.epsilon > self.epsilon_min:
                self.epsilon -= self.epsilon_decay

            next_obs, reward, done, _ = env.step(action)
            next_obs = torch.Tensor(next_obs).unsqueeze(0)
            episode_reward += None

            self.remember(state, action, reward, next_obs, done)

            state = next_obs
            
            if frame_count > self.batch_size:
                sample_batch = self.sample()
                self.optimize_network(*sample_batch)

            if (frame_count + 1) % self.update_target == 0:
                self.target_network.load_state_dict(self.q_network.state_dict())

        return episode_reward, frame_count
    
    def train(self, env, num_epochs=50):
        frame_count = 0
        reward_history = []
        for epoch in range(None):
            episodic_reward, frame_count = None
            reward_history.append(None)
            print('Episode {} || Rewards:{} || Avg reward:{}'.format(epoch, round(episodic_reward, 2), np.average(reward_history)))
        plt.cla()
        env.render_all()
        plt.show()
        return reward_history
    

In [None]:
### Ex-1-Task-1
class Agent(object):
    '''An agent class'''
    ## TODO:
    # Write all methods required for agent class
    # Must have build_network() method and train() method at least as described
    # in above section
    ### BEGIN SOLUTION
    # your code here
    raise NotImplementedError
    ### END SOLUTION

In [None]:
# Intentionally left blank

## Exercise 2: Train the model
<b><div style="text-align: right">[Marks: 10]</div></b>

---

In this exercise, you will create a stock environment and train your agent.

These are the guidlines for this exercise:
- First create a stock environment by `gym.make(env_name, frame_bound, window_size)`.
  - `env_name = stocks-v0`
  - `frame_bound = (800, 1000)`
  - `window_size = 15` 
- Initialize an agent with following parameters:
  - `state_size`
  - `num_actions`
  - `window_size`
- Train agent upto `20 epochs` and store all the rewards into `reward_history` variable. 

> #### **You will get full marks if the average rewards for 20 epochs training is greater than 50.00**

In [None]:
### Ex-2-Task-1
frame_bound = (800, 1000)
window_size = 15
num_epochs = 20
env = None  #implement gym.make()
## use seed value of 42 in env.seed()

state_size = None
num_actions = None

agent = None #initialize Agent(state_size, num_actions, window_size)


reward_history = None # store history using agent.train()
### BEGIN SOLUTION
# your code here
raise NotImplementedError
### END SOLUTION

In [None]:
assert env is not None
assert frame_bound is not None
assert window_size is not None
assert state_size is not None
assert num_actions is not None
assert agent is not None
assert num_epochs is not None
assert reward_history is not None

## Conclusion

Congrats!! You have successfully implemented Double DQN algorithm for stock market prediction. By now, you may have fully understand the Double DQN algorithm. You can also try Dueling DQN algorithm in various other openAI gym environment as well.