## Solving CartPole with DQNs
In this assignment you will make an RL agent capable of achieving 150+ average reward in the CartPole environment

In [1]:
pip install gym

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Make all necessary imports here
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from model import DQN, CustomDataset
import gym
import matplotlib.pyplot as plt
import numpy as np
import imageio
from tqdm import tqdm

Regarding the CartPoleAgent class:
- The constructor (\_\___init__\_\_) should initialize __gamma__ and __epsilon__ as class variables. It initializes online network, saves it and loads it again in target network (We do this so that both our target and online network are same during initialization)
- The __choose_action()__ function should take the __Q(s, a)__ values vector for a state s as input, for example if __Q_s__ is the given input, __Q_s[0]__ represents __Q(s, 0)__, __Q_s[1]__ represents __Q(s, 1)__ and so on, and the function should output the chosen action (an integer) according to the current exploration strategy (For example choose random action with probability ε and choose action with highest Q(s, a) value with probability 1-ε)
- The __train()__ function runs for a specific number of loops, in each loop:
    - It generates training data using __generate_training_data()__ function and passes it to train_instance function of the online network (which trains the online network)
    - It then saves the online network and loads that same saved function as target network
    - Calls the __evaluate_performace()__ function
    - Updates the value of epsilon as required
- The __generate_training_data()__ function:
    - Simulates lots of episodes/games/trajectories, it uses the online network for chossing actions, and the target netowrk for determining targets, it then stores all such states in an list/array/tensor and corresponding labels (i.e. targets) in another list/array/tensor.
    - It then makes a __CustomDataset__ variable with these state and labels and returns it
    - The CartPole environment terminates after 500 steps truncates itself after 500 steps in a single episode, you have to check this yourself and terminate the episode if it's length becomes >= 500
    - The number of data and targets in the dataset returned should be large enough (around 5000-10000), so that when we choose any random datapoints, they satisy the iid condition
- The __evaluate_performance()__ function calculates the average achieved reward with the current online network by simulating atleast 5 episodes (without any exploration as we are just calculating average reward), it then prints the average reward

Generally you should see a rising trend in your average obtained reward

Now some recommendations:
- You need a good exploratory strategy, exponentially decaying exploration is prefered, you can start with ε=0.5 and then divide it by a constant after each training loop, so that it finally reaches a value of ε = 0.01
- Whenever you use forward function of the DQN class in __generate_training_data()__ or __evaluate()__, make sure to detach the tensor so that it does not calculate gradients. You can detach any tensor "__a__" like:
```
    a = a.detach()
```
- 0.99 is a good value for Gamma

Some more things you can do (Optional):
- You can load an already saved PyTorch model with name "model.pth" into any variable network as follows:
```
    network = torch.load("model.pth)
```
- In the __evaluate()__ function, you can use __imageio__ library to make gifs of your agent playing the game (Google How!), but you have to initialize your environment as:
```
    env = gym.make("CartPole-v1", render_mode="rgb_array")
```
- In the __evaluate()__ function, you can calculate the Mean-Square Error of the model and store these values for each iterations and finally plot it to get an idea of how is your training going.

In [19]:
class CartPoleAgent:
    def __init__(self, epsilon=0.5, gamma=0.99) -> None:
        self.gamma = gamma
        self.epsilon = epsilon
        self.online_network = DQN()
        torch.save(self.online_network.state_dict(), 'online_model.pth')
        self.target_network = DQN()
        self.target_network.load_state_dict(torch.load('online_model.pth'))
        pass
    def choose_action(self, Q_s) -> int:
        if np.random.rand() < self.epsilon:
            return np.random.choice(len(Q_s))
        else:
            return np.argmax(Q_s)
        pass

    def generate_training_data(self) -> CustomDataset:
        env = gym.make("CartPole-v1")
        states = []
        labels = []

        for _ in range(5000):
            state = env.reset()
            episode_states = []
            episode_labels = []

            for _ in range(500):
                if isinstance(state, dict):
                    state_list = [np.array(state[key]) for key in state]
                    state_array = np.concatenate(state_list)
                    state_tensor = torch.from_numpy(state_array).float().view(1, -1)
                elif isinstance(state, tuple):
                    state_tensor = torch.cat([torch.from_numpy(s).float().view(1, -1) for s in state], dim=1)
                else:
                    state_tensor = torch.from_numpy(state).float().view(1, -1)

                Q_values = self.online_network.forward(state_tensor).detach().numpy()
                action = self.choose_action(Q_values)

                next_state, reward, done, _ = env.step(action)

                target = reward + self.gamma * np.max(self.target_network.forward(next_state).detach().numpy())

                episode_states.append(state)
                episode_labels.append(target)

                state = next_state

                if done:
                    break

            states.extend(episode_states)
            labels.extend(episode_labels)
        env.close()

        training_data = CustomDataset(np.array(states), np.array(labels))
        return training_data
        pass

    def train_agent(self, num_loops=1000):
        for loop in tqdm(range(num_loops)):
            training_data = self.generate_training_data()
            self.online_network.train_instance(training_data)
            torch.save(self.online_network.state_dict(), 'online_model.pth')
            self.target_network.load_state_dict(torch.load('online_model.pth'))
            self.evaluate_performance(loop)
            self.epsilon *= 0.99
        pass

    def evaluate_performance(self, iteration) -> None:
        env = gym.make("CartPole-v1")
        total_reward = 0
        num_episodes = 5

        for _ in range(num_episodes):
            state = env.reset()

            for _ in range(500):
                q_values = self.online_network.forward(torch.from_numpy(state).float())
                action = self.choose_action(q_values.detach().numpy())
                next_state, reward, done, _ = env.step(action)

                total_reward += reward

                if done:
                    break

                state = next_state

        average_reward = total_reward / num_episodes
        print(f'Iteration: {iteration}, Average Reward: {average_reward}')
        env.close()
        pass

You should run the below cell to start training

In [21]:
# This cell should not be changed
Agent = CartPoleAgent()
Agent.train_agent()