# PyTorch Assignment: Reinforcement Learning (RL)

**[Duke Community Standard](http://integrity.duke.edu/standard.html): By typing your name below, you are certifying that you have adhered to the Duke Community Standard in completing this assignment.**

Name: Abhinav Tembulkar

### Short answer

1\. One of the fundamental challenges of reinforcement learning is balancing *exploration* versus *exploitation*. What do these two terms mean, and why do they present a challenge?

**exploration**- 
1. Here we try different actions, we might learn something from we didn't expect from optimal actions generated from our estimates.
2. Here we take actions which are NOT OPTIMAL based on our current estimate of Q function, We take actions we never took before.
<br>

**exploitation**- 
1. Here we are going to exploit Q function we possess and we take actions that are optimal for a given state.
2. We exploit our learnt experience or learnt Q function. Always taking actions that maximize the Q function at any state.

These two terms present a challenge as they are goals of opposite nature.A exploration strategy often opposes exploitation strategy and vice-versa.Yet anyy model needs both characteristics to achieve its goal utlimately.

2\. Another fundamental reinforcement learning challenge is what is known as the *credit assignment problem*, especially when rewards are sparse. What do we mean by the phrase, and why does it make learning especially difficult?

Credit assginment problem or CAP is a problem that occurs in context of reinfocemtn learning where an agent interactiong with an environment takes action such that to maximise total reward or minize temporal difference.

When rewards are sparse , agent would find it difficult to reduce temporal difference frequently as a result of which taking actions is dfficult as actions are taken by an agent based on rewards it recieves on its actions here.

### Deep SARSA Cart Pole

[SARSA (state-action-reward-state-action)](https://en.wikipedia.org/wiki/Stateâ€“actionâ€“rewardâ€“stateâ€“action) is another Q value algorithm that resembles Q-learning quite closely:

Q-learning update rule:
\begin{equation}
Q_\pi (s_t, a_t) \leftarrow (1 - \alpha) \cdot Q_\pi(s_t, a_t) + \alpha \cdot \big(r_t + \gamma \max_a Q_\pi(s_{t+1}, a)\big)
\end{equation}

SARSA update rule:
\begin{equation}
Q_\pi (s_t, a_t) \leftarrow (1 - \alpha) \cdot Q_\pi(s_t, a_t) + \alpha \cdot \big(r_t + \gamma Q_\pi(s_{t+1}, a_{t+1})\big)
\end{equation}

Unlike Q-learning, which is considered an *off-policy* network, SARSA is an *on-policy* algorithm. 
When Q-learning calculates the estimated future reward, it must "guess" the future, starting with the next action the agent will take. In Q-learning, we assume the agent will take the best possible action: $\max_a Q_\pi(s_{t+1}, a)$. SARSA, on the other hand, uses the action that was actually taken next in the episode we are learning from: $Q_\pi(s_{t+1}, a_{t+1})$. In other words, SARSA learns from the next action he actually took (on policy), as opposed to what the max possible Q value for the next state was (off policy).

Build an RL agent that uses SARSA to solve the Cart Pole problem. 

*Hint: You can and should reuse the Q-Learning agent we went over earlier. In fact, if you know what you're doing, it's possible to finish this assignment in about 30 seconds.*

In [6]:
"""
Solution to the above aasignment below
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gym
import math

from collections import deque
from tqdm.notebook import tqdm
import random
import math

class DQN(nn.Module):
  def __init__(self):
    super().__init__()
    self.fc1 = nn.Linear(4,24)
    self.fc2 = nn.Linear(24,48)
    self.fc3 = nn.Linear(48,2)

  def forward(self,X):
    x = F.relu(self.fc1(X))
    x = F.relu(self.fc2(x))
    y = self.fc3(x)
    return y

EPISODES = 1000
SUCCESS_TICKS = 195
GAMMA = 1.0
EPSILON = 1.0
EPSILON_MIN = 0.01
EPSILON_DECAY = 0.995
ALPHA = 0.01
ALPHA_DECAY = 0.01
BATCH_SIZE = 64

class DQNcartpolesolver:
  def __init__(self,max_env_steps=None,monitor=False,quiet=True):
    self.memory = deque(maxlen=100000)
    self.env = gym.make('CartPole-v0')
    if monitor : 
      self.env = gym.wrappers.Monitor(self.env,'../data/cartpole-1',force=True)
    self.quiet = quiet
    self.gamma = GAMMA
    self.epsilon = EPSILON
    self.epsilon_min = EPSILON_MIN
    self.epsilon_decay = EPSILON_DECAY
    self.alpha = ALPHA
    self.alpha_decay = ALPHA_DECAY
    self.n_episodes = EPISODES
    self.n_win_ticks = SUCCESS_TICKS
    self.batch_size = BATCH_SIZE
    if max_env_steps is not None:
      self.env._max_episode_steps = max_env_steps

    self.dqn = DQN()
    self.criterion = nn.MSELoss()
    self.opt = torch.optim.Adam(self.dqn.parameters(),lr=0.01)

  def get_epsilon(self,t):
    return max(self.epsilon_min, min(self.epsilon, 1.0 - math.log10((t + 1) * self.epsilon_decay)))

  def preprocess_state(self,state):
    return torch.tensor(np.reshape(state,[1,4]),dtype=torch.float32)

  def choose_action(self,state,epsilon):
    if np.random.random()<=epsilon:
      return self.env.action_space.sample()
    else : 
      with torch.no_grad(): return torch.argmax(self.dqn(state)).numpy()

  def remember(self,state,action,reward,next_state,done,next_action):
    reward = torch.tensor(reward)
    self.memory.append((state, action, reward, next_state, done,next_action))

  def replay(self,batch_size):
    y_batch, y_target_batch = [], []
    minibatch = random.sample(self.memory, min(len(self.memory), batch_size))
    for state, action, reward, next_state, done,next_action in minibatch:
      y = self.dqn(state)
      y_target = y.clone().detach()
      with torch.no_grad():
        y_target[0][action] = reward if done else reward + self.gamma*self.dqn(next_state)[0][next_action]
      y_batch.append(y[0])
      y_target_batch.append(y_target[0])
  
    y_batch = torch.cat(y_batch)
    y_target_batch = torch.cat(y_target_batch)
    
    self.opt.zero_grad()
    loss = self.criterion(y_batch, y_target_batch)
    loss.backward()
    self.opt.step()        
    
    if self.epsilon > self.epsilon_min:
      self.epsilon *= self.epsilon_decay

  def run(self):
    scores = deque(maxlen=100)

    for e in tqdm(range(self.n_episodes)):
      state = self.preprocess_state(self.env.reset())
      done = False
      i = 0
    
      action = self.choose_action(state, self.get_epsilon(e))
    
      while not done:
        if e % 100 == 0 and not self.quiet:
            self.env.render()
        next_state, reward, done, _ = self.env.step(action)
        next_state = self.preprocess_state(next_state)

        next_action = self.choose_action(next_state,self.get_epsilon(e))
        self.remember(state, action, reward, next_state, done,next_action)
        state = next_state
        action = next_action
        i += 1

        scores.append(i)
        mean_score = np.mean(scores)
        if mean_score >= self.n_win_ticks and e >= 100:
            if not self.quiet: print('Ran {} episodes. Solved after {} trials âœ”'.format(e, e - 100))
            return e - 100
        if e % 100 == 0 and not self.quiet:
            print('[Episode {}] - Mean survival time over last 100 episodes was {} ticks.'.format(e, mean_score))

        self.replay(self.batch_size)
    
    if not self.quiet: print('Did not solve after {} episodes ðŸ˜ž'.format(e))
    return e

In [None]:
if __name__ == '__main__':
    agent = DQNcartpolesolver()
    agent.run()
    agent.env.close()
    
import time

def game (self):
    # lets game
    state = self.env.reset()
    done = False
    reward = 0

    with torch.no_grad():
      while not done:
        state = torch.Tensor([state]).view(-1,4)
        y = self.dqn(state)
        action = torch.Tensor([torch.argmax(y)]).view(-1,1)[0]
        print(y,action)
        obs = self.env.step(int(action))
        next_state,rwrd,done,_ = obs
        state = next_state
        self.env.render()
        reward+=rwrd
        time.sleep(0.1)

    self.env.close()
    print("YOUR REWARD: ",reward)
    
game(agent)