<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_04_atari.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Q-Learning with an Atari-Phoenix game(OpenAI gym environment)

<img src='https://www.gymlibrary.dev/_images/phoenix.gif'>

#### Install and Import libraries

In [None]:
# HIDE OUTPUT
try:
    from google.colab import drive
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

if COLAB:
  !sudo apt-get install -y xvfb ffmpeg
  !pip install -q ale-py
  !pip install -q 'gym==0.17.3'
  !pip install -q 'imageio==2.4.0'
  !pip install -q PILLOW
  !pip install -q 'pyglet==1.3.2'
  !pip install -q pyvirtualdisplay
  !pip install -q --upgrade tensorflow-probability
  !pip install -q 'tf-agents==0.12.0'
  !pip install -q keras-rl2

In [None]:

  !pip install  tensorflow==2.10.0 

In [None]:
import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay

import tensorflow as tf

# Set up a virtual display for rendering OpenAI gym environments.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()


In [None]:
# HIDE OUTPUT
! wget http://www.atarimania.com/roms/Roms.rar
! mkdir /content/ROM/
! unrar e -o+ /content/Roms.rar /content/ROM/
! python -m atari_py.import_roms /content/ROM/

#### Atari environment

In [None]:
import gym
env = gym.make("Phoenix-v4")

We can now reset the environment and display one step.  The following image shows how the Pong game environment appears to a user.

In [None]:
env.reset()
#PIL.Image.fromarray(env.render())


array([[[  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0],
        ...,
        [  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0]],

       [[  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0],
        ...,
        [  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0]],

       [[  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0],
        ...,
        [  0,   0,   0],
        [  0,   0,   0],
        [  0,   0,   0]],

       ...,

       [[146,  70, 192],
        [146,  70, 192],
        [146,  70, 192],
        ...,
        [146,  70, 192],
        [146,  70, 192],
        [146,  70, 192]],

       [[146,  70, 192],
        [146,  70, 192],
        [146,  70, 192],
        ...,
        [146,  70, 192],
        [146,  70, 192],
        [146,  70, 192]],

       [[146,  70, 192],
        [146,  70, 192],
        [146,  70, 192],
        ...,
        [146,  70, 192],
        [146,  70, 192],
        [146,  70, 192]]

In [None]:
height, width, channels = env.observation_space.shape
actions = env.action_space.n
print(env.observation_space.shape)

(210, 160, 3)


#### Create a Deep Learning Model with Keras

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten,Convolution2D
from keras.optimizers import Adam

In [None]:
from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy, BoltzmannQPolicy
from rl.memory import SequentialMemory

In [None]:
# Training parameters.
time_limit = True
buffer_size = 200000  # observation history size
batch_size = 25  # mini batch size sampled from history at each update step
nb_actions = env.action_space.n
window_length = 3

# construct a MLP
model = Sequential()
model.add(Convolution2D(32, (8,8), strides=(4,4), activation='relu', input_shape=(3,height, width, channels)))
model.add(Convolution2D(64, (4,4), strides=(2,2), activation='relu'))
model.add(Convolution2D(64, (3,3), activation='relu'))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(actions, activation='linear'))
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 3, 51, 39, 32)     6176      
                                                                 
 conv2d_1 (Conv2D)           (None, 3, 24, 18, 64)     32832     
                                                                 
 conv2d_2 (Conv2D)           (None, 3, 22, 16, 64)     36928     
                                                                 
 flatten (Flatten)           (None, 67584)             0         
                                                                 
 dense (Dense)               (None, 512)               34603520  
                                                                 
 dense_1 (Dense)             (None, 256)               131328    
                                                                 
 dense_2 (Dense)             (None, 8)                 2

#### Build Agent with Keras-RL and training

In [None]:
# keras-rl2 objects
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.1, nb_steps=10000)
memory = SequentialMemory(limit=1000, window_length=3)

dqn = DQNAgent(model=model, memory=memory, policy=policy,
                  enable_dueling_network=True, dueling_type='avg', gamma=.8,
                   nb_actions=actions, nb_steps_warmup=1000
                  )

dqn.compile(
    Adam(lr=1e-3),
    metrics=['mse']
)

  super().__init__(name, **kwargs)


In [None]:

history = dqn.fit(env, 
                  nb_steps=2500,
                  visualize=False,nb_max_episode_steps=250,
                  verbose=2) 

Training for 2500 steps ...


  updates=self.state_updates,


  250/2500: episode: 1, duration: 6.421s, episode steps: 250, steps per second:  39, episode reward: 140.000, mean reward:  0.560 [ 0.000, 80.000], mean action: 3.444 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --
  500/2500: episode: 2, duration: 3.747s, episode steps: 250, steps per second:  67, episode reward: 120.000, mean reward:  0.480 [ 0.000, 20.000], mean action: 3.548 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --
  750/2500: episode: 3, duration: 3.758s, episode steps: 250, steps per second:  67, episode reward: 120.000, mean reward:  0.480 [ 0.000, 20.000], mean action: 3.524 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --
 1000/2500: episode: 4, duration: 3.726s, episode steps: 250, steps per second:  67, episode reward: 80.000, mean reward:  0.320 [ 0.000, 20.000], mean action: 3.540 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --


  updates=self.state_updates,


 1250/2500: episode: 5, duration: 213.460s, episode steps: 250, steps per second:   1, episode reward: 100.000, mean reward:  0.400 [ 0.000, 20.000], mean action: 3.340 [0.000, 7.000],  loss: 218.717224, mse: 443.726453, mean_q: 5.425571, mean_eps: 0.898750
 1500/2500: episode: 6, duration: 213.051s, episode steps: 250, steps per second:   1, episode reward: 60.000, mean reward:  0.240 [ 0.000, 20.000], mean action: 3.508 [0.000, 7.000],  loss: 0.386837, mse: 5.785922, mean_q: 3.098195, mean_eps: 0.876295
 1750/2500: episode: 7, duration: 213.145s, episode steps: 250, steps per second:   1, episode reward: 200.000, mean reward:  0.800 [ 0.000, 80.000], mean action: 3.672 [0.000, 7.000],  loss: 0.473153, mse: 5.293258, mean_q: 3.027168, mean_eps: 0.853795
 2000/2500: episode: 8, duration: 213.026s, episode steps: 250, steps per second:   1, episode reward: 260.000, mean reward:  1.040 [ 0.000, 80.000], mean action: 3.660 [0.000, 7.000],  loss: 3.300629, mse: 8.600682, mean_q: 3.572834, 

In [None]:
scores = dqn.test(env, nb_episodes=5, visualize=True)
print(np.mean(scores.history['episode_reward']))

Testing for 5 episodes ...
Episode 1: reward: 120.000, steps: 589
Episode 2: reward: 200.000, steps: 1154
Episode 3: reward: 200.000, steps: 853
Episode 4: reward: 200.000, steps: 2589
Episode 5: reward: 140.000, steps: 866
172.0


The baseline performance show us a mean reward with 172.0 score.

#### Train agent with different learning rate and discount rate

According to the baseline with:<br>
max_steps_per_episode = 250 <br>
learning_rate = 0.001 <br>
discount_rate = 0.8 <br>
exploration_rate = 1 <br>
max_exploration_rate = 1 <br>
min_exploration_rate = 0.1 <br>
exploration_decay_rate = 0.1<br>

Here i try to change learning rate and discount rate to observe how they change the baseline.

In [None]:
# decrease learning rate α from 0.001 to 0.0001 
# increase discount rate gamma to 0.99
dqn1 = DQNAgent(model=model, memory=memory, policy=policy,
                  enable_dueling_network=True, dueling_type='avg', gamma=.99,
                   nb_actions=actions, nb_steps_warmup=1000
                  )

dqn1.compile(
    Adam(lr=1e-4),
    metrics=['mse']
)

  super().__init__(name, **kwargs)


In [None]:
history = dqn1.fit(env, 
                  nb_steps=2500,
                  visualize=False,nb_max_episode_steps=250,
                  verbose=2) 

Training for 2500 steps ...


  updates=self.state_updates,


  250/2500: episode: 1, duration: 3.905s, episode steps: 250, steps per second:  64, episode reward: 100.000, mean reward:  0.400 [ 0.000, 20.000], mean action: 3.248 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --
  500/2500: episode: 2, duration: 3.745s, episode steps: 250, steps per second:  67, episode reward: 140.000, mean reward:  0.560 [ 0.000, 20.000], mean action: 3.512 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --
  750/2500: episode: 3, duration: 3.750s, episode steps: 250, steps per second:  67, episode reward: 100.000, mean reward:  0.400 [ 0.000, 20.000], mean action: 3.544 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --
 1000/2500: episode: 4, duration: 3.772s, episode steps: 250, steps per second:  66, episode reward: 160.000, mean reward:  0.640 [ 0.000, 80.000], mean action: 3.676 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --


  updates=self.state_updates,


 1250/2500: episode: 5, duration: 213.420s, episode steps: 250, steps per second:   1, episode reward: 160.000, mean reward:  0.640 [ 0.000, 80.000], mean action: 3.668 [0.000, 7.000],  loss: 5.460064, mse: 7.197431, mean_q: 3.189197, mean_eps: 0.898750
 1500/2500: episode: 6, duration: 213.254s, episode steps: 250, steps per second:   1, episode reward: 180.000, mean reward:  0.720 [ 0.000, 80.000], mean action: 3.580 [0.000, 7.000],  loss: 1.685999, mse: 12.501783, mean_q: 4.530695, mean_eps: 0.876295
 1750/2500: episode: 7, duration: 213.356s, episode steps: 250, steps per second:   1, episode reward: 160.000, mean reward:  0.640 [ 0.000, 80.000], mean action: 3.616 [0.000, 7.000],  loss: 1.481246, mse: 11.871580, mean_q: 4.641760, mean_eps: 0.853795
 2000/2500: episode: 8, duration: 213.297s, episode steps: 250, steps per second:   1, episode reward: 100.000, mean reward:  0.400 [ 0.000, 20.000], mean action: 3.520 [0.000, 7.000],  loss: 0.582249, mse: 10.588105, mean_q: 4.353719, 

In [None]:
scores = dqn1.test(env, nb_episodes=5, visualize=True)
print(np.mean(scores.history['episode_reward']))

Testing for 5 episodes ...
Episode 1: reward: 180.000, steps: 1714
Episode 2: reward: 180.000, steps: 1025
Episode 3: reward: 80.000, steps: 1422
Episode 4: reward: 180.000, steps: 911
Episode 5: reward: 120.000, steps: 3292
148.0


#### Try policy 


In [None]:
policy1 = BoltzmannQPolicy() 

dqn2 = DQNAgent(model=model, memory=memory, policy=policy1,
                  enable_dueling_network=True, dueling_type='avg', gamma=.8,
                   nb_actions=actions, nb_steps_warmup=1000
                  )

dqn2.compile(
    Adam(lr=1e-3),
    metrics=['mse']
)

  super().__init__(name, **kwargs)


In [None]:
history = dqn2.fit(env, 
                  nb_steps=2500,
                  visualize=False,nb_max_episode_steps=250,
                  verbose=2) 

Training for 2500 steps ...


  updates=self.state_updates,


  250/2500: episode: 1, duration: 4.054s, episode steps: 250, steps per second:  62, episode reward: 80.000, mean reward:  0.320 [ 0.000, 80.000], mean action: 2.408 [1.000, 7.000],  loss: --, mse: --, mean_q: --
  500/2500: episode: 2, duration: 3.764s, episode steps: 250, steps per second:  66, episode reward: 120.000, mean reward:  0.480 [ 0.000, 80.000], mean action: 2.560 [1.000, 7.000],  loss: --, mse: --, mean_q: --
  750/2500: episode: 3, duration: 3.745s, episode steps: 250, steps per second:  67, episode reward: 80.000, mean reward:  0.320 [ 0.000, 20.000], mean action: 2.292 [1.000, 7.000],  loss: --, mse: --, mean_q: --
 1000/2500: episode: 4, duration: 3.782s, episode steps: 250, steps per second:  66, episode reward: 80.000, mean reward:  0.320 [ 0.000, 20.000], mean action: 2.576 [1.000, 7.000],  loss: --, mse: --, mean_q: --


  updates=self.state_updates,


 1250/2500: episode: 5, duration: 213.712s, episode steps: 250, steps per second:   1, episode reward: 80.000, mean reward:  0.320 [ 0.000, 20.000], mean action: 3.672 [0.000, 7.000],  loss: 3.793703, mse: 10.484817, mean_q: 4.007106
 1500/2500: episode: 6, duration: 213.280s, episode steps: 250, steps per second:   1, episode reward: 140.000, mean reward:  0.560 [ 0.000, 80.000], mean action: 3.632 [0.000, 7.000],  loss: 4.798661, mse: 14.739688, mean_q: 4.098798
 1750/2500: episode: 7, duration: 213.365s, episode steps: 250, steps per second:   1, episode reward: 160.000, mean reward:  0.640 [ 0.000, 80.000], mean action: 3.528 [0.000, 7.000],  loss: 1.130640, mse: 11.936115, mean_q: 4.057529
 2000/2500: episode: 8, duration: 213.371s, episode steps: 250, steps per second:   1, episode reward: 60.000, mean reward:  0.240 [ 0.000, 20.000], mean action: 3.768 [0.000, 7.000],  loss: 5.852178, mse: 17.140491, mean_q: 4.561198
 2250/2500: episode: 9, duration: 213.434s, episode steps: 250

In [None]:
scores = dqn2.test(env, nb_episodes=5, visualize=True)
print(np.mean(scores.history['episode_reward']))

Testing for 5 episodes ...
Episode 1: reward: 0.000, steps: 533
Episode 2: reward: 0.000, steps: 530
Episode 3: reward: 0.000, steps: 778
Episode 4: reward: 0.000, steps: 578
Episode 5: reward: 0.000, steps: 539
0.0


#### Different staring Epsilon

In [None]:
# change starting epsilon to 0.05 
policy2 = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.05, value_test=.05, nb_steps=10000)

dqn3 = DQNAgent(model=model, memory=memory, policy=policy2,
                  enable_dueling_network=True, dueling_type='avg', gamma=.8,
                   nb_actions=actions, nb_steps_warmup=1000
                  )

dqn3.compile(
    Adam(lr=1e-3),
    metrics=['mse']
)

In [None]:
history = dqn3.fit(env, 
                  nb_steps=2500,
                  visualize=False,nb_max_episode_steps=250,
                  verbose=2) 

Training for 2500 steps ...


  updates=self.state_updates,


  250/2500: episode: 1, duration: 6.701s, episode steps: 250, steps per second:  37, episode reward: 120.000, mean reward:  0.480 [ 0.000, 20.000], mean action: 3.272 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --
  500/2500: episode: 2, duration: 3.957s, episode steps: 250, steps per second:  63, episode reward: 240.000, mean reward:  0.960 [ 0.000, 80.000], mean action: 3.192 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --
  750/2500: episode: 3, duration: 3.860s, episode steps: 250, steps per second:  65, episode reward: 200.000, mean reward:  0.800 [ 0.000, 80.000], mean action: 3.516 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --
 1000/2500: episode: 4, duration: 3.852s, episode steps: 250, steps per second:  65, episode reward: 100.000, mean reward:  0.400 [ 0.000, 20.000], mean action: 3.476 [0.000, 7.000],  loss: --, mse: --, mean_q: --, mean_eps: --


  updates=self.state_updates,


 1250/2500: episode: 5, duration: 213.947s, episode steps: 250, steps per second:   1, episode reward: 80.000, mean reward:  0.320 [ 0.000, 20.000], mean action: 3.636 [0.000, 7.000],  loss: 583.913172, mse: 1486.837378, mean_q: 22.096355, mean_eps: 0.893125
 1500/2500: episode: 6, duration: 212.958s, episode steps: 250, steps per second:   1, episode reward: 140.000, mean reward:  0.560 [ 0.000, 80.000], mean action: 3.584 [0.000, 7.000],  loss: 2.191753, mse: 232.099916, mean_q: 18.585354, mean_eps: 0.869422
 1750/2500: episode: 7, duration: 213.040s, episode steps: 250, steps per second:   1, episode reward: 360.000, mean reward:  1.440 [ 0.000, 80.000], mean action: 3.736 [0.000, 7.000],  loss: 1.964644, mse: 229.178007, mean_q: 18.293737, mean_eps: 0.845673
 2000/2500: episode: 8, duration: 213.092s, episode steps: 250, steps per second:   1, episode reward: 80.000, mean reward:  0.320 [ 0.000, 20.000], mean action: 3.476 [0.000, 7.000],  loss: 2.063146, mse: 228.746888, mean_q: 1

In [None]:
scores = dqn3.test(env, nb_episodes=5, visualize=True)
print(np.mean(scores.history['episode_reward']))

Testing for 5 episodes ...
Episode 1: reward: 340.000, steps: 1102
Episode 2: reward: 1340.000, steps: 1630
Episode 3: reward: 1550.000, steps: 2314
Episode 4: reward: 540.000, steps: 2042
Episode 5: reward: 1030.000, steps: 1440
960.0


#### Anser Question:
1. **Baseline performance and how well dqn on this atari game** <br>
  **A:** max_steps_per_episode = 250,
  learning_rate = 0.001,
    discount_rate = 0.8,
    exploration_rate = 1,
    max_exploration_rate = 1,
    min_exploration_rate = 0.1,
    exploration_decay_rate = 0.1

2. **What are the states, the actions, and the size of the Q-table? **<br>
  **A:** Here state is 3 stacked consecutive frames from the environment. Actino of this atari environment like: NOOP, FIRE, RIGHT, LEFT, DOWN, RIGHTFIRE, LEFTFIRE, DOWNFIRE. Because DQL use a Neural Network that takes a state and approximates the Q-values for each action based on that state instead of using a Q-table, cannot answer size of q-table.

3. **What are the rewards? Why did you choose them? **<br>
  **A:** We choose the game score as th reward because the game used scoring system in single digit can be handled more convinient.

4. **How did you choose alpha and gamma in the Bellman equation? Try at least one additional value for alpha and gamma. How did it change the baseline performance?**<br>
  **A:** Lower gamma values will put more weight on short-term gains, whereas higher gamma values will put more weight towards long-term gains.So for continuous tasks, the discount factor should be as close to 1 as possible (e.g., γ=0.99) to avoid neglecting future rewards.<br> 
  The learning rate hyperparameter controls the rate or speed at which the model learns.  Generally, a large learning rate allows the model to learn faster, at the cost of arriving on a sub-optimal final set of weights. A smaller learning rate may allow the model to learn a more optimal or even globally optimal set of weights but may take significantly longer to train. <br>
  In this environment, I set the baseline: alpha be 0.001 and gamma is 0.8. and the mean reward of this baseline is 172.0. After change the value of alpha and gamma to be 0.0001 and 0.99, mean reward decrease to 148.0 which is not good as baseline. <br>



5. **Try a policy other than e-greedy. How did it change the baseline performance?**<br>
  **A:** Here i try BoltzmannQPolicy. The Boltzmann exploration policy is intended for discrete action spaces. It assumes that each of the possible actions has some value assigned to it (such as the Q value), and uses a softmax function to convert these values into a distribution over the actions. It then samples the action for playing out of the calculated distribution.<br> It underperform baseline. 

6. **How did you choose your decay rate and starting epsilon? Try at least one additional value for epsilon and the decay rate. How did it change the baseline performance? What is the value of epsilon when if you reach the max steps per episode?**<br>
  **A:** First i set epsilon to be 0.1 in baseline, and after decrease it and decay rate to be 0.05, the model perform better than baseline with mean reward 960.0. 

7. **What is the average number of steps taken per episode?** <br>
  **A:**The average number of steps taken per episode is 250.

8. **Does Q-learning use value-based or policy-based iteration?**<br>
  **A:** Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation(particularly Bellman equation).

9. **Could you use SARSA for this problem?**<br>
  **A:** State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. 
It is a slight variation of the popular Q-Learning algorithm. The difference between these two algorithms is that SARSA chooses an action following the same current policy and updates its Q-values whereas Q-learning chooses the greedy action, that is, the action that gives the maximum Q-value for the state, that is, it follows an optimal policy.
QL is a more aggressive agent, while SARSA is more conservative. In this low-cost and fast-iterating atari gym environment, mistakes are not costly like the unexpected minimal failure of robots, so I prefer to use Deep Q learning but not SARSA in this problem.
[1]
10. **What is meant by the expected lifetime value in the Bellman equation?**<br>
  **A:**Discount factor

11. **When would SARSA likely do better than Q-learning?**<br>
  **A:** As we mentioned above, SARSA is more conservative, it will approach convergence allowing for possible penalties from exploratory moves. If mistakes are costly( like the unexpected minimal failure- of robots ) in our environment and we care about rewards gained while learning, then SARSA likely does better than Q-learning.

12. **How does SARSA differ from Q-learning?** <br>
  **A:** Q-Learning technique is an Off Policy technique and uses the greedy approach to learn the Q-value. SARSA technique, on the other hand, is an On Policy and uses the action performed by the current policy to learn the Q-value.

13. **Explain the Q-learning algorithm.**<br>
  **A:** Algorithm: <img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/7c8c6f219d5ceabd052cb058a5135bfdac86dc0c'><br>
  Before learning begins, Q is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time t the agent selects an action at, observes a reward rt, enters a new state st+1 (that may depend on both the previous state st and the selected action), and Q is updated.
  <img src='https://tcnguyen.github.io/reinforcement_learning/images/Q_learning_algo.png'>

14. **Explain the SARSA algorithm.**<br>
  **A:** In SARSA, this is done by choosing another action a′ following the same current policy above and using 'equation' as target. 
SARSA is called on-policy learning because new action a′ is chosen using the same epsilon-greedy policy as the action a, the one that generated s′.<br>
   Algorithm: <img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/04c4392b9a682a765571d992e8df82edc808a305'><br>
   The Q value for a state-action is updated by an error, adjusted by the learning rate alpha. Q values represent the possible reward received in the next time step for taking action a in state s, plus the discounted future reward received from the next state-action observation.
  <img src='https://tcnguyen.github.io/reinforcement_learning/images/SARSA_algo.png'>

15. **What code is yours and what have you adapted?**<br>
  **A:** The code import atari gym environment, create model with tensorflow and train agent with keras-rl2 are cited form [2]. And modified some parameters to answer questions.

#### Reference
[1] https://medium.com/swlh/introduction-to-reinforcement-learning-coding-sarsa-part-4-2d64d6e37617<br>
[2] https://github.com/nicknochnack/KerasRL-OpenAI-Atari-SpaceInvadersv0/blob/main/Space%20Invaders%20Walkthrough.ipynb

#### Licence
Copyright (c) 2022, Yanping Fu All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


