## Assignment 2_Wei Chen 001562214

First import all the necessary libraries

tensorflow==2.10.0 opencv-python==4.6.0.66 gym==0.17.0 atari_py==0.2.9 keras==2.10.0

In [2]:
import numpy as np
from numpy import clip
import random
import gym
import cv2
from collections import deque,namedtuple
from keras.models import Sequential
import warnings
warnings.filterwarnings('ignore')
from keras.layers import Conv2D,Flatten,Dense
from tensorflow.keras.optimizers import Adam
from keras.models import load_model

Create and initialize pong gym environment. Every environment specifies the format of valid actions by providing an env.action_space attribute. Similarly, the format of valid observations is specified by env.observation_space. In the example above we sampled random actions via env.action_space.sample(). Note that we need to seed the action space separately from the environment to ensure reproducible samples.

In [4]:
env = gym.make('Pong-v4')

#get the shape of game
height, width, channels = env.observation_space.shape
actions = env.action_space.n
#show all the movement players can make in the game
print(env.unwrapped.get_action_meanings())
print(env.observation_space.shape)


['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']
(210, 160, 3)


In [5]:
#test run on the enviroment

EPISODES = 5
scores = []
scores_clipped = []

for episode in range(1, EPISODES + 1):
    state = env.reset()
    done = False
    score = 0 
    score_clipped = 0
    
    while not done:
        # env.render()
        action = random.choice(range(env.action_space.n))
        n_state, reward, done, info = env.step(action)
        score += reward
        score_clipped += clip(reward, -1.0, 1.0)
    
    scores.append(score)
    scores_clipped.append(score_clipped)
    print(f"Episode {episode}: Reward == {score}; Clipped Reward == {score_clipped}")

avg = np.mean(scores)
avg_clipped = np.mean(scores_clipped)
print(f"Average reward: {avg}; clipped: {avg_clipped}")
env.close()

Episode 1: Reward == -21.0; Clipped Reward == -21.0
Episode 2: Reward == -20.0; Clipped Reward == -20.0
Episode 3: Reward == -20.0; Clipped Reward == -20.0
Episode 4: Reward == -21.0; Clipped Reward == -21.0
Episode 5: Reward == -21.0; Clipped Reward == -21.0
Average reward: -20.6; clipped: -20.6


Next we create Buffer 

In [6]:
conv = namedtuple('Conv', 'filter kernel stride')

class Buffer:
	def __init__(self,size):
		self.size = size
		self.buffer = deque()
	def add(self,s,a,r,s2,t):
		s = np.stack((s[0],s[1],s[2],s[3]),axis=2)
		s2 = np.stack((s2[0],s2[1],s2[2],s2[3]),axis=2)
		if len(self.buffer) < self.size:
			self.buffer.appendleft((s,a,r,s2,t))
		else:
			self.buffer.pop()
			self.buffer.appendleft((s,a,r,s2,t))
	def sample(self,batch_size):
		return random.sample(self.buffer,batch_size)

In [7]:
class DQN:
	def __init__(self,buff,batch_size=32,min_buff=10000,gamma=0.99,learning_rate=2.5e-4):
		self.buffer = buff
		self.min_buffer = min_buff
		self.batch_size = batch_size
		self.gamma = gamma
		
		self.model = create_network(learning_rate)
		self.target_model = create_network(learning_rate)
		self.copy_network()

	def train(self):
		if len(self.buffer.buffer) < self.min_buffer:
			return
		states,actions,rewards,next_states,terminal = map(np.array,zip(*self.buffer.sample(self.batch_size)))
		next_state_action_values = np.max(self.target_model.predict(next_states),axis=1)
		targets = self.model.predict(states)
		targets[range(self.batch_size), actions] = rewards + self.gamma*next_state_action_values*np.invert(terminal)
		self.model.train_on_batch(states, targets)

	def copy_network(self):
		frm = self.model
		to = self.target_model
		for l_tg,l_sr in zip(to.layers,frm.layers):
			wk = l_sr.get_weights()
			l_tg.set_weights(wk)

	def predict(self,x):
		s = np.stack((x[0],x[1],x[2],x[3]),axis=2)
		return self.model.predict(np.array([s]))

Now we define a function called create_network for building our Q network. We input the game state to the Q network and get the Q values for all the actions in that state.


In [8]:
def create_network(learning_rate,conv_info=[conv(32,8,4),conv(64,4,2),conv(64,3,1)],dense_info=[512],input_size=(80,80,4)):
	model = Sequential()
	for i,cl in enumerate(conv_info):
		if i==0:
			model.add(Conv2D(cl.filter,cl.kernel,padding="same",strides=cl.stride,activation="relu", input_shape=input_size))
		else:
			model.add(Conv2D(cl.filter,cl.kernel,padding="same",strides=cl.stride,activation="relu"))
	model.add(Flatten())
	for dl in dense_info:
		model.add(Dense(dl,activation="relu"))
	model.add(Dense(6))
	adam = Adam(lr=learning_rate)
	model.compile(loss='mse',optimizer=adam)
	return model

In [9]:
# Create download and update function
def downsample(observation):
	s = cv2.cvtColor(observation[30:,:,:], cv2.COLOR_BGR2GRAY)
	s = cv2.resize(s, (80,80), interpolation = cv2.INTER_AREA) 
	s = s/255.0
	return s
	
def update_state(state,observation):
	ds_observation = downsample(observation)
	state.append(ds_observation)
	if len(state) > 4:
		state.pop(0)

def sample_action(model,s):
	return np.argmax(model.predict(np.array([np.stack((s[0],s[1],s[2],s[3]),axis=2)]))[0])

Next we create a specific Pong agent class

In [10]:
class Pong:
	def __init__(self):
		self.env = gym.make('Pong-v4')
		self.epsilon = 1
		self.buffer = Buffer(50000)
		self.dqn = DQN(self.buffer)
		self.copy_period = 40000
		self.itr = 0
		self.eps_step = 0.0000009

	def sample_action(self,s):
		if random.random() < self.epsilon:
			return self.env.action_space.sample()
		return np.argmax(self.dqn.predict(s)[0])
	
	def play_one_episode(self):
		observation = self.env.reset()
		done = False
		state = []
		update_state(state,observation)
		prv_state = []
		total_reward = 0
		while not done:
			
			if len(state) < 4:
				action = self.env.action_space.sample()
			else:
				action = self.sample_action(state)
        
			prv_state.append(state[-1])
			if len(prv_state) > 4:
				prv_state.pop(0)
			observation, reward, done, _ = self.env.step(action)

			update_state(state,observation)
			if len(state) == 4 and len(prv_state) == 4:
				self.buffer.add(prv_state,action,reward,state,done)
			total_reward += reward
			
			self.itr += 1
			if self.itr % 4 == 0:
				self.dqn.train()
			self.epsilon = max(0.1,self.epsilon-self.eps_step)
			if self.itr % self.copy_period == 0:
				self.dqn.copy_network()
		return total_reward

To save the time we get the model every 1000 episodes

In [None]:
p = Pong()
for i in range(100000):
	total_reward = p.play_one_episode()
	print("episode total reward:",total_reward)
	if i%1000 == 0:
		print("Save")
		p.dqn.model.save("result.h5".format(i))


We can run the environment to see its performance via loading the trained result.h5 file

In [15]:
env = gym.make('Pong-v4')
model = load_model('result.h5')
done = False
state = []
observation = env.reset()
update_state(state,observation)
score = 0
while not done:
	
	if len(state) < 4:
		action = env.action_space.sample()
	else:
		action = sample_action(model,state)
	observation, reward, done, _ = env.step(action)
	score+=reward
	update_state(state,observation)































In [16]:
score

2.0

Score is positive(2) means trained model is the winner of game.
From the data we got before training that is mostly -21 and -20.

## 1. Establish a baseline performance. How well did your Deep Q-learning do on your problem? (5 Points)

total_episodes = 100000

max_steps = 1/9e-6= 1,111,111.111111111

learning_rate = 2.5e-4

gamma = 0.99

epsilon = 1.0

max_epsilon = 1.0

min_epsilon = 0.1

## 2. What are the states, the actions, and the size of the Q-table? (5 Points)

states: len(['paddle1_pos','paddle2_pos','ball-pos','ball_direction'])==4

actions: 'NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE'

size of Q-table: 4*6=24


## 3. What are the rewards? Why did you choose them? (5 Points)

You get score points for getting the ball to pass the opponent’s paddle, when the opponent hits the ball out of bounds or misses a hit.  The first one to score 21 points wins the game.

The rewards will be the score got by both players each turn.

Because we want to train the atari game by DQN to get maximize scores.

## 4. How did you choose alpha and gamma in the Bellman equation? Try at least one additional value for alpha and gamma. How did it change the baseline performance?  (5 Points)

Choose alpha = 2.5e-4, gamma = 0.99


alpha(the learning rate) should decrease as you continue to gain a larger and larger knowledge base. The learning rate should be in the range of 0 -1. The higher the learning rate, it quickly replaces the new q value.


if gamma <1, then Gt will have a finite value. If gamma =0, the Agent is only interested in the immediate reward and discards the long-term return. Conversely, if gamma =1, the Agent will consider all future rewards equal to the immediate reward.
If change the value of alpha and gamma, it will cost additional time to train the model.


## 5. Try a policy other than e-greedy. How did it change the baseline performance? (5 Points)

 The reason for using 𝜖-greedy during testing is it can nullify the negative effects of overfitting or underfitting, and unlike in supervised machine learning (for example image classification), in reinforcement learning there is no unseen, held-out data set available for the test phase. This means the algorithm is tested on the very same setup that it has been trained on.

## 6. How did you choose your decay rate and starting epsilon? Try at least one additional value for epsilon and the decay rate. How did it change the baseline performance? What is the value of epsilon when if you reach the max steps per episode? (5 Points)

decay late: Learning rate scales the magnitude of our weight updates in order to minimize the network's loss function. It is also how big you take a leap in finding optimal policy. In the terms of simple QLearning it's how much you are updating the Q value with each step.

starting epsilon: Epsilon is used when we are selecting specific actions base on the Q values we already have. As an example if we select pure greedy method ( epsilon = 0 ) then we are always selecting the highest q value among the all the q values for a specific state. This causes issue in exploration as we can get stuck easily at a local optima.
When the agent is learning you should decay this to stabilize your model output which eventually converges to an optimal policy.


In conclusion learning rate is associated with how big you take a leap and epsilon is associated with how random you take an action. As the learning goes on both should decayed to stabilize and exploit the learned policy which converges to an optimal one.

## 7. What is the average number of steps taken per episode? (5 Points)

Since the motion of paddles are continuously in this environment, steps are not taken per episode.


average time per step =  40.24ms/step

## 8. Does Q-learning use value-based or policy-based iteration? (5 Points)


Value-based iteration.

since Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation(particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement.


Q-learning is an off-policy learner. Means it learns the value of the optimal policy independently of the agent’s actions. On the other hand, an on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps and it will find a policy that is optimal, taking into account the exploration inherent in the policy.

## 9. Could you use SARSA for this problem? (5 Points)

SARSA and Q Learning are both reinforcement learning algorithms that work in a similar way. The most striking difference is that SARSA is on policy while Q Learning is off policy.


We can use SARSA for this problem as well but it may not perform well as Q Learning, since the Q value for a state-action is updated by an error, adjusted by the learning rate alpha. 

## 10. What is meant by the expected lifetime value in the Bellman equation?(5 Points)


With expected values you have a fair bit of freedom to expand/resolve or not.
For instance, assuming the distributions 𝑋 and 𝑌 are independently resolved.


Each time step of a MDP is independent in this way, so you can use this when handling sums and products within expectations in the Bellman equations (provided you separate terms by time step).


For the Bellman equation, the goal is to relate 𝑣𝜋(𝑠𝑡) to 𝑣𝜋(𝑠𝑡+1), and the definition of value is given as an expectation, so it makes sense to preserve the second expectation rather than expand it.


The interpretation of H(x, a, v) is the lifetime value associated with choosing action a
at current state x and then continuing with a reward function v attributing value to
states.

## 11. When would SARSA likely do better than Q-learning? (5 Points)

If your agent learns online, and you care about rewards gained whilst learning, then SARSA may be a better choice.


SARSA will approach convergence allowing for possible penalties from exploratory moves, whilst Q-learning will ignore them. That makes SARSA more conservative. If there is risk of a large negative reward close to the optimal path, Q-learning will tend to trigger that reward whilst exploring, whilst SARSA will tend to avoid a dangerous optimal path and only slowly learn to use it when the exploration parameters are reduced. 


The classic toy problem that demonstrates this effect is called cliff walking.

## 12. How does SARSA differ from Q-learning? (5 Points)  

    Q-Learning:  Q(s,a)←Q(s,a)+α[r+γmaxQ(s',a′)−Q(s,a)]

    SARSA:       Q(s,a)←Q(s,a)+α[r+γQ(s',a')−Q(s,a)] 
    
SARRA is on-policy and Q-Learning is off-policy

The most important difference between the two is how Q is updated after each action. SARSA uses the Q' following a ε-greedy policy exactly, as A' is drawn from it. In contrast, Q-learning uses the maximum Q' over all possible actions for the next step. This makes it look like following a greedy policy with ε=0.

## 13. Explain the Q-learning algorithm. (5 Points)  


Q-Learning is a Reinforcement learning policy that will find the next best action, given a current state. It chooses this action at random and aims to maximize the reward.

Q-learning is a model-free, off-policy reinforcement learning that will find the best course of action, given the current state of the agent. Depending on where the agent is in the environment, it will decide the next action to be taken. 

The objective of the model is to find the best course of action given its current state. To do this, it may come up with rules of its own or it may operate outside the policy given to it to follow. This means that there is no actual need for a policy, hence we call it off-policy.

Model-free means that the agent uses predictions of the environment’s expected response to move forward. It does not use the reward system to learn, but rather, trial and error.

## 14. Explain the SARSA algorithm. (5 Points)  

SARSA is an on-policy algorithm where, in the current state, S an action, A is taken and the agent gets a reward, R and ends up in next state, S1 and takes action, A1 in S1. Therefore, the tuple (S, A, R, S1, A1) stands for the acronym SARSA.

Basically, the Q-value is updated taking into account the action, A1 performed in the state, S1 in SARSA as opposed to Q-learning where the action with the highest Q-value in the next state, S1 is used to update Q-table.

## 15. What code is yours and what have you adapted? (5 Points)

Seperated results and stored the optimal one.

Rewrite the code of implementing the agents of pong and creating the network.

Constructed class and organized the code to be more readable 

## 16. Did I explain my code clearly? (10 Points)

I have applied Deep Q Learning to a atari game(Pong)in the Open AI Gym Atari environments, after creating Pong agent and network and training the agent, the score is positive and the trained agent was the winner of the game.


## 17. Did I explain my licensing clearly? (5 Points)

## Reference

[1] RL Agent for Atari Game Pong https://github.com/amirhossein-hkh/pong-dqn

[2] The Bellman Equation https://towardsdatascience.com/the-bellman-equation-59258a0d3fa7

[3] Level up — Understanding Q learning https://medium.com/@nancyjemi/level-up-understanding-q-learning-cf739867eb1d

[4] A Beginners Guide to Q-Learning https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c

[5] What is the difference between Q-learning and SARSA? https://stackoverflow.com/questions/6848828/what-is-the-difference-between-q-learning-and-sarsa

## 18. Professionalism (10 Points)

In the process of implementation, I take the naming of each variable seriously, making it closely related to the variable it represents, so that it is easier to understand the content of the code execution. I carefully choose each parameter such as learning rate and epsilon, to ensure that the program can have higher efficiency. I divided the code into modularization, which makes it easier to understand the process of the program: import environment and test, create buffer and network, construct agent, train the agent and get performance.

##  Licenses


All licenses in this repository are copyrighted by their respective authors.

------------------------------------------------------------------------------

@author: Wei Chen (chen.wei6@northeastern.edu)

The person who associated a work with this deed has dedicated the work to the
public domain by waiving all of his or her rights to the work worldwide under
copyright law, including all related and neighboring rights,
to the extent allowed by law.

You can copy, modify, distribute and perform the work, even for commercial
purposes, all without asking permission. 