@Time ：2023/7/7 

@Auth ：**He Enhao, Bi Xiaoyang, Wang Qipeng, Li Haoyang**

@File ：Play atari pong with reinforce algorithm by pytorch in **gym (PongNoFrameskip-v4)** platform.

@IDE ：Jupyter

@Environment: gym==0.21, python==3.8, pytorch==1.10.0

# Contents:

    1. Introduction
    2. Problem formulation 
    3. Data preparation
    4. Processing pipeline 
    5. Experimental results and discussion
    6. Conclusion

In this project, we created AI agents in gym atari pong environment (PongNoFrameskip-v4) that produced the best action and used preprocessed pixels as features. We feed them into a deep Q-learning network. We implement it combined with a convolutional neural network (CNN). And we conducted some experiments, analysis the result, and compare the pong environment between the gym and gym-retro platform to test our model. We will discuss them in detail in this paper.

# Import the library

In [1]:
import gym
import cv2
import time
import torch
import joblib
import numpy as np
import torch.nn as nn
from tqdm import tqdm
from collections import deque
import torch.nn.functional as F
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from life.utils.replay.replay_buffer import ReplayBuffer

# Environment Wrappers

In [2]:
cv2.ocl.setUseOpenCL(False)

def make_env(env, stack_frames=True, episodic_life=True, clip_rewards=False, scale=False):
    if episodic_life:
        env = EpisodicLifeEnv(env)

    env = NoopResetEnv(env, noop_max=30)
    env = MaxAndSkipEnv(env, skip=4)
    if 'FIRE' in env.unwrapped.get_action_meanings():
        env = FireResetEnv(env)

    env = WarpFrame(env)
    if stack_frames:
        env = FrameStack(env, 4)
    if clip_rewards:
        env = ClipRewardEnv(env)
    return env


class RewardScaler(gym.RewardWrapper):

    def reward(self, reward):
        return reward * 0.1


class ClipRewardEnv(gym.RewardWrapper):
    def __init__(self, env):
        gym.RewardWrapper.__init__(self, env)

    def reward(self, reward):
        """Bin reward to {+1, 0, -1} by its sign."""
        return np.sign(reward)


class LazyFrames(object):
    def __init__(self, frames):
        """This object ensures that common frames between the observations are only stored once.
        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
        buffers.
        This object should only be converted to numpy array before being passed to the model.
        You'd not believe how complex the previous solution was."""
        self._frames = frames
        self._out = None

    def _force(self):
        if self._out is None:
            self._out = np.concatenate(self._frames, axis=2)
            self._frames = None
        return self._out

    def __array__(self, dtype=None):
        out = self._force()
        if dtype is not None:
            out = out.astype(dtype)
        return out

    def __len__(self):
        return len(self._force())

    def __getitem__(self, i):
        return self._force()[i]


class FrameStack(gym.Wrapper):
    def __init__(self, env, k):
        """Stack k last frames.
        Returns lazy array, which is much more memory efficient.
        See Also
        --------
        baselines.common.atari_wrappers.LazyFrames
        """
        gym.Wrapper.__init__(self, env)
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
        self.observation_space = gym.spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k),
                                                dtype=env.observation_space.dtype)

    def reset(self):
        ob = self.env.reset()
        for _ in range(self.k):
            self.frames.append(ob)
        return self._get_ob()

    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.frames.append(ob)
        return self._get_ob(), reward, done, info

    def _get_ob(self):
        assert len(self.frames) == self.k
        return LazyFrames(list(self.frames))


class WarpFrame(gym.ObservationWrapper):
    def __init__(self, env):
        """Warp frames to 84x84 as done in the Nature paper and later work."""
        gym.ObservationWrapper.__init__(self, env)
        self.width = 84
        self.height = 84
        self.observation_space = gym.spaces.Box(low=0, high=255,
                                                shape=(self.height, self.width, 1), dtype=np.uint8)

    def observation(self, frame):
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
        return frame[:, :, None]


class FireResetEnv(gym.Wrapper):
    def __init__(self, env=None):
        """For environments where the user need to press FIRE for the game to start."""
        super(FireResetEnv, self).__init__(env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def step(self, action):
        return self.env.step(action)

    def reset(self):
        self.env.reset()
        obs, _, done, _ = self.env.step(1)
        if done:
            self.env.reset()
        obs, _, done, _ = self.env.step(2)
        if done:
            self.env.reset()
        return obs


class EpisodicLifeEnv(gym.Wrapper):
    def __init__(self, env=None):
        """Make end-of-life == end-of-episode, but only reset on true game over.
        Done by DeepMind for the DQN and co. since it helps value estimation.
        """
        super(EpisodicLifeEnv, self).__init__(env)
        self.lives = 0
        self.was_real_done = True
        self.was_real_reset = False

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.was_real_done = done
        # check current lives, make loss of life terminal,
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if lives < self.lives and lives > 0:
            # for Qbert somtimes we stay in lives == 0 condtion for a few frames
            # so its important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
        self.lives = lives
        return obs, reward, done, info

    def reset(self):
        """Reset only when lives are exhausted.
        This way all states are still reachable even though lives are episodic,
        and the learner need not know about any of this behind-the-scenes.
        """
        if self.was_real_done:
            obs = self.env.reset()
            self.was_real_reset = True
        else:
            # no-op step to advance from terminal/lost life state
            obs, _, _, _ = self.env.step(0)
            self.was_real_reset = False
        self.lives = self.env.unwrapped.ale.lives()
        return obs


class MaxAndSkipEnv(gym.Wrapper):
    def __init__(self, env=None, skip=4):
        """Return only every `skip`-th frame"""
        super(MaxAndSkipEnv, self).__init__(env)
        # most recent raw observations (for max pooling across time steps)
        self._obs_buffer = deque(maxlen=2)
        self._skip = skip

    def step(self, action):
        total_reward = 0.0
        done = None
        for _ in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            self._obs_buffer.append(obs)
            total_reward += reward
            if done:
                break

        max_frame = np.max(np.stack(self._obs_buffer), axis=0)

        return max_frame, total_reward, done, info

    def reset(self):
        """Clear past frame buffer and init. to first obs. from inner env."""
        self._obs_buffer.clear()
        obs = self.env.reset()
        self._obs_buffer.append(obs)
        return obs


class NoopResetEnv(gym.Wrapper):
    def __init__(self, env=None, noop_max=30):
        """Sample initial states by taking random number of no-ops on reset.
        No-op is assumed to be action 0.
        """
        super(NoopResetEnv, self).__init__(env)
        self.noop_max = noop_max
        self.override_num_noops = None
        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

    def step(self, action):
        return self.env.step(action)

    def reset(self):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset()
        if self.override_num_noops is not None:
            noops = self.override_num_noops
        else:
            noops = np.random.randint(1, self.noop_max + 1)
        assert noops > 0
        obs = None
        for _ in range(noops):
            obs, _, done, _ = self.env.step(0)
            if done:
                obs = self.env.reset()
        return obs

def get_state(obs):
    state = np.array(obs)
    state = state.transpose((2, 0, 1))
    return state

# Utility functions

In [3]:
# Render avi or gif
def renderFrames(frame_array, savePath, fileName, fps, otype='AVI'):
    print('Creating replay ...', end=' ')
    if otype == 'AVI':
        fileName += '.avi'
        height, width, layers = frame_array[0].shape
        if layers == 1:
            layers = 0
        size = (width, height)
        out = cv2.VideoWriter(savePath + fileName, cv2.VideoWriter_fourcc(*'DIVX'), fps, size, layers)
        for i in range(len(frame_array)):
            out.write(frame_array[i])
        out.release()
        print('Done. Saved to {}'.format(savePath + fileName))
    else:
        print('Error: Invalid type, must be AVI.')

# Deep Q-Learning

## Build DQN Architecture

In [6]:
class DQN:
    ''' DQN Algorithm '''

    def __init__(self, state_dim, hidden_dim, action_dim, learning_rate, gamma,
                 epsilon, target_update, device, q_net):
        self.action_dim = action_dim
        self.q_net = q_net.to(device)
        # Target network
        self.target_q_net = q_net.to(device)
        # using ADAM optimizer
        self.optimizer = torch.optim.Adam(self.q_net.parameters(),
                                          lr=learning_rate)
        self.gamma = gamma  # discount factor
        self.epsilon = epsilon  # epsilon-greedy strategy
        self.target_update = target_update  # update frequency
        self.count = 0  #counter
        self.device = device

    def take_action(self, state):  # epsilon-greedy strategy for action choosing
        if np.random.random() < self.epsilon:
            action = np.random.randint(self.action_dim)
        else:
            state = torch.tensor([state], dtype=torch.float).to(self.device)
            action = self.q_net(state).argmax().item()
        return action

    def max_q_value(self, state):
        state = torch.tensor([state], dtype=torch.float).to(self.device)
        return self.q_net(state).max().item()

    def save_model(self,save_path):
        print('Saving models ...', end=' ')
        torch.save(self.q_net.state_dict(),save_path + 'DQN_eval.pth')
        torch.save(self.target_q_net.state_dict(),save_path + 'DQN_next.pth')
        print('Done.')

    def load_model(self,load_path):
        print('Loading models ...', end=' ')
        self.q_net.load_state_dict(torch.load(load_path+'Pong_gym_eval.pth'))
        self.target_q_net.load_state_dict(torch.load(load_path+'Pong_gym_next.pth'))
        print('Done.')


    def update(self, transition_dict):
        states = torch.tensor(transition_dict['states'],
                              dtype=torch.float).to(self.device)
        actions = torch.tensor(transition_dict['actions']).view(-1, 1).to(
            self.device)
        rewards = torch.tensor(transition_dict['rewards'],
                               dtype=torch.float).view(-1, 1).to(self.device)
        next_states = torch.tensor(transition_dict['next_states'],
                                   dtype=torch.float).to(self.device)
        dones = torch.tensor(transition_dict['dones'],
                             dtype=torch.float).view(-1, 1).to(self.device)

        q_values = self.q_net(states).gather(1, actions)  # Q value
        # maximum Q value in next state
        max_next_q_values = self.target_q_net(next_states).max(1)[0].view(
            -1, 1)
        q_targets = rewards + self.gamma * max_next_q_values * (1 - dones
                                                                )  # TD
        dqn_loss = torch.mean(F.mse_loss(q_values, q_targets))  # MSE
        self.optimizer.zero_grad()  # the gradient accumulates by default by Pytorch, we manually set this as 0
        dqn_loss.backward()  #backward update
        self.optimizer.step()

        if self.count % self.target_update == 0:
            self.target_q_net.load_state_dict(
                self.q_net.state_dict())  # update strategy
        self.count += 1
        

    def play_game(self, env):
        print('Playing game ...', end=' ')
        score = 0
        observation = env.reset()
        observation = get_state(observation)
        done = False
        steps = 0
        frames = []
        obsFrames = []
        while not done:
            action = self.take_action(observation)
            observation, reward, done, info = env.step(action)
            observation = get_state(observation)
            score += reward
            steps += 1
            frames.append(env.render(mode='rgb_array'))
            obsFrames.append(observation)
        print('Done. Score: {}'.format(score))
        return frames, obsFrames

class CNN(nn.Module):
    def __init__(self, in_channels=4, n_actions=14):
        """
        Initialize Deep Q Network

        Args:
            in_channels (int): number of input channels
            n_actions (int): number of outputs
        """
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self.fc4 = nn.Linear(7 * 7 * 64, 512)
        self.head = nn.Linear(512, n_actions)

    def forward(self, x):
        x = x.float() / 255
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = F.relu(self.fc4(x.view(x.size(0), -1)))
        return self.head(x)

# Training Process

In [5]:
def train_dqn(agent, env, replay_buffer, minimal_size, batch_size, num_episodes=500,
              conti_action=False, return_agent=False,load=False,load_path=None,save=True,save_path=None):
    """
    :param agent:
    :param env:
    :param num_episodes:
    :param replay_buffer:
    :param minimal_size: start training only when replay_buffer is larger than minimal_size
    :param batch_size:
    :param conti_action: if can be appied for continious action
    :param return_agent:
    :return:
    """

    if load:
        agent.load_model(load_path)

    return_list = []
    for i in range(10):
        with tqdm(total=int(num_episodes / 10), desc='Iteration %d' % i) as pbar:
            for i_episode in range(int(num_episodes / 10)):
                episode_return = 0
                state = env.reset()
                state = get_state(state)  # dimension trans
                done = False
                while not done:
                    action = agent.take_action(state)
                    next_state, reward, done, _ = env.step(action)
                    next_state = get_state(next_state)  # dimension trans
                    replay_buffer.add(state, action, reward, next_state, done)
                    state = next_state
                    episode_return += reward
                    # start training only when replay_buffer is larger than minimal_size
                    if replay_buffer.size() > minimal_size:
                        b_s, b_a, b_r, b_ns, b_d = replay_buffer.sample(batch_size)
                        transition_dict = {
                            'states': b_s,
                            'actions': b_a,
                            'next_states': b_ns,
                            'rewards': b_r,
                            'dones': b_d
                        }
                        agent.update(transition_dict)
                return_list.append(episode_return)
                if (i_episode + 1) % 10 == 0:
                    pbar.set_postfix({
                        'episode':
                            '%d' % (num_episodes / 10 * i + i_episode + 1),
                        'return':
                            '%.3f' % np.mean(return_list[-10:])
                    })
                pbar.update(1)

    if save:
        agent.save_model(save_path)
        
    if return_agent:
        return return_list, agent
    return return_list

# Main

In [7]:
env = gym.make("PongNoFrameskip-v4")
env = make_env(env)
state_dim = env.observation_space.shape
action_dim = env.action_space.shape or env.action_space.n
device = torch.device("cuda") if torch.cuda.is_available() else torch.device('cpu')
load_path = './models/'
save_path = 'C:/Users/86139/Desktop/Project/rl-pong-main/models/'

base_net = CNN(
    in_channels=4,
    n_actions=6
)
replay_buffer = ReplayBuffer(capacity=10000)

agent_dqn = DQN(
    state_dim=state_dim,
    hidden_dim=128,
    action_dim=action_dim,
    learning_rate=0.0001,  # reducing the learning rate
    gamma=0.92,
    epsilon=0.01,
    target_update=50,  # increase to 50 from 10
    device=device,
    q_net=base_net,
)
start_t = time.time()
result, agent = train_dqn(
    agent=agent_dqn,
    env=env,
    replay_buffer=replay_buffer,
    minimal_size=500,
    num_episodes=20, # 20 episodes for showing the process. We trained the agent for 1000 episodes. 
    batch_size=64,  # batch_size reduced from 128 to 64, but the speed of training on GPU increases
    return_agent=True,
    load=True,
    save=False,
    load_path=load_path # load the network parameters of 1000 episodes 
)


print(result)
print("Training finished,using {} min.".format(time.time() - start_t))

# joblib.dump(result, "./result/dqn1000iter_result_list.dat")
# joblib.dump(agent, "./result/dqn1000iter_agent.dat")

Loading models ... 

Iteration 0:   0%|          | 0/2 [00:00<?, ?it/s]

Done.


  state = torch.tensor([state], dtype=torch.float).to(self.device)
Iteration 0: 100%|██████████| 2/2 [02:15<00:00, 67.74s/it]
Iteration 1: 100%|██████████| 2/2 [02:08<00:00, 64.05s/it]
Iteration 2: 100%|██████████| 2/2 [01:37<00:00, 48.75s/it]
Iteration 3: 100%|██████████| 2/2 [01:43<00:00, 51.94s/it]
Iteration 4: 100%|██████████| 2/2 [01:21<00:00, 40.62s/it]
Iteration 5: 100%|██████████| 2/2 [02:15<00:00, 67.54s/it]
Iteration 6: 100%|██████████| 2/2 [01:43<00:00, 51.97s/it]
Iteration 7: 100%|██████████| 2/2 [02:11<00:00, 65.85s/it]
Iteration 8: 100%|██████████| 2/2 [02:48<00:00, 84.18s/it]
Iteration 9: 100%|██████████| 2/2 [02:03<00:00, 61.87s/it]

[20.0, 16.0, 18.0, 17.0, 18.0, 18.0, -4.0, -15.0, 18.0, 20.0, -5.0, 6.0, 20.0, 3.0, 7.0, -7.0, 4.0, -3.0, -5.0, 10.0]
Training finished,using 1211.2879800796509 min.





# Play&Test

In [10]:
frames, obsFrames = agent_dqn.play_game(env)
fileName = 'Pong_gym_playgame'
renderFrames(frames, './', fileName, 60, otype='AVI')

Playing game ... Done. Score: 21.0
Creating replay ... Done. Saved to ./Pong_gym_playgame.avi
