# Mario Bros RL implementation
Sumin Park /
2023.08.01
## Implementation details
* MARL
* Implement Sarsa/Q-learning/Expected Sarsa (maybe all three, if it's easy to implement)
* Deep Q-Learning?
* Make a graph comparing the performances of each method

## Need to learn
* How having 2 agents affect Q-**learning**

# Plan here
## What I know
* Mixed-sum game of planning and control
* Goal of the game: kick the pest off the step
** This requires 2 steps
> 1. Hit the floor beneath the pest, which knocks the pest onto its back.
> 2. Move up to the floor and kick it off. (+800 reward)

## Game Plan
* Since there are only 2 agents in this game, there is no need to implement mean-field Q-learning
* Use of joint action space (18 X 18 for 2 mario bros)
* Epsilon-greedy policy/what alpha value? pretty large
* Q-network (add target network after testing Q-network)
* Replay memory
* Preprocessing of the frames (how many frames to stack together as the state?)


## Code in parts
### 1. Initialize replay memory
>Initially, the game will run up to N frames while storing exp variables to D. After it reaches N, replace D = D[1:N] (or use deque?)

* Variables: exp (e:s, a, r, s'); replay memory pool (D: list of N # of e); N (capacity); batch (batch size to sample from D at each training - updating Q-network)

### 2. Build model
> Build a Keras model including a few convolutional layers.

* Variables: state_size (state space); action_size (joint action space)

### 3. Train a model
> Train the model with target and current q values. *Important*: For actions that were chosen, target comes from Bellman eq. For actions not chosen, target is just predicted values from our q-network (in lunar example, however in cartpole it seems that the target for unchosen action is 0 ...). Each agent will have seperate q-network to be trained on, since q-values of the agents should be different on the same state.

* Variables: current and future state; reward; termination values (termination & truncation); lr (learning rate); optimizer;


* Functions: get_q_target(given current reward, argmax_q_value (from next obs/reward), alpha, return a target to calculate loss).

### 4. Play game

> Loop through each frame of the game.

* Variables: epsilon (starting e, min e, decay_rate), episode (# of episode to train the model); T (# of frames per episode); total (total reward for each agent)

* Functions: get_action (given current state and e values, return action according to e greedy policy); store_memory (given current obs, add it replay memory)


### 5. Testing the q-network
> Not sure if this is necessary? By the end of training a model the


### 6. Handling 2 agents
> Q-network determines the highest joint action value. Need functions that will convert this value to individual actions. Joint_action_space is of shape flattened(18, 18) + 2*18, the last of which determines an action taken by one agent when the other is dead.

* Functions: joint_action_to_actions (given two action values, convert it to a joint action value in join_action_space (18*18 +

### 7. Google drive
> It might be necessary to save q-network param inside a file on google drive, so as not to have to run everything all over agian.


## Loops

## Remaining questions
* When sampling from the memory pool, do I sample a stack of frames to use as samples together? -> most likely yes
* ~~How do I take in a stack of sequential frames through input node?~~ **DONE**
* How do I use multiagent wrappers? (from supersuit)
* ~~What happens to the action space when one agent is dead?~~
*

## Useful bits of code
```Python
from collections import deque
self.memory = deque(maxlen=N)
self.memory.appendleft() # or something like this
```

# Mount Goolge Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

%cd /content/drive/MyDrive/Github/MarioBros
!pwd

/content/drive/MyDrive/Github/MarioBros
/content/drive/MyDrive/Github/MarioBros


In [None]:
!pip install pettingzoo



In [None]:
!pip install pettingzoo[atari]



In [None]:
!pip install tensorflow



In [None]:
!pip install gymnasium[accept-rom-license]



In [None]:
%pip install -U gym>=0.26.2
%pip install -U gym[atari,accept-rom-license]



In [None]:
!AutoROM

AutoROM will download the Atari 2600 ROMs.
They will be installed to:
	/usr/local/lib/python3.10/dist-packages/AutoROM/roms
	/usr/local/lib/python3.10/dist-packages/multi_agent_ale_py/roms

Existing ROMs will be overwritten.

I own a license to these Atari 2600 ROMs.
I agree to not distribute these ROMs and wish to proceed: [Y/n]: 
Aborted!


In [None]:
!pip install supersuit



In [None]:
# Fool Colab that it has a video card
!pip install pygame

import os
os.environ['SDL_VIDEODRIVER']='dummy'
import pygame
pygame.display.set_mode((640,480))



<Surface(640x480x32 SW)>

# Import packages

In [None]:
from pettingzoo.atari import mario_bros_v3

import numpy as np
import numpy.random as npr

import supersuit

import tensorflow as tf
from tensorflow.keras import optimizers, losses
from tensorflow.keras import Model

from collections import deque
from tqdm import tqdm

# from Ipython.display import clear_output

# Create environment and preprocess

In [None]:
'''
Create a custom environment that includes a joint action space
'''
from pettingzoo.atari.base_atari_env import ParallelAtariEnv

class jointAtariEnv(ParallelAtariEnv):
  super().__init__()

RuntimeError: ignored

In [None]:
env = mario_bros_v3.env(render_mode="human", full_action_space=True)

stacked_frames = 4

# maxes over the last 2 frames to deal with frame flickering
env = supersuit.max_observation_v0(env, 2)

# repeat_action_probability is set to 0.25 to introduce non-determinism to the system
env = supersuit.sticky_actions_v0(env, repeat_action_probability=0.25)

# skip frames for faster processing and less control
# to be compatible with gym, use frame_skip(env, (2,5))
env = supersuit.frame_skip_v0(env, 4)

# downscale observation for faster processing
env = supersuit.resize_v1(env, 84, 84) # not sure if the x_ and y_size are good

# stack frames together to give more info on what is happening at one timestep
env = supersuit.frame_stack_v1(env, stacked_frames)

# preprocessing for MADRL
env = supersuit.agent_indicator_v0(env)

env = supersuit.pad_observations_v0(env) # Is this necessary? The obs will be the same for both agents.



In [None]:
raw_env = env.unwrapped
print(type(raw_env))
print(raw_env.full_action_space)
print(raw_env.action_mapping)
# So now what I have to do is modify my env... and map the action to 18 * 18 ...
# but how does that actually change the next state of the game!!! where in the class
# does the action_mapping work
env.reset()
print(env.last())

# Define Q-network model

In [None]:
state_size = env.observation_space('first_0').shape # convert to np array?
action_size = env.action_space('first_0').n

batch_size = 100

# hyper parameters
lr = 1
gamma = 0

print(state_size)

(84, 84, 14)


In [None]:
class MarioBrosQNet(tf.keras.Model):
  def __init__(self, state_size, action_size):

    super(MarioBrosQNet, self).__init__()

    self.conv1 = tf.keras.layers.Conv2D(32,
                                        kernel_size=(8, 8), strides=(4, 4),
                                        activation='relu',
                                        data_format='channels_last',
                                        input_shape=state_size)

    self.maxpool1 = tf.keras.layers.MaxPooling2D(pool_size=(4, 4),
                                                 strides=(4, 4),
                                                 data_format='channels_last')
    self.conv2 = tf.keras.layers.Conv2D(64,
                                        kernel_size=(3, 3), strides=(2, 2),
                                        activation='relu',
                                        data_format='channels_last')
    self.maxpool2 = tf.keras.layers.MaxPooling2D(pool_size=(2, 2),
                                                 strides=(2, 2),
                                                 data_format='channels_last')

    self.flatten = tf.keras.layers.Flatten()
    self.dense1 = tf.keras.layers.Dense(512,
                                        activation='relu')
    self.dense2 = tf.keras.layers.Dense(256,
                                        activation='relu')

    self.value = tf.keras.layers.Dense(action_size,
                                        activation='linear')


  def call(self, state):
    conv1 = self.conv1(state)
    maxpool1 = self.maxpool1(conv1)
    conv2 = self.conv2(maxpool1)
    maxpool2 = self.maxpool2(conv2)
    flatten = self.flatten(maxpool2)
    dense1 = self.dense1(flatten)
    dense2 = self.dense2(dense1)
    value = self.value(dense2)


In [None]:
# Initialize Q-Networks for 2 agents

qNet_1 = MarioBrosQNet(state_size, action_size)
qNet_2 = MarioBrosQNet(state_size, action_size)

In [None]:
# Test q-network moel

env.reset()
print(env.last()[0].dtype)
sample_data = np.array([env.last()[0], env.last()[0]])
#tf.cast(sample_data, dtype=tf.float32) # state data need to be recast into float32 (from uint8) to be compaitable with conv2d layer
# or why is it not necessary???
print(sample_data.dtype)

qNet_1.build(sample_data.shape) # build the model with a sample data
qNet_1.summary()

uint8
uint8
Model: "mario_bros_q_net_35"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_69 (Conv2D)          multiple                  28704     
                                                                 
 max_pooling2d_68 (MaxPoolin  multiple                 0         
 g2D)                                                            
                                                                 
 conv2d_70 (Conv2D)          multiple                  18496     
                                                                 
 max_pooling2d_69 (MaxPoolin  multiple                 0         
 g2D)                                                            
                                                                 
 flatten_34 (Flatten)        multiple                  0         
                                                                 
 dense_102 (Dense)           multip

# Train q-nets

In [None]:
def train_model(q_net, replay_memory, batch_size, lr):
  pass

# Define policy

In [None]:
# initial epsilon, decay rate, final epsilon values
epsilon = 1
max_epsilon = 1
decay_rate = 0.005
min_epsilon = 0.05

def get_next_action(state, epsilon, agent):
  if (npr.rand() < epsilon):
    action = env.action_space(agent).sample()
  else:
    if agent == 'first_0':
      action = np.argmax(qNet_1.predict(state,
                                        verbose = 0))
    elif agent == 'second_0':
      action = np.argmax(qNet_2.predict(state,
                                        verbose = 0))

  return action

# Run the game

In [None]:
# episode parameters

n_games = 1

max_frames = 11

for i in range(n_games):

  env.reset()

  # keep a seperate replay memory for each agent
  replay_memory = {'first_0': deque(maxlen=100000),
                   'second_0': deque(maxlen=100000)}

  # empty dictionary to store total rewards for each game
  episode_reward = {'first_0': 0,
                    'second_0': 0}

  done = False
  j = 0

  while not done:

    if j >= max_frames:
      break

    # I only need one state? no because after one agent plays the state changes...
    for agent in env.agent_iter():

        state, _, done, trunc, _ = env.last() # last reward doesn't matter
        action = get_next_action(state, epsilon, agent)
        env.step(action)

        #env.render()

        next_state, reward, done, trunc, _ = env.last()

        episode_reward[agent] += reward

        replay_memory[agent].append([state, action, reward, next_state, done])


    if ((j % 10) == 0):
      train_model(replay_memory, batch_size, lr)

    j += 1


# Code testing

In [None]:
from collections import deque

memories = {'one': deque(maxlen=8),
            'two': deque(maxlen=8)}
agents = ['one', 'two']

for agent in agents:
  for i in range(8):
    memories[agent].append(i)

def test(dict):
  print("original: ", dict)
  print("modifying")
  dict['one'][0] = 1
  print("after: ", dict)

test(memories)

original:  {'one': deque([0, 1, 2, 3, 4, 5, 6, 7], maxlen=8), 'two': deque([0, 1, 2, 3, 4, 5, 6, 7], maxlen=8)}
modifying
after:  {'one': deque([1, 1, 2, 3, 4, 5, 6, 7], maxlen=8), 'two': deque([0, 1, 2, 3, 4, 5, 6, 7], maxlen=8)}
