**[Intro to Game AI and Reinforcement Learning Home Page](https://www.kaggle.com/learn/intro-to-game-ai-and-reinforcement-learning)**

---


# Setting up training buddies

In [None]:
%load_ext Cython

In [None]:
from agents import matrix_agent, agentc1, agentc2, agentc3, agentc5, agentc7, agentc9, agentc11

# Creating the gym environment 

In [None]:
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import tensorflow as tf


from stable_baselines.bench import Monitor 
from stable_baselines.common.vec_env import DummyVecEnv

from stable_baselines import PPO2 
from stable_baselines.common.policies import CnnPolicy
from stable_baselines.common.callbacks import EvalCallback

from common import board_flip
from connect4gym import ConnectFourGym


    
    
# Create ConnectFour environment
env = ConnectFourGym([matrix_agent, 'random', agentc1, agentc2, agentc3, agentc5])

# Create directory for logging training information
log_dir = "logtf1/"
os.makedirs(log_dir, exist_ok=True)

# Logging progress
monitor_env = Monitor(env, log_dir, allow_early_resets=True)

# Create a vectorized environment
vec_env = DummyVecEnv([lambda: monitor_env])



# Model definition

In [None]:
from tensorflow.layers import Dropout, BatchNormalization, Dense, Conv2D
import tensorflow as tf


"""
args = dotdict({
    'lr': 0.001,
    'dropout': 0.3,
    'epochs': 5,
    'batch_size': 64,
    'num_channels': 64,
})
"""


NUM_CHANNELS = 64

BN1 = BatchNormalization()
BN2 = BatchNormalization()
BN3 = BatchNormalization()
BN4 = BatchNormalization()
BN5 = BatchNormalization()
BN6 = BatchNormalization()


CONV1 = Conv2D(NUM_CHANNELS, kernel_size=3, strides=1, padding='same')
CONV2 = Conv2D(NUM_CHANNELS, kernel_size=3, strides=1, padding='same')
CONV3 = Conv2D(NUM_CHANNELS, kernel_size=3, strides=1)
CONV4 = Conv2D(NUM_CHANNELS, kernel_size=3, strides=1)

FC1 = Dense(128)
FC2 = Dense(64)
FC3 = Dense(7)

DROP1 = Dropout(0.3)
DROP2 = Dropout(0.3)


# 6x7 input
# https://github.com/PaddlePaddle/PARL/blob/0915559a1dd1b9de74ddd2b261e2a4accd0cd96a/benchmark/torch/AlphaZero/submission_template.py#L496
def modified_cnn(inputs, **kwargs):
    relu = tf.nn.relu
    log_softmax = tf.nn.log_softmax
    
    
    layer_1_out = relu(BN1(CONV1(inputs)))
    layer_2_out = relu(BN2(CONV2(layer_1_out)))
    layer_3_out = relu(BN3(CONV3(layer_2_out)))
    layer_4_out = relu(BN4(CONV4(layer_3_out)))
    
    # 3 is width - 4 due to convolition filters, 2 is same for height
    flattened = tf.reshape(layer_4_out, [-1, NUM_CHANNELS * 3 * 2]) 
    
    layer_5_out = DROP1(relu(BN5(FC1(flattened))))
    layer_6_out = DROP2(relu(BN6(FC2(layer_5_out))))
    
    return log_softmax(FC3(layer_6_out))  

# https://www.kaggle.com/c/connectx/discussion/128591
class CustomCnnPolicy(CnnPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomCnnPolicy, self).__init__(*args, **kwargs, cnn_extractor=modified_cnn)

In [None]:
from connect4gym import SaveBestModelCallback

Next, run the code cell below to train an agent with PPO and view how the rewards evolved during training.  This code is identical to the code from the tutorial.

In [None]:
# Initialize agent
# Try CnnPolicy and MlpPolicy
# https://www.kaggle.com/toshikazuwatanabe/connect4-make-submission-with-stable-baselines3/comments


eval_callback = SaveBestModelCallback('RDaneelConnect4_', 1000, ['random', agentc1, agentc3, agentc5, matrix_agent])

model = PPO2(CustomCnnPolicy, vec_env, verbose=1)


# Train agent
model.learn(total_timesteps=3000, callback=eval_callback)

#vec_env.close()

In [None]:
# Plot cumulative reward

plt.figure(figsize = (20,20))


with open(os.path.join(log_dir, "monitor.csv"), 'rt') as fh:    
    firstline = fh.readline()
    assert firstline[0] == '#'
    df = pd.read_csv(fh, index_col=None)['r']
df.rolling(window=100).mean().plot()
plt.show()

In [None]:
from time import sleep

def dqn_agent(obs, config):
    # Use the best model to select a column
    grid = board_flip(obs.mark, np.array(obs['board']).reshape(6,7,1))
    col, _ = model.predict(grid, deterministic=True)
    # Check if selected column is valid
    is_valid = (obs['board'][int(col)] == 0)
    # If not valid, select random move. 
    if is_valid:
        return int(col)
    else:
        grid = grid.reshape(6, 7)
        #sleep(2)
        #print(f'Illegal move attempted! Move: {col}, Boardf:\n{grid}')
        return random.choice([col for col in range(config.columns) if obs.board[int(col)] == 0])

In [None]:
from common import get_win_percentages_and_score

print('=' * 80)
print('VS Random')
get_win_percentages_and_score('random', dqn_agent)
print('=' * 80)
print('VS Heuristic')
get_win_percentages_and_score(agentc1, dqn_agent)
print('=' * 80)
print('VS Minmax2')
get_win_percentages_and_score(agentc2, dqn_agent)
print('=' * 80)
print('VS Minmax3')
get_win_percentages_and_score(agentc3, dqn_agent)
print('=' * 80)
print('VS Minmax5')
get_win_percentages_and_score(agentc5, dqn_agent)
print('=' * 80)
print('VS Minmax7')
get_win_percentages_and_score(agentc7, dqn_agent)
print('=' * 80)
print('VS Matrix Agent')
get_win_percentages_and_score(matrix_agent, dqn_agent)

In [None]:
compressed = serializeAndCompress(model.get_parameters())
print(compressed)

If your agent trained well, the plot (which shows average cumulative rewards) should increase over time.

Once you have verified that the code runs, try making amendments to see if you can get increased performance.  You might like to:
- change `PPO1` to `A2C` (or `ACER` or `ACKTR` or `TRPO`) when defining the model in this line of code: `model = PPO1(CustomCnnPolicy, vec_env, verbose=0)`.  This will let you see how performance can be affected by changing the algorithm from Proximal Policy Optimization [PPO] to one of:
  - Advantage Actor-Critic (A2C),
  - or Actor-Critic with Experience Replay (ACER),
  - Actor Critic using Kronecker-factored Trust Region (ACKTR), or 
  - Trust Region Policy Optimization (TRPO).
- modify the `change_reward()` method in the `ConnectFourGym` class to change the rewards that the agent receives in different conditions.  You may also need to modify `self.reward_range` in the `__init__` method (this tuple should always correspond to the minimum and maximum reward that the agent can receive).
- change `agent2` to a different agent when creating the ConnectFour environment with `env = ConnectFourGym(agent2="random")`.  For instance, you might like to use the `"negamax"` agent, or a different, custom agent.  Note that the smarter you make the opponent, the harder it will be for your agent to train!

# Congratulations!

You have completed the course, and it's time to put your new skills to work!  

The next step is to apply what you've learned to a **[more complex game: Halite](https://www.kaggle.com/c/halite)**.  For a step-by-step tutorial in how to make your first submission to this competition, **[check out the bonus lesson](https://www.kaggle.com/alexisbcook/getting-started-with-halite)**!

You can find more games as they're released on the **[Kaggle Simulations page](https://www.kaggle.com/simulations)**.

As we did in the course, we recommend that you start simple, with an agent that follows your precise instructions.  This will allow you to learn more about the mechanics of the game and to build intuition for what makes a good agent.  Then, gradually increase the complexity of your agents to climb the leaderboard!

---
**[Intro to Game AI and Reinforcement Learning Home Page](https://www.kaggle.com/learn/intro-to-game-ai-and-reinforcement-learning)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*