Skip to content

Train a CNN to play my mini board game "Pop it!" with reinforcement learning method and Monte Carlo Tree Search.

Notifications You must be signed in to change notification settings

crema-lida/rl-popit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rl-popit

Welcome to my website to play the game. Give it a try!

https://crema.evalieben.cn/game/

屏幕截图 2023-01-03 032403

The game is currently only available in Chinese.

Neural Network Architecture

The model structure references DeepMind's work on AlphaGo Zero:

Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). https://doi.org/10.1038/nature24270

The best model, resnet3-64, is comprised of:

  • 1 input convolutional layer
  • 3 residual blocks (2 convolutional layers + 1 skip connection for each block)
  • 1 policy head (1 convolutional layer + 1 fully connected layers)
  • 1 value head (1 convolutional layer + 2 fully connected layers)

Each of the convolutional layer has 64 features (the input layer has 2 features). The network structure resembles AlphaGo Zero's but has much less features and residual blocks (My game is too simple after all).

Method

The neural network is trained by Proximal Policy Optimization, PPO. I also tweaked the original PPO implementation according to this webpage:

The 37 Implementation Details of Proximal Policy Optimization https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

Some of the implementations work very well for my model: entropy maximization, normalization of advantages, global gradient clipping, etc. Each training cycle begins with 128 vectorized environments sampling game states, and then do backpropagation with a minibatch size of 2048. I simply set the reward of each action to 1 if agent wins in one game, elsewise the reward for all actions will be -1.

Training Curve

The horizontal axis shows the number of epochs and the vertial axis shows the win rate (%). Each curve represents an opponent using the old model. Once the win rate reaches 90%, drop the old model and then save the latest for opponent to use. Training is much tougher after 1000 epochs, and the model finally converges in 15000 epochs.

Win Rate Change during Epoch 0 - 1.4k

win_rate

Win Rate Change during Epoch 1k - 15k

win_rate(1)

About

Train a CNN to play my mini board game "Pop it!" with reinforcement learning method and Monte Carlo Tree Search.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages