Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU accelerated batch MCTS #10

Closed
fqjin opened this issue May 5, 2019 · 7 comments
Closed

GPU accelerated batch MCTS #10

fqjin opened this issue May 5, 2019 · 7 comments
Assignees

Comments

@fqjin
Copy link
Owner

fqjin commented May 5, 2019

Using pytorch

@fqjin fqjin self-assigned this May 5, 2019
@fqjin
Copy link
Owner Author

fqjin commented May 9, 2019

Batch moving successfully implemented in cfc92db. Speed comparisons coming...

@fqjin
Copy link
Owner Author

fqjin commented May 9, 2019

Speed comparisons (in seconds) for various number of games. Result for 1 game is the average of 10 games. Standard deviation in parens.

  • Batching is slower for a single game, equal at about 10, and significantly faster at 100 games in parallel.
  • No batch GPU is much slower than on CPU for all cases. In other testing, I found that tensors need to have on the order of 10^6 elements for GPU acceleration to start beating CPU. May be worth examining contribution of play_fixed_batch vs move_batch vs merge_row_batch.
  • CPU: Intel Xeon E5-1620 @ 3.60 GHz
  • GPU: GeForce RTX 2070 @ 645 MHz, CUDA 10.1

CPU

n nobatch batch
1 0.189 (0.073) 0.468 (0.139)
10 1.70 1.25
100 16.5 3.27
1000 x 17.6

CUDA

n nobatch batch
1 0.626 (0.244) 1.14 (0.308)
10 6.19 2.95
100 66.2 9.47
1000 x 63.9

@fqjin
Copy link
Owner Author

fqjin commented May 10, 2019

At 10000 games with 3 addition randomly generated tiles, merge_row_batch is about the same between GPU and CPU, and move_batch is 3 times slower for GPU. The code in move_batch involves iterating and appending lists, and flipping torch tensors.

@fqjin
Copy link
Owner Author

fqjin commented May 11, 2019

Functions are slower on GPU when data size is small. For the function nonzero(), CPU is faster when data has 10^4 elements, and GPU becomes faster when data has more than 10^5 elements. I will implement a batch version of generate_tiles(), but even with 1000 per batch, it is still faster on CPU.

@fqjin
Copy link
Owner Author

fqjin commented May 13, 2019

Timings for mcts_nn with number=100. Running mcts is about 3x slower using cuda tensors. GPU gives about 3x faster evaluation with the CNN. However, given the overhead of running the mcts, running on GPU is still slower. ConvNet game is faster than TestNet because the mcts lines die earlier.

Network CPU CUDA
TestNet 11.7 47.3
ConvNet 9.77 26.4

Timings for 'play_nn' which does not do mcts (it only plays 1 game). Even this is slower due to slower move and generate_tiles functions on GPU. I should plan to optimize these functions in the future.

Network CPU CUDA
ConvNet 1.13 1.27

@fqjin fqjin closed this as completed May 13, 2019
@fqjin
Copy link
Owner Author

fqjin commented May 20, 2019

Selfplay game generation is still too slow. However, timing tests suggest that the minimum size for a GPU to be better than CPU is 200,000 batches in parallel. It is very hard for me to reach these numbers when only searching 50 lines per move or 200 games per mcts_nn. I would need to run 1000 mcts_nn in parallel to get that benefit. For now, I am focusing on improving speed on the CPU.

@fqjin
Copy link
Owner Author

fqjin commented May 28, 2019

I was able to use GPU-accelerated model prediction to speed up mcts_nn. See #14 and 64ebf88 for details. GPU usage hovers around 5% for one process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant