GPU accelerated batch MCTS #10

fqjin · 2019-05-05T01:11:46Z

Using pytorch

fqjin · 2019-05-09T02:42:21Z

Batch moving successfully implemented in cfc92db. Speed comparisons coming...

fqjin · 2019-05-09T16:31:00Z

Speed comparisons (in seconds) for various number of games. Result for 1 game is the average of 10 games. Standard deviation in parens.

Batching is slower for a single game, equal at about 10, and significantly faster at 100 games in parallel.
No batch GPU is much slower than on CPU for all cases. In other testing, I found that tensors need to have on the order of 10^6 elements for GPU acceleration to start beating CPU. May be worth examining contribution of play_fixed_batch vs move_batch vs merge_row_batch.
CPU: Intel Xeon E5-1620 @ 3.60 GHz
GPU: GeForce RTX 2070 @ 645 MHz, CUDA 10.1

CPU

n	nobatch	batch
1	0.189 (0.073)	0.468 (0.139)
10	1.70	1.25
100	16.5	3.27
1000	x	17.6

CUDA

n	nobatch	batch
1	0.626 (0.244)	1.14 (0.308)
10	6.19	2.95
100	66.2	9.47
1000	x	63.9

fqjin · 2019-05-10T15:40:37Z

At 10000 games with 3 addition randomly generated tiles, merge_row_batch is about the same between GPU and CPU, and move_batch is 3 times slower for GPU. The code in move_batch involves iterating and appending lists, and flipping torch tensors.

fqjin · 2019-05-11T20:56:18Z

Functions are slower on GPU when data size is small. For the function nonzero(), CPU is faster when data has 10^4 elements, and GPU becomes faster when data has more than 10^5 elements. I will implement a batch version of generate_tiles(), but even with 1000 per batch, it is still faster on CPU.

fqjin · 2019-05-13T15:17:14Z

Timings for mcts_nn with number=100. Running mcts is about 3x slower using cuda tensors. GPU gives about 3x faster evaluation with the CNN. However, given the overhead of running the mcts, running on GPU is still slower. ConvNet game is faster than TestNet because the mcts lines die earlier.

Network	CPU	CUDA
TestNet	11.7	47.3
ConvNet	9.77	26.4

Timings for 'play_nn' which does not do mcts (it only plays 1 game). Even this is slower due to slower move and generate_tiles functions on GPU. I should plan to optimize these functions in the future.

Network	CPU	CUDA
ConvNet	1.13	1.27

fqjin · 2019-05-20T13:48:17Z

Selfplay game generation is still too slow. However, timing tests suggest that the minimum size for a GPU to be better than CPU is 200,000 batches in parallel. It is very hard for me to reach these numbers when only searching 50 lines per move or 200 games per mcts_nn. I would need to run 1000 mcts_nn in parallel to get that benefit. For now, I am focusing on improving speed on the CPU.

fqjin · 2019-05-28T14:00:42Z

I was able to use GPU-accelerated model prediction to speed up mcts_nn. See #14 and 64ebf88 for details. GPU usage hovers around 5% for one process.

fqjin self-assigned this May 5, 2019

fqjin mentioned this issue May 5, 2019

Rewrite training for pytorch #11

Closed

fqjin closed this as completed May 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU accelerated batch MCTS #10

GPU accelerated batch MCTS #10

fqjin commented May 5, 2019

fqjin commented May 9, 2019

fqjin commented May 9, 2019

fqjin commented May 10, 2019

fqjin commented May 11, 2019

fqjin commented May 13, 2019

fqjin commented May 20, 2019

fqjin commented May 28, 2019

GPU accelerated batch MCTS #10

GPU accelerated batch MCTS #10

Comments

fqjin commented May 5, 2019

fqjin commented May 9, 2019

fqjin commented May 9, 2019

fqjin commented May 10, 2019

fqjin commented May 11, 2019

fqjin commented May 13, 2019

fqjin commented May 20, 2019

fqjin commented May 28, 2019