Errors running/retraining Splendor using commands from tutorial #3

aethy · 2024-03-24T18:13:36Z

I tried to play Splendor using the command from the tutorial (I first changed the package imports):

python ./pit.py splendor/pretrained_2players.pt human -n 1

But I got this following error:

File "D:\programs\Python\Python311\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (MatMulBnFusion_Gemm) Op (Gemm) [ShapeInferenceError] First input does not have rank 2

So I figured maybe it's due to the mentioned issue "Ongoing code/features rework, some pretrained networks won't work anymore". So I reverted to the version of 30/1/2024, without avail. Then I decided to first run the training myself, using the example from the tutorial (I had to add the -V 85 though, otherwise it complained about version 1 not existing):

python main.py -m 800 -e 1000 -i 5 -F -c 2.5 -f 0.1 -T 10 -b 32 -l 0.0003 -p 1 -D 0.3 -C ../results/mytest -V 85

But now I got the following error:

  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'SplendorNNet' object has no attribute 'first_layer'

The text was updated successfully, but these errors were encountered:

cestpasphoto · 2024-03-25T06:52:51Z

Thank you for testing my code, you're the first one from who I have feedback.

Pretrained networks should now work, so current code should work. Just tested on head of master and it runs on my side. Some questions:

Have you switched imports in main.py and pit.py to Splendor ?
Can you check that NUMBER_PLAYERS is set to 2 in SplendorGame.py ?
On my computer, I have onnxruntime 1.15.0 installed. Although I don't think that should be a problem, please test by removing your current version and installing same as me pip3 install onnxruntime==1.15.0
Can you retry and disable numba and human moves? Also on my system I have to call python3 instead of python. NUMBA_DISABLE_JIT=1 python3 ./pit.py splendor/pretrained_2players.pt splendor/pretrained_2players.pt -n 1

aethy · 2024-03-25T08:28:14Z

Thank you for testing my code, you're the first one from who I have feedback.

You're welcome, thank you for your quick reply.

Have you switched imports in main.py and pit.py to Splendor ?

Yes I have. I assume the following changes wrt master should be sufficient?

pit.py:

#!/usr/bin/env python3

import Arena
from MCTS import MCTS
from splendor.SplendorPlayers import *
from splendor.SplendorGame import SplendorGame as Game
from splendor.NNet import NNetWrapper as NNet

main.py:

from Coach import Coach
from splendor.SplendorGame import SplendorGame as Game
from splendor.NNet import NNetWrapper as NNet

Can you check that NUMBER_PLAYERS is set to 2 in SplendorGame.py ?

I didn't change it, so it's at 2.

On my computer, I have onnxruntime 1.15.0 installed. Although I don't think that should be a problem, please test by removing your current version and installing same as me pip3 install onnxruntime==1.15.0

Done, apparently I was at 1.17

Can you retry and disable numba and human moves? Also on my system I have to call python3 instead of python. NUMBA_DISABLE_JIT=1 python3 ./pit.py splendor/pretrained_2players.pt splendor/pretrained_2players.pt -n 1

I did as you suggested and playing now works, both with the pretrained models matching up as with a human player.


> which python
/d/programs/Python/Python311/python
> set NUMBA_DISABLE_JIT=1

> python ./pit.py splendor/pretrained_2players.pt splendor/pretrained_2players.pt -n 1
splendor/pretrained_2players.pt vs splendor/pretrained_2players.pt
Arena.playGames:   0%|                                                                            | 0/1 [00:00<?, ?it/s]D:\werkdir\alpha-zero-general\splendor\SplendorLogicNumba.py:321: RuntimeWarning: overflow encountered in scalar multiply
  fake_random_index = (4594591 * (random_seed+seed)) % len(remaining_cards_all)
Arena (1 vs 2): 100%|█████████████████████████████████████████████| 1/1 [01:01<00:00, 61.74s/it, one_wins=1, two_wins=0]

However, training still gives the same error:

> python main.py -m 800 -e 1000 -i 5 -F -c 2.5 -f 0.1 -T 10 -b 32 -l 0.0003 -p 1 -D 0.3 -C ../results/mytest -V 85 -P 1
Namespace(checkpoint='../results/mytest', load_folder_file=None, numEps=1000, numItersHistory=5, numMCTSSims=800, tempThreshold=10, temperature=[1.25, 0.8], cpuct=2.5, dirichletAlpha=-1, fpu=0.1, forced_playouts=True, learn_rate=0.0003, epochs=1, batch_size=32, dropout=0.3, nn_version=85, q_weight=0.5, updateThreshold=0.6, ratio_fullMCTS=5, prob_fullMCTS=0.25, universes=1, forget_examples=False, numIters=50, stop_after_N_fail=5, profile=False, debug=False, useray=False, parallel_inferences=1, no_compression=False, no_mem_optim=False, arenaCompare=30, maxlenOfQueue=1000000, load_model=False)
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file ../results/mytest/ already exists.
Error occurred while processing: ../results/mytest/.
Self Play:   0%|                                                                               | 0/1000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "D:\werkdir\alpha-zero-general\main.py", line 172, in <module>
    main()
  File "D:\werkdir\alpha-zero-general\main.py", line 169, in main
    run(args)
  File "D:\werkdir\alpha-zero-general\main.py", line 55, in run
    c.learn()
  File "D:\werkdir\alpha-zero-general\Coach.py", line 165, in learn
    iterationTrainExamples = self.executeEpisodes()
                             ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\werkdir\alpha-zero-general\Coach.py", line 112, in executeEpisodes
    iterationTrainExamples += self.executeEpisode()
                              ^^^^^^^^^^^^^^^^^^^^^
  File "D:\werkdir\alpha-zero-general\Coach.py", line 65, in executeEpisode
    pi, q, is_full_search = my_mcts.getActionProb(canonicalBoard, temp=1.)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\werkdir\alpha-zero-general\MCTS.py", line 65, in getActionProb
    self.search(canonicalBoard, dirichlet_noise=dir_noise, forced_playouts=forced_playouts)
  File "D:\werkdir\alpha-zero-general\MCTS.py", line 144, in search
    Ps, v = self.nnet.predict(canonicalBoard, Vs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\werkdir\alpha-zero-general\GenericNNetWrapper.py", line 100, in predict
    self.switch_target('inference')
  File "D:\werkdir\alpha-zero-general\GenericNNetWrapper.py", line 290, in switch_target
    self.export_and_load_onnx()
  File "D:\werkdir\alpha-zero-general\GenericNNetWrapper.py", line 321, in export_and_load_onnx
    torch.onnx.export(
  File "D:\programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py", line 516, in export
    _export(
  File "D:\programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py", line 1613, in _export
    graph, params_dict, torch_out = _model_to_graph(
                                    ^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py", line 1135, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py", line 1011, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py", line 915, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\jit\_trace.py", line 1296, in _get_trace_graph
    outs = ONNXTracedModule(
           ^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\jit\_trace.py", line 138, in forward
    graph, out = torch._C._create_graph_by_tracing(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\jit\_trace.py", line 129, in wrapper
    outs.append(self.inner(*trace_inputs))
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1501, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\werkdir\alpha-zero-general\splendor\SplendorNNet.py", line 236, in forward
    x = self.first_layer(x)
        ^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'SplendorNNet' object has no attribute 'first_layer'

aethy · 2024-03-25T12:53:34Z

I tried now with -V 74 and that seemed to work!

python main.py -m 800 -e 1000 -i 5 -F -c 2.5 -f 0.1 -T 10 -b 32 -l 0.0003 -p 1 -D 0.3 -C ../results/mytest -V 74 -P 1

cestpasphoto · 2024-03-26T06:35:19Z

OK so:

I should investigate why onnx v1.17 isn't supported
-V 85 failed because 85 isn't a supported architecture version in SplendorNNet.py. But 74 is. I should at least mention that in the instructions.

FYI instead of starting training from scratch, you can load an existing checkpoint with the option -L splendor/pretrained_2players.pt. You may also check what were the exact settings for training by using python3 GenericNNetWrapper.py -i splendor/pretrained_2players.pt (need to update imports on GenericNNetWrapper line 361 and install package fvcore). For instance, I used architecture V80 for the pretrained network, cpuct was set to 0.8, ...

cestpasphoto · 2024-03-30T08:34:38Z

onnxruntime v1.17 now supported on master

aethy changed the title ~~Errors when running commands from tutorial~~ Errors running/retraining Splendor using commands from tutorial Mar 24, 2024

aethy closed this as completed Mar 25, 2024

lumi-a mentioned this issue May 8, 2024

ONNXRuntimeErrors trying to run or train Splendor #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors running/retraining Splendor using commands from tutorial #3

Errors running/retraining Splendor using commands from tutorial #3

aethy commented Mar 24, 2024 •

edited

Loading

cestpasphoto commented Mar 25, 2024 •

edited

Loading

aethy commented Mar 25, 2024 •

edited

Loading

aethy commented Mar 25, 2024

cestpasphoto commented Mar 26, 2024

cestpasphoto commented Mar 30, 2024

Errors running/retraining Splendor using commands from tutorial #3

Errors running/retraining Splendor using commands from tutorial #3

Comments

aethy commented Mar 24, 2024 • edited Loading

cestpasphoto commented Mar 25, 2024 • edited Loading

aethy commented Mar 25, 2024 • edited Loading

aethy commented Mar 25, 2024

cestpasphoto commented Mar 26, 2024

cestpasphoto commented Mar 30, 2024

aethy commented Mar 24, 2024 •

edited

Loading

cestpasphoto commented Mar 25, 2024 •

edited

Loading

aethy commented Mar 25, 2024 •

edited

Loading