Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors running/retraining Splendor using commands from tutorial #3

Closed
aethy opened this issue Mar 24, 2024 · 5 comments
Closed

Errors running/retraining Splendor using commands from tutorial #3

aethy opened this issue Mar 24, 2024 · 5 comments

Comments

@aethy
Copy link
Contributor

aethy commented Mar 24, 2024

I tried to play Splendor using the command from the tutorial (I first changed the package imports):

python ./pit.py splendor/pretrained_2players.pt human -n 1

But I got this following error:

File "D:\programs\Python\Python311\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (MatMulBnFusion_Gemm) Op (Gemm) [ShapeInferenceError] First input does not have rank 2

So I figured maybe it's due to the mentioned issue "Ongoing code/features rework, some pretrained networks won't work anymore". So I reverted to the version of 30/1/2024, without avail. Then I decided to first run the training myself, using the example from the tutorial (I had to add the -V 85 though, otherwise it complained about version 1 not existing):

python main.py -m 800 -e 1000 -i 5 -F -c 2.5 -f 0.1 -T 10 -b 32 -l 0.0003 -p 1 -D 0.3 -C ../results/mytest -V 85

But now I got the following error:

  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'SplendorNNet' object has no attribute 'first_layer'
@aethy aethy changed the title Errors when running commands from tutorial Errors running/retraining Splendor using commands from tutorial Mar 24, 2024
@cestpasphoto
Copy link
Owner

cestpasphoto commented Mar 25, 2024

Thank you for testing my code, you're the first one from who I have feedback.

Pretrained networks should now work, so current code should work. Just tested on head of master and it runs on my side. Some questions:

  • Have you switched imports in main.py and pit.py to Splendor ?
  • Can you check that NUMBER_PLAYERS is set to 2 in SplendorGame.py ?
  • On my computer, I have onnxruntime 1.15.0 installed. Although I don't think that should be a problem, please test by removing your current version and installing same as me pip3 install onnxruntime==1.15.0
  • Can you retry and disable numba and human moves? Also on my system I have to call python3 instead of python. NUMBA_DISABLE_JIT=1 python3 ./pit.py splendor/pretrained_2players.pt splendor/pretrained_2players.pt -n 1

@aethy
Copy link
Contributor Author

aethy commented Mar 25, 2024

Thank you for testing my code, you're the first one from who I have feedback.

You're welcome, thank you for your quick reply.

  • Have you switched imports in main.py and pit.py to Splendor ?

Yes I have. I assume the following changes wrt master should be sufficient?

pit.py:

#!/usr/bin/env python3

import Arena
from MCTS import MCTS
from splendor.SplendorPlayers import *
from splendor.SplendorGame import SplendorGame as Game
from splendor.NNet import NNetWrapper as NNet

main.py:

from Coach import Coach
from splendor.SplendorGame import SplendorGame as Game
from splendor.NNet import NNetWrapper as NNet
  • Can you check that NUMBER_PLAYERS is set to 2 in SplendorGame.py ?

I didn't change it, so it's at 2.

  • On my computer, I have onnxruntime 1.15.0 installed. Although I don't think that should be a problem, please test by removing your current version and installing same as me pip3 install onnxruntime==1.15.0

Done, apparently I was at 1.17

  • Can you retry and disable numba and human moves? Also on my system I have to call python3 instead of python. NUMBA_DISABLE_JIT=1 python3 ./pit.py splendor/pretrained_2players.pt splendor/pretrained_2players.pt -n 1

I did as you suggested and playing now works, both with the pretrained models matching up as with a human player.


> which python
/d/programs/Python/Python311/python
> set NUMBA_DISABLE_JIT=1

> python ./pit.py splendor/pretrained_2players.pt splendor/pretrained_2players.pt -n 1
splendor/pretrained_2players.pt vs splendor/pretrained_2players.pt
Arena.playGames:   0%|                                                                            | 0/1 [00:00<?, ?it/s]D:\werkdir\alpha-zero-general\splendor\SplendorLogicNumba.py:321: RuntimeWarning: overflow encountered in scalar multiply
  fake_random_index = (4594591 * (random_seed+seed)) % len(remaining_cards_all)
Arena (1 vs 2): 100%|█████████████████████████████████████████████| 1/1 [01:01<00:00, 61.74s/it, one_wins=1, two_wins=0]

However, training still gives the same error:

> python main.py -m 800 -e 1000 -i 5 -F -c 2.5 -f 0.1 -T 10 -b 32 -l 0.0003 -p 1 -D 0.3 -C ../results/mytest -V 85 -P 1
Namespace(checkpoint='../results/mytest', load_folder_file=None, numEps=1000, numItersHistory=5, numMCTSSims=800, tempThreshold=10, temperature=[1.25, 0.8], cpuct=2.5, dirichletAlpha=-1, fpu=0.1, forced_playouts=True, learn_rate=0.0003, epochs=1, batch_size=32, dropout=0.3, nn_version=85, q_weight=0.5, updateThreshold=0.6, ratio_fullMCTS=5, prob_fullMCTS=0.25, universes=1, forget_examples=False, numIters=50, stop_after_N_fail=5, profile=False, debug=False, useray=False, parallel_inferences=1, no_compression=False, no_mem_optim=False, arenaCompare=30, maxlenOfQueue=1000000, load_model=False)
A subdirectory or file -p already exists.
Error occurred while processing: -p.
A subdirectory or file ../results/mytest/ already exists.
Error occurred while processing: ../results/mytest/.
Self Play:   0%|                                                                               | 0/1000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "D:\werkdir\alpha-zero-general\main.py", line 172, in <module>
    main()
  File "D:\werkdir\alpha-zero-general\main.py", line 169, in main
    run(args)
  File "D:\werkdir\alpha-zero-general\main.py", line 55, in run
    c.learn()
  File "D:\werkdir\alpha-zero-general\Coach.py", line 165, in learn
    iterationTrainExamples = self.executeEpisodes()
                             ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\werkdir\alpha-zero-general\Coach.py", line 112, in executeEpisodes
    iterationTrainExamples += self.executeEpisode()
                              ^^^^^^^^^^^^^^^^^^^^^
  File "D:\werkdir\alpha-zero-general\Coach.py", line 65, in executeEpisode
    pi, q, is_full_search = my_mcts.getActionProb(canonicalBoard, temp=1.)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\werkdir\alpha-zero-general\MCTS.py", line 65, in getActionProb
    self.search(canonicalBoard, dirichlet_noise=dir_noise, forced_playouts=forced_playouts)
  File "D:\werkdir\alpha-zero-general\MCTS.py", line 144, in search
    Ps, v = self.nnet.predict(canonicalBoard, Vs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\werkdir\alpha-zero-general\GenericNNetWrapper.py", line 100, in predict
    self.switch_target('inference')
  File "D:\werkdir\alpha-zero-general\GenericNNetWrapper.py", line 290, in switch_target
    self.export_and_load_onnx()
  File "D:\werkdir\alpha-zero-general\GenericNNetWrapper.py", line 321, in export_and_load_onnx
    torch.onnx.export(
  File "D:\programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py", line 516, in export
    _export(
  File "D:\programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py", line 1613, in _export
    graph, params_dict, torch_out = _model_to_graph(
                                    ^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py", line 1135, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py", line 1011, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\onnx\utils.py", line 915, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\jit\_trace.py", line 1296, in _get_trace_graph
    outs = ONNXTracedModule(
           ^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\jit\_trace.py", line 138, in forward
    graph, out = torch._C._create_graph_by_tracing(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\jit\_trace.py", line 129, in wrapper
    outs.append(self.inner(*trace_inputs))
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1501, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\werkdir\alpha-zero-general\splendor\SplendorNNet.py", line 236, in forward
    x = self.first_layer(x)
        ^^^^^^^^^^^^^^^^
  File "D:\programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'SplendorNNet' object has no attribute 'first_layer' 

@aethy
Copy link
Contributor Author

aethy commented Mar 25, 2024

I tried now with -V 74 and that seemed to work!

python main.py -m 800 -e 1000 -i 5 -F -c 2.5 -f 0.1 -T 10 -b 32 -l 0.0003 -p 1 -D 0.3 -C ../results/mytest -V 74 -P 1

@aethy aethy closed this as completed Mar 25, 2024
@cestpasphoto
Copy link
Owner

OK so:

  • I should investigate why onnx v1.17 isn't supported
  • -V 85 failed because 85 isn't a supported architecture version in SplendorNNet.py. But 74 is. I should at least mention that in the instructions.

FYI instead of starting training from scratch, you can load an existing checkpoint with the option -L splendor/pretrained_2players.pt. You may also check what were the exact settings for training by using python3 GenericNNetWrapper.py -i splendor/pretrained_2players.pt (need to update imports on GenericNNetWrapper line 361 and install package fvcore). For instance, I used architecture V80 for the pretrained network, cpuct was set to 0.8, ...

@cestpasphoto
Copy link
Owner

onnxruntime v1.17 now supported on master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants