can't load my trained model #3278

JamesCann · 2020-01-23T13:53:00Z

Describe the bug
Can't load the trained model to continue the training session.

The command is the same to start the training but I added '--load' to the command line to continue. The moment I press Play in Unity the session seems to just end.

See below for command-line dump

This is "Windows 10"

C:\Users\james>pip3 show mlagents
Name: mlagents
Version: 0.13.1
Summary: Unity Machine Learning Agents
Home-page: https://github.com/Unity-Technologies/ml-agents
Author: Unity Technologies
Author-email: ML-Agents@unity3d.com
License: UNKNOWN
Location: c:\users\james\appdata\local\programs\python\python36\lib\site-packages
Requires: h5py, Pillow, tensorflow, mlagents-envs, pyyaml, jupyter, pypiwin32, grpcio, matplotlib, numpy, protobuf
Required-by:

C:\Users\james>pip3 show tensorflow
Name: tensorflow
Version: 2.0.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: c:\users\james\appdata\local\programs\python\python36\lib\site-packages
Requires: wheel, google-pasta, six, gast, termcolor, keras-preprocessing, absl-py, grpcio, keras-applications, tensorboard, numpy, astor, opt-einsum, protobuf, tensorflow-estimator, wrapt
Required-by: mlagents

C:\ml-agents>mlagents-learn config/trainer_config.yaml --run-id=thirdRun --train --load
WARNING:tensorflow:From c:\users\james\appdata\local\programs\python\python36\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term

                    ▄▄▄▓▓▓▓
               ╓▓▓▓▓▓▓█▓▓▓▓▓
          ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
        ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
      ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
    ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
    ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
      ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
        '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
           ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
               `▀█▓▓▓▓▓▓▓▓▓▌
                    ¬`▀▀▀█▓

Version information:
ml-agents: 0.13.1,
ml-agents-envs: 0.13.1,
Communicator API: API-13,
TensorFlow: 2.0.0
INFO:mlagents.trainers:CommandLineOptions(debug=False, num_runs=1, seed=-1, env_path=None, run_id='thirdRun', load_model=True, train_model=True, save_freq=50000, keep_checkpoints=5, base_port=5005, num_envs=1, curriculum_folder=None, lesson=0, no_graphics=False, multi_gpu=False, trainer_config_path='config/trainer_config.yaml', sampler_file_path=None, docker_target_name=None, env_args=None, cpu=False, width=84, height=84, quality_level=5, time_scale=20, target_frame_rate=-1)
INFO:mlagents_envs:Listening on port 5004. Start training by pressing the Play button in the Unity Editor.

The text was updated successfully, but these errors were encountered:

JamesCann · 2020-01-23T13:57:24Z

Here is the command line dump when it attempts to rerun then exists immediately

INFO:mlagents_envs:Connected new brain:
thirdRun?team=0
INFO:mlagents.trainers:Hyperparameters for the PPOTrainer of brain thirdRun:
trainer: ppo
batch_size: 1024
beta: 0.005
buffer_size: 10240
epsilon: 0.2
hidden_units: 16
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 5000000
memory_size: 256
normalize: False
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 1000
use_recurrent: False
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
summary_path: thirdRun_thirdRun
model_path: ./models/thirdRun-0/thirdRun
keep_checkpoints: 5
2020-01-23 10:26:30.135661: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
INFO:mlagents.trainers:Loading Model for brain thirdRun?team=0
INFO:mlagents.trainers:Saved Model
INFO:mlagents.trainers:List of nodes to export for brain :thirdRun?team=0
INFO:mlagents.trainers: is_continuous_control
INFO:mlagents.trainers: version_number
INFO:mlagents.trainers: memory_size
INFO:mlagents.trainers: action_output_shape
INFO:mlagents.trainers: action
INFO:mlagents.trainers: action_probs
Converting ./models/thirdRun-0/thirdRun/frozen_graph_def.pb to ./models/thirdRun-0/thirdRun.nn
IGNORED: StopGradient unknown layer
GLOBALS: 'is_continuous_control', 'version_number', 'memory_size', 'action_output_shape'
IN: 'vector_observation': [-1, 1, 1, 18] => 'main_graph_0/hidden_0/BiasAdd'
IN: 'epsilon': [-1, 1, 1, 2] => 'mul'
OUT: 'action', 'action_probs'
DONE: wrote ./models/thirdRun-0/thirdRun.nn file.
INFO:mlagents.trainers:Exported ./models/thirdRun-0/thirdRun.nn file

vincentpierre · 2020-01-23T18:20:53Z

Using the --load argument will resume the training where it ended. It will save for example the number of training steps that have happened in the previous training session. My guess is that you need to increase the max_steps argument in the training configuration yaml file. --load is to continue training not restart a training session with a previously trained behavior as starting point.

JamesCann · 2020-01-25T14:07:27Z

ill give that a try today, thanks

JamesCann · 2020-01-25T14:14:51Z

Ok tried that, this time it didn't end. However, the model doesn't seem to have remembered anything. It looks as confused as when it was initially executed. Step count seems back to the beginning.

JamesCann · 2020-01-25T14:17:24Z

MY BAD!! It's working now, I forgot to add the --load parameter this time. Looks GREAT, thank you

github-actions · 2021-01-28T00:46:46Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

JamesCann added the bug Issue describes a potential bug in ml-agents. label Jan 23, 2020

vincentpierre self-assigned this Jan 23, 2020

vincentpierre closed this as completed Jan 27, 2020

github-actions bot locked as resolved and limited conversation to collaborators Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can't load my trained model #3278

can't load my trained model #3278

JamesCann commented Jan 23, 2020

JamesCann commented Jan 23, 2020

vincentpierre commented Jan 23, 2020

JamesCann commented Jan 25, 2020

JamesCann commented Jan 25, 2020

JamesCann commented Jan 25, 2020

github-actions bot commented Jan 28, 2021

can't load my trained model #3278

can't load my trained model #3278

Comments

JamesCann commented Jan 23, 2020

JamesCann commented Jan 23, 2020

vincentpierre commented Jan 23, 2020

JamesCann commented Jan 25, 2020

JamesCann commented Jan 25, 2020

JamesCann commented Jan 25, 2020

github-actions bot commented Jan 28, 2021