Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trial container raise error when use --no-gpu #2290

Closed
caoyu1664 opened this issue Apr 28, 2021 · 2 comments
Closed

Trial container raise error when use --no-gpu #2290

caoyu1664 opened this issue Apr 28, 2021 · 2 comments

Comments

@caoyu1664
Copy link

Hi,
The container raise a TypeError: NoneType object does not support item assignment. in _socket_manager.py when trial is running.

1. Envs

  • Cpu only
  • OS: Ubuntu 16.04
  • Determined: 0.15.1
  • Python: 3.7

2. Install and start

pip install dertermined
det deploy local cluster-up --no-gpu
det experiment create const.yaml . (mnist_pytorch official)

2. Master logs

<info>    [2021-04-28, 08:18:41] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""}},"checkpoint_storage":{"host_path":"/home/shihk/.local/share/determined","save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":"determined-checkpoint","type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null},"port":8080,"harness_path":"/opt/determined","root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","segment_webui_key":"********"},"enable_cors":false,"cluster_name":"","logging":{"type":"default"},"hyperparameter_importance":{"workers_limit":2,"queue_limit":16,"cores_per_worker":1,"max_trees":100},"resource_manager":{"default_cpu_resource_pool":"default","default_gpu_resource_pool":"default","scheduler":{"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_cpu_containers_per_agent":100}]}
<info>    [2021-04-28, 08:18:41] Determined master 0.15.1 (built with go1.16.3)
<info>    [2021-04-28, 08:18:41] connecting to database determined-db:5432
<warning> [2021-04-28, 08:18:45] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=determined-db user=postgres database=determined`: dial error (dial tcp 172.23.0.2:5432: connect: connection refused)"
<info>    [2021-04-28, 08:18:45] running migrations from file:///usr/share/determined/master/static/migrations
<info>    [2021-04-28, 08:18:45] unable to find golang-migrate version
<info>    [2021-04-28, 08:18:45] deleting all snapshots for terminal state experiments
<info>    [2021-04-28, 08:18:45] creating resource pool: default  id="agentRM" system="master" type="agentResourceManager"
<info>    [2021-04-28, 08:18:45] pool default using global scheduling config  id="agentRM" system="master" type="agentResourceManager"
<info>    [2021-04-28, 08:18:45] initializing endpoints for agents
<info>    [2021-04-28, 08:18:45] not enabling provisioner for resource pool: default  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:18:45] scheduling next resource allocation aggregation in 15h42m14s at 2021-04-29 00:01:00 +0000 UTC  id="allocation-aggregator" system="master" type="allocationAggregator"
<info>    [2021-04-28, 08:18:45] telemetry reporting is enabled; run with `--telemetry-enabled=false` to disable
<info>    [2021-04-28, 08:18:45] accepting incoming connections on port 8080
<info>    [2021-04-28, 08:18:45] Subchannel Connectivity change to READY  system="system"
<info>    [2021-04-28, 08:18:45] pickfirstBalancer: HandleSubConnStateChange: 0xc000255d10, {READY <nil>}  system="system"
<info>    [2021-04-28, 08:18:45] Channel Connectivity change to READY  system="system"
<info>    [2021-04-28, 08:18:46] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="GetAgents" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.791" span.kind="server" system="grpc"
<info>    [2021-04-28, 08:18:46] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="Logout" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.323" span.kind="server" system="grpc"
<info>    [2021-04-28, 08:18:47] resource pool is empty; using default resource pool: default  id="agents" system="master" type="agents"
<info>    [2021-04-28, 08:18:48] agent connected ip: 192.168.100.30 resource pool: default slots: 1  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:18:48] adding agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:18:48] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<warning> [2021-04-28, 08:20:03] response already committed
<info>    [2021-04-28, 08:20:03] experiment state changed to ACTIVE  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:03] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: 51871fd3-0ac4-4419-95ce-eee38508beff)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:04] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:04] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:04] starting container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:06] found container running: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:06] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:06] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] new connection from container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 trial 1 (experiment 1) at 172.23.0.1:43810
<info>    [2021-04-28, 08:20:09] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:09] found 4 rendezvous addresses instead of 2 for container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:09] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-63a14abe-fc1f-4c44-8583-d2fb6f9c6136" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:09] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:09] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:09] stopped container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:09] found container terminated: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:09] unexpected failure of trial after restart 0/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:09] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: d997c647-2a3c-497c-8cd0-7ac7dd468634)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:10] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:10] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:10] starting container id: d73b67ac-1500-4037-8bc0-066701af1efa slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:12] found container running: d73b67ac-1500-4037-8bc0-066701af1efa (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:12] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:12] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] new connection from container d73b67ac-1500-4037-8bc0-066701af1efa trial 1 (experiment 1) at 172.23.0.1:43858
<info>    [2021-04-28, 08:20:15] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:15] found 4 rendezvous addresses instead of 2 for container d73b67ac-1500-4037-8bc0-066701af1efa; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:15] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-d73b67ac-1500-4037-8bc0-066701af1efa" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:15] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:15] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:15] stopped container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:15] found container terminated: d73b67ac-1500-4037-8bc0-066701af1efa  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:15] unexpected failure of trial after restart 1/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:15] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a8cc64d8-4902-4c37-97c6-00ca352a5bde)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:16] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:16] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:16] starting container id: b8fb8292-b905-48a2-951d-817b640b4ab6 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:18] found container running: b8fb8292-b905-48a2-951d-817b640b4ab6 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:18] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:18] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] new connection from container b8fb8292-b905-48a2-951d-817b640b4ab6 trial 1 (experiment 1) at 172.23.0.1:43922
<info>    [2021-04-28, 08:20:21] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:21] found 4 rendezvous addresses instead of 2 for container b8fb8292-b905-48a2-951d-817b640b4ab6; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:21] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-b8fb8292-b905-48a2-951d-817b640b4ab6" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:21] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:21] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:22] stopped container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:22] found container terminated: b8fb8292-b905-48a2-951d-817b640b4ab6  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:22] unexpected failure of trial after restart 2/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a7f517a4-d7e2-46fd-be41-c271865069f0)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] starting container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:24] found container running: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:24] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:24] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] new connection from container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 trial 1 (experiment 1) at 172.23.0.1:43982
<info>    [2021-04-28, 08:20:27] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:27] found 4 rendezvous addresses instead of 2 for container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:27] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:27] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:27] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:27] stopped container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:27] found container terminated: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:27] unexpected failure of trial after restart 3/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:27] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: ca1b557d-35fa-4c22-ae39-17e86f9cb140)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:28] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:28] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:28] starting container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:30] found container running: 1636009d-9875-4cf4-91f4-3fe617fba5b7 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:30] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:30] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] new connection from container 1636009d-9875-4cf4-91f4-3fe617fba5b7 trial 1 (experiment 1) at 172.23.0.1:44030
<info>    [2021-04-28, 08:20:33] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:33] found 4 rendezvous addresses instead of 2 for container 1636009d-9875-4cf4-91f4-3fe617fba5b7; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:33] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1636009d-9875-4cf4-91f4-3fe617fba5b7" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:33] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:33] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:33] stopped container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:33] found container terminated: 1636009d-9875-4cf4-91f4-3fe617fba5b7  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:33] unexpected failure of trial after restart 4/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:33] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: fc49fcfc-d6db-4dd5-96b1-1aa52032c118)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:34] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:34] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:34] starting container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:36] found container running: 1bc2292d-139d-45ea-ad5c-d261d06c4847 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:36] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:36] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] new connection from container 1bc2292d-139d-45ea-ad5c-d261d06c4847 trial 1 (experiment 1) at 172.23.0.1:44090
<info>    [2021-04-28, 08:20:39] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] found 4 rendezvous addresses instead of 2 for container 1bc2292d-139d-45ea-ad5c-d261d06c4847; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1bc2292d-139d-45ea-ad5c-d261d06c4847" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:39] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:39] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:39] stopped container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:39] found container terminated: 1bc2292d-139d-45ea-ad5c-d261d06c4847  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:39] unexpected failure of trial after restart 5/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] trial completed workload: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] exiting trial early: 0xc000d32040  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] error shutting down actor  error="trial 1 failed and reached maximum number of restarts" experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<error>   [2021-04-28, 08:20:39] trial failed unexpectedly  error="trial 1 failed and reached maximum number of restarts" id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] experiment state changed to STOPPING_ERROR  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] experiment state changed to ERROR  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] resources are requested by /experiment-1-checkpoint-gc (Task ID: 16f11e76-f209-4ab2-8c79-045bbfd6ccf9)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] experiment shut down successfully  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:40] allocated resources to /experiment-1-checkpoint-gc  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:40] starting checkpoint garbage collection  id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-04-28, 08:20:40] starting container id: 47011443-1146-4a7e-a138-e94a2115aa66 slots: 0 task handler: /experiment-1-checkpoint-gc  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:44] stopped container id: 47011443-1146-4a7e-a138-e94a2115aa66  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:44] finished checkpoint garbage collection  id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-04-28, 08:20:44] resources are released for /experiment-1-checkpoint-gc  id="default" resource-pool="default" system="master" type="ResourcePool"
<error>   [2021-04-28, 08:49:47] error while actor was running  error="ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb" system="master" type="websocketActor"
<error>   [2021-04-28, 08:49:47] ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932
<error>   [2021-04-28, 08:49:47] error while actor was running  error="child failed: /agents/determined-agent-0/websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb: ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:49:47] http: connection has been hijacked
<info>    [2021-04-28, 08:49:47] agent disconnected  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:49:47] removing device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) (determined-agent-0)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:49:47] removing agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 09:01:37] resource pool is empty; using default resource pool: default  id="agents" system="master" type="agents"
<info>    [2021-04-28, 09:01:38] agent connected ip: 192.168.100.30 resource pool: default slots: 1  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 09:01:38] adding agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 09:01:38] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"<info>    [2021-04-28, 08:18:41] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""}},"checkpoint_storage":{"host_path":"/home/shihk/.local/share/determined","save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":"determined-checkpoint","type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null},"port":8080,"harness_path":"/opt/determined","root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","segment_webui_key":"********"},"enable_cors":false,"cluster_name":"","logging":{"type":"default"},"hyperparameter_importance":{"workers_limit":2,"queue_limit":16,"cores_per_worker":1,"max_trees":100},"resource_manager":{"default_cpu_resource_pool":"default","default_gpu_resource_pool":"default","scheduler":{"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_cpu_containers_per_agent":100}]}
<info>    [2021-04-28, 08:18:41] Determined master 0.15.1 (built with go1.16.3)
<info>    [2021-04-28, 08:18:41] connecting to database determined-db:5432
<warning> [2021-04-28, 08:18:45] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=determined-db user=postgres database=determined`: dial error (dial tcp 172.23.0.2:5432: connect: connection refused)"
<info>    [2021-04-28, 08:18:45] running migrations from file:///usr/share/determined/master/static/migrations
<info>    [2021-04-28, 08:18:45] unable to find golang-migrate version
<info>    [2021-04-28, 08:18:45] deleting all snapshots for terminal state experiments
<info>    [2021-04-28, 08:18:45] creating resource pool: default  id="agentRM" system="master" type="agentResourceManager"
<info>    [2021-04-28, 08:18:45] pool default using global scheduling config  id="agentRM" system="master" type="agentResourceManager"
<info>    [2021-04-28, 08:18:45] initializing endpoints for agents
<info>    [2021-04-28, 08:18:45] not enabling provisioner for resource pool: default  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:18:45] scheduling next resource allocation aggregation in 15h42m14s at 2021-04-29 00:01:00 +0000 UTC  id="allocation-aggregator" system="master" type="allocationAggregator"
<info>    [2021-04-28, 08:18:45] telemetry reporting is enabled; run with `--telemetry-enabled=false` to disable
<info>    [2021-04-28, 08:18:45] accepting incoming connections on port 8080
<info>    [2021-04-28, 08:18:45] Subchannel Connectivity change to READY  system="system"
<info>    [2021-04-28, 08:18:45] pickfirstBalancer: HandleSubConnStateChange: 0xc000255d10, {READY <nil>}  system="system"
<info>    [2021-04-28, 08:18:45] Channel Connectivity change to READY  system="system"
<info>    [2021-04-28, 08:18:46] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="GetAgents" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.791" span.kind="server" system="grpc"
<info>    [2021-04-28, 08:18:46] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="Logout" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.323" span.kind="server" system="grpc"
<info>    [2021-04-28, 08:18:47] resource pool is empty; using default resource pool: default  id="agents" system="master" type="agents"
<info>    [2021-04-28, 08:18:48] agent connected ip: 192.168.100.30 resource pool: default slots: 1  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:18:48] adding agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:18:48] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<warning> [2021-04-28, 08:20:03] response already committed
<info>    [2021-04-28, 08:20:03] experiment state changed to ACTIVE  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:03] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: 51871fd3-0ac4-4419-95ce-eee38508beff)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:04] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:04] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:04] starting container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:06] found container running: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:06] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:06] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] new connection from container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 trial 1 (experiment 1) at 172.23.0.1:43810
<info>    [2021-04-28, 08:20:09] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:09] found 4 rendezvous addresses instead of 2 for container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:09] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-63a14abe-fc1f-4c44-8583-d2fb6f9c6136" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:09] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:09] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:09] stopped container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:09] found container terminated: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:09] unexpected failure of trial after restart 0/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:09] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: d997c647-2a3c-497c-8cd0-7ac7dd468634)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:10] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:10] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:10] starting container id: d73b67ac-1500-4037-8bc0-066701af1efa slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:12] found container running: d73b67ac-1500-4037-8bc0-066701af1efa (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:12] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:12] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] new connection from container d73b67ac-1500-4037-8bc0-066701af1efa trial 1 (experiment 1) at 172.23.0.1:43858
<info>    [2021-04-28, 08:20:15] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:15] found 4 rendezvous addresses instead of 2 for container d73b67ac-1500-4037-8bc0-066701af1efa; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:15] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-d73b67ac-1500-4037-8bc0-066701af1efa" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:15] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:15] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:15] stopped container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:15] found container terminated: d73b67ac-1500-4037-8bc0-066701af1efa  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:15] unexpected failure of trial after restart 1/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:15] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a8cc64d8-4902-4c37-97c6-00ca352a5bde)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:16] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:16] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:16] starting container id: b8fb8292-b905-48a2-951d-817b640b4ab6 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:18] found container running: b8fb8292-b905-48a2-951d-817b640b4ab6 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:18] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:18] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] new connection from container b8fb8292-b905-48a2-951d-817b640b4ab6 trial 1 (experiment 1) at 172.23.0.1:43922
<info>    [2021-04-28, 08:20:21] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:21] found 4 rendezvous addresses instead of 2 for container b8fb8292-b905-48a2-951d-817b640b4ab6; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:21] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-b8fb8292-b905-48a2-951d-817b640b4ab6" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:21] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:21] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:22] stopped container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:22] found container terminated: b8fb8292-b905-48a2-951d-817b640b4ab6  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:22] unexpected failure of trial after restart 2/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a7f517a4-d7e2-46fd-be41-c271865069f0)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] starting container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:24] found container running: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:24] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:24] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] new connection from container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 trial 1 (experiment 1) at 172.23.0.1:43982
<info>    [2021-04-28, 08:20:27] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:27] found 4 rendezvous addresses instead of 2 for container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:27] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:27] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:27] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:27] stopped container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:27] found container terminated: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:27] unexpected failure of trial after restart 3/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:27] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: ca1b557d-35fa-4c22-ae39-17e86f9cb140)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:28] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:28] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:28] starting container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:30] found container running: 1636009d-9875-4cf4-91f4-3fe617fba5b7 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:30] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:30] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] new connection from container 1636009d-9875-4cf4-91f4-3fe617fba5b7 trial 1 (experiment 1) at 172.23.0.1:44030
<info>    [2021-04-28, 08:20:33] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:33] found 4 rendezvous addresses instead of 2 for container 1636009d-9875-4cf4-91f4-3fe617fba5b7; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:33] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1636009d-9875-4cf4-91f4-3fe617fba5b7" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:33] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:33] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:33] stopped container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:33] found container terminated: 1636009d-9875-4cf4-91f4-3fe617fba5b7  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:33] unexpected failure of trial after restart 4/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:33] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: fc49fcfc-d6db-4dd5-96b1-1aa52032c118)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:34] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:34] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:34] starting container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:36] found container running: 1bc2292d-139d-45ea-ad5c-d261d06c4847 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:36] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:36] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] new connection from container 1bc2292d-139d-45ea-ad5c-d261d06c4847 trial 1 (experiment 1) at 172.23.0.1:44090
<info>    [2021-04-28, 08:20:39] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] found 4 rendezvous addresses instead of 2 for container 1bc2292d-139d-45ea-ad5c-d261d06c4847; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1bc2292d-139d-45ea-ad5c-d261d06c4847" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:39] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:39] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:39] stopped container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:39] found container terminated: 1bc2292d-139d-45ea-ad5c-d261d06c4847  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:39] unexpected failure of trial after restart 5/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] trial completed workload: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] exiting trial early: 0xc000d32040  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] error shutting down actor  error="trial 1 failed and reached maximum number of restarts" experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<error>   [2021-04-28, 08:20:39] trial failed unexpectedly  error="trial 1 failed and reached maximum number of restarts" id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] experiment state changed to STOPPING_ERROR  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] experiment state changed to ERROR  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] resources are requested by /experiment-1-checkpoint-gc (Task ID: 16f11e76-f209-4ab2-8c79-045bbfd6ccf9)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] experiment shut down successfully  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:40] allocated resources to /experiment-1-checkpoint-gc  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:40] starting checkpoint garbage collection  id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-04-28, 08:20:40] starting container id: 47011443-1146-4a7e-a138-e94a2115aa66 slots: 0 task handler: /experiment-1-checkpoint-gc  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:44] stopped container id: 47011443-1146-4a7e-a138-e94a2115aa66  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:44] finished checkpoint garbage collection  id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-04-28, 08:20:44] resources are released for /experiment-1-checkpoint-gc  id="default" resource-pool="default" system="master" type="ResourcePool"
<error>   [2021-04-28, 08:49:47] error while actor was running  error="ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb" system="master" type="websocketActor"
<error>   [2021-04-28, 08:49:47] ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932
<error>   [2021-04-28, 08:49:47] error while actor was running  error="child failed: /agents/determined-agent-0/websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb: ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:49:47] http: connection has been hijacked
<info>    [2021-04-28, 08:49:47] agent disconnected  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:49:47] removing device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) (determined-agent-0)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:49:47] removing agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 09:01:37] resource pool is empty; using default resource pool: default  id="agents" system="master" type="agents"
<info>    [2021-04-28, 09:01:38] agent connected ip: 192.168.100.30 resource pool: default slots: 1  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 09:01:38] adding agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 09:01:38] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"

3. Trial logs

[2021-04-28T08:20:04Z] 63a14abe || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:06Z] 63a14abe || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:06Z] 63a14abe || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:06Z] 63a14abe || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:06Z] 63a14abe || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:06Z] 63a14abe || + '[' -z '' ']'
[2021-04-28T08:20:06Z] 63a14abe || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:06Z] 63a14abe || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:06Z] 63a14abe || + /bin/which python3
[2021-04-28T08:20:06Z] 63a14abe || + '[' /root = / ']'
[2021-04-28T08:20:06Z] 63a14abe || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:08Z] 63a14abe || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:08Z] 63a14abe || + cd /run/determined/workdir
[2021-04-28T08:20:08Z] 63a14abe || + test -f startup-hook.sh
[2021-04-28T08:20:08Z] 63a14abe || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:09Z] 63a14abe || INFO: New trial runner in (container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '63a14abe-fc1f-4c44-8583-d2fb6f9c6136', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:09Z] 63a14abe || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/63a14abe-fc1f-4c44-8583-d2fb6f9c6136
[2021-04-28T08:20:09Z] 63a14abe || INFO: Connected to master
[2021-04-28T08:20:09Z] 63a14abe || INFO: Established WebSocket session with master
[2021-04-28T08:20:09Z] 63a14abe || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49193}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49190}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49192}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49189}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:09Z] 63a14abe || Traceback (most recent call last):
[2021-04-28T08:20:09Z] 63a14abe ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:09Z] 63a14abe ||     "__main__", mod_spec)
[2021-04-28T08:20:09Z] 63a14abe ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:09Z] 63a14abe ||     exec(code, run_globals)
[2021-04-28T08:20:09Z] 63a14abe ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:09Z] 63a14abe ||     main()
[2021-04-28T08:20:09Z] 63a14abe ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:09Z] 63a14abe ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:09Z] 63a14abe ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:09Z] 63a14abe ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:09Z] 63a14abe ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:09Z] 63a14abe ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:09Z] 63a14abe ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:09Z] 63a14abe ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:09Z] 63a14abe || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:09Z] 63a14abe || WARNING: disconnecting websocket
[2021-04-28T08:20:09Z] 63a14abe || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
[2021-04-28T08:20:10Z] d73b67ac || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:12Z] d73b67ac || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:12Z] d73b67ac || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:12Z] d73b67ac || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:12Z] d73b67ac || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:12Z] d73b67ac || + '[' -z '' ']'
[2021-04-28T08:20:12Z] d73b67ac || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:12Z] d73b67ac || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:12Z] d73b67ac || + /bin/which python3
[2021-04-28T08:20:12Z] d73b67ac || + '[' /root = / ']'
[2021-04-28T08:20:12Z] d73b67ac || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:14Z] d73b67ac || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:14Z] d73b67ac || + cd /run/determined/workdir
[2021-04-28T08:20:14Z] d73b67ac || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:14Z] d73b67ac || + test -f startup-hook.sh
[2021-04-28T08:20:15Z] d73b67ac || INFO: New trial runner in (container d73b67ac-1500-4037-8bc0-066701af1efa) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'd73b67ac-1500-4037-8bc0-066701af1efa', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:15Z] d73b67ac || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/d73b67ac-1500-4037-8bc0-066701af1efa
[2021-04-28T08:20:15Z] d73b67ac || INFO: Connected to master
[2021-04-28T08:20:15Z] d73b67ac || INFO: Established WebSocket session with master
[2021-04-28T08:20:15Z] d73b67ac || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49195}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49192}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49194}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49191}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:15Z] d73b67ac || Traceback (most recent call last):
[2021-04-28T08:20:15Z] d73b67ac ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:15Z] d73b67ac ||     "__main__", mod_spec)
[2021-04-28T08:20:15Z] d73b67ac ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:15Z] d73b67ac ||     exec(code, run_globals)
[2021-04-28T08:20:15Z] d73b67ac ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:15Z] d73b67ac ||     main()
[2021-04-28T08:20:15Z] d73b67ac ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:15Z] d73b67ac ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:15Z] d73b67ac ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:15Z] d73b67ac ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:15Z] d73b67ac ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:15Z] d73b67ac ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:15Z] d73b67ac ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:15Z] d73b67ac ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:15Z] d73b67ac || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:15Z] d73b67ac || WARNING: disconnecting websocket
[2021-04-28T08:20:15Z] d73b67ac || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
[2021-04-28T08:20:16Z] b8fb8292 || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:18Z] b8fb8292 || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:18Z] b8fb8292 || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:18Z] b8fb8292 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:18Z] b8fb8292 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:18Z] b8fb8292 || + '[' -z '' ']'
[2021-04-28T08:20:18Z] b8fb8292 || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:18Z] b8fb8292 || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:18Z] b8fb8292 || + /bin/which python3
[2021-04-28T08:20:18Z] b8fb8292 || + '[' /root = / ']'
[2021-04-28T08:20:18Z] b8fb8292 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:20Z] b8fb8292 || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:20Z] b8fb8292 || + cd /run/determined/workdir
[2021-04-28T08:20:20Z] b8fb8292 || + test -f startup-hook.sh
[2021-04-28T08:20:20Z] b8fb8292 || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:21Z] b8fb8292 || INFO: New trial runner in (container b8fb8292-b905-48a2-951d-817b640b4ab6) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'b8fb8292-b905-48a2-951d-817b640b4ab6', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:21Z] b8fb8292 || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/b8fb8292-b905-48a2-951d-817b640b4ab6
[2021-04-28T08:20:21Z] b8fb8292 || INFO: Connected to master
[2021-04-28T08:20:21Z] b8fb8292 || INFO: Established WebSocket session with master
[2021-04-28T08:20:21Z] b8fb8292 || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49197}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49194}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49196}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49193}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:21Z] b8fb8292 || Traceback (most recent call last):
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:21Z] b8fb8292 ||     "__main__", mod_spec)
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:21Z] b8fb8292 ||     exec(code, run_globals)
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:21Z] b8fb8292 ||     main()
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:21Z] b8fb8292 ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:21Z] b8fb8292 ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:21Z] b8fb8292 ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:21Z] b8fb8292 ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:21Z] b8fb8292 || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:21Z] b8fb8292 || WARNING: disconnecting websocket
[2021-04-28T08:20:22Z] b8fb8292 || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:24Z] 8fd8ae29 || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:24Z] 8fd8ae29 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:24Z] 8fd8ae29 || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:24Z] 8fd8ae29 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:24Z] 8fd8ae29 || + '[' -z '' ']'
[2021-04-28T08:20:24Z] 8fd8ae29 || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:24Z] 8fd8ae29 || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:24Z] 8fd8ae29 || + /bin/which python3
[2021-04-28T08:20:24Z] 8fd8ae29 || + '[' /root = / ']'
[2021-04-28T08:20:24Z] 8fd8ae29 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:25Z] 8fd8ae29 || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:26Z] 8fd8ae29 || + cd /run/determined/workdir
[2021-04-28T08:20:26Z] 8fd8ae29 || + test -f startup-hook.sh
[2021-04-28T08:20:26Z] 8fd8ae29 || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: New trial runner in (container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Connected to master
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Established WebSocket session with master
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49199}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49196}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49198}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49195}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:27Z] 8fd8ae29 || Traceback (most recent call last):
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:27Z] 8fd8ae29 ||     "__main__", mod_spec)
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:27Z] 8fd8ae29 ||     exec(code, run_globals)
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:27Z] 8fd8ae29 ||     main()
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:27Z] 8fd8ae29 ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:27Z] 8fd8ae29 ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:27Z] 8fd8ae29 ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:27Z] 8fd8ae29 ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:27Z] 8fd8ae29 || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:27Z] 8fd8ae29 || WARNING: disconnecting websocket
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
[2021-04-28T08:20:28Z] 1636009d || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:30Z] 1636009d || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:30Z] 1636009d || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:30Z] 1636009d || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:30Z] 1636009d || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:30Z] 1636009d || + '[' -z '' ']'
[2021-04-28T08:20:30Z] 1636009d || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:30Z] 1636009d || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:30Z] 1636009d || + /bin/which python3
[2021-04-28T08:20:30Z] 1636009d || + '[' /root = / ']'
[2021-04-28T08:20:30Z] 1636009d || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:31Z] 1636009d || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:32Z] 1636009d || + cd /run/determined/workdir
[2021-04-28T08:20:32Z] 1636009d || + test -f startup-hook.sh
[2021-04-28T08:20:32Z] 1636009d || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:33Z] 1636009d || INFO: New trial runner in (container 1636009d-9875-4cf4-91f4-3fe617fba5b7) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '1636009d-9875-4cf4-91f4-3fe617fba5b7', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:33Z] 1636009d || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/1636009d-9875-4cf4-91f4-3fe617fba5b7
[2021-04-28T08:20:33Z] 1636009d || INFO: Connected to master
[2021-04-28T08:20:33Z] 1636009d || INFO: Established WebSocket session with master
[2021-04-28T08:20:33Z] 1636009d || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49201}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49198}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49200}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49197}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:33Z] 1636009d || Traceback (most recent call last):
[2021-04-28T08:20:33Z] 1636009d ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:33Z] 1636009d ||     "__main__", mod_spec)
[2021-04-28T08:20:33Z] 1636009d ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:33Z] 1636009d ||     exec(code, run_globals)
[2021-04-28T08:20:33Z] 1636009d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:33Z] 1636009d ||     main()
[2021-04-28T08:20:33Z] 1636009d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:33Z] 1636009d ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:33Z] 1636009d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:33Z] 1636009d ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:33Z] 1636009d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:33Z] 1636009d ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:33Z] 1636009d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:33Z] 1636009d ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:33Z] 1636009d || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:33Z] 1636009d || WARNING: disconnecting websocket
[2021-04-28T08:20:33Z] 1636009d || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
[2021-04-28T08:20:34Z] 1bc2292d || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:36Z] 1bc2292d || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:36Z] 1bc2292d || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:36Z] 1bc2292d || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:36Z] 1bc2292d || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:36Z] 1bc2292d || + '[' -z '' ']'
[2021-04-28T08:20:36Z] 1bc2292d || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:36Z] 1bc2292d || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:36Z] 1bc2292d || + /bin/which python3
[2021-04-28T08:20:36Z] 1bc2292d || + '[' /root = / ']'
[2021-04-28T08:20:36Z] 1bc2292d || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:38Z] 1bc2292d || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:38Z] 1bc2292d || + cd /run/determined/workdir
[2021-04-28T08:20:38Z] 1bc2292d || + test -f startup-hook.sh
[2021-04-28T08:20:38Z] 1bc2292d || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:39Z] 1bc2292d || INFO: New trial runner in (container 1bc2292d-139d-45ea-ad5c-d261d06c4847) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '1bc2292d-139d-45ea-ad5c-d261d06c4847', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:39Z] 1bc2292d || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/1bc2292d-139d-45ea-ad5c-d261d06c4847
[2021-04-28T08:20:39Z] 1bc2292d || INFO: Connected to master
[2021-04-28T08:20:39Z] 1bc2292d || INFO: Established WebSocket session with master
[2021-04-28T08:20:39Z] 1bc2292d || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49203}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49200}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49202}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49199}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:39Z] 1bc2292d || Traceback (most recent call last):
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:39Z] 1bc2292d ||     "__main__", mod_spec)
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:39Z] 1bc2292d ||     exec(code, run_globals)
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:39Z] 1bc2292d ||     main()
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:39Z] 1bc2292d ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:39Z] 1bc2292d ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:39Z] 1bc2292d ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:39Z] 1bc2292d ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:39Z] 1bc2292d || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:39Z] 1bc2292d || WARNING: disconnecting websocket
[2021-04-28T08:20:39Z] 1bc2292d || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
�[32mTrial log stream ended. To reopen log stream, run: det trial logs -f 1�[0m

The variable rendezvous_ports in env seem not be set correct?

@vishnu2kmohan
Copy link
Contributor

@caoyu1664 It's the same issue that you reported earlier, which applies even to det deploy cluster-up --no-gpu.

@caoyu1664
Copy link
Author

@vishnu2kmohan Thanks, Close this issues please :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants