We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi, The container raise a TypeError: NoneType object does not support item assignment. in _socket_manager.py when trial is running.
NoneType
pip install dertermined det deploy local cluster-up --no-gpu det experiment create const.yaml . (mnist_pytorch official)
pip install dertermined
det deploy local cluster-up --no-gpu
det experiment create const.yaml .
<info> [2021-04-28, 08:18:41] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""}},"checkpoint_storage":{"host_path":"/home/shihk/.local/share/determined","save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":"determined-checkpoint","type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null},"port":8080,"harness_path":"/opt/determined","root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","segment_webui_key":"********"},"enable_cors":false,"cluster_name":"","logging":{"type":"default"},"hyperparameter_importance":{"workers_limit":2,"queue_limit":16,"cores_per_worker":1,"max_trees":100},"resource_manager":{"default_cpu_resource_pool":"default","default_gpu_resource_pool":"default","scheduler":{"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_cpu_containers_per_agent":100}]} <info> [2021-04-28, 08:18:41] Determined master 0.15.1 (built with go1.16.3) <info> [2021-04-28, 08:18:41] connecting to database determined-db:5432 <warning> [2021-04-28, 08:18:45] failed to connect to postgres, trying again in 4s error="failed to connect to `host=determined-db user=postgres database=determined`: dial error (dial tcp 172.23.0.2:5432: connect: connection refused)" <info> [2021-04-28, 08:18:45] running migrations from file:///usr/share/determined/master/static/migrations <info> [2021-04-28, 08:18:45] unable to find golang-migrate version <info> [2021-04-28, 08:18:45] deleting all snapshots for terminal state experiments <info> [2021-04-28, 08:18:45] creating resource pool: default id="agentRM" system="master" type="agentResourceManager" <info> [2021-04-28, 08:18:45] pool default using global scheduling config id="agentRM" system="master" type="agentResourceManager" <info> [2021-04-28, 08:18:45] initializing endpoints for agents <info> [2021-04-28, 08:18:45] not enabling provisioner for resource pool: default id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:18:45] scheduling next resource allocation aggregation in 15h42m14s at 2021-04-29 00:01:00 +0000 UTC id="allocation-aggregator" system="master" type="allocationAggregator" <info> [2021-04-28, 08:18:45] telemetry reporting is enabled; run with `--telemetry-enabled=false` to disable <info> [2021-04-28, 08:18:45] accepting incoming connections on port 8080 <info> [2021-04-28, 08:18:45] Subchannel Connectivity change to READY system="system" <info> [2021-04-28, 08:18:45] pickfirstBalancer: HandleSubConnStateChange: 0xc000255d10, {READY <nil>} system="system" <info> [2021-04-28, 08:18:45] Channel Connectivity change to READY system="system" <info> [2021-04-28, 08:18:46] finished unary call with code Unauthenticated error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="GetAgents" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.791" span.kind="server" system="grpc" <info> [2021-04-28, 08:18:46] finished unary call with code Unauthenticated error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="Logout" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.323" span.kind="server" system="grpc" <info> [2021-04-28, 08:18:47] resource pool is empty; using default resource pool: default id="agents" system="master" type="agents" <info> [2021-04-28, 08:18:48] agent connected ip: 192.168.100.30 resource pool: default slots: 1 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:18:48] adding agent: determined-agent-0 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:18:48] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0 id="default" resource-pool="default" system="master" type="ResourcePool" <warning> [2021-04-28, 08:20:03] response already committed <info> [2021-04-28, 08:20:03] experiment state changed to ACTIVE id="1" system="master" type="experiment" <info> [2021-04-28, 08:20:03] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: 51871fd3-0ac4-4419-95ce-eee38508beff) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:04] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:04] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:04] starting container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:06] found container running: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:06] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:06] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] new connection from container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 trial 1 (experiment 1) at 172.23.0.1:43810 <info> [2021-04-28, 08:20:09] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:09] found 4 rendezvous addresses instead of 2 for container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:09] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-63a14abe-fc1f-4c44-8583-d2fb6f9c6136" system="master" type="websocketActor" <error> [2021-04-28, 08:20:09] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:09] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:09] stopped container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:09] found container terminated: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:09] unexpected failure of trial after restart 0/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] resetting trial 1 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:09] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: d997c647-2a3c-497c-8cd0-7ac7dd468634) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:10] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:10] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:10] starting container id: d73b67ac-1500-4037-8bc0-066701af1efa slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:12] found container running: d73b67ac-1500-4037-8bc0-066701af1efa (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:12] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:12] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] new connection from container d73b67ac-1500-4037-8bc0-066701af1efa trial 1 (experiment 1) at 172.23.0.1:43858 <info> [2021-04-28, 08:20:15] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:15] found 4 rendezvous addresses instead of 2 for container d73b67ac-1500-4037-8bc0-066701af1efa; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:15] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-d73b67ac-1500-4037-8bc0-066701af1efa" system="master" type="websocketActor" <error> [2021-04-28, 08:20:15] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:15] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:15] stopped container id: d73b67ac-1500-4037-8bc0-066701af1efa id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:15] found container terminated: d73b67ac-1500-4037-8bc0-066701af1efa experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:15] unexpected failure of trial after restart 1/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] resetting trial 1 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:15] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a8cc64d8-4902-4c37-97c6-00ca352a5bde) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:16] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:16] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:16] starting container id: b8fb8292-b905-48a2-951d-817b640b4ab6 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:18] found container running: b8fb8292-b905-48a2-951d-817b640b4ab6 (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:18] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:18] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:21] new connection from container b8fb8292-b905-48a2-951d-817b640b4ab6 trial 1 (experiment 1) at 172.23.0.1:43922 <info> [2021-04-28, 08:20:21] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:21] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:21] found 4 rendezvous addresses instead of 2 for container b8fb8292-b905-48a2-951d-817b640b4ab6; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:21] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-b8fb8292-b905-48a2-951d-817b640b4ab6" system="master" type="websocketActor" <error> [2021-04-28, 08:20:21] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:21] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:21] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:21] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:22] stopped container id: b8fb8292-b905-48a2-951d-817b640b4ab6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:22] found container terminated: b8fb8292-b905-48a2-951d-817b640b4ab6 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:22] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:22] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6 id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:22] unexpected failure of trial after restart 2/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:22] resetting trial 1 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:22] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:22] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a7f517a4-d7e2-46fd-be41-c271865069f0) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:22] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:22] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:22] starting container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:24] found container running: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:24] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:24] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] new connection from container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 trial 1 (experiment 1) at 172.23.0.1:43982 <info> [2021-04-28, 08:20:27] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:27] found 4 rendezvous addresses instead of 2 for container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:27] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4" system="master" type="websocketActor" <error> [2021-04-28, 08:20:27] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:27] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:27] stopped container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:27] found container terminated: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:27] unexpected failure of trial after restart 3/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] resetting trial 1 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:27] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: ca1b557d-35fa-4c22-ae39-17e86f9cb140) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:28] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:28] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:28] starting container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:30] found container running: 1636009d-9875-4cf4-91f4-3fe617fba5b7 (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:30] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:30] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] new connection from container 1636009d-9875-4cf4-91f4-3fe617fba5b7 trial 1 (experiment 1) at 172.23.0.1:44030 <info> [2021-04-28, 08:20:33] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:33] found 4 rendezvous addresses instead of 2 for container 1636009d-9875-4cf4-91f4-3fe617fba5b7; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:33] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1636009d-9875-4cf4-91f4-3fe617fba5b7" system="master" type="websocketActor" <error> [2021-04-28, 08:20:33] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:33] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:33] stopped container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:33] found container terminated: 1636009d-9875-4cf4-91f4-3fe617fba5b7 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:33] unexpected failure of trial after restart 4/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] resetting trial 1 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:33] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: fc49fcfc-d6db-4dd5-96b1-1aa52032c118) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:34] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:34] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:34] starting container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:36] found container running: 1bc2292d-139d-45ea-ad5c-d261d06c4847 (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:36] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:36] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] new connection from container 1bc2292d-139d-45ea-ad5c-d261d06c4847 trial 1 (experiment 1) at 172.23.0.1:44090 <info> [2021-04-28, 08:20:39] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:39] found 4 rendezvous addresses instead of 2 for container 1bc2292d-139d-45ea-ad5c-d261d06c4847; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:39] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1bc2292d-139d-45ea-ad5c-d261d06c4847" system="master" type="websocketActor" <error> [2021-04-28, 08:20:39] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:39] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:39] stopped container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:39] found container terminated: 1bc2292d-139d-45ea-ad5c-d261d06c4847 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:39] unexpected failure of trial after restart 5/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] trial completed workload: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] exiting trial early: 0xc000d32040 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:39] error shutting down actor error="trial 1 failed and reached maximum number of restarts" experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <error> [2021-04-28, 08:20:39] trial failed unexpectedly error="trial 1 failed and reached maximum number of restarts" id="1" system="master" type="experiment" <info> [2021-04-28, 08:20:39] experiment state changed to STOPPING_ERROR id="1" system="master" type="experiment" <info> [2021-04-28, 08:20:39] experiment state changed to ERROR id="1" system="master" type="experiment" <info> [2021-04-28, 08:20:39] resources are requested by /experiment-1-checkpoint-gc (Task ID: 16f11e76-f209-4ab2-8c79-045bbfd6ccf9) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] experiment shut down successfully id="1" system="master" type="experiment" <info> [2021-04-28, 08:20:40] allocated resources to /experiment-1-checkpoint-gc id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:40] starting checkpoint garbage collection id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask" <info> [2021-04-28, 08:20:40] starting container id: 47011443-1146-4a7e-a138-e94a2115aa66 slots: 0 task handler: /experiment-1-checkpoint-gc id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:44] stopped container id: 47011443-1146-4a7e-a138-e94a2115aa66 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:44] finished checkpoint garbage collection id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask" <info> [2021-04-28, 08:20:44] resources are released for /experiment-1-checkpoint-gc id="default" resource-pool="default" system="master" type="ResourcePool" <error> [2021-04-28, 08:49:47] error while actor was running error="ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb" system="master" type="websocketActor" <error> [2021-04-28, 08:49:47] ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932 <error> [2021-04-28, 08:49:47] error while actor was running error="child failed: /agents/determined-agent-0/websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb: ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:49:47] http: connection has been hijacked <info> [2021-04-28, 08:49:47] agent disconnected id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:49:47] removing device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) (determined-agent-0) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:49:47] removing agent: determined-agent-0 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 09:01:37] resource pool is empty; using default resource pool: default id="agents" system="master" type="agents" <info> [2021-04-28, 09:01:38] agent connected ip: 192.168.100.30 resource pool: default slots: 1 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 09:01:38] adding agent: determined-agent-0 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 09:01:38] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0 id="default" resource-pool="default" system="master" type="ResourcePool"<info> [2021-04-28, 08:18:41] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""}},"checkpoint_storage":{"host_path":"/home/shihk/.local/share/determined","save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":"determined-checkpoint","type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null},"port":8080,"harness_path":"/opt/determined","root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","segment_webui_key":"********"},"enable_cors":false,"cluster_name":"","logging":{"type":"default"},"hyperparameter_importance":{"workers_limit":2,"queue_limit":16,"cores_per_worker":1,"max_trees":100},"resource_manager":{"default_cpu_resource_pool":"default","default_gpu_resource_pool":"default","scheduler":{"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_cpu_containers_per_agent":100}]} <info> [2021-04-28, 08:18:41] Determined master 0.15.1 (built with go1.16.3) <info> [2021-04-28, 08:18:41] connecting to database determined-db:5432 <warning> [2021-04-28, 08:18:45] failed to connect to postgres, trying again in 4s error="failed to connect to `host=determined-db user=postgres database=determined`: dial error (dial tcp 172.23.0.2:5432: connect: connection refused)" <info> [2021-04-28, 08:18:45] running migrations from file:///usr/share/determined/master/static/migrations <info> [2021-04-28, 08:18:45] unable to find golang-migrate version <info> [2021-04-28, 08:18:45] deleting all snapshots for terminal state experiments <info> [2021-04-28, 08:18:45] creating resource pool: default id="agentRM" system="master" type="agentResourceManager" <info> [2021-04-28, 08:18:45] pool default using global scheduling config id="agentRM" system="master" type="agentResourceManager" <info> [2021-04-28, 08:18:45] initializing endpoints for agents <info> [2021-04-28, 08:18:45] not enabling provisioner for resource pool: default id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:18:45] scheduling next resource allocation aggregation in 15h42m14s at 2021-04-29 00:01:00 +0000 UTC id="allocation-aggregator" system="master" type="allocationAggregator" <info> [2021-04-28, 08:18:45] telemetry reporting is enabled; run with `--telemetry-enabled=false` to disable <info> [2021-04-28, 08:18:45] accepting incoming connections on port 8080 <info> [2021-04-28, 08:18:45] Subchannel Connectivity change to READY system="system" <info> [2021-04-28, 08:18:45] pickfirstBalancer: HandleSubConnStateChange: 0xc000255d10, {READY <nil>} system="system" <info> [2021-04-28, 08:18:45] Channel Connectivity change to READY system="system" <info> [2021-04-28, 08:18:46] finished unary call with code Unauthenticated error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="GetAgents" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.791" span.kind="server" system="grpc" <info> [2021-04-28, 08:18:46] finished unary call with code Unauthenticated error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="Logout" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.323" span.kind="server" system="grpc" <info> [2021-04-28, 08:18:47] resource pool is empty; using default resource pool: default id="agents" system="master" type="agents" <info> [2021-04-28, 08:18:48] agent connected ip: 192.168.100.30 resource pool: default slots: 1 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:18:48] adding agent: determined-agent-0 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:18:48] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0 id="default" resource-pool="default" system="master" type="ResourcePool" <warning> [2021-04-28, 08:20:03] response already committed <info> [2021-04-28, 08:20:03] experiment state changed to ACTIVE id="1" system="master" type="experiment" <info> [2021-04-28, 08:20:03] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: 51871fd3-0ac4-4419-95ce-eee38508beff) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:04] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:04] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:04] starting container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:06] found container running: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:06] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:06] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] new connection from container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 trial 1 (experiment 1) at 172.23.0.1:43810 <info> [2021-04-28, 08:20:09] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:09] found 4 rendezvous addresses instead of 2 for container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:09] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-63a14abe-fc1f-4c44-8583-d2fb6f9c6136" system="master" type="websocketActor" <error> [2021-04-28, 08:20:09] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:09] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:09] stopped container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:09] found container terminated: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:09] unexpected failure of trial after restart 0/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] resetting trial 1 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:09] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:09] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: d997c647-2a3c-497c-8cd0-7ac7dd468634) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:10] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:10] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:10] starting container id: d73b67ac-1500-4037-8bc0-066701af1efa slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:12] found container running: d73b67ac-1500-4037-8bc0-066701af1efa (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:12] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:12] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] new connection from container d73b67ac-1500-4037-8bc0-066701af1efa trial 1 (experiment 1) at 172.23.0.1:43858 <info> [2021-04-28, 08:20:15] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:15] found 4 rendezvous addresses instead of 2 for container d73b67ac-1500-4037-8bc0-066701af1efa; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:15] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-d73b67ac-1500-4037-8bc0-066701af1efa" system="master" type="websocketActor" <error> [2021-04-28, 08:20:15] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:15] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:15] stopped container id: d73b67ac-1500-4037-8bc0-066701af1efa id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:15] found container terminated: d73b67ac-1500-4037-8bc0-066701af1efa experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:15] unexpected failure of trial after restart 1/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] resetting trial 1 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:15] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:15] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a8cc64d8-4902-4c37-97c6-00ca352a5bde) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:16] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:16] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:16] starting container id: b8fb8292-b905-48a2-951d-817b640b4ab6 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:18] found container running: b8fb8292-b905-48a2-951d-817b640b4ab6 (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:18] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:18] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:21] new connection from container b8fb8292-b905-48a2-951d-817b640b4ab6 trial 1 (experiment 1) at 172.23.0.1:43922 <info> [2021-04-28, 08:20:21] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:21] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:21] found 4 rendezvous addresses instead of 2 for container b8fb8292-b905-48a2-951d-817b640b4ab6; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:21] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-b8fb8292-b905-48a2-951d-817b640b4ab6" system="master" type="websocketActor" <error> [2021-04-28, 08:20:21] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:21] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:21] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:21] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:22] stopped container id: b8fb8292-b905-48a2-951d-817b640b4ab6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:22] found container terminated: b8fb8292-b905-48a2-951d-817b640b4ab6 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:22] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:22] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6 id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:22] unexpected failure of trial after restart 2/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:22] resetting trial 1 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:22] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:22] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a7f517a4-d7e2-46fd-be41-c271865069f0) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:22] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:22] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:22] starting container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:24] found container running: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:24] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:24] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] new connection from container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 trial 1 (experiment 1) at 172.23.0.1:43982 <info> [2021-04-28, 08:20:27] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:27] found 4 rendezvous addresses instead of 2 for container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:27] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4" system="master" type="websocketActor" <error> [2021-04-28, 08:20:27] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:27] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:27] stopped container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:27] found container terminated: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:27] unexpected failure of trial after restart 3/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] resetting trial 1 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:27] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:27] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: ca1b557d-35fa-4c22-ae39-17e86f9cb140) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:28] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:28] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:28] starting container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:30] found container running: 1636009d-9875-4cf4-91f4-3fe617fba5b7 (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:30] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:30] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] new connection from container 1636009d-9875-4cf4-91f4-3fe617fba5b7 trial 1 (experiment 1) at 172.23.0.1:44030 <info> [2021-04-28, 08:20:33] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:33] found 4 rendezvous addresses instead of 2 for container 1636009d-9875-4cf4-91f4-3fe617fba5b7; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:33] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1636009d-9875-4cf4-91f4-3fe617fba5b7" system="master" type="websocketActor" <error> [2021-04-28, 08:20:33] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:33] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:33] stopped container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:33] found container terminated: 1636009d-9875-4cf4-91f4-3fe617fba5b7 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:33] unexpected failure of trial after restart 4/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] resetting trial 1 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:33] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:33] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: fc49fcfc-d6db-4dd5-96b1-1aa52032c118) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:34] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:34] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:34] starting container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:36] found container running: 1bc2292d-139d-45ea-ad5c-d261d06c4847 (rank 0) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:36] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:36] found not all containers are connected experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] new connection from container 1bc2292d-139d-45ea-ad5c-d261d06c4847 trial 1 (experiment 1) at 172.23.0.1:44090 <info> [2021-04-28, 08:20:39] pushing rendezvous information experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] found all containers are connected successfully experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:39] found 4 rendezvous addresses instead of 2 for container 1bc2292d-139d-45ea-ad5c-d261d06c4847; dropping rendezvous addresses experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:39] error while actor was running error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1bc2292d-139d-45ea-ad5c-d261d06c4847" system="master" type="websocketActor" <error> [2021-04-28, 08:20:39] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF <info> [2021-04-28, 08:20:39] found child actor failed, terminating forcibly experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:39] stopped container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:39] found container terminated: 1bc2292d-139d-45ea-ad5c-d261d06c4847 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] forcibly terminating trial experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:20:39] unexpected failure of trial after restart 5/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] trial completed workload: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)> experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] exiting trial early: 0xc000d32040 experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <error> [2021-04-28, 08:20:39] error shutting down actor error="trial 1 failed and reached maximum number of restarts" experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 id="default" resource-pool="default" system="master" type="ResourcePool" <error> [2021-04-28, 08:20:39] trial failed unexpectedly error="trial 1 failed and reached maximum number of restarts" id="1" system="master" type="experiment" <info> [2021-04-28, 08:20:39] experiment state changed to STOPPING_ERROR id="1" system="master" type="experiment" <info> [2021-04-28, 08:20:39] experiment state changed to ERROR id="1" system="master" type="experiment" <info> [2021-04-28, 08:20:39] resources are requested by /experiment-1-checkpoint-gc (Task ID: 16f11e76-f209-4ab2-8c79-045bbfd6ccf9) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:39] experiment shut down successfully id="1" system="master" type="experiment" <info> [2021-04-28, 08:20:40] allocated resources to /experiment-1-checkpoint-gc id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:20:40] starting checkpoint garbage collection id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask" <info> [2021-04-28, 08:20:40] starting container id: 47011443-1146-4a7e-a138-e94a2115aa66 slots: 0 task handler: /experiment-1-checkpoint-gc id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:44] stopped container id: 47011443-1146-4a7e-a138-e94a2115aa66 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:20:44] finished checkpoint garbage collection id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask" <info> [2021-04-28, 08:20:44] resources are released for /experiment-1-checkpoint-gc id="default" resource-pool="default" system="master" type="ResourcePool" <error> [2021-04-28, 08:49:47] error while actor was running error="ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb" system="master" type="websocketActor" <error> [2021-04-28, 08:49:47] ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932 <error> [2021-04-28, 08:49:47] error while actor was running error="child failed: /agents/determined-agent-0/websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb: ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="determined-agent-0" system="master" type="agent" <error> [2021-04-28, 08:49:47] http: connection has been hijacked <info> [2021-04-28, 08:49:47] agent disconnected id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 08:49:47] removing device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) (determined-agent-0) id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 08:49:47] removing agent: determined-agent-0 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 09:01:37] resource pool is empty; using default resource pool: default id="agents" system="master" type="agents" <info> [2021-04-28, 09:01:38] agent connected ip: 192.168.100.30 resource pool: default slots: 1 id="determined-agent-0" system="master" type="agent" <info> [2021-04-28, 09:01:38] adding agent: determined-agent-0 id="default" resource-pool="default" system="master" type="ResourcePool" <info> [2021-04-28, 09:01:38] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0 id="default" resource-pool="default" system="master" type="ResourcePool"
[2021-04-28T08:20:04Z] 63a14abe || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40 [2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: / [2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: / [2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: / [2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: / [2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: / [2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: / [2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /run/determined/train [2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /run/determined/train/model [2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /run/determined/workdir [2021-04-28T08:20:06Z] 63a14abe || + WORKING_DIR=/run/determined/workdir [2021-04-28T08:20:06Z] 63a14abe || + STARTUP_HOOK=startup-hook.sh [2021-04-28T08:20:06Z] 63a14abe || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:06Z] 63a14abe || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:06Z] 63a14abe || + '[' -z '' ']' [2021-04-28T08:20:06Z] 63a14abe || + export DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:06Z] 63a14abe || + DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:06Z] 63a14abe || + /bin/which python3 [2021-04-28T08:20:06Z] 63a14abe || + '[' /root = / ']' [2021-04-28T08:20:06Z] 63a14abe || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl [2021-04-28T08:20:08Z] 63a14abe || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible. [2021-04-28T08:20:08Z] 63a14abe || + cd /run/determined/workdir [2021-04-28T08:20:08Z] 63a14abe || + test -f startup-hook.sh [2021-04-28T08:20:08Z] 63a14abe || + exec python3 -m determined.exec.harness [2021-04-28T08:20:09Z] 63a14abe || INFO: New trial runner in (container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '63a14abe-fc1f-4c44-8583-d2fb6f9c6136', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}. [2021-04-28T08:20:09Z] 63a14abe || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/63a14abe-fc1f-4c44-8583-d2fb6f9c6136 [2021-04-28T08:20:09Z] 63a14abe || INFO: Connected to master [2021-04-28T08:20:09Z] 63a14abe || INFO: Established WebSocket session with master [2021-04-28T08:20:09Z] 63a14abe || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49193}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49190}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49192}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49189}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'} [2021-04-28T08:20:09Z] 63a14abe || Traceback (most recent call last): [2021-04-28T08:20:09Z] 63a14abe || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main [2021-04-28T08:20:09Z] 63a14abe || "__main__", mod_spec) [2021-04-28T08:20:09Z] 63a14abe || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code [2021-04-28T08:20:09Z] 63a14abe || exec(code, run_globals) [2021-04-28T08:20:09Z] 63a14abe || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module> [2021-04-28T08:20:09Z] 63a14abe || main() [2021-04-28T08:20:09Z] 63a14abe || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main [2021-04-28T08:20:09Z] 63a14abe || build_and_run_training_pipeline(env) [2021-04-28T08:20:09Z] 63a14abe || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline [2021-04-28T08:20:09Z] 63a14abe || with layers.SocketManager(env) as socket_mgr: [2021-04-28T08:20:09Z] 63a14abe || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__ [2021-04-28T08:20:09Z] 63a14abe || ri = self.check_for_rendezvous_info(ws_event) [2021-04-28T08:20:09Z] 63a14abe || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info [2021-04-28T08:20:09Z] 63a14abe || addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}" [2021-04-28T08:20:09Z] 63a14abe || TypeError: 'NoneType' object does not support item assignment [2021-04-28T08:20:09Z] 63a14abe || WARNING: disconnecting websocket [2021-04-28T08:20:09Z] 63a14abe || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) [2021-04-28T08:20:10Z] d73b67ac || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40 [2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: / [2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: / [2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: / [2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: / [2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: / [2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: / [2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /run/determined/train [2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /run/determined/train/model [2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /run/determined/workdir [2021-04-28T08:20:12Z] d73b67ac || + WORKING_DIR=/run/determined/workdir [2021-04-28T08:20:12Z] d73b67ac || + STARTUP_HOOK=startup-hook.sh [2021-04-28T08:20:12Z] d73b67ac || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:12Z] d73b67ac || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:12Z] d73b67ac || + '[' -z '' ']' [2021-04-28T08:20:12Z] d73b67ac || + export DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:12Z] d73b67ac || + DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:12Z] d73b67ac || + /bin/which python3 [2021-04-28T08:20:12Z] d73b67ac || + '[' /root = / ']' [2021-04-28T08:20:12Z] d73b67ac || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl [2021-04-28T08:20:14Z] d73b67ac || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible. [2021-04-28T08:20:14Z] d73b67ac || + cd /run/determined/workdir [2021-04-28T08:20:14Z] d73b67ac || + exec python3 -m determined.exec.harness [2021-04-28T08:20:14Z] d73b67ac || + test -f startup-hook.sh [2021-04-28T08:20:15Z] d73b67ac || INFO: New trial runner in (container d73b67ac-1500-4037-8bc0-066701af1efa) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'd73b67ac-1500-4037-8bc0-066701af1efa', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}. [2021-04-28T08:20:15Z] d73b67ac || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/d73b67ac-1500-4037-8bc0-066701af1efa [2021-04-28T08:20:15Z] d73b67ac || INFO: Connected to master [2021-04-28T08:20:15Z] d73b67ac || INFO: Established WebSocket session with master [2021-04-28T08:20:15Z] d73b67ac || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49195}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49192}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49194}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49191}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'} [2021-04-28T08:20:15Z] d73b67ac || Traceback (most recent call last): [2021-04-28T08:20:15Z] d73b67ac || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main [2021-04-28T08:20:15Z] d73b67ac || "__main__", mod_spec) [2021-04-28T08:20:15Z] d73b67ac || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code [2021-04-28T08:20:15Z] d73b67ac || exec(code, run_globals) [2021-04-28T08:20:15Z] d73b67ac || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module> [2021-04-28T08:20:15Z] d73b67ac || main() [2021-04-28T08:20:15Z] d73b67ac || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main [2021-04-28T08:20:15Z] d73b67ac || build_and_run_training_pipeline(env) [2021-04-28T08:20:15Z] d73b67ac || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline [2021-04-28T08:20:15Z] d73b67ac || with layers.SocketManager(env) as socket_mgr: [2021-04-28T08:20:15Z] d73b67ac || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__ [2021-04-28T08:20:15Z] d73b67ac || ri = self.check_for_rendezvous_info(ws_event) [2021-04-28T08:20:15Z] d73b67ac || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info [2021-04-28T08:20:15Z] d73b67ac || addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}" [2021-04-28T08:20:15Z] d73b67ac || TypeError: 'NoneType' object does not support item assignment [2021-04-28T08:20:15Z] d73b67ac || WARNING: disconnecting websocket [2021-04-28T08:20:15Z] d73b67ac || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) [2021-04-28T08:20:16Z] b8fb8292 || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40 [2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: / [2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: / [2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: / [2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: / [2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: / [2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: / [2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /run/determined/train [2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /run/determined/train/model [2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /run/determined/workdir [2021-04-28T08:20:18Z] b8fb8292 || + WORKING_DIR=/run/determined/workdir [2021-04-28T08:20:18Z] b8fb8292 || + STARTUP_HOOK=startup-hook.sh [2021-04-28T08:20:18Z] b8fb8292 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:18Z] b8fb8292 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:18Z] b8fb8292 || + '[' -z '' ']' [2021-04-28T08:20:18Z] b8fb8292 || + DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:18Z] b8fb8292 || + export DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:18Z] b8fb8292 || + /bin/which python3 [2021-04-28T08:20:18Z] b8fb8292 || + '[' /root = / ']' [2021-04-28T08:20:18Z] b8fb8292 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl [2021-04-28T08:20:20Z] b8fb8292 || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible. [2021-04-28T08:20:20Z] b8fb8292 || + cd /run/determined/workdir [2021-04-28T08:20:20Z] b8fb8292 || + test -f startup-hook.sh [2021-04-28T08:20:20Z] b8fb8292 || + exec python3 -m determined.exec.harness [2021-04-28T08:20:21Z] b8fb8292 || INFO: New trial runner in (container b8fb8292-b905-48a2-951d-817b640b4ab6) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'b8fb8292-b905-48a2-951d-817b640b4ab6', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}. [2021-04-28T08:20:21Z] b8fb8292 || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/b8fb8292-b905-48a2-951d-817b640b4ab6 [2021-04-28T08:20:21Z] b8fb8292 || INFO: Connected to master [2021-04-28T08:20:21Z] b8fb8292 || INFO: Established WebSocket session with master [2021-04-28T08:20:21Z] b8fb8292 || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49197}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49194}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49196}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49193}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'} [2021-04-28T08:20:21Z] b8fb8292 || Traceback (most recent call last): [2021-04-28T08:20:21Z] b8fb8292 || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main [2021-04-28T08:20:21Z] b8fb8292 || "__main__", mod_spec) [2021-04-28T08:20:21Z] b8fb8292 || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code [2021-04-28T08:20:21Z] b8fb8292 || exec(code, run_globals) [2021-04-28T08:20:21Z] b8fb8292 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module> [2021-04-28T08:20:21Z] b8fb8292 || main() [2021-04-28T08:20:21Z] b8fb8292 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main [2021-04-28T08:20:21Z] b8fb8292 || build_and_run_training_pipeline(env) [2021-04-28T08:20:21Z] b8fb8292 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline [2021-04-28T08:20:21Z] b8fb8292 || with layers.SocketManager(env) as socket_mgr: [2021-04-28T08:20:21Z] b8fb8292 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__ [2021-04-28T08:20:21Z] b8fb8292 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info [2021-04-28T08:20:21Z] b8fb8292 || ri = self.check_for_rendezvous_info(ws_event) [2021-04-28T08:20:21Z] b8fb8292 || addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}" [2021-04-28T08:20:21Z] b8fb8292 || TypeError: 'NoneType' object does not support item assignment [2021-04-28T08:20:21Z] b8fb8292 || WARNING: disconnecting websocket [2021-04-28T08:20:22Z] b8fb8292 || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) [2021-04-28T08:20:22Z] 8fd8ae29 || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40 [2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: / [2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: / [2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: / [2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: / [2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: / [2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: / [2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /run/determined/train [2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /run/determined/train/model [2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /run/determined/workdir [2021-04-28T08:20:24Z] 8fd8ae29 || + WORKING_DIR=/run/determined/workdir [2021-04-28T08:20:24Z] 8fd8ae29 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:24Z] 8fd8ae29 || + STARTUP_HOOK=startup-hook.sh [2021-04-28T08:20:24Z] 8fd8ae29 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:24Z] 8fd8ae29 || + '[' -z '' ']' [2021-04-28T08:20:24Z] 8fd8ae29 || + export DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:24Z] 8fd8ae29 || + DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:24Z] 8fd8ae29 || + /bin/which python3 [2021-04-28T08:20:24Z] 8fd8ae29 || + '[' /root = / ']' [2021-04-28T08:20:24Z] 8fd8ae29 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl [2021-04-28T08:20:25Z] 8fd8ae29 || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible. [2021-04-28T08:20:26Z] 8fd8ae29 || + cd /run/determined/workdir [2021-04-28T08:20:26Z] 8fd8ae29 || + test -f startup-hook.sh [2021-04-28T08:20:26Z] 8fd8ae29 || + exec python3 -m determined.exec.harness [2021-04-28T08:20:27Z] 8fd8ae29 || INFO: New trial runner in (container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}. [2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 [2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Connected to master [2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Established WebSocket session with master [2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49199}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49196}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49198}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49195}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'} [2021-04-28T08:20:27Z] 8fd8ae29 || Traceback (most recent call last): [2021-04-28T08:20:27Z] 8fd8ae29 || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main [2021-04-28T08:20:27Z] 8fd8ae29 || "__main__", mod_spec) [2021-04-28T08:20:27Z] 8fd8ae29 || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code [2021-04-28T08:20:27Z] 8fd8ae29 || exec(code, run_globals) [2021-04-28T08:20:27Z] 8fd8ae29 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module> [2021-04-28T08:20:27Z] 8fd8ae29 || main() [2021-04-28T08:20:27Z] 8fd8ae29 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main [2021-04-28T08:20:27Z] 8fd8ae29 || build_and_run_training_pipeline(env) [2021-04-28T08:20:27Z] 8fd8ae29 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline [2021-04-28T08:20:27Z] 8fd8ae29 || with layers.SocketManager(env) as socket_mgr: [2021-04-28T08:20:27Z] 8fd8ae29 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__ [2021-04-28T08:20:27Z] 8fd8ae29 || ri = self.check_for_rendezvous_info(ws_event) [2021-04-28T08:20:27Z] 8fd8ae29 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info [2021-04-28T08:20:27Z] 8fd8ae29 || addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}" [2021-04-28T08:20:27Z] 8fd8ae29 || TypeError: 'NoneType' object does not support item assignment [2021-04-28T08:20:27Z] 8fd8ae29 || WARNING: disconnecting websocket [2021-04-28T08:20:27Z] 8fd8ae29 || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) [2021-04-28T08:20:28Z] 1636009d || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40 [2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: / [2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: / [2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: / [2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: / [2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: / [2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: / [2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /run/determined/train [2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /run/determined/train/model [2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /run/determined/workdir [2021-04-28T08:20:30Z] 1636009d || + WORKING_DIR=/run/determined/workdir [2021-04-28T08:20:30Z] 1636009d || + STARTUP_HOOK=startup-hook.sh [2021-04-28T08:20:30Z] 1636009d || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:30Z] 1636009d || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:30Z] 1636009d || + '[' -z '' ']' [2021-04-28T08:20:30Z] 1636009d || + export DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:30Z] 1636009d || + DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:30Z] 1636009d || + /bin/which python3 [2021-04-28T08:20:30Z] 1636009d || + '[' /root = / ']' [2021-04-28T08:20:30Z] 1636009d || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl [2021-04-28T08:20:31Z] 1636009d || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible. [2021-04-28T08:20:32Z] 1636009d || + cd /run/determined/workdir [2021-04-28T08:20:32Z] 1636009d || + test -f startup-hook.sh [2021-04-28T08:20:32Z] 1636009d || + exec python3 -m determined.exec.harness [2021-04-28T08:20:33Z] 1636009d || INFO: New trial runner in (container 1636009d-9875-4cf4-91f4-3fe617fba5b7) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '1636009d-9875-4cf4-91f4-3fe617fba5b7', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}. [2021-04-28T08:20:33Z] 1636009d || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/1636009d-9875-4cf4-91f4-3fe617fba5b7 [2021-04-28T08:20:33Z] 1636009d || INFO: Connected to master [2021-04-28T08:20:33Z] 1636009d || INFO: Established WebSocket session with master [2021-04-28T08:20:33Z] 1636009d || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49201}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49198}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49200}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49197}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'} [2021-04-28T08:20:33Z] 1636009d || Traceback (most recent call last): [2021-04-28T08:20:33Z] 1636009d || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main [2021-04-28T08:20:33Z] 1636009d || "__main__", mod_spec) [2021-04-28T08:20:33Z] 1636009d || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code [2021-04-28T08:20:33Z] 1636009d || exec(code, run_globals) [2021-04-28T08:20:33Z] 1636009d || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module> [2021-04-28T08:20:33Z] 1636009d || main() [2021-04-28T08:20:33Z] 1636009d || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main [2021-04-28T08:20:33Z] 1636009d || build_and_run_training_pipeline(env) [2021-04-28T08:20:33Z] 1636009d || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline [2021-04-28T08:20:33Z] 1636009d || with layers.SocketManager(env) as socket_mgr: [2021-04-28T08:20:33Z] 1636009d || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__ [2021-04-28T08:20:33Z] 1636009d || ri = self.check_for_rendezvous_info(ws_event) [2021-04-28T08:20:33Z] 1636009d || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info [2021-04-28T08:20:33Z] 1636009d || addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}" [2021-04-28T08:20:33Z] 1636009d || TypeError: 'NoneType' object does not support item assignment [2021-04-28T08:20:33Z] 1636009d || WARNING: disconnecting websocket [2021-04-28T08:20:33Z] 1636009d || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) [2021-04-28T08:20:34Z] 1bc2292d || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40 [2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: / [2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: / [2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: / [2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: / [2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: / [2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: / [2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /run/determined/train [2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /run/determined/train/model [2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /run/determined/workdir [2021-04-28T08:20:36Z] 1bc2292d || + WORKING_DIR=/run/determined/workdir [2021-04-28T08:20:36Z] 1bc2292d || + STARTUP_HOOK=startup-hook.sh [2021-04-28T08:20:36Z] 1bc2292d || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:36Z] 1bc2292d || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin [2021-04-28T08:20:36Z] 1bc2292d || + '[' -z '' ']' [2021-04-28T08:20:36Z] 1bc2292d || + export DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:36Z] 1bc2292d || + DET_PYTHON_EXECUTABLE=python3 [2021-04-28T08:20:36Z] 1bc2292d || + /bin/which python3 [2021-04-28T08:20:36Z] 1bc2292d || + '[' /root = / ']' [2021-04-28T08:20:36Z] 1bc2292d || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl [2021-04-28T08:20:38Z] 1bc2292d || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible. [2021-04-28T08:20:38Z] 1bc2292d || + cd /run/determined/workdir [2021-04-28T08:20:38Z] 1bc2292d || + test -f startup-hook.sh [2021-04-28T08:20:38Z] 1bc2292d || + exec python3 -m determined.exec.harness [2021-04-28T08:20:39Z] 1bc2292d || INFO: New trial runner in (container 1bc2292d-139d-45ea-ad5c-d261d06c4847) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '1bc2292d-139d-45ea-ad5c-d261d06c4847', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}. [2021-04-28T08:20:39Z] 1bc2292d || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/1bc2292d-139d-45ea-ad5c-d261d06c4847 [2021-04-28T08:20:39Z] 1bc2292d || INFO: Connected to master [2021-04-28T08:20:39Z] 1bc2292d || INFO: Established WebSocket session with master [2021-04-28T08:20:39Z] 1bc2292d || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49203}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49200}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49202}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49199}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'} [2021-04-28T08:20:39Z] 1bc2292d || Traceback (most recent call last): [2021-04-28T08:20:39Z] 1bc2292d || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main [2021-04-28T08:20:39Z] 1bc2292d || "__main__", mod_spec) [2021-04-28T08:20:39Z] 1bc2292d || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code [2021-04-28T08:20:39Z] 1bc2292d || exec(code, run_globals) [2021-04-28T08:20:39Z] 1bc2292d || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module> [2021-04-28T08:20:39Z] 1bc2292d || main() [2021-04-28T08:20:39Z] 1bc2292d || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main [2021-04-28T08:20:39Z] 1bc2292d || build_and_run_training_pipeline(env) [2021-04-28T08:20:39Z] 1bc2292d || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline [2021-04-28T08:20:39Z] 1bc2292d || with layers.SocketManager(env) as socket_mgr: [2021-04-28T08:20:39Z] 1bc2292d || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__ [2021-04-28T08:20:39Z] 1bc2292d || ri = self.check_for_rendezvous_info(ws_event) [2021-04-28T08:20:39Z] 1bc2292d || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info [2021-04-28T08:20:39Z] 1bc2292d || addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}" [2021-04-28T08:20:39Z] 1bc2292d || TypeError: 'NoneType' object does not support item assignment [2021-04-28T08:20:39Z] 1bc2292d || WARNING: disconnecting websocket [2021-04-28T08:20:39Z] 1bc2292d || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137) �[32mTrial log stream ended. To reopen log stream, run: det trial logs -f 1�[0m
The variable rendezvous_ports in env seem not be set correct?
The text was updated successfully, but these errors were encountered:
@caoyu1664 It's the same issue that you reported earlier, which applies even to det deploy cluster-up --no-gpu.
det deploy cluster-up --no-gpu
Sorry, something went wrong.
@vishnu2kmohan Thanks, Close this issues please :)
No branches or pull requests
Hi,
The container raise a TypeError:
NoneType
object does not support item assignment. in _socket_manager.py when trial is running.1. Envs
2. Install and start
pip install dertermined
det deploy local cluster-up --no-gpu
det experiment create const.yaml .
(mnist_pytorch official)2. Master logs
3. Trial logs
The variable rendezvous_ports in env seem not be set correct?
The text was updated successfully, but these errors were encountered: