Training with aws Sagemaker stuck if more than one epoch

**Describe the bug**
Training the simpletransformers.ner model on AWS Sagemaker for more than one epoch results in the training idling for ever before the end of the first epoch. If trained for one epoch, the training ends normally.

While, idling, the GPU use drop to 0 and the disk utilization grows and began to plateau. (see attached picture) 

It might also be a simpleTransformers issue, but the same procedure works perfectly when training the same model directly on a Sagemaker notebook.
![metrics_sgmk](https://user-images.githubusercontent.com/52677237/76527217-3d2b7280-6445-11ea-8e9f-42807bcd2bce.PNG)

**To Reproduce**

1. Generate a docker, based on the pytorch-training and with Apex and simpletransformers installed 

```
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-gpu-py36-cu101-ubuntu16.04
RUN pip install simpletransformers
RUN git clone https://github.com/NVIDIA/apex

RUN pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex
```

2.  Upload the Docker to ECR with a name ```<training_image>```

3. Prepare a minimal training script and upload it to a Sagemaker Notebook

```python
import os

from simpletransformers.ner import NERModel
import pandas as pd

if __name__ == "__main__":

    # Retrieve data and labels
    data = pd.read_csv(os.path.join(os.environ["SM_CHANNEL_TRAINING"], "training.csv"))
    labels = data.labels.unique().tolist()

    # Instanciate a NER model
    args = {
        "output_dir": os.environ["SM_MODEL_DIR"],
        "reprocess_input_data": True,
        "num_train_epochs": 2,
        "train_batch_size": 8,
    }
    model = NERModel("bert", "bert-base-uncased", args=args, labels=labels)

    model.train(data)

    return 0
```

4. Start the training from Sagemaker Notebook
```python
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.estimator import Estimator
from sagemaker import get_execution_role

ROLE = get_execution_role()

pytorch_estimator = PyTorch(entry_point='train.py',
                            image_name="<docker_Image>", 
                            instance_type="ml.p3.2xlarge",
                            train_instance_count=1,
                            role = ROLE,
                           )

pytorch_estimator.fit({"training":"<path_to_training.csv>"})
```

**Expected behavior**
I was expecting a larger number of epochs to works

**Screenshots**


**Desktop (please complete the following information):**
 - OS : Ubuntu, from the base Sagemaker's pytorch Image

**Additional context**
Please, find attached thge logs. I'have remove the logs between the start of the first step of the first epoch and the last event received from the first epoch:
```
01:10:37
bash: cannot set terminal process group (-1): Inappropriate ioctl for device

01:10:37
bash: no job control in this shell

01:10:39
2020-03-12 01:10:38,648 sagemaker-containers INFO Imported framework sagemaker_pytorch_container.training

01:10:39
2020-03-12 01:10:38,727 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.

01:10:39
2020-03-12 01:10:38,728 sagemaker_pytorch_container.training INFO Invoking user training script.

01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Module default_user_module_name does not provide a setup.py.

01:10:39
Generating setup.py

01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Generating setup.cfg

01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Generating MANIFEST.in

01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Installing module with the following command:

01:10:39
/opt/conda/bin/python -m pip install .

01:10:40
Processing /tmp/tmpvf5izqsk/module_dir

01:10:40
Building wheels for collected packages: default-user-module-name Building wheel for default-user-module-name (setup.py): started Building wheel for default-user-module-name (setup.py): finished with status 'done' Created wheel for default-user-module-name: filename=default_user_module_name-1.0.0-py2.py3-none-any.whl size=4365 sha256=a7c076c4c020f4b8b9f85a40721d629b752f9b1006cdb0ba2bf27132f9b

01:10:40
Successfully built default-user-module-name

01:10:41
Installing collected packages: default-user-module-name

01:10:41
Successfully installed default-user-module-name-1.0.0

01:10:41
2020-03-12 01:10:41,498 sagemaker-containers INFO Invoking user script

01:10:41
Training Env:

01:10:41
{ "additional_framework_parameters": {}, "channel_input_dirs": { "training": "/opt/ml/input/data/training" }, "current_host": "algo-1", "framework_module": "sagemaker_pytorch_container.training:main", "hosts": [ "algo-1" ], "hyperparameters": { "n_gpu": 8, "batch_size": 32, "seed": 2, "lower": true, "epochs": 5

01:10:41
}

01:10:41
Environment variables:

01:10:41
SM_HOSTS=["algo-1"]

01:10:41
SM_NETWORK_INTERFACE_NAME=eth0

01:10:41
SM_HPS={"batch_size":32,"epochs":50,"lower":true,"n_gpu":8,"seed":2}

01:10:41
SM_USER_ENTRY_POINT=train.py

01:10:41
SM_FRAMEWORK_PARAMS={}

01:10:41
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}

01:10:41
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}

01:10:41
SM_OUTPUT_DATA_DIR=/opt/ml/output/data

01:10:41
SM_CHANNELS=["training"]

01:10:41
SM_CURRENT_HOST=algo-1

01:10:41
SM_MODULE_NAME=train

01:10:41
SM_LOG_LEVEL=20

01:10:41
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main

01:10:41
SM_INPUT_DIR=/opt/ml/input

01:10:41
SM_INPUT_CONFIG_DIR=/opt/ml/input/config

01:10:41
SM_OUTPUT_DIR=/opt/ml/output

01:10:41
SM_NUM_CPUS=64

01:10:41
SM_NUM_GPUS=8

01:10:41
SM_MODEL_DIR=/opt/ml/model

01:10:41
SM_MODULE_DIR=<path>

01:10:41
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":32,"epochs":50,"lower":true,"n_gpu":8,"seed":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"

01:10:41
SM_USER_ARGS=["--batch_size","32","--epochs","50","--lower","True","--n_gpu","8","--seed","2"]

01:10:41
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate

01:10:41
SM_CHANNEL_TRAINING=/opt/ml/input/data/training

01:10:41
SM_HP_N_GPU=8

01:10:41
SM_HP_BATCH_SIZE=32

01:10:41
SM_HP_SEED=2

01:10:41
SM_HP_LOWER=true

01:10:41
SM_HP_EPOCHS=50

01:10:41
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages

01:10:41
Invoking script with the following command:

01:10:41
/opt/conda/bin/python train.py --batch_size 32 --epochs 50 --lower True --n_gpu 8 --seed 2

01:11:03
Converting to features started.

01:11:03
2020-03-12 01:10:45,583 [train.py ] INFO Starting training...

01:11:07
[2020-03-12 01:11:07.095 algo-1:102 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.

01:11:07
2020-03-12 01:10:45,584 [train.py ] INFO No dataset provided for testing...training will be splitted

01:11:07
[2020-03-12 01:11:07.096 algo-1:102 INFO hook.py:152] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.

01:11:07
2020-03-12 01:10:45,584 [train.py ] INFO Reading training data at /opt/ml/input/data/training

01:11:07
[2020-03-12 01:11:07.096 algo-1:102 INFO hook.py:197] Saving to /opt/ml/output/tensors

01:11:07
2020-03-12 01:10:45,584 [train.py ] INFO Reading file : 00_data_IOB.csv, located at /opt/ml/input/data/training/00_data_IOB.csv

01:11:07
[2020-03-12 01:11:07.116 algo-1:102 INFO hook.py:326] Monitoring the collections: losses

01:11:07
train.py:95: SettingWithCopyWarning:

01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention.self NoneType

01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.

01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention.self NoneType

01:11:08
Try using .loc[row_indexer,col_indexer] = value instead

01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention.self NoneType

01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention NoneType

01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

01:11:08
[2020-03-12 01:11:08.064 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0 NoneType training["words"] = training["words"].apply(str)

01:11:08
[2020-03-12 01:11:08.064 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0 NoneType

01:11:08
train.py:96: SettingWithCopyWarning:

01:11:08
[2020-03-12 01:11:08.064 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0 NoneType

01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.

01:11:08
[2020-03-12 01:11:08.069 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention.self NoneType

01:11:08
Try using .loc[row_indexer,col_indexer] = value instead

01:11:08
[2020-03-12 01:11:08.069 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention.self NoneType

01:11:08
[2020-03-12 01:11:08.070 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention.self NoneType

01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

01:11:08
[2020-03-12 01:11:08.070 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention NoneType testing["words"] = testing["words"].apply(str)

01:11:08
[2020-03-12 01:11:08.073 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1 NoneType

01:11:08
train.py:99: SettingWithCopyWarning:

01:11:08
[2020-03-12 01:11:08.073 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1 NoneType

01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.

01:11:08
[2020-03-12 01:11:08.073 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1 NoneType

01:11:08
Try using .loc[row_indexer,col_indexer] = value instead

01:11:08
[2020-03-12 01:11:08.078 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention.self NoneType

01:11:08
[2020-03-12 01:11:08.078 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention.self NoneType

01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

01:11:08
[2020-03-12 01:11:08.078 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention.self NoneType training["words"] = training["words"].apply(lambda x: x.lower().strip())

01:11:08
[2020-03-12 01:11:08.079 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention NoneType

01:11:08
train.py:100: SettingWithCopyWarning:

01:11:08
[2020-03-12 01:11:08.081 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2 NoneType

01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.

01:11:08
[2020-03-12 01:11:08.081 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2 NoneType

01:11:08
Try using .loc[row_indexer,col_indexer] = value instead

01:11:08
[2020-03-12 01:11:08.081 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2 NoneType

01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention.self NoneType

01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention.self NoneType testing["words"] = testing["words"].apply(lambda x: x.lower().strip())

01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention.self NoneType

01:11:08
2020-03-12 01:10:45,676 [train.py ] INFO Training datapoitns : 71712

01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention NoneType

01:11:08
2020-03-12 01:10:45,676 [train.py ] INFO TraininTestingg datapoitns : 18001

01:11:08
[2020-03-12 01:11:08.090 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3 NoneType

01:11:08
2020-03-12 01:10:45,681 [train.py ] INFO Training the model with label : ['O', 'B-E', 'I-E', 'B-C', 'I-C']

01:11:08
[2020-03-12 01:11:08.090 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3 NoneType

01:11:08
2020-03-12 01:10:45,681 [train.py ] INFO Training the model with arguments : {'output_dir': '/opt/ml/model/model', 'reprocess_input_data': True, 'num_train_epochs': 50, 'train_batch_size': 32, 'fp16': False, 'save_eval_checkpoints': False, 'save_steps': 8223372036854775807, 'save_model_every_epoch': False, 'overwrite_output_dir': True, 'logging_steps': 500, 'silent': False, 'use_early_stopp

01:11:08
[2020-03-12 01:11:08.090 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3 NoneType

01:11:08
#015Downloading: 0%| | 0.00/361 [00:00<?, ?B/s]#015Downloading: 100%|██████████| 361/361 [00:00<00:00, 366kB/s]

01:11:08
[2020-03-12 01:11:08.095 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention.self NoneType

01:11:08
#015Downloading: 0%| | 0.00/440M [00:00<?, ?B/s]#015Downloading: 1%| | 4.70M/440M [00:00<00:09, 47.0MB/s]#015Downloading: 2%|▏ | 9.55M/440M [00:00<00:09, 47.4MB/s]#015Downloading: 3%|▎ | 14.4M/440M [00:00<00:08, 47.9MB/s]#015Downloading: 4%|▍ | 19.0M/440M [00:00<00:08, 47.2MB/s]#015Downloading: 5%|▌ | 23.5M/440M [00:00<00:08, 46.5MB/s]#

01:11:08
[2020-03-12 01:11:08.095 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention.self NoneType

01:11:08
#015Downloading: 0%| | 0.00/232k [00:00<?, ?B/s]#015Downloading: 100%|██████████| 232k/232k [00:00<00:00, 26.4MB/s]

01:11:08
[2020-03-12 01:11:08.095 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention.self NoneType

01:11:08
#015 0%| | 0/5234 [00:00<?, ?it/s]#015 0%| | 1/5234 [00:00<46:04, 1.89it/s]#015 57%|█████▋ | 3001/5234 [00:00<13:45, 2.70it/s]#015100%|██████████| 5234/5234 [00:00<00:00, 6824.67it/s]

01:11:08
[2020-03-12 01:11:08.096 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention NoneType

01:11:08
#015Epoch: 0%| | 0/50 [00:00<?, ?it/s]

01:11:08
[2020-03-12 01:11:08.098 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4 NoneType

01:11:08
#015Current iteration: 0%| | 0/164 [00:00<?, ?it/s]#033[A

01:11:08
[2020-03-12 01:11:08.099 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4 NoneType

01:11:08
#015Current iteration: 1%| | 1/164 [00:01<03:49, 1.41s/it]#033[A

01:11:08
[2020-03-12 01:11:08.099 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4 NoneType

01:11:09
#015Current iteration: 1%| | 2/164 [00:01<02:51, 1.06s/it]#033[A

01:11:09
[2020-03-12 01:11:08.104 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.5.attention.self NoneType

01:11:09
#015Current iteration: 2%|▏ | 3/164 [00:01<02:10, 1.23it/s]#033[A

...



01:13:09
[2020-03-12 01:13:09.473 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.11 NoneType

01:13:09
#015Current iteration: 82%|████████▏ | 134/164 [00:33<00:07, 4.15it/s]#033[A

01:13:09
[2020-03-12 01:13:09.473 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.11 NoneType

01:13:09
#015Current iteration: 82%|████████▏ | 135/164 [00:33<00:07, 4.14it/s]#033[A
```

The training is stuck after the last event received. At the same time, GPU use drops to 0.


```

* Testedon 3 differents Sagemaker's instance types



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training with aws Sagemaker stuck if more than one epoch #1353

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training with aws Sagemaker stuck if more than one epoch #1353

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions