-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
Training the simpletransformers.ner model on AWS Sagemaker for more than one epoch results in the training idling for ever before the end of the first epoch. If trained for one epoch, the training ends normally.
While, idling, the GPU use drop to 0 and the disk utilization grows and began to plateau. (see attached picture)
It might also be a simpleTransformers issue, but the same procedure works perfectly when training the same model directly on a Sagemaker notebook.
To Reproduce
- Generate a docker, based on the pytorch-training and with Apex and simpletransformers installed
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-gpu-py36-cu101-ubuntu16.04
RUN pip install simpletransformers
RUN git clone https://github.com/NVIDIA/apex
RUN pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex
-
Upload the Docker to ECR with a name
<training_image>
-
Prepare a minimal training script and upload it to a Sagemaker Notebook
import os
from simpletransformers.ner import NERModel
import pandas as pd
if __name__ == "__main__":
# Retrieve data and labels
data = pd.read_csv(os.path.join(os.environ["SM_CHANNEL_TRAINING"], "training.csv"))
labels = data.labels.unique().tolist()
# Instanciate a NER model
args = {
"output_dir": os.environ["SM_MODEL_DIR"],
"reprocess_input_data": True,
"num_train_epochs": 2,
"train_batch_size": 8,
}
model = NERModel("bert", "bert-base-uncased", args=args, labels=labels)
model.train(data)
return 0
- Start the training from Sagemaker Notebook
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.estimator import Estimator
from sagemaker import get_execution_role
ROLE = get_execution_role()
pytorch_estimator = PyTorch(entry_point='train.py',
image_name="<docker_Image>",
instance_type="ml.p3.2xlarge",
train_instance_count=1,
role = ROLE,
)
pytorch_estimator.fit({"training":"<path_to_training.csv>"})
Expected behavior
I was expecting a larger number of epochs to works
Screenshots
Desktop (please complete the following information):
- OS : Ubuntu, from the base Sagemaker's pytorch Image
Additional context
Please, find attached thge logs. I'have remove the logs between the start of the first step of the first epoch and the last event received from the first epoch:
01:10:37
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
01:10:37
bash: no job control in this shell
01:10:39
2020-03-12 01:10:38,648 sagemaker-containers INFO Imported framework sagemaker_pytorch_container.training
01:10:39
2020-03-12 01:10:38,727 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
01:10:39
2020-03-12 01:10:38,728 sagemaker_pytorch_container.training INFO Invoking user training script.
01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Module default_user_module_name does not provide a setup.py.
01:10:39
Generating setup.py
01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Generating setup.cfg
01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Generating MANIFEST.in
01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Installing module with the following command:
01:10:39
/opt/conda/bin/python -m pip install .
01:10:40
Processing /tmp/tmpvf5izqsk/module_dir
01:10:40
Building wheels for collected packages: default-user-module-name Building wheel for default-user-module-name (setup.py): started Building wheel for default-user-module-name (setup.py): finished with status 'done' Created wheel for default-user-module-name: filename=default_user_module_name-1.0.0-py2.py3-none-any.whl size=4365 sha256=a7c076c4c020f4b8b9f85a40721d629b752f9b1006cdb0ba2bf27132f9b
01:10:40
Successfully built default-user-module-name
01:10:41
Installing collected packages: default-user-module-name
01:10:41
Successfully installed default-user-module-name-1.0.0
01:10:41
2020-03-12 01:10:41,498 sagemaker-containers INFO Invoking user script
01:10:41
Training Env:
01:10:41
{ "additional_framework_parameters": {}, "channel_input_dirs": { "training": "/opt/ml/input/data/training" }, "current_host": "algo-1", "framework_module": "sagemaker_pytorch_container.training:main", "hosts": [ "algo-1" ], "hyperparameters": { "n_gpu": 8, "batch_size": 32, "seed": 2, "lower": true, "epochs": 5
01:10:41
}
01:10:41
Environment variables:
01:10:41
SM_HOSTS=["algo-1"]
01:10:41
SM_NETWORK_INTERFACE_NAME=eth0
01:10:41
SM_HPS={"batch_size":32,"epochs":50,"lower":true,"n_gpu":8,"seed":2}
01:10:41
SM_USER_ENTRY_POINT=train.py
01:10:41
SM_FRAMEWORK_PARAMS={}
01:10:41
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
01:10:41
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
01:10:41
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
01:10:41
SM_CHANNELS=["training"]
01:10:41
SM_CURRENT_HOST=algo-1
01:10:41
SM_MODULE_NAME=train
01:10:41
SM_LOG_LEVEL=20
01:10:41
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
01:10:41
SM_INPUT_DIR=/opt/ml/input
01:10:41
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
01:10:41
SM_OUTPUT_DIR=/opt/ml/output
01:10:41
SM_NUM_CPUS=64
01:10:41
SM_NUM_GPUS=8
01:10:41
SM_MODEL_DIR=/opt/ml/model
01:10:41
SM_MODULE_DIR=<path>
01:10:41
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":32,"epochs":50,"lower":true,"n_gpu":8,"seed":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"
01:10:41
SM_USER_ARGS=["--batch_size","32","--epochs","50","--lower","True","--n_gpu","8","--seed","2"]
01:10:41
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
01:10:41
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
01:10:41
SM_HP_N_GPU=8
01:10:41
SM_HP_BATCH_SIZE=32
01:10:41
SM_HP_SEED=2
01:10:41
SM_HP_LOWER=true
01:10:41
SM_HP_EPOCHS=50
01:10:41
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
01:10:41
Invoking script with the following command:
01:10:41
/opt/conda/bin/python train.py --batch_size 32 --epochs 50 --lower True --n_gpu 8 --seed 2
01:11:03
Converting to features started.
01:11:03
2020-03-12 01:10:45,583 [train.py ] INFO Starting training...
01:11:07
[2020-03-12 01:11:07.095 algo-1:102 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
01:11:07
2020-03-12 01:10:45,584 [train.py ] INFO No dataset provided for testing...training will be splitted
01:11:07
[2020-03-12 01:11:07.096 algo-1:102 INFO hook.py:152] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
01:11:07
2020-03-12 01:10:45,584 [train.py ] INFO Reading training data at /opt/ml/input/data/training
01:11:07
[2020-03-12 01:11:07.096 algo-1:102 INFO hook.py:197] Saving to /opt/ml/output/tensors
01:11:07
2020-03-12 01:10:45,584 [train.py ] INFO Reading file : 00_data_IOB.csv, located at /opt/ml/input/data/training/00_data_IOB.csv
01:11:07
[2020-03-12 01:11:07.116 algo-1:102 INFO hook.py:326] Monitoring the collections: losses
01:11:07
train.py:95: SettingWithCopyWarning:
01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention.self NoneType
01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.
01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention.self NoneType
01:11:08
Try using .loc[row_indexer,col_indexer] = value instead
01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention.self NoneType
01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention NoneType
01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
01:11:08
[2020-03-12 01:11:08.064 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0 NoneType training["words"] = training["words"].apply(str)
01:11:08
[2020-03-12 01:11:08.064 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0 NoneType
01:11:08
train.py:96: SettingWithCopyWarning:
01:11:08
[2020-03-12 01:11:08.064 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0 NoneType
01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.
01:11:08
[2020-03-12 01:11:08.069 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention.self NoneType
01:11:08
Try using .loc[row_indexer,col_indexer] = value instead
01:11:08
[2020-03-12 01:11:08.069 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention.self NoneType
01:11:08
[2020-03-12 01:11:08.070 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention.self NoneType
01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
01:11:08
[2020-03-12 01:11:08.070 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention NoneType testing["words"] = testing["words"].apply(str)
01:11:08
[2020-03-12 01:11:08.073 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1 NoneType
01:11:08
train.py:99: SettingWithCopyWarning:
01:11:08
[2020-03-12 01:11:08.073 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1 NoneType
01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.
01:11:08
[2020-03-12 01:11:08.073 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1 NoneType
01:11:08
Try using .loc[row_indexer,col_indexer] = value instead
01:11:08
[2020-03-12 01:11:08.078 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention.self NoneType
01:11:08
[2020-03-12 01:11:08.078 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention.self NoneType
01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
01:11:08
[2020-03-12 01:11:08.078 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention.self NoneType training["words"] = training["words"].apply(lambda x: x.lower().strip())
01:11:08
[2020-03-12 01:11:08.079 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention NoneType
01:11:08
train.py:100: SettingWithCopyWarning:
01:11:08
[2020-03-12 01:11:08.081 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2 NoneType
01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.
01:11:08
[2020-03-12 01:11:08.081 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2 NoneType
01:11:08
Try using .loc[row_indexer,col_indexer] = value instead
01:11:08
[2020-03-12 01:11:08.081 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2 NoneType
01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention.self NoneType
01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention.self NoneType testing["words"] = testing["words"].apply(lambda x: x.lower().strip())
01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention.self NoneType
01:11:08
2020-03-12 01:10:45,676 [train.py ] INFO Training datapoitns : 71712
01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention NoneType
01:11:08
2020-03-12 01:10:45,676 [train.py ] INFO TraininTestingg datapoitns : 18001
01:11:08
[2020-03-12 01:11:08.090 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3 NoneType
01:11:08
2020-03-12 01:10:45,681 [train.py ] INFO Training the model with label : ['O', 'B-E', 'I-E', 'B-C', 'I-C']
01:11:08
[2020-03-12 01:11:08.090 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3 NoneType
01:11:08
2020-03-12 01:10:45,681 [train.py ] INFO Training the model with arguments : {'output_dir': '/opt/ml/model/model', 'reprocess_input_data': True, 'num_train_epochs': 50, 'train_batch_size': 32, 'fp16': False, 'save_eval_checkpoints': False, 'save_steps': 8223372036854775807, 'save_model_every_epoch': False, 'overwrite_output_dir': True, 'logging_steps': 500, 'silent': False, 'use_early_stopp
01:11:08
[2020-03-12 01:11:08.090 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3 NoneType
01:11:08
#015Downloading: 0%| | 0.00/361 [00:00<?, ?B/s]#015Downloading: 100%|██████████| 361/361 [00:00<00:00, 366kB/s]
01:11:08
[2020-03-12 01:11:08.095 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention.self NoneType
01:11:08
#015Downloading: 0%| | 0.00/440M [00:00<?, ?B/s]#015Downloading: 1%| | 4.70M/440M [00:00<00:09, 47.0MB/s]#015Downloading: 2%|▏ | 9.55M/440M [00:00<00:09, 47.4MB/s]#015Downloading: 3%|▎ | 14.4M/440M [00:00<00:08, 47.9MB/s]#015Downloading: 4%|▍ | 19.0M/440M [00:00<00:08, 47.2MB/s]#015Downloading: 5%|▌ | 23.5M/440M [00:00<00:08, 46.5MB/s]#
01:11:08
[2020-03-12 01:11:08.095 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention.self NoneType
01:11:08
#015Downloading: 0%| | 0.00/232k [00:00<?, ?B/s]#015Downloading: 100%|██████████| 232k/232k [00:00<00:00, 26.4MB/s]
01:11:08
[2020-03-12 01:11:08.095 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention.self NoneType
01:11:08
#015 0%| | 0/5234 [00:00<?, ?it/s]#015 0%| | 1/5234 [00:00<46:04, 1.89it/s]#015 57%|█████▋ | 3001/5234 [00:00<13:45, 2.70it/s]#015100%|██████████| 5234/5234 [00:00<00:00, 6824.67it/s]
01:11:08
[2020-03-12 01:11:08.096 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention NoneType
01:11:08
#015Epoch: 0%| | 0/50 [00:00<?, ?it/s]
01:11:08
[2020-03-12 01:11:08.098 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4 NoneType
01:11:08
#015Current iteration: 0%| | 0/164 [00:00<?, ?it/s]#033[A
01:11:08
[2020-03-12 01:11:08.099 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4 NoneType
01:11:08
#015Current iteration: 1%| | 1/164 [00:01<03:49, 1.41s/it]#033[A
01:11:08
[2020-03-12 01:11:08.099 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4 NoneType
01:11:09
#015Current iteration: 1%| | 2/164 [00:01<02:51, 1.06s/it]#033[A
01:11:09
[2020-03-12 01:11:08.104 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.5.attention.self NoneType
01:11:09
#015Current iteration: 2%|▏ | 3/164 [00:01<02:10, 1.23it/s]#033[A
...
01:13:09
[2020-03-12 01:13:09.473 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.11 NoneType
01:13:09
#015Current iteration: 82%|████████▏ | 134/164 [00:33<00:07, 4.15it/s]#033[A
01:13:09
[2020-03-12 01:13:09.473 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.11 NoneType
01:13:09
#015Current iteration: 82%|████████▏ | 135/164 [00:33<00:07, 4.14it/s]#033[A
The training is stuck after the last event received. At the same time, GPU use drops to 0.
* Testedon 3 differents Sagemaker's instance types