-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Dear all,
I am facing a problem when training a job with Sagemaker.
First I created my own docker image with the model I wanted to use (VAR modeling from the statsmodels.tsa.api package).
Following the documentation, I have a folder with the following organization :
train-image/
|-- var/
|------- nginx.conf
|------- serve
|------- train
|------- wsgi.py
|-- Dockerfile
|-- requirements.txt
I put my custom image on ECR with the following commands :

My error arrives during the .fit() :

I get this error :
INFO:sagemaker:Creating training-job with name: var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331
2024-05-14 12:56:20 Starting - Starting the training job...
2024-05-14 12:56:38 Starting - Preparing the instances for training...
2024-05-14 12:57:17 Downloading - Downloading the training image
2024-05-14 12:57:17 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: Permission denied
2024-05-14 12:57:45 Uploading - Uploading generated training model
2024-05-14 12:57:45 Failed - Training job failed
I tried to put some logs/prints on my train file (from the train-image/var/ folder) but i can't see any of them.... I think it surely fails before.
I tried to look at the permissions of the role running the training job and gave full access for Sagemaker and S3.
I also checked at the s3 bucket and there is the training dataset sucessfully saved. I tried to download it with a debug script and It worked. I can access this file.
I read tons of issues but never found something related to my error message...
[FATAL tini (7)] exec train failed: Permission denied
Traceback (most recent call last):
File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 16, in var_train_jobs
raise exp
File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 13, in var_train_jobs
varTrainer.var_train(local_run=False)
File "/home/faustine/Documents/mantorai/train/resources/streamflow/trainer.py", line 116, in var_train
var_model = varEstimator.model_data
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
return run_func(*args, **kwargs)
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
self.latest_training_job.wait(logs=logs)
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
_logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331: Failed. Reason: AlgorithmError: , exit code: 126
python-BaseException
System information
- SageMaker Python SDK version: sagemaker==2.177.1
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): VAR (from statsmodels.tsa.api)
- Framework version: statsmodels==0.14.0
- Python version: 3.10
- CPU or GPU: CPU
- Custom Docker image (Y/N): Yes
Thank your in advance for your help !
Sorry I can not make my issue reproductible to you.



