Skip to content

Training job failed "Permission denied" - with custom docker image #4678

@FaustineBt

Description

@FaustineBt

Dear all,

I am facing a problem when training a job with Sagemaker.
First I created my own docker image with the model I wanted to use (VAR modeling from the statsmodels.tsa.api package).
Following the documentation, I have a folder with the following organization :
train-image/
|-- var/
|------- nginx.conf
|------- serve
|------- train
|------- wsgi.py
|-- Dockerfile
|-- requirements.txt

My dockerfile :
image

I put my custom image on ECR with the following commands :
image

My error arrives during the .fit() :
image

I get this error :

INFO:sagemaker:Creating training-job with name: var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331
2024-05-14 12:56:20 Starting - Starting the training job...
2024-05-14 12:56:38 Starting - Preparing the instances for training...
2024-05-14 12:57:17 Downloading - Downloading the training image
2024-05-14 12:57:17 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: Permission denied
2024-05-14 12:57:45 Uploading - Uploading generated training model
2024-05-14 12:57:45 Failed - Training job failed

I tried to put some logs/prints on my train file (from the train-image/var/ folder) but i can't see any of them.... I think it surely fails before.

I tried to look at the permissions of the role running the training job and gave full access for Sagemaker and S3.

I also checked at the s3 bucket and there is the training dataset sucessfully saved. I tried to download it with a debug script and It worked. I can access this file.

I read tons of issues but never found something related to my error message...
[FATAL tini (7)] exec train failed: Permission denied

Traceback (most recent call last):
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 16, in var_train_jobs
    raise exp
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 13, in var_train_jobs
    varTrainer.var_train(local_run=False)
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/trainer.py", line 116, in var_train
    var_model = varEstimator.model_data
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
    _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331: Failed. Reason: AlgorithmError: , exit code: 126
python-BaseException

Screenshots or logs
image
image
image

System information

  • SageMaker Python SDK version: sagemaker==2.177.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): VAR (from statsmodels.tsa.api)
  • Framework version: statsmodels==0.14.0
  • Python version: 3.10
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Yes

Thank your in advance for your help !
Sorry I can not make my issue reproductible to you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions