Training job failed "Permission denied" - with custom docker image

Dear all,

I am facing a problem when training a job with Sagemaker. 
First I created my own docker image with the model I wanted to use (VAR modeling from the statsmodels.tsa.api package). 
Following the documentation, I have a folder with the following organization :
train-image/
|-- var/
|------- nginx.conf
|------- serve
|------- train
|------- wsgi.py
|-- Dockerfile
|-- requirements.txt

My dockerfile : 
![image](https://github.com/aws/sagemaker-python-sdk/assets/136351683/af1598c7-eae2-47d1-9e14-09bb58c4e6b0)

I put my custom image on ECR with the following commands : 
![image](https://github.com/aws/sagemaker-python-sdk/assets/136351683/0f13f3da-59e2-4dc8-81ac-a4a2bbf17f55)

My error arrives during the .fit() : 
![image](https://github.com/aws/sagemaker-python-sdk/assets/136351683/d9ad9465-78c8-462f-a070-85d0efaec861)

I get this error : 

> INFO:sagemaker:Creating training-job with name: var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331
2024-05-14 12:56:20 Starting - Starting the training job...
2024-05-14 12:56:38 Starting - Preparing the instances for training...
2024-05-14 12:57:17 Downloading - Downloading the training image
2024-05-14 12:57:17 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: Permission denied
2024-05-14 12:57:45 Uploading - Uploading generated training model
2024-05-14 12:57:45 Failed - Training job failed

I tried to put some logs/prints on my train file (from the train-image/var/ folder) but i can't see any of them.... I think it surely fails before.

I tried to look at the permissions of the role running the training job and gave full access for Sagemaker and S3. 

I also checked at the s3 bucket and there is the training dataset sucessfully saved. I tried to download it with a debug script and It worked. I can access this file.

I read tons of issues but never found something related to my error message...
[FATAL tini (7)] exec train failed: Permission denied

```
Traceback (most recent call last):
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 16, in var_train_jobs
    raise exp
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/main.py", line 13, in var_train_jobs
    varTrainer.var_train(local_run=False)
  File "/home/faustine/Documents/mantorai/train/resources/streamflow/trainer.py", line 116, in var_train
    var_model = varEstimator.model_data
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
    _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/faustine/.pyenv/versions/3.10.4/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job var-HUBEAU-STREAM-FLOW-I522101001-2024-05-14-12-56-19-331: Failed. Reason: AlgorithmError: , exit code: 126
python-BaseException
```


**Screenshots or logs**
![image](https://github.com/aws/sagemaker-python-sdk/assets/136351683/8f428cc0-7608-4443-8d92-3761d3cd17b0)
![image](https://github.com/aws/sagemaker-python-sdk/assets/136351683/53707f1b-f87f-4e03-b538-a1604c6378f4)
![image](https://github.com/aws/sagemaker-python-sdk/assets/136351683/934c878c-0e56-42ad-86fd-48cbea222beb)


**System information**
- **SageMaker Python SDK version**: sagemaker==2.177.1
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**: VAR (from statsmodels.tsa.api)
- **Framework version**: statsmodels==0.14.0
- **Python version**: 3.10
- **CPU or GPU**: CPU
- **Custom Docker image (Y/N)**: Yes

Thank your in advance for your help ! 
Sorry I can not make my issue reproductible to you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training job failed "Permission denied" - with custom docker image #4678

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training job failed "Permission denied" - with custom docker image #4678

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions