Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requirements_file doesn't seem to do anything with TensorFlow estimator #154

Closed
zmjjmz opened this Issue Apr 17, 2018 · 11 comments

Comments

Projects
None yet
5 participants
@zmjjmz
Copy link

zmjjmz commented Apr 17, 2018

Hey there,

In the documentation I see a pip-style requirements.txt is supposedly supported for installing arbitrary Python packages in the containers. I haven't seen the need to try it yet, but recently wanted to test out something that required a new package and was surprised to find that well, nothing happened, and then I got a fun little ImportError.

To test this further, I put some garbage into the requirements file, and found that (when I went back to the version of my code that didn't require the extra package) it chugged along just fine. I know that the SDK ensures that the requirements file exists in the source directory, since I tested that, but otherwise it seems to me like nothing is being done with it once it's sent over.

Here's the sample requirements file that didn't seem to get parsed:

tensorflow==1.7.0      
tqdm==4.7.4            
sjdhjaskldhjaskl==1.2.0

When I use this file locally, I get the following:

(dataplayground2) zach@wa1okdba002:~/stats/omc/utils/sagemaker_source_dir$ pip install -r requirements.txt                                     
Requirement already satisfied: tensorflow==1.7.0 in /home/u1/zach/proj/dataplayground2/lib/python2.7/site-packages (from -r requirements.txt (l
ine 1))                                                                                                                                        
Requirement already satisfied: tqdm==4.7.4 in /home/u1/zach/proj/dataplayground2/lib/python2.7/site-packages (from -r requirements.txt (line 2)
)                                                                                                                                              
Collecting sjdhjaskldhjaskl==1.2.0 (from -r requirements.txt (line 3))                                                                         
  Could not find a version that satisfies the requirement sjdhjaskldhjaskl==1.2.0 (from -r requirements.txt (line 3)) (from versions: )        
No matching distribution found for sjdhjaskldhjaskl==1.2.0 (from -r requirements.txt (line 3))                                                                                                                     

Thanks,
Zach

@zmjjmz zmjjmz changed the title `requirements.txt` doesn't seem to do anything with TensorFlow estimator requirements_file doesn't seem to do anything with TensorFlow estimator Apr 17, 2018

@ChoiByungWook

This comment has been minimized.

Copy link
Contributor

ChoiByungWook commented Apr 18, 2018

Hello,

Thanks for using SageMaker!

Can you please provide the following?

  1. Python sdk version.
  2. Tensorflow training image name and tag. (This can be found through the aws console. AWS Consoel -> SageMaker -> Jobs. From there, click on a job and looking for "training image")

Thanks!

@zmjjmz

This comment has been minimized.

Copy link
Author

zmjjmz commented Apr 18, 2018

Hey there,

  1. So, uh, I'm kinda ahead of the latest release version (1.2.2) since I made a (very unrelated) bug fix as per this issue the fix for which doesn't seem to have been released to PyPI afaict just yet.
  2. 520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow:1.6-cpu-py2

Thanks!

@zmjjmz

This comment has been minimized.

Copy link
Author

zmjjmz commented Apr 23, 2018

Hey there!

I've recently come into a situation where I'm going to need a new version of Tensorflow in place to unbreak something (details here).

In order to do so I need the requirements.txt file to work -- has there been any update on this? Thanks.

@ChoiByungWook

This comment has been minimized.

Copy link
Contributor

ChoiByungWook commented Apr 24, 2018

Hello @zmjjmz,

I apologize for the lack of response.

Did this error occur while attempting to train within the container?

We only install the dependencies specified in the requirements.txt if you attempt to train within the container. The is handled through our container library. Here is the code that installs the dependencies from the provided requirements.txt.

The workflow is initiated through this entry point. This is automatically called when training on SageMaker using our estimators.

@zmjjmz

This comment has been minimized.

Copy link
Author

zmjjmz commented Apr 24, 2018

Hey there,

This occurs (or... doesn't occur?) when calling the estimator's constructor, as in here.

If I look for the logs from this line I don't see anything, so my suspicion is that it's not hitting that code path at all. I don't have a good way to inject any sort of debugging in there however, so I can't really provide more detail.

I'm best able to test this with 'local' instances, however to do so I'd like to ensure that I have the latest container version. Is there any way to force it to update the container version when I run it? Thanks.

@zmjjmz

This comment has been minimized.

Copy link
Author

zmjjmz commented Apr 24, 2018

Hey, did a bit more digging actually:

It looks like there's two repos: sagemaker-containers and sagemaker-tensorflow-containers. I'm not sure which one should be handling the requirements file, but I imagine it's sagemaker-containers. Anyways, I can't find a single reference to user_requirements_file in the latter repo.

In the ContainerEnvironment class user_requirements_file is initialized as None here. The comment above it implies to me that there's something that sub-classes ContainerEnvironment. It appears that TrainingEnvironment and HostingEnvironment set this property based on looking in some hyperparameters.json for 'sagemaker_requirements' in the former case and an environment variable 'SAGEMAKER_REQUIREMENTS' in the latter case (why?).

So, I'm guessing that requirements_file argument should make its way into those hyperparameters (or the environment variables), but that's not happening. Where is that being set up?

Note: I tried to set sagemaker_requirements accordingly in the hyperparameters that I pass on to the training job, but didn't meet with any success :(

@zmjjmz

This comment has been minimized.

Copy link
Author

zmjjmz commented Apr 24, 2018

Ok, so I'm incredibly confused. I threw a breakpoint into my estimator's model_fn and found that sagemaker_requirements was set to requirements.txt and that os.path.exists('/opt/ml/code/requirements.txt') returned true. I was even able to execute the pip install command and install stuff!

So, I really have no clue what's going on. I guess I could, as a workaround, do this in the model_fn :) But that's obviously not ideal.

My only guesses at this point are:

  1. pip_install_requirements is never being called (which doesn't make sense)
  2. Somehow /opt/ml/code isn't fully populated when it runs, so it's not there.

I can't really test these right now, but hopefully this helps.

@lukmis

This comment has been minimized.

Copy link
Contributor

lukmis commented Apr 30, 2018

Hi zmjjmz,

sorry you are having these confusions here. I've just looked at the code and can confirm that for training, the name of the file is passed in the hyperparameters and shall be initialized during Trainer.start. All of which you have confirmed now.

The reason though seems to be the order of operations. When Trainer.start() calls 'pip_install_requirements' the code is not downloaded yet and it is handled silently. The code is downloaded later in the 'train_entry_point.py' in train().
This also explains why you saw the file and was able to pip install manually.

@lukmis lukmis added the bug label Apr 30, 2018

@laurenyu

This comment has been minimized.

Copy link
Member

laurenyu commented May 2, 2018

@lukmis has merged fixes in aws/sagemaker-containers#39 and aws/sagemaker-tensorflow-container#35, but I'm going to leave this issue open until those changes are released.

@zmjjmz

This comment has been minimized.

Copy link
Author

zmjjmz commented May 9, 2018

I think this fix has been released -- new training instances install the requirements_file properly. I'm having other issues, but I'll make a separate ticket for that :)

@zmjjmz zmjjmz closed this May 9, 2018

@Kaidanov

This comment has been minimized.

Copy link

Kaidanov commented May 17, 2018

awslabs/amazon-sagemaker-examples#249

I can't pip install tqdm into sagemaker .. works fine localy

Any idea?

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018

Merge pull request aws#154 from awslabs/mvs-scalar-data
ResNet CIFAR 10 generates scalar data faster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.