Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TensorFlow log syncing bug #105

Merged
merged 12 commits into from
Mar 22, 2018
Merged

Fix TensorFlow log syncing bug #105

merged 12 commits into from
Mar 22, 2018

Conversation

jbencook
Copy link

This is a solution to #26. As soon as aws s3 sync adds temp files to the TensorFlow logs directory, TensorBoard stops looking at the correct files for updates. This works around the issue by creating a temporary sync directory for aws s3 sync, then synchronously copying files to the logs directory after the download finishes.

@ChoiByungWook
Copy link
Contributor

Hi,

Thanks for submitting your pull request. This is awesome.

Can you please add unit tests?

Copy link
Contributor

@winstonaws winstonaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good overall!

Can you add a new unit test file to https://github.com/aws/sagemaker-python-sdk/tree/master/tests/unit with tests on the _sync_directories function? Here are some cases I can think of that would be good to cover:

  • to_directory does not exist
  • to_directory exists, child dirs within it don't exist
  • to_directory exists with the same structure and files as from_directory

And from_directory should have at least one child dir with files in it.

@staticmethod
def _sync_directories(from_directory, to_directory):
"""Sync to_directory with from_directory by copying each file in
to_directory with new contents. Why do this? Because TensorBoard picks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a sentence to the docstring which clarifies that this function will overwrite the contents of existing files in the to_directory, but does not delete the files? This is key because it explains why we implement it in this particular way, rather than using shutil.rmtree / shutil.copytree or similar.

@codecov-io
Copy link

codecov-io commented Mar 22, 2018

Codecov Report

Merging #105 into master will increase coverage by 0.02%.
The diff coverage is 78.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #105      +/-   ##
==========================================
+ Coverage   89.72%   89.74%   +0.02%     
==========================================
  Files          34       34              
  Lines        2015     2039      +24     
==========================================
+ Hits         1808     1830      +22     
- Misses        207      209       +2
Impacted Files Coverage Δ
src/sagemaker/tensorflow/estimator.py 80% <78.57%> (+2.77%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2af9d45...92b5e7d. Read the comment docs.

@jbencook
Copy link
Author

Ok I made the docstring more explicit and added unit tests for the _sync_directories method. Sorry for all the failing builds - I'm having trouble with the tox tests on my laptop.

I didn't realize the tempfile.TemporaryDirectory context manager isn't available in python 2.7. I think the simplest approach is to roll it by hand similar to what this answer recommends. But I didn't add unit tests for the context manager and I'm not crazy about it living in tensorflow.estimator. Happy to change how I handled that if you all prefer something else.

Copy link
Contributor

@winstonaws winstonaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with you adding the context manager into tensorflow/estimator.py since we are only using it there at the moment; we can consider moving it somewhere else if we ever need it elsewhere.

I'd recommend putting the temporary_directory function as a "private" staticmethod in Tensorboard, just like _sync_directories, to make it clear that it's only intended to be used there at the moment.

I also have one more question below about the filecmp usage in the tests below. These comments are rather nitpicky though; everything looks great! Thanks for adding tests!

for fname in comp.common_files:
left_file = os.path.join(a, fname)
right_file = os.path.join(b, fname)
if not filecmp.cmp(left_file, right_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't dircmp already do a shallow comparison of the files to build the common_files list? Should you be passing in shallow=False here? (or just removing this check)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common_files is just a list of files with the same name (whether or not they're actually the same). But it looks like we can get rid of this by checking len(comp.diff_files). I'll make that change.

Copy link
Contributor

@winstonaws winstonaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for figuring this out and getting this in!

@winstonaws winstonaws merged commit da14a6b into aws:master Mar 22, 2018
@jbencook
Copy link
Author

No problem - happy to help!

@jbencook jbencook deleted the UpdateTFLogs branch March 23, 2018 11:34
apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants