Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SnapshotManager #1598

Merged
merged 10 commits into from
Jan 15, 2021
Merged

Adding SnapshotManager #1598

merged 10 commits into from
Jan 15, 2021

Conversation

lematt1991
Copy link
Contributor

This allows users to create a snapshot of the current git repository
and launch the job from this snapshot. This can prevent jobs that
are slow to start or re-queued from picking up local changes

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 12, 2021
This allows users to create a snapshot of the current `git` repository
and launch the job from this snapshot.  This can prevent jobs that
are slow to start or requeued from picking up local changes
continue
dest_file = os.path.join(self.snapshot_dir, file)
os.makedirs(os.path.dirname(dest_file), exist_ok=True)
shutil.copy(os.path.join(root_dir, file), dest_file)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rsync of a bunch of files together is likely faster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use Rsync then I think Rsync should appear in the name just to make it clear on the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also rsync can respect .gitignore which we probably want to do to avoid copying models/checkpoints in the root dir: rsync --exclude='.git/' --filter=':- .gitignore'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stephenroller for the suggestion! I'm getting a list of all the checked in files using git ls-files and stashing them in a temporary file that gets passed to rsync now. I think this should address the above comments (fast and avoid copying models/checkpoints etc).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use Rsync then I think Rsync should appear in the name just to make it clear on the user.

I would also tend to check that Rsync is installed in the init, and provide a nice message as error explaining that it is needed and how to install it? Would be a nice user experience

Copy link
Contributor

@gwenzek gwenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, this has been asked by several users !

submitit/test_snapshot.py Outdated Show resolved Hide resolved
continue
dest_file = os.path.join(self.snapshot_dir, file)
os.makedirs(os.path.dirname(dest_file), exist_ok=True)
shutil.copy(os.path.join(root_dir, file), dest_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use Rsync then I think Rsync should appear in the name just to make it clear on the user.

submitit/snapshot.py Outdated Show resolved Hide resolved
submitit/snapshot.py Outdated Show resolved Hide resolved
submitit/snapshot.py Outdated Show resolved Hide resolved
continue
dest_file = os.path.join(self.snapshot_dir, file)
os.makedirs(os.path.dirname(dest_file), exist_ok=True)
shutil.copy(os.path.join(root_dir, file), dest_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also rsync can respect .gitignore which we probably want to do to avoid copying models/checkpoints in the root dir: rsync --exclude='.git/' --filter=':- .gitignore'

submitit/snapshot.py Outdated Show resolved Hide resolved
submitit/test_snapshot.py Outdated Show resolved Hide resolved
Copy link

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment is addressed, but deferring to Guillaume

@gwenzek
Copy link
Contributor

gwenzek commented Jan 13, 2021

My comment is addressed, but deferring to Guillaume

@stephenroller do you know what would be the best way to have rsync in our CI ?

Apparently we can just add - run: sudo apt-get install rsync to the .circleci config

@stephenroller
Copy link

AFAIK, rsync is installed by default on nearly every Linux distribution out there.

@lematt1991
Copy link
Contributor Author

Hmm, I'm not sure I understand what's going on with rsync. I can see it being installed in the create_env job, but the test_coverage job can't seem to find it.

@gwenzek
Copy link
Contributor

gwenzek commented Jan 13, 2021

Sorry for the mess. I think you need to add the sudo apt-get install rsync directly inside the test_coverage. Here I believe: https://github.com/facebookincubator/submitit/pull/1598/files#diff-78a8a19706dbd2a4425dd72bdab0502ed7a2cef16365ab7030a5a0588927bf47R65

and the changes to venv caching aren't needed because rsync isn't part of the venv.

Copy link
Contributor

@gwenzek gwenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, LGTM otherwise.

@jrapin do you think we should put this as submitit.SnapshotManager
or as submitit.helpers.SnapshotManager ?

If it's the second I think snapshot.py can by inlined inside helpers.py.

submitit/snapshot.py Outdated Show resolved Hide resolved
submitit/snapshot.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jrapin jrapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, LGTM otherwise.

@jrapin do you think we should put this as submitit.SnapshotManager
or as submitit.helpers.SnapshotManager ?

If it's the second I think snapshot.py can by inlined inside helpers.py.

I was going to say exactly that before seeing your comment. helpers provides some tools (eg: CommandFunction, FunctionSequence etc) that can be used to launch jobs but are not strictly necessary, so I think it fits well.

submitit/snapshot.py Outdated Show resolved Hide resolved
@jrapin
Copy link
Contributor

jrapin commented Jan 14, 2021

Minor nit, LGTM otherwise.
@jrapin do you think we should put this as submitit.SnapshotManager
or as submitit.helpers.SnapshotManager ?
If it's the second I think snapshot.py can by inlined inside helpers.py.

I was going to say exactly that before seeing your comment. helpers provides some tools (eg: CommandFunction, FunctionSequence etc) that can be used to launch jobs but are not strictly necessary, so I think it fits well.

and btw, a line explaining how it works there would be great: https://github.com/facebookincubator/submitit/blob/master/docs/structure.md#helpers

@gwenzek gwenzek merged commit 25319d8 into facebookincubator:master Jan 15, 2021
@gwenzek
Copy link
Contributor

gwenzek commented Jan 15, 2021

Thanks a lot @lematt1991 for keeping up with the streams of comment !
The PR looks pretty great.
I'll play a bit with the feature before cutting a release.

@gwenzek
Copy link
Contributor

gwenzek commented Jan 29, 2021

It's part of 1.2.0 release, I'll think on if we should add a snapshot parameter to the executor for easy snapshotting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants