Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup build #77

Merged
merged 19 commits into from Jun 20, 2022
Merged

Cleanup build #77

merged 19 commits into from Jun 20, 2022

Conversation

Szubie
Copy link
Collaborator

@Szubie Szubie commented Jun 1, 2022

Currently if a build is interrupted, LXD images imported during the build process are often left on the server. These images will cause a conflict the next time a build is attempted, and in the meantime use up space needlessly. It would be nice if bravetools cleaned up after itself nicely.

These changes aim to ensure that interrupted builds are correctly cleaned up.

The following steps are taken to ensure this:

  • Images fingerprints needed for build process are recorded before starting the build. Upon cancellation, all new fingerprints are deleted to roll the system back to what it was before the build.
  • A separate goroutine intercepts SIGINT and sets abort build flag and cancels the context.Context.
  • After every stage of the build, the abort flag is checked.
  • Functions called during the build now accept a context.Context argument and check it at appropriate times to abort the build when it is safe to do so.

This change does its best to ensure that builds are only cancelled when safe to do so - for example, cancellation is not allowed during image publishing.

For now the idea to check the diff of the image fingerprints mimics existing behavior for retrieving the fingerprint used in bravetools. I have some ideas on how to improve this and make the fingerprint check more accurate later - this would allow for more granular cleanups of just the images created during the build.

  • This is now implemented

@idroz idroz requested a review from adrozdovbering June 6, 2022 18:29
fingerprint, err = ImportImage(image, unitParams.PlatformService.Name, bh.Remote)
if err != nil {
if err != nil || abortFlag {
return errors.New("failed to import image: " + err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When interrupting during Image Import, this line causes panic:

"""bash
⠧ Importing mlbase-1.0.tar.gz ^CInterrupting build and cleaning artefacts
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4528ded]

goroutine 1 [running]:
github.com/bravetools/bravetools/platform.(*BraveHost).InitUnit(0x4bf7d20, {0x486e930, 0xc0002780f0}, 0xc0001c5540)
/Users/ignat/go/src/github.com/beringresearch/bravetools/platform/host_api.go:820 +0x2ad
github.com/bravetools/bravetools/commands.deploy(0x4bed440?, {0x4c28890, 0x0, 0x0?})
/Users/ignat/go/src/github.com/beringresearch/bravetools/commands/deploy.go:98 +0x35f
github.com/spf13/cobra.(*Command).execute(0x4bed440, {0x4c28890, 0x0, 0x0})
/Users/ignat/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x4becf40)
/Users/ignat/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960 +0x39c
github.com/spf13/cobra.(*Command).Execute(...)
/Users/ignat/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897
main.main()
/Users/ignat/go/src/github.com/beringresearch/bravetools/main.go:15 +0x25
"""

I think this is because err is actually nil and the if statement is trying to return err.Error()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Will have to take a different approach to err checking - likely delegate it to a helper function and reuse throughout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue should now be dealt with. The shared utility function CollectErrors returns an error if there is one, so there will no longer be a case where a nil error is unexpectedly used.

@Szubie Szubie force-pushed the cleanup-build-check-flag branch 2 times, most recently from 7ffb389 to e26a10b Compare June 7, 2022 14:27
Copy link
Contributor

@adrozdovbering adrozdovbering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When unit deployment interrupted during execution of POSTDEPLOY commands unit record left undeleted from database.

@Szubie Szubie force-pushed the cleanup-build-check-flag branch 2 times, most recently from a6f8b37 to 8de5bdc Compare June 8, 2022 15:59
@Szubie
Copy link
Collaborator Author

Szubie commented Jun 8, 2022

When unit deployment interrupted during execution of POSTDEPLOY commands unit record left undeleted from database.

Nice catch. I reordered the calls in 8de5bdc so that the unit insertion into the database is the last thing to happen. This should address the issue of units being left in the database despite the actual container having been cleaned up.

@adrozdovbering
Copy link
Contributor

Can we have at least a list of test cases which need to be checked?

Szubie added 19 commits June 17, 2022 15:15
…nup after errors encountered during the image import process. Slightly improved the capability of the program to cleanup after SIGTERM so long as unit/image already exist when SIGTERM is given
…ine intercepting os interrupt signals into `BuildImage` - it flips the abortFlag and cancels the context. Between the two, builds can be safely interrupted at non-critical junctures and cleanup can be properly conducted. The cost is that there can be some waiting between SIGINT and bravetools shutting down.
…t` (deployment). A goroutine intercepts SIGINT and sets flag/cancels context. Checks made against flag throughout.
…nching unit from it. Allows for cleaning up just the images generated during the build process.
…ents attempts to reference a nil error later by ensuring an error value is present.
…ortant step of unit initialization; the two should come as part and parcel. It may be useful to separately execute Postdeploy (like on an already running container) so it has been left as a separate public function.
…ecessary short function for build cleanup - using an anonymous function instead to make clear what happens. Fix some scoping for err short assignments to avoid overwriting more important errors
…n. This avoids the case where interrupts or errors in postdeploy would leave a record of a unit in the database even though the actual container had been cleaned up.
@Szubie
Copy link
Collaborator Author

Szubie commented Jun 17, 2022

Can we have at least a list of test cases which need to be checked?

In terms of automated tests the happy path should be testable using the tests in #92
It could also be possible to force an error during build/deploy and check if things are cleaned up.

It's a bit tricky to automatically test the SIGINT capturing and cleanup so I've been proceeding by testing interrupts manually. I then ensure no LXD images or containers are left on the system by running another build - if any exist an error will occur.

What I've been checking is whether bravetools catches interrupts and exits after cleaning up anything created during build/deploy:

  • BuildImage interrupted during image import
  • BuildImage interrupted during package install
  • BuildImage interrupted during bravefile Copy
  • BuildImage interrupted during bravefile Run
  • BuildImage interrupted during Publish/Export (should only exit after publish finishes to avoid leaving ZFS datasets around)
  • BuildImage interrupted during image file hashing/copying

For deploy, similar story:

  • InitUnit interrupted during image import
  • InitUnit interrupted during unit launch
  • InitUnit interrupted during unit config
  • InitUnit interrupted during image postdeploy

From my manual testing of the above cases I found no issues - it appears to cleanup very reliably.

@idroz idroz merged commit a201aab into bravetools:master Jun 20, 2022
@Szubie Szubie deleted the cleanup-build-check-flag branch June 20, 2022 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants