Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel package catalog processing #1355

Merged
merged 5 commits into from
Jan 11, 2023

Conversation

Mikcl
Copy link
Contributor

@Mikcl Mikcl commented Nov 20, 2022

Closes #1353

This introduces a new option SYFT_PARALLELISM=N environement variable and syft config file value. Which will use at most N workers to process the package catalogers in parallel. The default will be 1.

  • uses the sync library to create a wait group, waiting for all package catalogers to finish before proceeding.
  • effectively has no real change in behaviour to users calling syft today, as the default is to create one worker.

example performance benchmark comparision

Open to recommendations on performance benchmarking but early results on laptop seem promising:

Created an exmple directory with venv and node_modules

➜  syft git:(mikcl/concurrent-catalog) SYFT_PARALLELISM=1 time go run cmd/syft/main.go packages  --file ../dump  /path/to/project
[indexing truncated]
 ✔ Indexed /Users/mikcl/Documents/junk/benchmark-concurrent
 ✔ Cataloged packages      [758 packages]

go run cmd/syft/main.go packages --file ../dump   13.29s user 1.75s system 114% cpu 13.179 total
➜  syft git:(mikcl/concurrent-catalog) SYFT_PARALLELISM=4 time  go run cmd/syft/main.go packages   --file ../dump  /path/to/project
[indexing truncated]
 ✔ Indexed /Users/mikcl/Documents/junk/benchmark-concurrent
 ✔ Cataloged packages      [758 packages]
go run cmd/syft/main.go packages --file ../dump   14.16s user 1.94s system 174% cpu 9.235 total

1 worker: 13.29s user 1.75s system 114% cpu 13.179 total
4 workers: 14.16s user 1.94s system 174% cpu 9.235 total

uname -a
Darwin Kernel Version 21.6.0: [date redacted]; root:xnu-8020.141.5~2/RELEASE_ARM64_T8101 arm64

For larger file systems and cpu's with more cores, this may be useful :)

@Mikcl Mikcl marked this pull request as draft November 20, 2022 17:02
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a great idea! I wonder if it could be simplified a bit (see some initial comments). Also, I wonder if there would be a way to extract this to some more "generic" function like runInParallel(workers, func()...)...

syft/pkg/cataloger/catalog.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/catalog.go Show resolved Hide resolved
syft/pkg/cataloger/catalog.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/catalog.go Show resolved Hide resolved
syft/pkg/cataloger/catalog.go Outdated Show resolved Hide resolved
@kzantow
Copy link
Contributor

kzantow commented Nov 21, 2022

A friendly reminder that you'll need to sign-off your commits (https://github.com/anchore/syft/blob/main/CONTRIBUTING.md#sign-off-your-work)

@Mikcl
Copy link
Contributor Author

Mikcl commented Nov 21, 2022

Thanks for the initial review @kzantow (i think i may take this out of draft now)

This is the concurrency pattern that i have been following. I will assume for it idiomatic?

Summarise the approach / requirements:
Given N jobs and W workers,
we would need to return two things from the job (1. discovered packages dynamically, 2. "results" to aggregate from all jobs)

I think the sync/WaitGroup approach is suitable here.

  • Set up W workers to independently read from the jobs channel.
  • publish N catalogers to the jobs channel.
  • worker w picks up the next job, and does necessary processing and result publishing to desried channels.
    • discoveredPackages to publish dynamic values from this process.
    • results for later aggregation
  • once wg.Wait is ready then we can aggregate

Are we aligned with the general steps? I'm open to exploring other concurrency patterns, however the waitgroup seems advantageous here due to Wait and our need to aggregate?

wdyt?

@Mikcl Mikcl marked this pull request as ready for review November 24, 2022 02:24
internal/config/application.go Outdated Show resolved Hide resolved
syft/pkg/cataloger/catalog.go Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@Mikcl Mikcl force-pushed the mikcl/concurrent-catalog branch 2 times, most recently from 2929836 to 38767e6 Compare November 29, 2022 23:48
@wagoodman
Copy link
Contributor

My preference would be to introduce only an application configuration option first and not a CLI option. That is, keep the SYFT_PARALLELISM env var / parallelism config var and remove the --parallelism CLI switch.

Copy link
Contributor

@wagoodman wagoodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question / asking for folks opinion: why introduce a number of logical concurrent workers? We could lean on the default GOMAXPROCS value and simply start go routines for all catalogers without setting an artificial limit.

@Mikcl
Copy link
Contributor Author

Mikcl commented Dec 1, 2022

  • Have hidden the cli --parallelism flag. (such that it will not show up as incouraged interface/in the --help text, but still functional).

SYFT_PARALLELISM is the corresponding env variable.

Do we want to remove the flag entriely?

We could lean on the default GOMAXPROCS?

This pr currently setting 1 as the default (i.e "same" as without). Would encourage keeping it the same in initial release before increasing / settting different default.

My (non maintainer) pov:

  • we could set GOMAXPROCS in a following PR as the default.
    • however we dont necessarily know how this is used by other users / whether this would impact their workflows in anyway. Something which could be comunicated in advance?
    • Do not think there is much risk invovled but configurability would be good.

@Mikcl Mikcl force-pushed the mikcl/concurrent-catalog branch 2 times, most recently from e0d57a2 to b75e661 Compare December 1, 2022 16:56
Signed-off-by: mikcl <mikesmikes400@gmail.com>
@Mikcl
Copy link
Contributor Author

Mikcl commented Dec 5, 2022

Any updates on @wagoodman open question and interface comments above from the syft team.

Signed-off-by: mikcl <mikesmikes400@gmail.com>
@Mikcl
Copy link
Contributor Author

Mikcl commented Dec 13, 2022

just bumping in case there are any new updates on this from syft team.

cc @kzantow

@wagoodman
Copy link
Contributor

Sorry for the radio silence -- I thought a little more about this. I agree having a more explicit config option (SYFT_PARALLELISM) instead of leaning on GOMAXPROCS would be more obvious.

I do think we should make this an app-config-only update and still not introduce a CLI flag. That is, I see that you hid the flag, but ideally it wouldn't be on the CLI at all. I can help with those updates if you'd like?

Signed-off-by: mikcl <mikesmikes400@gmail.com>
@Mikcl
Copy link
Contributor Author

Mikcl commented Dec 21, 2022

thanks for the update and thoughts on the direction @wagoodman

  • updated to be an environment variable (SYFT_PARALLELISM) and in yaml config option only.
  • (removed cli flag option).
  • updated tests and PR text to reflect the change.

@Mikcl
Copy link
Contributor Author

Mikcl commented Jan 4, 2023

hny!, what will it take to get this across the line?

@Mikcl
Copy link
Contributor Author

Mikcl commented Jan 11, 2023

Is this still progressing? happy with waiting, would be good to get any feedback if possible.

understand that there may be other priorities so will leave this open for now :)

@wagoodman
Copy link
Contributor

Sorry for the radio silence -- I'm going to push a couple minor tweaks, then I think it'll be go. Nice work @Mikcl !

@wagoodman wagoodman added the enhancement New feature or request label Jan 11, 2023
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman wagoodman removed the enhancement New feature or request label Jan 11, 2023
Copy link
Contributor

@wagoodman wagoodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice enhancement @Mikcl !

@wagoodman wagoodman enabled auto-merge (squash) January 11, 2023 18:44
Signed-off-by: mikcl <mikesmikes400@gmail.com>
auto-merge was automatically disabled January 11, 2023 19:23

Head branch was pushed to by a user without write access

@Mikcl
Copy link
Contributor Author

Mikcl commented Jan 11, 2023

Thanks for the logging changes, looks good

(some tests failed due to the logging change, made a commit to update it).

I feel like it is useful to assert some of the logging (There were tests for it which failed due to logging update commit)

This last commit should fix the tests that failed.

Please feel free to auto-enable-merge / merge when available if approved changes.

@Mikcl
Copy link
Contributor Author

Mikcl commented Jan 11, 2023

this is green now 🚀

Thanks for your patience on this PR @wagoodman

@wagoodman
Copy link
Contributor

My bad, thanks for the fix!

@wagoodman wagoodman merged commit 4bfb849 into anchore:main Jan 11, 2023
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
* catalog: run cataloggers concurrently

Signed-off-by: mikcl <mikesmikes400@gmail.com>

* frontend: expose workers as a configurable option

Signed-off-by: mikcl <mikesmikes400@gmail.com>

* fixup! frontend: expose workers as a configurable option

Signed-off-by: mikcl <mikesmikes400@gmail.com>

* update logging statements

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

* test: assert for debug logging

Signed-off-by: mikcl <mikesmikes400@gmail.com>

Signed-off-by: mikcl <mikesmikes400@gmail.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Co-authored-by: Alex Goodman <alex.goodman@anchore.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Increase the speed of cataloger stage
3 participants