Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very long cataloging process #1328

Closed
erik-bershel opened this issue Nov 7, 2022 · 11 comments · Fixed by #1510
Closed

Very long cataloging process #1328

erik-bershel opened this issue Nov 7, 2022 · 11 comments · Fixed by #1510
Assignees
Labels
bug Something isn't working

Comments

@erik-bershel
Copy link

erik-bershel commented Nov 7, 2022

Please provide a set of steps on how to reproduce the issue

  1. Install syft using default curl method
  2. run sudo syft dir:/ -o spdx-json --exclude ./Users --exclude ./System/Volumes --exclude ./private/etc
  3. ???

What happened:
Process freeze on "Cataloging packages" step. The longest process was started on November 5 at 2:46 PM (GMT) and is still ongoing.
What you expected to happen:
Any result or error message to figure out what we can do with that.
Anything else we need to know?:
SYFT-tool runs on GitHub Actions macOS runner images.
Environment:

  • Output of syft version: syft 0.60.3
  • OS (e.g: cat /etc/os-release or similar): macOS 10.15.7 (19H2026), macOS 11.7.1 (20G918), macOS 12.6.1 (21G217)
@erik-bershel erik-bershel added the bug Something isn't working label Nov 7, 2022
@tgerla
Copy link
Contributor

tgerla commented Nov 7, 2022

Hey @erik-bershel, I'm trying to reproduce this locally. Can you try running that same syft command with "-vv" for some extra verbosity? Maybe it will give us a hint as to where it is stopping.

@erik-bershel
Copy link
Author

Hey @erik-bershel, I'm trying to reproduce this locally. Can you try running that same syft command with "-vv" for some extra verbosity? Maybe it will give us a hint as to where it is stopping.

Sure. I'll send you additional info a bit later today.

@erik-bershel
Copy link
Author

@tgerla
macOS 11.7.1 (20G918) done in ~2d10h, output spdx-json about 36,8MB.
I started new processes with "noisy" option. Let's see what will happen.

@erik-bershel
Copy link
Author

erik-bershel commented Nov 14, 2022

Hello @tgerla!
So. I paid extra attention to the process. In the end, this looks like either an optimisation problem for macOS, or a lack of resources. During the entire process of cataloging packages, all free CPU time was taken up by the sift process.

  1. The result for the same VM with macOS 11.7.1 (20G918) is about 1d 23h. SBOM example - macOS11full.sbom.json.zip
  2. Process failed without error on macOS 10.15 after 3d of run.
  3. Process still work on macOS 12:
    (Снимок экрана 2022-11-14 в 19 32 38)

Nothing relevant included in verbose output. Just regular info/debug messages:
Снимок экрана 2022-11-14 в 19 49 08
^^ macOS 12 ^^
Снимок экрана 2022-11-14 в 19 48 51
^^ macOS 10.15 ^^

Which information from these VMs might be helpful?

@tgerla
Copy link
Contributor

tgerla commented Nov 18, 2022

Hi @erik-bershel, thanks for the details. A couple of questions for you:

  1. are you running Syft from inside these VMs, or are you pointing Syft at them from outside?
  2. is it possible for me to get a hold of these VMs to do my own tests against? I'm not very familiar with GitHub Actions yet.

We suspect that we're hitting some device files or some other special file on the VM that is slowing things down. We just added a "trace" level of verbosity to Syft (-vvv) which may help us identify where the slowdown is happening.

I would be happy to do some experimentation on my side, if it is possible to get the VMs.

Thanks!

-Tim

@erik-bershel
Copy link
Author

Hi @tgerla, thanks for response.
Yes, I run test from inside these particular VMs. But I cannot share access to them because of security reasons. I'll create couple new VMs with different macOS brunches which are used in GHA now (previous images a bit old at the moment). And I'll run new version of SYFT with new verbosity option. In case the logs are successfully or at least partially collected, I will share them one way or another depending on their size.
That's all I can offer for my part. If you have any other ideas, I will gladly try to implement them.

@Mikcl
Copy link
Contributor

Mikcl commented Nov 21, 2022

Have noticed “slowness” with some of the performance too. (Albeit, not in the order of magnitude of Days)

Have created #1353: the summary is that each cataloger runs serially and each cataloger seems to do glob pattern searches (which may be slow depending on the file indexer which I don’t have the full details for) against the file system.

Judging from the time/large sbom sizes/number of packages found/ruining syft against the root directory, the OP’s VM may have a large file system and this is just the amount of time it takes to go though each cataloger serially?

An unknown bug aside:

issues:

  • serial catalogers
  • Slow cataloger (indexing/search algorithm)

Have a pr which attempts to address the serial running of catalogers: #1355

———
Ps is running with sudo necessary?
Are there particular catalogers which are slow?

( notamaintainer )

@erik-bershel
Copy link
Author

Hey @tgerla @kzantow!
The 0.66.2 release was so good that I was finally able to go through the full inventory and cataloging process for our images. What do we end up with:

  • macOS 11 VM with 3 core, 14GB RAM, 255GB used space - ~64 hours, without errors, ~109MB SBOM file, ~2MB zipped SBOM file
  • macOS 11 VM with 6 core, 28GB RAM, 255GB used space - ~42 hours, without errors
  • macOS 12 VM with 3 core, 14GB RAM, 253GB used space - ~45 hours, fault, fixed in new release (next run in progress)
  • macOS 12 VM with 6 core, 28GB RAM, 253GB used space - ~47,5 hours, without errors, ~88MB SBOM file, ~2MB zipped SBOM file

Logs and sbom-files:

The same image was used to create both macOS 11 and another one for both macOS 12 VMs.

@Mikcl
Copy link
Contributor

Mikcl commented Jan 24, 2023

nice.

out of interest @erik-bershel have you tried settingSYFT_PARALELLISM to something greater than 1?

Would (personally) be interested in seeing comparisions if so. (#1355 )

@erik-bershel
Copy link
Author

@Mikcl hmm. I'll try couple different options. Will return with results in two-three days.

@wagoodman wagoodman self-assigned this Jan 25, 2023
@wagoodman
Copy link
Contributor

wagoodman commented Jan 26, 2023

Coincidentally, I'm working on anchore/stereoscope#154 and #1510 , which dramatically speeds up searching the index (by leveraging more indexes). This won't help build the index any faster, so it will still be rather slow for something as large as the GitHub mac runners... but I've applied the same principle from #1510 to the directory resolvers by adding a stereoscope FileCatalog object and leveraging those new indexes instead of glob calls in a prototype branch: I'm seeing a dramatic speedup.

go run ./cmd/syft dir:/ --exclude ./Users --exclude ./System/Volumes --exclude ./private.   1632.20s user 473.36s system 151% cpu 23:14.12 total

(this had been taking several... several hours before hand)

I'll try and polish this up and get it in after these two PRs I have open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants