Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't scan hidden directories (.git, etc) #328

Closed
wants to merge 1 commit into from

Conversation

earlephilhower
Copy link

When checking for updated core files, don't scan inside UNIX hidden
directories like .git. Scanning inside a GIT tree for a large project
with many branches could take many minutes of runtime and multiple GB of
RAM. This was seen in the Arduino core for the ESP8266, but is probably
also an issue for other core development teams.

Fixes #327

Signed-off-by: Earle F. Philhower, III earlephilhower@yahoo.com

@earlephilhower earlephilhower changed the title Don't scan hidden directories (.git, etc) WIP - Don't scan hidden directories (.git, etc) Jul 1, 2019
When checking for updated core files, don't scan inside UNIX hidden
directories like .git.  Scanning inside a GIT tree for a large project
with many branches could take many minutes of runtime and multiple GB of
RAM.  This was seen in the Arduino core for the ESP8266, but is probably
also an issue for other core development teams.

Fixes arduino#327

Signed-off-by: Earle F. Philhower, III <earlephilhower@yahoo.com>
@earlephilhower earlephilhower changed the title WIP - Don't scan hidden directories (.git, etc) Don't scan hidden directories (.git, etc) Jul 1, 2019
@earlephilhower
Copy link
Author

Just did some proper, multi-run testing, and the PR is now properly filtering out .git from its scanning. Runtimes of 1:45 (on a 16-core Xeon server!) and memory usage of 4GB+ are now almost instantaneous and much, much lower RAM usage.

@d-a-v
Copy link

d-a-v commented Jul 1, 2019

Thanks alot @earlephilhower

@matthijskooijman
Copy link
Collaborator

Sounds like a good change, but I have a few questions and remarks:

  • Where is this code actually used? I would expect in recursive scanning of scketches and libraries, but doesn't that only happen inside the src subdirectory? Do you have .git directories in there? I'm just trying to understand the problem here.
  • Making a low-level function like FindAllFilesInFolder skip hidden files might be a bit surprising for users of the function (it's not really returning all files anymore). Perhaps passing a list of files or patterns to ignore explicitly might be better?
  • You're now ignoring all files starting with ., but IIRC there's already a list of files to ignore somewhere in the builder code (IIRC it's used to not warn about these files when they show up in a library or something). Perhaps that list could be reused instead?

@earlephilhower
Copy link
Author

Good notes. I think I can clear up most of them in reverse order of difficulty. :)

Where is this code actually used? I would expect in recursive scanning of scketches and libraries, but doesn't that only happen inside the src subdirectory? Do you have .git directories in there? I'm just trying to understand the problem here.

@d-a-v and I are maintainers of https://github.com/esp8266/Arduino/ , so we use ~/Arduino/hardware/esp8266com/esp8266 as the Arduino code and git repository we work from (branches, checking PRs, etc.) So there is a .git dir in the local hardware tree in our cases. It has something like 3000 commits and dozens of branches, so the .git dir has many, many files.

For normal users, though, your point is valid and they shouldn't have a .git in boards-manager installed tree. This change, for them, should be a no-op.

The actual code change itself is called from the wipeout_build_path... stage where arduino-builder is only trying to see if platform.txt has been changed (or other files with .txt extensions). Without this change, the findAllFilesInFolder does a massive amount of recursion and basically stats the entire GIT tree of 100K files or more, and stores that in memory only to be thrown away on return by the simple file.Ext()==".txt" iteration.

You're now ignoring all files starting with ., but IIRC there's already a list of files to ignore somewhere in the builder code (IIRC it's used to not warn about these files when they show up in a library or something). Perhaps that list could be reused instead?

The problem is any ignore list is only processed after the file list is in memory. It's not that there are some txt files in the git tree which are causing false rebuilds, it's that the builder is taking multiple minutes and GB scanning the GIT tree to build a really massive list of objects only to throw them out. We're trying to short-circuit that.

Making a low-level function like FindAllFilesInFolder skip hidden files might be a bit surprising for users of the function (it's not really returning all files anymore). Perhaps passing a list of files or patterns to ignore explicitly might be better?

You have a point, but I'm not sure it's a valid use case. FindAllFilesInFolder in the arduino-builder context is really looking for inputs to the build process (at multiple stages). I think it could be argued that if your build source files are contained in a hidden directory that you've got some issues, or at least the next guy to manage your code will have them! :)

I think it might be better to have a Bool flag for ignoring hidden directories (as opposed to an ignore list since there are also code coverage tools and others which make hidden dirs to store their artifacts), but the impact of such a change would be significantly larger in terms of code changes (but 0 logic changes, of course) since wipe... actually has 2 levels of indirection before this is called.

Alternatively, wipe... could be rewritten to be more intelligent about the parsing. Now it's doing the equivalent of select * from core-tree; and then dropping everything but .txt files. There should be a way to express select * from core-tree where extension='.txt';. I'm not sure my Go skills are up to that, but if that's what you'd like I could give it a shot.

earlephilhower added a commit to earlephilhower/arduino-cli that referenced this pull request Aug 28, 2019
Port of arduino/arduino-builder#328

When checking for updated core files, don't scan inside UNIX hidden
directories like .git.  Scanning inside a GIT tree for a large project
with many branches could take many minutes of runtime and multiple GB of
RAM.  This was seen in the Arduino core for the ESP8266, but is probably
also an issue for other core development teams.

Fixes arduino/arduino-builder#327

Signed-off-by: Earle F. Philhower, III <earlephilhower@yahoo.com>
@earlephilhower
Copy link
Author

Looks like this patch was implemented in the replacement arduino-cli:

https://github.com/arduino/arduino-cli/blob/master/arduino/builder/sketch.go#L91

Closing since it's no longer relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Please prevent from scanning hidden directories
4 participants