Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Source Files from Database #39

Closed
creative16 opened this issue Jun 26, 2020 · 33 comments
Closed

Missing Source Files from Database #39

creative16 opened this issue Jun 26, 2020 · 33 comments
Assignees
Labels

Comments

@creative16
Copy link

When trying to build a database for Chromium on Windows, I found that a significant number of source files were missing from the database. One specific example would be src/services/network/p2p/socket_manager.cc.

I'm not sure if its relevant, but during the build process, the extractor crashed many times, in what appears to be an access violation.

I'm wondering if this is something the developers would be willing to look into?

@nickrolfe
Copy link

Hi @creative16 - I work on our C/C++ support, and I am absolutely interested in finding and fixing extractor crashes. I'll build a Chromium/Windows database to try and reproduce them and track them down.

Thanks!

@nickrolfe
Copy link

A quick update: we've made a change to address this, so the next version (2.2.5) should fix the problem for you. I expect that release will happen in approximately 2 weeks' time.

Thanks again for letting us know.

@creative16
Copy link
Author

Thanks for looking into it promptly! I'll look forward to the release.

@sad-dev
Copy link

sad-dev commented Aug 17, 2020

The chromium on linux build also misses many files, including the aforementioned src/services/network/p2p/socket_manager.cc. It might be a regression, as building with 2.1.0 actually results in a larger database with more files included.

@sad-dev
Copy link

sad-dev commented Aug 23, 2020

I noticed 2.25 came out and rebuilt chromium on Linux. Compared to my builds from 2.10-2.14 with 73k files, the 2.24 and 2.25 dbs return about 70k files. Running comm between them flagged up some very important files. I'll hence be moving back to 2.14 for now.

I have attached the filelists (from File f select f.getRelativePath())) here @nickrolfe. Please let me know if there is anything else in the log files (4 GB, sorry) to look out for in helping to troubleshoot
old_files.txt
new_files.txt

@sad-dev
Copy link

sad-dev commented Aug 23, 2020

One thing that occurred to me was that my old, bigger databases were created without setting --threads. Codeql could be causing extra crashes in the extractor due to this; I'll rebuild on 2.14 and 2.25 with/without the --threads and see how it goes. Really hoping that this isn't the root cause though, especially if it leads to larger compile times overall.

@sad-dev
Copy link

sad-dev commented Aug 24, 2020

Have 4 builds for 2.14 and 2.25 with/without --threads, all of which still have around 70k files. The only two variables left that I can think of which vary as compared to my bigger 73k builds are different versions of github/codeql in my codeql-home, and varying power/clock speed configurations of my CPU (the bigger builds were done on CPUs with less cores and lower clock speeds) . Does the version of codeql affect the cli/extractor?

I did notice that identical (same codeql-cli, same github/codeql, same codebase) builds can result in slightly different database contents, but am at my wits end right now. It is looking like instability is the likely cause, but havn't quite pinned down exactly why.

@nickrolfe
Copy link

Hi @sad-dev - sorry to hear that the 2.2.5 release did not fix things for you, and thanks for letting us know. I will build a Chromium snapshot locally. Hopefully I can reproduce your problem and I will track down the cause of the missing files.

Usually it's because of either a segfault or catastrophic error in the extractor. In general, build-system parallelism shouldn't affect this, unless the crashes are related to out-of-memory errors. But the more likely cause is that the extractor is choking on some construct in a header file included by all the files that didn't get extracted. I'll let you know what I find.

Does the version of codeql affect the cli/extractor?

Yes, each new release of the CodeQL CLI includes the latest extractor changes.

@sad-dev
Copy link

sad-dev commented Aug 24, 2020

Hi @sad-dev - sorry to hear that the 2.2.5 release did not fix things for you, and thanks for letting us know. I will build a Chromium snapshot locally. Hopefully I can reproduce your problem and I will track down the cause of the missing files.

Usually it's because of either a segfault or catastrophic error in the extractor. In general, build-system parallelism shouldn't affect this, unless the crashes are related to out-of-memory errors. But the more likely cause is that the extractor is choking on some construct in a header file included by all the files that didn't get extracted. I'll let you know what I find.

Does the version of codeql affect the cli/extractor?

Yes, each new release of the CodeQL CLI includes the latest extractor changes.

Thank you for taking a look! With regards to the version of codeql, I was referring to the repos for the queries rather than the CLI (i.e. those found by codeql resolve languages). I deleted them and was still able to build with the cli, so that answers my question, I guess :)

@sad-dev
Copy link

sad-dev commented Aug 25, 2020

@nickrolfe I made three builds of chromium with codeql-cli 2.25 on three machines with --threads=0 passed to codeql database create. The specs and results are below:

6C/12T, 32gb ram - 74.6k files
8C/16T, 32gb ram - 73k files
16C/32T, 32gb ram - 70k files

From the looks of it, more threads are leading to stability issues in the extractor, possibly due to a larger 'ripple effect'.

@nickrolfe
Copy link

Hi @sad-dev. The Chromium build I'm doing here (also using CodeQL 2.2.5) on an old Linux laptop still hasn't quite finished, but so far I see only 5 files (all in Skia) that failed to extract. And since you are seeing different behaviours on different machines, I'd like to see if we can find what's going wrong with the extractor from your build-tracer.log file.

Assuming that the failed extractions are caused by the extractor segfaulting or otherwise hitting a catastrophic error, do you think you could take a look at places where CodeQL C++ extractor: Backtrace occurs in the log and see if there's a common pattern in the backtraces you find? Note that the top few functions will always be in the exception-handling code, but the following entries will show what was executing when the exception happened.

If the log file is too big to open in a text editor, you could do something like:

grep -A 12 "CodeQL C++ extractor: Backtrace:" log/build-tracer.log

@sad-dev
Copy link

sad-dev commented Aug 25, 2020

The greps (with and without the -A12) for "CodeQL C++ extractor" are attached for my 6,8, and 16 core builds, as well as the filelists.

extractor_6c.txt
extractor_6c_full.txt
extractor_8c.txt
extractor_8c_full.txt
extractor_16c.txt
extractor_16c_full.txt
f6c.txt
f8c.txt
f16c.txt

@nickrolfe
Copy link

Thanks for sending those. We (the CodeQL team) will need to fix those crashes, but they don't explain why you're missing so many files.

Can you confirm that you're building Chromium from scratch? CodeQL will only extract source files that it observes being compiled, so if the build system thinks foo.o is already up to date, it won't recompile foo.cc and then it won't get extracted. Also, are you using ninja to build? Ninja should be fine, but some other build systems compile code from a daemon process that CodeQL won't 'see'.

The next thing I would do is pick a file that's missing, say foobar.cc, and grep build-tracer.log for that name. If there are no results, then CodeQL did not observe that file being compiled. If there are some results, then they should give clues as to what happened when trying to extract that file.

The other possibility is that extraction was successful, but something went wrong during database finalization. Once my local build finishes, I can investigate whether anything goes wrong there for me, but it might be worth checking the database-create-*.log file for errors around the line Importing .*foobar.cc.*, if that exists.

@sad-dev
Copy link

sad-dev commented Aug 25, 2020

I edited the previous post to update with filelists and extractor crashes. Also can confirm that all three were build with autoninja -C out/Default chrome with a empty (rm -rf out) output folder

Picking v8/src/wasm/wasm-objects.cc at random from comm f6.txt f16.txt,

1.) The build-tracer.log files are slightly broken up (even on 6c, it appears that multiple threads write to it at the same time) but they all show :
a) the extractor being called to 'mimic' clang on v8/src/wasm/wasm-objects.cc to create obj/v8/v8_base_without_compiler/wasm-objects.o
b) Creating trap tarball /...v8/src/wasm/wasm-objects.cc....trap.tar.br
c) Emitting trap files for ../../v8/src/wasm/wasm-objects.cc
d) Archiving /home/user/codeql_chromium_db/src/home/user/chromium/src/v8/src/wasm/wasm-objects.cc

only my 6core and 8core logs contain lines of the form
e) Wrote XXX files to ...v8/src/wasm/wasm-objects.cc....trap.tar.br
f) 11 errors detected in the compilation of "../../v8/src/wasm/wasm-objects.cc"

2.) the database-create-*log does not mention v8/src/wasm/wasm-objects.cc on 6c and 8c, but the 16c log has the line:
[ERROR] Skipping import of TRAP files in /home/user/codeql_latest_db/trap/cpp/tarballs/home/user/chromium/src/v8/src/wasm/wasm-objects.cc.506d6f4e_0.trap.tar.br as the manifest could not be read.

grep "manifest could not be read" on the logs throws up 118, 321, and 769 errors on the 6c, 8c, and 16c builds respectively.Some quick sed commands show they they all correspond to missing files in the respective databases

Further observations:
3.) On the 16c build, a large number of files do not even reach step (b) - creating the trap tarball.
4.) the manifest reading error almost perfectly corresponds with the difference in number of files appearing in steps (b) and steps (e).
5.) The files reaching step (e) corresponds almost (> 99%) perfectly to the different .cc files appearing between builds
6.) .h files only appear in step d, if at all (and not in steps a to c), and I observed missing files reaching and not reaching this step.

@nickrolfe
Copy link

Thanks, that's really useful.

My guess is that, on the 16c machine, you have a file at /home/user/codeql_latest_db/trap/cpp/tarballs/home/user/chromium/src/v8/src/wasm/wasm-objects.cc.506d6f4e_0.trap.tar.br.manifest.tmp file that is partially or completely truncated, and was not renamed (to strip the .tmp extension).

That's consistent with the extractor terminating prematurely, but I don't yet know why that would happen, and without any kind of backtrace. One guess is that it's the OOM killer.

However, I now see that it's also happening on my machine, so I have a good chance of tracking this down. I'll let you know how I get on, but thanks again for your help and patience!

@nickrolfe
Copy link

.h files only appear in step d, if at all (and not in steps a to c), and I observed missing files reaching and not reaching this step.

That is normal - the log entries in steps a to c derive from the name of the file being compiled. Step d corresponds to copying any observed source or header file to the database archive.

I also note the extractor running out of memory would be consistent with your observation that more threads means more failures. I will try to confirm if that's what's happening and, if so, what steps we can take to reduce the memory usage.

@sad-dev
Copy link

sad-dev commented Aug 26, 2020

.h files only appear in step d, if at all (and not in steps a to c), and I observed missing files reaching and not reaching this step.

That is normal - the log entries in steps a to c derive from the name of the file being compiled. Step d corresponds to copying any observed source or header file to the database archive.

I also note the extractor running out of memory would be consistent with your observation that more threads means more failures. I will try to confirm if that's what's happening and, if so, what steps we can take to reduce the memory usage.

Instead of reducing the memory usage (or if it proves unfeasible to do so / fails to completely fix extractor crash problems), is it possible to rerun the extractor, either immediately upon failure or just before finalizing? One obvious thing to look out for is that build files might be temporary, thus necessitating an archival if the extractor is to be rerun at the end.

@creative16
Copy link
Author

creative16 commented Aug 26, 2020

Hello @nickrolfe ,

just an update after trying version 2.2.5 to build Chromium on Windows. Generally, it definitely looks like an improvement over 2.2.4. The table below provides some summarised numbers from my build:

2.2.4 2.2.5
Total files in database 63356 73672
Total .c, .cc, .cpp, .cxx files 26248 31670
Total .h, .hxx, .hpp, .hh files 36787 41680

Grepping for "manifest could not be read" from database-create*.log only resulted in 76 results, as compared to 5555 for version 2.2.4. These files do correspond to files missing in the database.

There is, however, no significant difference when looking for "CodeQL C++ extractor: Backtrace" in build-tracer.log for both versions of CodeQL CLI (about 10 to 11 results each). The files indicated are also not present in database-create*.log, although I'm guessing perhaps this is to be expected. So there are still at least 76 + 11 = 87 missing source files.

Finally, as compared to version 2.2.4, the extractor for version 2.2.5 only crashed 7 times due to access violations as compared to hundreds (or thousands) of times for version 2.2.4.

To get a rough idea of how many files are missing, I'm wondering if you or @sad-dev have any idea how to get a full list of files that will be built by ninja?

@sad-dev
Copy link

sad-dev commented Aug 26, 2020

I monitored my peak memory usage on a smaller codebase that uses ninja: v8

-j = 34 (default with ninja), no codeql: 10GB
-j = 12, codeql --threads=1 (default): 18GB
-j = 12, codeql --threads=0 : 18GB
-j = 18, codeql --threads=0 : 28GB
As long as memory is not exhausted, I have consistent builds on v8. Remove the limiter and:

-j = 34, codeql --threads=0 : maxes out on my 32gb system, and throws about 150 manifest errors as compared to 0 otherwise. About 5% less files in the database.

We are thus looking at a roughly 5x multiplier on peak memory usage that codeql imposes when compiling v8, and threads used for importing doesn't seem to affect that. Given chromium's higher complexity and much larger (about 30x) size, I wouldn't be surprised if larger multipliers of 10x or so are observed on the memory requirements.

@creative16
Copy link
Author

creative16 commented Aug 28, 2020

After comparing the list of files in the database, and a list of possible files I believe should have been compiled, I think the database is still missing about 1000 source files (i.e. .cc, .cpp, .cxx, .c) in the Windows build.

I did a grep for some of the missing files in build-tracer.log, and managed to find results for them that appears to show they were intercepted by the extractor. However, there seemed to be errors indicated. I include three of the most common ones I found below. Could these errors be the reason why the files were not included in the database? I checked the compiled binary using IDA Pro and verified that the files were compiled into the binary during the build.

"../../third_party/ffmpeg/libavutil/cpu.c", line 21: catastrophic error: cannot open source file "stdatomic.h"
  #include <stdatomic.h>
"../../content/browser/gpu/gpu_process_host.cc": using precompiled header file "obj/content/browser/browser_cc.pch.semmle"
"../../base/trace_event/trace_category.h", line 34: error: expression must have a constant value
          offsetof(TraceCategory, state_) == 0,
          ^
"../../base/trace_event/trace_category.h", line 34: note: conversion from "unsigned char" to "char" is invalid in constant-expression evaluation
          offsetof(TraceCategory, state_) == 0,
          ^
"../../buildtools/third_party/libc++/trunk/include/exception", line 216: error: expected a ")"
  _LIBCPP_FUNC_VIS exception_ptr __copy_exception_ptr(void *__except, const void* __ptr);
                                                            ^

@nickrolfe
Copy link

Catastrophic errors will cause that source file to be missing from the database; regular errors will not. I've created an internal issue to investigate why we don't find stdatomic.h on Windows.

Regarding the excessive memory consumption, we've identified one of the major causes. In general, memory consumption should only be affected by the complexity of the one source file being extracted, but there is one feature that causes memory usage to scale with the overall size of the entire project. For the vast majority of projects, this is never a problem, but with Chromium and its tens of thousands of source files, it is. Of course we want to reliably build databases of Chromium and other massive C++ projects, so we're looking for a solution.

@creative16
Copy link
Author

Hello,

I've tried a Chromium build on Windows using version 2.2.6. The database had about 1000 more source files than the one I created using 2.2.5, which is great. Thanks for the improvements!

The catastrophic error regarding the missing stdatomic.h still appears in the log file. Another error I noticed that popped up 173 times in the log file was the following:

ERROR: Uncaught Windows exception (EXCEPTION_STACK_OVERFLOW). Exception code: 0xc00000fd

I checked a few of the source files where this error popped up and indeed they were not included in the database. Would this be a cause for the missing files as well?

@nickrolfe
Copy link

Hi @creative16, thanks for letting us know.

Yes, the stack overflow exception would also cause the files to be missing from the database. That was the same underlying issue for the problem reported at the start of this thread, but since then we increased the extractor's stack size while we investigate a longer-term fix for the excessive stack usage. If you're seeing that in 2.2.6, then even the increased stack size is not sufficient.

I haven't observed this problem on my own Windows machine while building V8 databases, but I'll do another build of Chromium to try and reproduce the exception.

@creative16
Copy link
Author

Hello @nickrolfe,

noticed that building Chromium on Windows using version 2.4.2 produced 11134 instances of the EXCEPTION_STACK_OVERFLOW error in the log file, as opposed to 178 instances using version 2.3.2.

You previously mentioned that this had something to do with the extractor's stack size. Just wondering if there was a regression of some sort between the two versions, as the number of such errors are now even more than prior to the fix in 2.2.5.

I haven't had the chance to test this out on version 2.4.3 yet, but looking at the release notes for 2.4.3, I suspect there shouldn't be a difference.

Thanks!

@jbj
Copy link

jbj commented Feb 2, 2021

Thanks for following up, @creative16. If stack usage or allocation on Windows has indeed regressed as you observe, we're eager to fix the problem. Did you build the same version of Chromium between CodeQL 2.3.2 and 2.4.2? I'm wondering if a newer and more complex version of Chromium could have pushed us over some limit.

@nickrolfe has switched teams and no longer works on C/C++, but I'll make sure someone investigates this problem.

@creative16
Copy link
Author

Hello @jbj,

thanks for replying. Nope, I did not build the same version of Chromium. However, I could try using 2.4.2 (or 2.4.3) to build the version of Chromium I used previously for CodeQL 2.3.2. I'll get back to you after I've got the chance to try that out.

@creative16
Copy link
Author

Hello @jbj, I've finished building the version of Chromium I used previously for CodeQL 2.3.2 with 2.4.3. There were 10414 instances of the stack overflow exception in the log files. So it does appear that there might have been a regression of some sort between 2.3.2 and 2.4.3.

@jbj
Copy link

jbj commented Feb 5, 2021

Thanks for investigating! I've planned for the team to address this problem in our next sprint, meaning within three weeks.

@criemen
Copy link

criemen commented Mar 3, 2021

Hi @creative16, I'm really sorry for the delays on this issue!
I just wanted to let you know that I'm now on it, and I was able to reproduce the stack overflow exceptions on Windows.
When building the chrome target, I got approximately 3500 (chromium HEAD, codeql HEAD).
I also observed some unrelated errors which will also lead to missing source files from the database.

I'll update the issue once I know more/have a fix.
Some of this might take some time though.

@criemen
Copy link

criemen commented Mar 8, 2021

Hi @creative16, I established that our previous workaround of increasing the stack size further doesn't work anymore.

I did track the problem to the C++ compiler frontend that's supplied by our vendor.
I have contacted them, and we're working with them on a solution to this problem, but so far there's no ETA for a fix yet.

@creative16
Copy link
Author

Hello @criemen , thanks for the update, appreciate it. Hopefully it'll be fixed soon!

@criemen
Copy link

criemen commented Mar 11, 2021

While our vendor is still working on fixing the underlying issue, I discovered that we had a regression in our workaround.
After fixing the regression, we do not enounter any more stack overflow exceptions when extracting chromium!
There's still a couple of files we fail to extract, and I have logged issues internally for these, but overall it's looking very good now.

The fix is in the next release of the codeql cli.
Please reopen this issue, or file a new one if you encounter issues with chromium again.

@criemen criemen closed this as completed Mar 11, 2021
@criemen
Copy link

criemen commented Mar 15, 2021

The stars misaligned, and the upcoming CLI release 2.4.6 only contains half the fix for this issue. I hope that the release after that will contain all relevant patches for this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants