Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasionally recurring bug: Invalid Int32: "280375465082892" (ArgumentError) #14496

Open
wout opened this issue Apr 16, 2024 · 19 comments
Open

Comments

@wout
Copy link

wout commented Apr 16, 2024

Bug Report

This is a bug I've been dealing with for years now, and unfortunately, it's very hard to reproduce. Until now, it mostly happened with deployments, but today, I got it again in development. It's in a Lucky project that has been fine in development, but after updating Crystal this morning, it wouldn't run. Here's the full backtrace:

backtrace-int32-error.txt

The weird thing about this bug is that it's inconsistent; it only pops up occasionally. It seems that sometimes the Int32 in question is valid; other times, it's not. When I had it in deploys, it behaved the same. One deployment would fail, and the next would go through. Sometimes, multiple deploys would fail consecutively; sometimes, it would be fine for many times in a row.

At least one person I spoke to in the Lucky Discord has the same thing. They do the same and re-deploy until deployment
is successful. I mentioned it before in the Crystal Discord, but I've never been able to create an example that would consistently fail.

It's the same as now in development. It happened once, but now I can boot the app repeatedly without issues. When I had it in deployment, switching from the crystal-alpine package to crystal-debian solved it for me, and it hasn't returned since January this year.

Crystal version

Crystal 1.12.1 [4cea101] (2024-04-11)

LLVM: 15.0.7
Default target: x86_64-unknown-linux-gnu

OS

elementary OS 7.1 Horus x86_64 (Ubuntu 22.04 LTS)

@wout wout added the kind:bug label Apr 16, 2024
@straight-shoota
Copy link
Member

straight-shoota commented Apr 16, 2024

This looks very similar to https://forum.crystal-lang.org/t/mysterious-spec-failure-in-ecosystem-test/6748
For some reason it appears to be relatively frequently failing in https://github.com/crystal-lang/test-ecosystem in CI.

The compiler stack trace here might be helpful.

@wout
Copy link
Author

wout commented Apr 16, 2024

This looks very similar to https://forum.crystal-lang.org/t/mysterious-spec-failure-in-ecosystem-test/6748

I'm pretty sure it's the same indeed.

@refi64 noticed this: https://discord.com/channels/591460182777790474/611671115835768862/1199390683220414474

The compiler stack trace here might be helpful.

How can I do that?

@straight-shoota
Copy link
Member

How can I do that?

I mean the file you already shared. backtrace-int32-error.txt

@straight-shoota
Copy link
Member

straight-shoota commented Apr 16, 2024

The stack trace indicates the invalid literal appears inside code generated from macro run.

@wout Could you look in your sources to find out which files are being run like that? Probably grep -R run src lib should do, but it might be a bit noisy.

@wout
Copy link
Author

wout commented Apr 16, 2024

Sure! Here's lib:

grep-run-lib.txt

In src there's nothing:

src/operations/request_password_reset.cr:  def run
src/operations/sign_in_user.cr:  def run
src/operations/mixins/password_validations.cr:    before_save run_password_validations
src/operations/mixins/password_validations.cr:  private def run_password_validations
src/operations/mixins/user_validations.cr:    before_save run_user_validations
src/operations/mixins/user_validations.cr:  private def run_user_validations
src/jobs/delete_snippets_job.cr:  run_every 1.day
src/jobs/soft_delete_snippets_job.cr:  run_every 1.day
src/jobs/healthcheck_job.cr:  run_every 10.minutes
src/actions/password_reset_requests/create.cr:    RequestPasswordReset.run(params) do |operation, user|
src/actions/sign_ins/create.cr:    SignInUser.run(params) do |operation, authenticated_user|
src/actions/api/sign_ins/create.cr:    SignInUser.run(params) do |operation, user|
src/app.cr:# When this file is compiled/run it will require and run 'start_server',
src/components/admin/layout_head.cr:      # Used only in development when running `lucky watch`.

@ysbaddaden
Copy link
Contributor

Do you render a template? such as ECR for instance?

@wout
Copy link
Author

wout commented Apr 16, 2024

Do you render a template? such as ECR for instance?

No, in this particular project we're using Lucky HTML exclusively. But I did notice this (grep -R "ECR\." lib src):

lib/teeplate/src/lib/file_tree/macros/directory.cr:    ::ECR.embed #{abs.inspect}, "____io"
lib/exception_page/src/exception_page.cr:  ECR.def_to_s "#{__DIR__}/exception_page/exception_page.ecr"
lib/lucky_template/spec/support/structures.cr:  ECR.def_to_s "#{__DIR__}/../fixtures/template.ecr"
lib/lucky_template/README.md:  ECR.def_to_s "README.md.ecr"
lib/lucky_template/src/lucky_template/folder.cr:    #   ECR.embed("hello.ecr", io)
lib/lucky_template/src/lucky_template/folder.cr:    #   ECR.embed("hello.ecr", io)
lib/lucky_template/src/lucky_template/folder.cr:    # proc = LuckyTemplate::FileIO.new { |io| ECR.embed("hello.ecr", io) }
lib/lucky_template/src/lucky_template/folder.cr:    #   ECR.embed("hello.ecr", io)
lib/lucky_template/src/lucky_template/folder.cr:    #   ECR.embed("hello.ecr", io)
lib/lucky_template/src/lucky_template/folder.cr:    # proc = LuckyTemplate::FileIO.new { |io| ECR.embed("hello.ecr", io) }
lib/lucky_template/src/lucky_template/file.cr:  #   ECR.def_to_s "hello.ecr"
lib/avram/src/avram/tasks/gen/migration_generator.cr:  ECR.def_to_s "#{__DIR__}/migration.ecr"
lib/carbon/README.md:For more information on what you can do with Embedded Crystal (ECR), see [the official Crystal documentation](https://crystal-lang.org/api/latest/ECR.html).
lib/carbon/src/carbon/tasks/gen/email.cr:          ECR.embed("#{__DIR__}/templates/email.cr.ecr", io)
lib/carbon/src/carbon/tasks/gen/email.cr:            ECR.embed("#{__DIR__}/templates/html.ecr.ecr", io)
lib/carbon/src/carbon/tasks/gen/email.cr:            ECR.embed("#{__DIR__}/templates/text.ecr.ecr", io)
lib/carbon/src/carbon/email.cr:          ECR.embed "#{__DIR__}/templates/\{{ @type.name.underscore.gsub(/::/, "_") }}/\{{ content_type }}.ecr", content_io
lib/carbon/src/carbon/email.cr:        ECR.embed "#{__DIR__}/templates/\{{ template_name.id }}/layout.ecr", layout_io

@ggiraldez
Copy link
Contributor

I tracked down this problem to be a memory corruption in LLVM type structures. I'm still not 100% sure how this is happening, but I have a theory (more below). This corruption ends up impacting code generation for macro runs and member offset calculations.

I've seen two separate macro run failures:

  1. the getrandom failure described in the linked forum post
  2. I got an additional one which reported an invalid PCRE2 library version, which helped me understand what was happening.

The offset calculation issue is what manifests as the Invalid Int32 error message. The compiler will evaluate the offsetof operator and produce a NumberLiteral of type Int32 from an UInt64, without checking the numeric limits. This "unchecked cast" is an additional problem that needs fixing, but not the main issue right now. The place where this happens is in the MainVisitor, here. There are a couple of other places were a similar unchecked cast happens, related to sizeof and offsetof operators (LLVM reports sizes and offsets as UInt64) and in one instance of the macro method to_i.

Anyway, what happens is that the call to LLVM's LLVMOffsetOfElement is returning a 64-bit unsigned integer bigger than Int32::MAX, this gets rendered to a String and then later attempted to evaluate as an Int32. The corruption of the LLVM type structures also explains the crashes in the macro_run programs, as the generated binaries are invalid. In the forum, @HertzDevil pointed out the incorrect memory layouts for some unions in the getrandom failure. And in the invalid PCRE2 version case I found, the value of String::HEADER_SIZE, which is computed using offsetof is 0. I figured this out by inspecting and stepping through the generated assembly code, and this is what hinted me to the root of the problem.

As to why this is happening, as I said, I'm not 100% sure. But my guess right now is that the array of members for LLVM struct types is not being properly marked by the GC and that memory is being reused and overwritten by other parts of the code. One thing that sticks out is: the type structure itself is allocated by LLVM, but the array of its elements is allocated by Crystal. So my guess is that the interior of the LLVM type structure (where the pointer to the array buffer is stored) is somehow invisible to the GC. I tested running with the GC disabled (by setting the env varible GC_DONT_GC=1) and I haven't been able to reproduce the failures anymore. But of course that doesn't mean it's fixed, and the corruption may be happening for some other reason.

@ggiraldez
Copy link
Contributor

That array of elements ultimately goes to LLVMStructSetBody, but is copied, not stored:

Interesting, that's one less reason to worry about then. Then I have no idea yet how the corruption is happening 😁

@sol-vin
Copy link
Contributor

sol-vin commented May 8, 2024

If you guys would like a consistent way to create this error check out:

https://github.com/sol-vin/minievents/actions/runs/9007590688/job/24748627867

This job consistently causes one of the two errors. It is tied to something to do with using self as a TypeDeclaration in a macro.

Also during my time working on this issue I also encountered a fairly random LLVM Error: scalable vector not expected size (it says something like that, but not exactly. IDK if that helps at all.

@ggiraldez
Copy link
Contributor

I managed to get a pretty small test case that reproduces the issue consistently (at least in my machine running Ubuntu Linux with the release version of Crystal 1.12.1). What I'm not being able to do it is reproduce the issue with Crystal compiled from source.

For reference:

$ crystal version
Crystal 1.12.1 [4cea10199] (2024-04-11)

LLVM: 15.0.7
Default target: x86_64-unknown-linux-gnu

$ ccrystal version  # this is compiled from source
Using compiled compiler at .build/crystal
Crystal 1.13.0-dev [4dd58147c] (2024-04-17)

The compiler was not built in release mode.

LLVM: 15.0.7
Default target: x86_64-pc-linux-gnu

I also tried building the compiler in release mode, and with static linking but still no luck reproducing the issue. This was with BDWGC installed from distro packages. I'm now trying with one built from sources as well.

The test case itself consists of a couple of files:

# main.cr
COUNT=6

{{ run("./gen", "#{COUNT}") }}

macro read_sample(exe)
  {{ run(exe, "main.cr").stringify }}
end

{% for i in 1..COUNT %}
  puts read_sample("./read{{i}}")
{% end %}
# gen.cr
(1..(ARGV[0].to_i)).each do |i|
  File.write("read#{i}.cr", "puts #{Random.rand}\nputs File.read(ARGV[0])")
end
puts ARGV[0].to_i

The randomness is to avoid Crystal caching the result of the macro runs, but not really important. The first macro run, ./gen, is cached in any case, unless I manually clear the cache each time.

But anyway, this forces the compiler to repeatedly compile the generated readX.cr files and then executes them to construct the contents of main.cr, all in the same compilation context. With COUNT=6, or 6 or more macro compilations it reproduces consistently.

For the compiler built from source, I pushed the count up to 50 and still didn't trigger the bug.

@ggiraldez
Copy link
Contributor

Small update: I built a compiler from source using the base Docker image and steps from the distribution scripts and I can reproduce the issue with it, consistently.

Just using the GC compiled from source did not do it. But as before, running with GC_DONT_GC=1 makes the issue go away. So, I'm now thinking it could be an interaction between the GC and Musl.

@straight-shoota
Copy link
Member

Oooh, that's interesting. The Dockerfile has actually changed quite significantly since the last release. It doesn't build libgc anymore and instead uses the libraries provided from the system packages. Previously, including the 1.12 releases, libgc was built twice, once for the compiler itself linking musl, and once for distributing with the release compiler, linked against glibc.

For reference, see the diff of latest changes.

Since this issue reproduces with the latest version of the Dockerfile, it doesn't seem to depend on our custom libgc build.

@ggiraldez
Copy link
Contributor

ggiraldez commented May 10, 2024

BIG UPDATE: found the root of the problem and I think I understand why is happening, with 98% of certainty (I reserve the remaining 2% because it's late and I'm really tired by now 😴).

We're getting a cache poisoning situation in LLVM's DataLayout structure. The DataLayout is usually created from the TargetMachine and as I understand it, its responsibility is to layout values and structs in memory, with proper alignment and that stuff. This entity lazily caches the members and layouts of struct types in order to, for example efficiently compute the offset an element. To accomplish this, is has a DenseMap (which is basically a hash) of StructLayout objects indexed by TypeRef, ie. the address in memory of the created LLVM struct type.

Now, the TypeRefs are returned by the type creating functions in the LLVMContext, LLVMStructCreateNamed being the one that's interesting to us. When creating a type, the context allocates memory for the returned types using a bump allocator. So far so good, and this is coherent with the documentation that says that once types are allocated, they are never freed. Now, that's a big lie! Turns out that when the context which was used to create the type is disposed, it destroys the bump allocator, free'ing up all the slabs!. So, the problem is that the address of the types created are now open to be reused... by another type, in another context.

I hope the problem is clear by now. After some number of iterations of macro_run which will inevitably create and dispose LLVMContext objects, the cached information in LLVM's DataLayout will contain several entries for types that no longer exist, or rather more generally, StructLayout objects for addresses of types that no longer exist, or are not what the layout refers to. And because the DenseMap is populated lazily, it will just accept whatever it finds when asked for some memory address.

When Crystal tries to compile some program again (either another macro, or the main program), it encounters the implementation for String in the prelude, which uses the offsetof operand in its implementation. The compiler reaches out to LLVM to compute the offset and gets a completely bogus value in return.

I'm 100% sure of the cache poisoning (I added some debugging code and was able to reproduce the issue) and, as I said, 98% sure about all the rest. What I don't know is what's the proper way to fix it.

  • Is this a bug in LLVM, or are we misusing it? I couldn't find good and detailed documentation from the LLVM side, so I'm not sure who's on the wrong side here.
  • Are we supposed to hold to contexts for as long as the target machines? This explains why the problem disappears when running without GC by the way.
  • Are we supposed to create new target machines and associated data layouts for each compilation?
  • Or maybe there's a way to reset the cache of the data layout? It has a reset() method but I don't think there's C API for it.

I'm not sure. On the Crystal side, the instances and lifetimes are hard to follow as well. I think we're creating new LLVM contexts, not necessarily to create a new program (search for LLVM::Context.new). And in fact, it's mentioned in some comments that you should not share types/values across contexts, which makes me think that the compiler expects to have multiple contexts lying around, and that could potentially make things worse.

@beta-ziliani
Copy link
Member

Wow man, that's amazing detective work! The best is to place this in the llvm forum and hear what they say about it.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented May 11, 2024

So the crux of the issue is that we're keeping LLVMDataLayout references while we disposed of the LLVMContext: the former keeps a cache of LLVMTypeRef allocated into the bump allocator that are freed when we dispose of the LLVMContext.

Are we supposed to create new target machines and associated data layouts for each compilation?

We do that during codegen: each compilation unit (aka module) has its own LLVMTargetMachine and thus its own LLVMDataLayout.

A question is why do we keep creating and deleting LLVM contexts? What about using a single one? LLVM contexts are thread unsafe but semantic phase is single threaded anyway (at worst each thread could have its own context).

Alternatively we could ask LLVM for the data layout string and interpret it ourselves. That would be duplicating what LLVM is already doing (prone to bugs), but we wouldn't have to create LLVMTypeRef until we reach the codegen phase.

@russ
Copy link

russ commented May 15, 2024

I just encountered this during a Github CI deployment. Unfortunately I can't offer much more than the stacktrace.
https://gist.github.com/russ/11c6a6a27d5eb6577eaa836386e98e86

@straight-shoota
Copy link
Member

Thanks.
At this point we know enough of the problem to consistently reproduce it and have pinpointed the root cause in the usage of LLVM. There's no need for further analysis of Crystal behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants