Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large numbers in magnitudes cause compiler crashes or long compilation times #217

Open
Tehforsch opened this issue Jan 20, 2024 · 5 comments
Labels
⬇️ affects: code (implementation) Affects implementation details of the code 📁 kind: bug Something isn't working correctly, or there's a mistake 💪 effort: large

Comments

@Tehforsch
Copy link

Tehforsch commented Jan 20, 2024

This is more of a curiosity than an actual issue, so feel free to close this if it doesn't seem relevant.
I was inspired by the nice talk you (@chiphogg ) gave at CppCon to implement vector space magnitudes in my rust compile time units library (not because the library is mature enough to really need them, but because rusts const-exprs like integers a lot better than floats).
While doing so I stumbled over the problem that factorizing the integers in the magnitude in the standard/naive way via trial division runs into performance problems (especially if the user really wants to run into performance problems). In my library, the magnitudes are currently read from the exponent / mantissa of a float, but the largest primes that fit into the 52+1 bit mantissa of a double (9007199254740881 for example), can already take a really long time. While implementing some more sophisticated algorithms, I checked what au does to solve this issue and realized that it also uses the standard factorization algorithm, so I tried to confirm my suspicion and compiled the following code:

struct LargePrimes : decltype(Meters{} * mag<9007199254740881>());

if compiled with clang, this gives

note: constexpr evaluation hit maximum step limit; possible infinite loop?

and with gcc, I get

au.hh:442:5: error: ‘constexpr’ loop iteration count exceeds limit of 262144 (use ‘-fconstexpr-loop-limit=’ to increase the limit)

it eventually compiles with
g++ -fconstexpr-loop-limit=2147483647 -fconstexpr-ops-limit=2147483647
but takes about ~100s on my machine. Putting in a ~64bit prime should take about an hour.

This is probably not much of a real-world concern since I maliciously wrote in large prime numbers, but if somebody, for example, were to autogenerate their unit definitions from a bunch of sensor data, they might get unlucky and accidentally stumble into this. Probably not.

There are much faster algorithms that handle 64 bit integers just fine. If you're at all interested in fixing this, I am by no means knowledgeable in number theory, but I implemented the Miller Rabin test to check for primality first and then the Pollard's Rho algorithm to factor the composite numbers and with some tweaking, that seems to work fine.

@Tehforsch Tehforsch changed the title Large primes in magnitudes make compiler crash or take very long Large numbers in magnitudes cause compiler crashes or long compilation times Jan 20, 2024
@chiphogg
Copy link
Contributor

Very well spotted! This is a known issue, although not previously tracked in this project.

We encountered this in the mp-units project after I contributed vector space magnitudes there (mpusz/mp-units#300). The problem is more practical than you might imagine! Once you start using more "extreme" units, such as those found in subatomic or cosmic scales, you find magnitudes whose smallest prime factor is surprisingly large, and you hit this compiler iteration limit.

In mp-units, I assumed this problem was due to naive trial division, so I learned about wheel factorization and implemented it there. Unfortunately, it seems that for wheel division, the space requirements grow very rapidly while the iteration count reduces only slowly. Ultimately, while it might make find some answers a little faster, it's not a solution to this problem.

Our current plan is to write a paper proposing a new standard library function, perhaps something like std::first_factor. Vector space magnitudes are (as far as we know now) a hard requirement for a standard units library, and reliable prime factorization is a hard requirement for vector space magnitudes. The unix program factor, written in C, can easily factor any 64-bit integer in a tiny fraction of a second --- so we know this can be done. If we had this functionality in C++, it would make all kinds of basic number theory code easier to write.

I'm planning to finish this paper before the Feb. 15 mailing deadline, and have it discussed at the Tokyo meeting in March.

Pollard's rho algorithm sounds fascinating and exciting! I'm a little hesitant to code it up in this project just yet, because of its pseudorandom character. It sounds like it can fail, and I'm not confident I could write test cases that correctly exercise these failure cases, and cover the selection of a new parameter for the retry.

Perhaps we'll see how things develop with the paper. Perhaps we'll bite the bullet and try Pollard's rho in the meantime, ideally with plenty of tests and perhaps also fuzzing to gain enough confidence. Perhaps both!

@chiphogg
Copy link
Contributor

What piqued my interest the most in your post was the idea of implementing vector space magnitudes in rust. In C++, Au and mp-units use variadic templates for the implementation. But I had thought rust didn't support variadic templates! (I've never used rust.) So how can you implement vector space magnitudes?

The reason I'm familiar with this is because I checked out the iliekturtles/uom project when it briefly ascended to the top of the dimensional-analysis star chart. I was scandalized to find that they eagerly convert all quantities of any given dimension to some base unit for that dimension! That's undesirable because it complicates end users' choice: it forces them to change the binary program they produce if they want to get unit safety.

After thinking it through, I speculated that they were more-or-less forced into this design choice because rust's lack of variadic templates prevents them from being able to represent unit magnitudes in the type. (Yes, they could have used some analogue of std::ratio, but this representation doesn't meet the basic requirements for a multidimensional units library.) Thus, my assumption was that you can't actually write a good units library in rust as it stands today. But I could easily be missing something here due to my lack of experience with the language.

@Tehforsch
Copy link
Author

We encountered this in the mp-units project after I contributed vector space magnitudes there (mpusz/mp-units#300). The problem is more practical than you might imagine! Once you start using more "extreme" units, such as those found in subatomic or cosmic scales, you find magnitudes whose smallest prime factor is surprisingly large, and you hit this compiler iteration limit.

Our current plan is to write a paper proposing a new standard library function, perhaps something like std::first_factor. Vector space magnitudes are (as far as we know now) a hard requirement for a standard units library, and reliable prime factorization is a hard requirement for vector space magnitudes. The unix program factor, written in C, can easily factor any 64-bit integer in a tiny fraction of a second --- so we know this can be done. If we had this functionality in C++, it would make all kinds of basic number theory code easier to write.

I'm planning to finish this paper before the Feb. 15 mailing deadline, and have it discussed at the Tokyo meeting in March.

That's very interesting to hear, thanks! I was briefly looking at parsec or mole to check if their conversion factors to SI happened to have large prime factors, but then gave up on the search. Out of curiosity, is there a specific one you could point to?

Pollard's rho algorithm sounds fascinating and exciting! I'm a little hesitant to code it up in this project just yet, because of its pseudorandom character. It sounds like it can fail, and I'm not confident I could write test cases that correctly exercise these failure cases, and cover the selection of a new parameter for the retry.
Perhaps we'll see how things develop with the paper. Perhaps we'll bite the bullet and try Pollard's rho in the meantime, ideally with plenty of tests and perhaps also fuzzing to gain enough confidence. Perhaps both!

I was initially under the impression that it would always succeed if retried with (a number of) different starting values, but after some more googling I realize now that that assumption is wrong, so you're right. I think for my purposes, I can "simply" choose to use slow trial division (or throw a compiler error) if Pollard's rho doesn't find a factor but the primality test says that the number is composite. That is not exactly a great solution, but it could be an improvement, since the small primes can still be covered by trial division, so everything that worked before will still work.

Note that the normal Miller-Rabin test is also probabilistic, but apparently it's been shown to always give the correct result for 64 bit ints if a large enough set of witnesses are tested. I won't claim to have double checked that proof though.

I'd be curious to know if fuzzing can help with this problem. Very handwavy again, but I wouldn't be surprised if the unpredictable nature of integer factorization prevents fuzzing from slowly encroaching on a failing test case.

What piqued my interest the most in your post was the idea of implementing vector space magnitudes in rust. In C++, Au and mp-units use variadic templates for the implementation. But I had thought rust didn't support variadic templates! (I've never used rust.) So how can you implement vector space magnitudes?

So, I should preface this by saying that I am just in the process of implementing this and haven't seen all the parts come together, but I think that I have verified that all individual components of my implementation should work.

My library works using const generics. The currently stable version of that feature only allows one to write code that is generic over ints/bool/chars. However, there are also unstable features that allow one to use generic const expressions and to have "arbitrary" structs as const generics, which is what my library uses. These still only allow a tiny subset of what C++ templates can do, but it definitely makes things a little easier.

Now, real variadics don't exist, so my current plan is to express the vector space as a fixed size array of at most 15 integer factors (plus special entries for pi/zero/...). 15 since that should make all 64 bit integers representable (the primorial numbers are the worst case here and the number 32589158477190044730, which is the product of the first 16 primes is already too large for 64 bits).
With the unstable features, an array like that is allowed as a const generic and it shouldnt be too hard to implement multiplication etc. on such an array.

The reason I'm familiar with this is because I checked out the iliekturtles/uom project when it briefly ascended to the top of the dimensional-analysis star chart. I was scandalized to find that they eagerly convert all quantities of any given dimension to some base unit for that dimension! That's undesirable because it complicates end users' choice: it forces them to change the binary program they produce if they want to get unit safety.

After thinking it through, I speculated that they were more-or-less forced into this design choice because rust's lack of variadic templates prevents them from being able to represent unit magnitudes in the type. (Yes, they could have used some analogue of std::ratio, but this representation doesn't meet the basic requirements for a multidimensional units library.) Thus, my assumption was that you can't actually write a good units library in rust as it stands today. But I could easily be missing something here due to my lack of experience with the language.

I think your assumption is mostly correct. I don't know how exactly the decision was made in uom, but since they work on stable Rust, they rely on typenum which performs some wonderful type magic to allow them to represent integers at compile time. I suspect representing the actual unit magnitudes at compile time without const generics would probably be challenging at least.

However, I'm afraid mine and all other Rust dimensional-analysis libraries that I am aware of currently also represent quantities by converting them down to a base representation. For now, the reason I want the compile-time magnitudes is mainly to make constructing quantities easier (again, inspired by your talk - I want to be able to write v = 5.0 * meter / second or l = length.round_in(meters), and this will enable that syntax).

I don't think it is necessarily the lack of variadic templates that is keeping us from doing this properly - with const generics, we could choose to make our type generic over the exponent of every available unit (i.e. instead of { length: 1, time: -1 }, i'd have { meters: 1, seconds: -1, feet: 0, hours: 0, ... }).
This certainly gives ugly error messages, but I think the main reason most don't do this right now is that we currently cannot express even remotely as many things in Rusts trait system as one can with C++ templates. I don't know enough about C++, but it seems that you are able to state things like "perform implicit conversion between two units if the ratio of their magnitudes fulfills some criterion". I wouldn't know how to do this in Rust, which means that the only option if I want my quantities to be based on units instead of just the dimension, is to disallow all operations between quantities with different units and make every conversion explicit.
That might be an option, but for now I preferred the simplicity of the first approach. I used my library to write scientific code where the alternative seemed to be to just convert everything to a base representation anyways, but to cross your fingers and hope that nobody forgets a conversion factor - coming from there, any kind of unit safety seems preferable and having a nice user interface makes it easier to convince others to use it too.

That's undesirable because it complicates end users' choice: it forces them to change the binary program they produce if they want to get unit safety.

I'm curious what exactly you mean by this point. My impression was that the main advantage of proper unit safety was to make it possible to simultaneously represent quantities in units that are not easily convertible into one another (for example because their ratio is non-integral, or it might overflow the underlying storage type, ...), but now I am thinking I might have understood this incorrectly.

@chiphogg
Copy link
Contributor

That's very interesting to hear, thanks! I was briefly looking at parsec or mole to check if their conversion factors to SI happened to have large prime factors, but then gave up on the search. Out of curiosity, is there a specific one you could point to?

Here, have two! I tracked down the motivating units.

  • The proton mass definition involves 1,672,621,923,695, one of whose factors is 334,524,384,739, which is prime.
  • The unit eV / c^2 involves 17,826,619,216,279, one of whose factors is 225,653,407,801, which is prime.

Now, real variadics don't exist, so my current plan is to express the vector space as a fixed size array of at most 15 integer factors (plus special entries for pi/zero/...). 15 since that should make all 64 bit integers representable (the primorial numbers are the worst case here and the number 32589158477190044730, which is the product of the first 16 primes is already too large for 64 bits). With the unstable features, an array like that is allowed as a const generic and it shouldnt be too hard to implement multiplication etc. on such an array.

Thanks for the interesting explanation! I don't yet have any experience writing rust, so I appreciate the insights.

I wonder if you might want to (conservatively) double that size of 15? After all, you'll very commonly need to represent ratios, and the numerator and denominator won't have any prime factors in common.

Of course, even this wouldn't be enough because you could have fractional exponents, and magnitudes that could be represented in double even though they couldn't be represented as a ratio of two 64-bit integers. However, I expect that in practice, 30 basis vectors would be more than enough.

I used my library to write scientific code where the alternative seemed to be to just convert everything to a base representation anyways, but to cross your fingers and hope that nobody forgets a conversion factor - coming from there, any kind of unit safety seems preferable and having a nice user interface makes it easier to convince others to use it too.

Yes --- definitely! The robust unit safety is IMO clearly worth the cost of forcing users to change their computations in order to get it. Of course, it'd be even better if you could get the benefit without the cost, but that seems pretty challenging for rust at the moment.

That's undesirable because it complicates end users' choice: it forces them to change the binary program they produce if they want to get unit safety.

I'm curious what exactly you mean by this point. My impression was that the main advantage of proper unit safety was to make it possible to simultaneously represent quantities in units that are not easily convertible into one another (for example because their ratio is non-integral, or it might overflow the underlying storage type, ...), but now I am thinking I might have understood this incorrectly.

Let me elaborate. What I had in mind here was the principle that I referred to in my talk as "same program, only safer". Imagine a user writing a program without a units library. Let's suppose that their preferred choice for some particular length variable is to use an int, and represent the result in "mils" (i.e., one thousandth of an inch).

Now let's say they want to use a units library so they can prevent unit errors.

Some libraries have one unit for each dimension. I believe uom is like this. So in switching to this units library, they would have to convert their value to meters. (Obviously, this is majorly lossy in this case. It's not clear to me whether uom would raise a compiler error, or silently truncate.)

Some other libraries have one storage type for all user facing types. The nholthaus C++ library is like this. So in this case, the user can store their value natively in mils, but they will have to use double (or float, if they select that globally) rather than int. This makes the assembly code meaningfully different, which gives the user a tradeoff that must be weighed.

What "same program, only safer" means is that the quantity type should support both the same unit and storage type that the user would have had without any units library. In this case, something like Au's Quantity<Milli<Inches>, int> fulfills these criteria. Once the compiler finishes optimizing out the units types, we'll get the same assembly we would have had with just int ("same program"), but with the added knowledge that if we made a mistake, the compiler would catch it ("only safer").

@Tehforsch
Copy link
Author

Here, have two! I tracked down the motivating units.

The proton mass definition involves 1,672,621,923,695, one of whose factors is 334,524,384,739, which is prime.
The unit eV / c^2 involves 17,826,619,216,279, one of whose factors is 225,653,407,801, which is prime.

Thanks for the nice examples! I especially like that I do have the definition of the proton mass in the code for which I wrote my library, so it is quite likely that I would have run into this issue.

I wonder if you might want to (conservatively) double that size of 15? After all, you'll very commonly need to represent ratios, and the numerator and denominator won't have any prime factors in common.

Of course, even this wouldn't be enough because you could have fractional exponents, and magnitudes that could be represented in double even though they couldn't be represented as a ratio of two 64-bit integers. However, I expect that in practice, 30 basis vectors would be more than enough.

That's true, in order to do ratios I will probably have to go to 30 basis vectors.

What "same program, only safer" means is that the quantity type should support both the same unit and storage type that the user would have had without any units library. In this case, something like Au's Quantity<Milli, int> fulfills these criteria. Once the compiler finishes optimizing out the units types, we'll get the same assembly we would have had with just int ("same program"), but with the added knowledge that if we made a mistake, the compiler would catch it ("only safer").

Thank you, I understand your point now. I believe implementing the "same program, only safer"-approach in Rust could be possible, but only at the cost of lots of convenience (and error message quality), which could make it a bit harder to convince people to move to a unit library. This is definitely something to think about though.

@chiphogg chiphogg added 📁 kind: bug Something isn't working correctly, or there's a mistake 💪 effort: large ⬇️ affects: code (implementation) Affects implementation details of the code labels Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⬇️ affects: code (implementation) Affects implementation details of the code 📁 kind: bug Something isn't working correctly, or there's a mistake 💪 effort: large
Projects
None yet
Development

No branches or pull requests

2 participants