Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement backwards-compatible 'random' redesign #3619

Closed
wants to merge 9 commits into from
Closed

Implement backwards-compatible 'random' redesign #3619

wants to merge 9 commits into from

Conversation

occivink
Copy link
Contributor

@occivink occivink commented Dec 4, 2016

#2642

This should preserve the behavior for 0 or 1 argument.

The seeding is a bit arbitrary (8*32 bits of random data for the 19937 624*32 bits of internal state in the engine) but the initialization step of the algorithm is here to make the most of the initial data. It's definitely better than 32 bits seeding to produce 64 bits output.

Regarding performance, I'm not sure if there is any concern to be had. The first invocation should be slower due to initialization, but insignificantly so ($CMD_DURATION reports 0ms on my end).

@ridiculousfish
Copy link
Member

Heh, this is the first C++11-only feature usage AFAICT.

@ridiculousfish
Copy link
Member

The code looks good to me. Very modern.

From my reading, nobody seems to really like or want the step parameter, and the order of parameters is hard to remember. As written, we are also vulnerable to divide by 0, and probably LLONG_MIN/-1, leading to crashes. Let's just eliminate the step variant, unless someone champions it and wants to tackle the overflow issues.

@occivink
Copy link
Contributor Author

occivink commented Dec 4, 2016

Thank you. step is actually being checked for being strictly positive so I believe it should be okay.
Regarding overflow issues, the checks against start > end and step <= 0 should take care of them.

@ridiculousfish
Copy link
Member

ridiculousfish commented Dec 4, 2016

You're right, I missed those checks. How about end-start on line 1836 and 1840? It looks like that may overflow if start is negative.

@occivink
Copy link
Contributor Author

occivink commented Dec 4, 2016

Indeed, this is really a minefield. I tried to come up with a solution but couldn't find a clean one. It might be better to just remove step.

long long result;
if (end - start < step) {
// nine nine nine nine nine nine
result = start;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hate to lose the dilbert reference, but I'm not sure returning something deterministic is the right thing to do. Error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I'm of the opinion that if it is technically possible to produce a result we might as well do it, even if it doesn't make sense. Same reason as to why the start == end case is accepted.
It's less potential errors for scripts to handle (for example choose with only one argument).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to argue that this case and start == end is an error since it always returns a constant. I don't like giving people enough rope to hang themselves. The "choose with only one argument" case is interesting in that the naive implementation would call random 1 (count $list) hence the reason you're allowing it. The problem with that logic is it fails if $list is empty as you're then running random 1 0 which will return one and $list[1] is obviously wrong. Shells by their nature tend to be lenient but in this case I think we're being too lenient and thus likely to mask serious usage errors.


int argc = builtin_count_args(argv);
static bool seeded = false;
static std::mt19937_64 engine;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still opposed to using this RNG engine. For one thing as the link makes it crystal clear that initializing the RNG with fewer seed bits than it requires can cause surprising behavior. Second, we do not need the guarantees it provides. We shouldn't even hint via code inspection that our RNG is suitable for cryptographic applications.

I do not see any legitimate argument for our random implementation to have a range larger than 0 to 4 GiB (or -2 GiB to 2 GiB). If someone can provide such an argument then it is sufficient to call a RNG that returns 32 bit values and merge them to result in a 64 bit value. Yes, doing that can produce statistical anomalies but, again, we should not even pretend to produce sequences that satisfy strong statistical guarantees. Our random numbers are meant for casual applications such as picking a value at random from a small set of values.


int argc = builtin_count_args(argv);
static const struct woption long_options[] = {{L"help", no_argument, 0, 'h'},
{0,0,0,0}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use NULL, or even better nullptr, for the args that represent pointers. I know that a literal zero is equivalent and large parts of the fish code does so (I'm slowly changing those). We shouldn't introduce more such bogosities 😄

// nine nine nine nine nine nine
result = start;
} else {
std::uniform_int_distribution<long long> dist(start, start+(end-start)/step);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run make style to ensure your code follows our documented style. In this case it would add whitespace around those binops.

@@ -0,0 +1,6 @@
function choose --description "Chooses a random item from a list"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see this implemented via random choose (or random choice or random select) via a random function. See what we do with the history function to augment the history builtin. Adding new commands runs the risk of causing problems for someone with an external command by the same name. So we should not do so if there is a reasonable alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's not introduce a new function choose but instead make it a feature of random

@krader1961
Copy link
Contributor

While I obviously have a strong opinion about a couple aspects of this change overall I like it and greatly appreciate your taking the time to create an implementation, @occivink. The cherry on top of the sundae (metaphorically) would be at least a handful of unit tests to verify basic behavior such as bounds checking.

@occivink
Copy link
Contributor Author

occivink commented Dec 5, 2016

Thanks for the feedback, I'll take care of the open points that were raised.

Regarding the engine, I really think we shouldn't one use with a period of smaller than the maximum range we want to produce. Even if we restricted start and end to 32 bits and used std::minstd_rand0, it would be insufficient to ever produce the full range. Seeding the mersenne twister with 256 bits is not ideal, but it should give plenty of possible initial states and solves the most important issue mentioned in the article (the first value not covering the full 64 bit range).
And I think that it's better for somebody to look at the implementation and conclude that it's sufficient for their use (even if it might not be and then that's on them), than the opposite (i.e. somebody assuming that it's enough and getting really inadequate results, but we could have done better).
If you're categorically against it, I'll yield but I won't be really happy about it.

@krader1961
Copy link
Contributor

...somebody assuming that it's enough and getting really inadequate results...

This was discussed extensively. The decision was that we would not implement a random suitable for applications needing hard guarantees. People should use tools like openssl where appropriate. See @ridiculousfish's comment. As part of this change the man page should be modified to make it crystal clear no one should trust our implementation to be safe for use in cryptographic or equally demanding applications even if we use the Mersenne Twister engine. We do not have the expertise or desire to take on the responsibility for providing such guarantees. This is also why I am opposed to using a 64-bit generator. It makes it too likely someone will think they can safely use it in situations where it is not appropriate.

@ridiculousfish
Copy link
Member

ridiculousfish commented Dec 6, 2016

Where I come down is:

  1. fish promises to generate random numbers that are good enough for a command line shell, which is a very low bar, therefore we
  2. use whatever PRNG is easiest, least likely to be wrong, and least likely to raise eyebrows/questions.

Based on that, I think any of the (super over-designed, OMG) C++11 engines are fine. MT is especially fine, since it's most widely used wikipedia says so so least likely to raise eyebrows.

Regarding whether to output 32 bit or 64 bit output: it seems to be no harder to use 64 bits. It's just changing the type, right? So we might as well just do 64 bit now and save ourselves the embarrassment in 10 years time.

Totally agree with krader to document that our PRNs are in no way suitable for cryptographic purposes.

Edit I just noticed that MT is pretty porky, at 2.5 KB state. fish uses 1.4 MB currently according to Activity Monitor, so the MT's contribution is significant (~1.3%). Let's just use a LCG engine, which has a puny state. shells are often invoked to recover from OOM scenarios, so we ought to be quite lean.

@occivink
Copy link
Contributor Author

occivink commented Dec 6, 2016

it seems to be no harder to use 64 bits
Let's just use a LCG engine

The C++11 default typedefs for LCG engines only support 32 bits output. Other possibilities include:

  • using non-STL constants for an LCG, such as "Knuth's preferred 64-bit LCG" (as mentioned in the article or wikipedia). Dangerously close to rolling our own prng.
  • ranlux_48 from the STL.
  • Ditching 64 bits output for 32 bits.

ranlux_48 seems like a pretty conservative choice to me.

@krader1961
Copy link
Contributor

use whatever PRNG is easiest, least likely to be wrong

That's why I'm arguing for one of the simpler 32 bit engines. One reason is how do you manually seed the MT engine given how many seed bits it requires. A simple 32 bit, or even 64 bit, int isn't sufficient. And that's all that random $seed can provide.

As for 32 versus 64 bit PRNGs I still think that given how random is used in shell scripts even 32 bits should be more than sufficient even a decade from now. If someone really needs a range larger than 4 billion then the shells random command is not the right tool for the job. Note that the range is not the same thing as the period. From what I can glean by googling it looks like the two randlux engines have a significantly larger period than the range of values they return.

The documentation at http://en.cppreference.com/w/cpp/numeric/random is awful regarding the characteristics of the various engines. And I suspect everyone else commenting on this change is just as confused as I am regarding which makes the most sense given our requirements.

@occivink
Copy link
Contributor Author

Okay this should take care of the points that were raised.

The overflow handling is rather bulky but I'm reasonably confident in it (special mention to clang's undefined behaviour sanitizer). I'd understand if you'd rather completely drop STEP for simplicity of implementation.
I'm still somewhat concerned by the use of an engine with a period of 2^31-1 to cover 64 bits output, but at least uniform_int_distribution is making the result uniform the entire range. So really there should only be a problem if you use random as many times as the period, which is already a weird use-case.
I've allowed the trivial case of only one entry for random choice.

The documentation at http://en.cppreference.com/w/cpp/numeric/random is awful regarding the characteristics of the various engines.

No objections here.

@ridiculousfish
Copy link
Member

ridiculousfish commented Dec 16, 2016

I'm happy with this as is and would like to squash-merge it. Thank you again! I'd like to try to simplify some of the overflow checking but that can happen after merge. Any further comments @krader1961 ?

@krader1961
Copy link
Contributor

LGTM. I appreciate the comprehensiveness of the change. There are a handful of whitespace style issues and make lint warned about one semi-serious problem:

implicit conversion loses integer precision: 'long long' to 'result_type' (aka 'unsigned int')

for the engine.seed(seed); statement. Also, two lines later drop the } else {. Just do

            if (!parse_error) {
                engine.seed(seed);
                return STATUS_BUILTIN_OK;
            }
            return STATUS_BUILTIN_ERROR;

Even better would be to invert the logic:

if (parse_error) return STATUS_BUILTIN_ERROR;
engine.seed(seed);
return STATUS_BUILTIN_OK;

@occivink
Copy link
Contributor Author

Thank you again! I'd like to try to simplify some of the overflow checking but that can happen after merge.

I'd appreciate, I probably made this more complicated than necessary.

@krader1961: how are you getting that message? cppcheck is not warning me of anything in my changes when I run make lint-all.

@krader1961
Copy link
Contributor

Don't know why cppcheck isn't giving you that warning because it should. See http://www.cplusplus.com/reference/random/linear_congruential_engine/seed/ where it says

result_type is a member type, defined as an alias of the first class template parameter (UIntType).
default_seed is a member constant, defined as 1u.

@occivink
Copy link
Contributor Author

Not sure why either, there are a lot of other hints but nothing on src/builtin.cpp. Can you tell me if this fixes it?

std::seed_seq seq{ seed };
engine.seed(seq);

@ridiculousfish
Copy link
Member

Well it's a decision we have to make - the seed value for the standard engine is 32 bits, but the interface allows specifying a 64 bit seed. Assuming we don't really care, I think the right fix is to just cast to the smaller size:

engine.seed(static_cast<uint32_t>(seed));

Or the more precise and annoying:

engine.seed(static_cast<std::minstd_rand::result_type>(seed));

@krader1961
Copy link
Contributor

I was going to recommend the same solution that @ridiculousfish just provided. Keep it 64-bits at the user level for consistency and to give us flexibility if we change the implementation such that a 64-bit seed would be useful. Suppress the warning by explicitly casting the value to indicate we know we're throwing away information.

@occivink
Copy link
Contributor Author

Alright, I was hoping that seed_seq would turn my 64-bit input into a 32-bit sequence automatically, but it doesn't and I don't want to do that manually. Truncating is a good enough solution imo.

I couldn't find the whitespace issues you were talking about.
Btw, make style doesn't fix whitespace around binary operators.

@krader1961
Copy link
Contributor

Squash merged as 7996e15 and 1ace742. Many thanks, @occivink, for your hard work on this.

@krader1961 krader1961 closed this Dec 21, 2016
@krader1961 krader1961 added this to the fish 2.5.0 milestone Dec 21, 2016
@occivink
Copy link
Contributor Author

occivink commented Dec 21, 2016

Likewise, I appreciate the time that you spent with me on this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants