Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The current RNG implementation generates a negative number from a specific seed #2206

Open
ephphatha opened this issue Jun 21, 2021 · 10 comments
Assignees
Labels
bug Something isn't working vanilla Bugs found also in the original Diablo 1.09b release

Comments

@ephphatha
Copy link
Contributor

ephphatha commented Jun 21, 2021

Important information
Bug present in all DevilutionX versions since the initial commit on any compiler with similar implementation defined behaviour as msvc

Describe
Calling GenerateRnd(int) with a seed of 1457187811 (either supplied as the seed from a call to SetRndSeed(int) immediately prior or reached naturally from calls to AdvanceRndSeed()) results in the next call to either AdvanceRndSeed() or GenerateRndSeed(x) (where x is any value in the range [2, INT_MAX]) generating a negative number.

In instances like this where the number is used in a switch statement none of the cases will be taken, but the program will continue as normal.

switch (GenerateRnd(4)) {

There are instances where a random number is used as an array index where the possible values are still within bounds. For example PlaySfx calls RndSfx to get a random sound from a certain category. The lowest value that could be returned is 8 (corresponding to the effect PS_LGHIT1) when called with PS_SWING (expected values are either 9 or 10, PS_SWING or PS_SWING2)

return static_cast<_sfx_id>(psfx + GenerateRnd(nRand));

Other instances will lead to reading a negative array index. The code which looks for a free item space is one example, this code can generate a point with an x or y position of -32 (16 + -48) which is immediately used as an index into dMonster

position = Point { 16, 16 } + Point { GenerateRnd(80), GenerateRnd(80) };

if (dMonster[position.x][position.y] != 0)

To Reproduce
Edit a save or use breakpoints to seed the RNG before one of these call sites. If there's a save with a known seed leading to a crash when generating a dungeon that would be great, I haven't tried to craft one yet.

Expected behavior
I would expect non-critical calls to the RNG to generate within the expected range, so situations like PlaySfx should always play one of the expected sounds.

Critical calls to the RNG should mimic the same behaviour if it didn't lead to a crash. Using the first example to keep compatibility with old saves/original clients bypassing the defined cases in the switch statement is desirable.

I am unsure what the best approach is for critical calls that lead to a crash/reading out of bounds. If a seed leading to this behaviour always crashes the game when trying to enter the next floor of a dungeon, fixing it would be desirable so that the seed is actually playable. If it's possible to recover and a second attempt generates a different dungeon then maybe not?

Additional context
I have not done an exhaustive search, the lines I linked were the first examples I found illustrating the different ways I could see this case causing odd behaviour. There may be additional ways for the game to crash or give unexpected results.

I also have not tested this on the original game yet, if I get the time to craft a save that reliably crashes I'll update this issue.

@julealgon
Copy link
Contributor

  1. How did you find this?
  2. Is it possible to get that seed "normally"?

@ephphatha
Copy link
Contributor Author

ephphatha commented Jun 22, 2021

  1. I wrote a short program to test the effects of the distribution and casts, pulling out the seeds immediately before and the seeds which when cast to signed values then fed into abs(int) gave "interesting" values (basically corner cases or useful test values) - https://gist.github.com/ephphatha/c7df46c2c4938f3d315c7a56434f9da2
    Every positive signed value has two seeds that lead to it (as expected, casting an unsigned value to a signed value is well defined) except for 0 (only one seed, would expect 2) and INT_MIN (would expect none).
  2. Yep, it's guaranteed after calling AdvanceRndSeed() enough times. LCG engines effectively have a set of chains they will loop through before getting back to the starting value for that chain. The Borland LCG constants give an LCG engine with a single chain containing every value from 0 to 2^32 - 1.
    The saving grace is that not every use of a random number crashes when fed a negative value, so the 1 in 2^32 instance of this happening is also less likely to be noticed.

@julealgon julealgon added bug Something isn't working vanilla Bugs found also in the original Diablo 1.09b release labels Jun 23, 2021
@ephphatha
Copy link
Contributor Author

ephphatha commented Jun 24, 2021

Had a look at the original executable this arvo and the reversed code from the initial commit of this repository almost matches the base game implementation, the only difference is the way certain types are interpreted.

It looks like Diablo was originally compiled using Borland C++ but at some point they switched to their own rand() implementation.

Borland C++ provides two random number functions based off an LCG engine. lrand() masks off the most significant bit, returning a result in the interval [0, 2^31). Because this set of LCG factors produces a fairly cyclic bit pattern in the low bits, rand() discards the low half of the result and masks off the sign bit (by performing state >> 16 & 0x7FFF), returning a value in the interval [0, 2^15). Both functions discard the MSB as it doesn't change often between consecutive values in the sequence.

The RNG used by Diablo attempts to provide both functions but seems to be incorrectly implemented. As both rand() and lrand() mask off the MSB, the Diablo implementation attempts to do the same thing by interpreting the state as a signed integer and negating it. This fails in the case noted in this bug report.

The logic to determine when to discard the low bits is also ill-formed. An upper bound of 2^16 - 2 is used (which is already an off by one error) but both Borland rand() and the Diablo version return a max value of 2^15 - 1, just over half of this limit. Using mod instead of bit masking also allows the sign bit propagation to continue which introduces a tiny bit of bias. Each value in the interval [1, 2^15 - 1] has a 2^17 in 2^32 chance of being returned, 0 is slightly less likely at 2^17 - 1 in 2^32, and -(2^15) has a 1 in 2^32 chance. If it's provable that a limit lower than 2^15 is always used then we could fix this without breaking compatibility.

Regarding the LCG implementation, I suspect whatever tool was used to generate the code in the initial commit incorrectly interpreted one of the operations in the core _rand() function. The engine state is an unsigned value which uses the three argument form of the x86 IMUL (signed multiplication) instruction to calculate a 32 bit result. The engine state is not interpreted as a signed value until inside the _abs() function when a test is performed to see if the most significant bit is set.

The code in its current form is inconsistent about whether seeds are signed or unsigned, with some callers using int, some using uint, and some using int16 values. Signed to unsigned conversion is well defined, but unsigned to signed is implementation defined, so when the seed is passed to these callers or set on the structs it's not guaranteed the values will be interpreted correctly. I hope to open a PR sometime in the next few days updating this to avoid implementation defined behaviour.

@julealgon
Copy link
Contributor

Very impressive investigative work @ephphatha . I'm always amazed at these ultra low-level details that I'd absolutely never dive into in a higher level language 🤣

It is kind of a shame to me that we have to dedicate so much effort for such a "common" thing like "generating a random number".

I don't expect DevilutionX to break compatibility here at any point, but I'm certainly looking forward to do doing so in a mod and getting rid of this insanity altogether in favor of a more standard approach using known libraries.

I also assume it would be possible to "fix" this by forcing the function to re-execute "until a non-negative value is returned". Would be slightly less efficient per call (due to the added if check) but would otherwise be harmless. Feels very hacky though, so I already despise it 😛

@ephphatha
Copy link
Contributor Author

Found a nice example of how this bug can give unexpected behaviour. The following code is slightly modified from quests.cpp:InitQuests() to show the steps involved. Using a modified save where glSeedTbl[15] is one of the values leading up to the troublesome state you can reliably trigger out of bounds reads and writes. On my machine (running a debug x64 build with msvc) this does not crash the game but it does trample some region of memory and results in all quests in QuestGroup1 remaining active.

	if (!gbIsMultiplayer && sgOptions.Gameplay.bRandomizeQuests) {
		vanilla::SetRndSeed(glSeedTbl[15]); // glSeedTbl of 988045466 results in int_min after two rounds
		if (vanilla::GenerateRnd(2) != 0)
			quests[Q_PWATER]._qactive = QUEST_NOTAVAIL;
		else
			quests[Q_SKELKING]._qactive = QUEST_NOTAVAIL;

		int32_t questGroupIndex = vanilla::GenerateRnd(sizeof(QuestGroup1) / sizeof(*QuestGroup1)); // questGroupIndex is -2, expected [0, 2]
		int questId = QuestGroup1[questGroupIndex]; // read OOB, questId is -65536 in this test
		quests[questId]._qactive = QUEST_NOTAVAIL; // write OOB. All quests in group 1 are still active, expected Q_GARBUD or Q_BUTCHER to be deactivated with this seed.
		quests[QuestGroup2[vanilla::GenerateRnd(sizeof(QuestGroup2) / sizeof(int))]]._qactive = QUEST_NOTAVAIL;
		quests[QuestGroup3[vanilla::GenerateRnd(sizeof(QuestGroup3) / sizeof(int))]]._qactive = QUEST_NOTAVAIL;
		quests[QuestGroup4[vanilla::GenerateRnd(sizeof(QuestGroup4) / sizeof(int))]]._qactive = QUEST_NOTAVAIL;
	}

A safe fix (that modifies behaviour) would be to transform int_min to 0 in the RandomNumberDistribution that performs abs((int32_t)state) (I've got a PR in progress that would provide this behaviour on a vanillaSafe::RandomEngine, I'll provide an example when that's ready for review). This would mean that one quest is always deactivated. Or, to allow for the existing behaviour the check can be rewritten to call GenerateRnd() separately and check that the result is valid before using it as an index.

		int32_t questGroupIndex = vanilla::GenerateRnd(sizeof(QuestGroup2) / sizeof(int));
		if (questGroupIndex >= 0)
			quests[QuestGroup2[questGroupIndex]]._qactive = QUEST_NOTAVAIL;
		// else leave all quests active

@ephphatha
Copy link
Contributor Author

One thing I should've realised earlier is INT_MIN is a power of 2, so calling GenerateRnd with any power of two as an argument will always return a safe value as INT_MIN % pow(2, x) is always 0.

This makes code like the barrel spawning function safe in all cases because it uses GenerateRnd(pow(2, 3))

dir = GenerateRnd(8);

The first example in this bug report is safe as well 😅, the second could fail but not the specific situation I described.

@vlolteanu
Copy link
Contributor

I suggest fixing this in a non-atomic fashion, since I suspect that it would take a lot of time and effort.

First, we could do a project-wide rename GenerateRnd to GenerateRndUnsafe, and implement GenerateRnd to essentially return abs(GenerateRndUnsafe()). This would effectively make it deprecated and place a "dunce cap" on everything that uses it, telling us where attention is needed.

Next, we could look at every instance closely and see what can be done about it:

If it produces weird side-effects that we can't get rid of without breaking saved games, we'd have to look at it on a case-by case basis. We could try to cleanly mimic the weird side-effects s.t. the behavior follows clearly from the code and/or try to blacklist the problematic seeds. I have no idea to what extent this is possible.

@AJenbo
Copy link
Member

AJenbo commented Jul 7, 2021

I like the idea of this

@ephphatha
Copy link
Contributor Author

I suggest fixing this in a non-atomic fashion, since I suspect that it would take a lot of time and effort.

Agreed, that's part of the reason I proposed instanced generators in #2261 as it's relatively easy to track calls to GenerateRnd() between calls to SetRndSeed() and then isolate those sections of code.

First, we could do a project-wide rename GenerateRnd to GenerateRndUnsafe, and implement GenerateRnd to essentially return abs(GenerateRndUnsafe()). This would effectively make it deprecated and place a "dunce cap" on everything that uses it, telling us where attention is needed.

In that PR I've marked unsafe functions as deprecated with references to safe alternatives (using doxygen comments and namespaces). While I haven't provided a safe version using the original engine I'd avoid using a further call to abs() to handle the negative case. It would be better to clamp to 0 for the negative result as it reduces bias slightly.

@qndel identified a lot of areas that don't rely on repeatable sequences in #2084, these can be freely replaced with a generator using standard library concepts from <random>.

Next, we could look at every instance closely and see what can be done about it:

If it produces weird side-effects that we can't get rid of without breaking saved games, we'd have to look at it on a case-by case basis. We could try to cleanly mimic the weird side-effects s.t. the behavior follows clearly from the code and/or try to blacklist the problematic seeds. I have no idea to what extent this is possible.

As you say handling this appropriately depends on the use. Using the quest pool example from above, I've implemented a RandomChoice() function in that PR that returns an optional. If GenerateRnd() returns a negative value (when the list has a non power-of-two length) then it returns an empty optional and the caller can use is_present() checks to replicate vanilla behaviour (in this case not marking a quest in the pool as inactive).

Cases like when GenerateRnd() is used in a switch statement can be left as is with a comment explaining the use of the bug, rewritten to use the same RandomChoice() optional, or potentially fixed to always take one of the cases if it turns out to be non-breaking.

Blacklisting seeds is something I would prefer to avoid at all costs, given the way level generation works the only difference I expect from devilutionX compared to vanilla is a level that always crashes vanilla to be playable in devilutionX.

@AJenbo
Copy link
Member

AJenbo commented Jul 8, 2021

Just don't forget that the active quests affect the level layout :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working vanilla Bugs found also in the original Diablo 1.09b release
Projects
None yet
Development

No branches or pull requests

4 participants