Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM: Introduce a new method of encoding arrays. #199

Merged
merged 4 commits into from
Apr 21, 2018
Merged

Conversation

dvander
Copy link
Member

@dvander dvander commented Apr 20, 2018

This introduces an experimental new feature called "direct arrays". One of the most confusing and
complicated aspets of SourcePawn is how multi-dimensional arrays are encoded. Each non-terminal
level requires an "indirection vector" - an intermediate array that represents the next set of
arrays. That's normal, and is similar to how any language implements multi-level array pointers.

The difference with SourcePawn is that the addresses are encoded relative to the base address that
was last computed. It makes the compiler code extremely gross, it adds extra instructions on the
array access path, and it makes the VM support code for GENARRAY very complicated. Lastly, it makes
it very difficult to swap two slots in a nested dimension: all the relative calculations are wrong.

Direct arrays are much simpler. Addresses are absolute. Generating the indirection vectors is
trivial, and so is accessing slots in the array.

Unfortunately, the old scheme did have an advantage: it made it possible to memcpy() an initializer
out of DAT, and into the stack, to populate an array. Since all the internal addresses were
relative, nothing needed to be rebased or fixed up. Even when a multi-dimensional array has no initializer, it still has a "template" in DAT that needs to be memcpy'd to lay out the levels of indirection.

With this new scheme we have to introduce a new opcode called REBASE. REBASE takes an address in
PRI, and three constants: the DAT offset where the array initializer lives, the size of the
indirection vector table, and the size of the terminal dimension data. REBASE performs a memcpy()
over to the stack address, and then goes through and applies an offset to fix each address in the
indirection vector section. It will be slightly less performant, but it should still be quite fast,
and I'd like to inline this later on if/when the feature becomes permanent.

Compilers that use this scheme must error when using a native that takes a multi-dimensional array.
The old algorithms for accessing them no longer work. I'll propose an alternative and give
sorting.inc a treatment before merging this.

This introduces an experimental new feature called "direct arrays". One of the most confusing and
complicated aspets of SourcePawn is how multi-dimensional arrays are encoded. Each non-terminal
level requires an "indirection vector" - an intermediate array that represents the next set of
arrays. That's normal, and is similar to how any language implements multi-level array pointers.

The difference with SourcePawn is that the addresses are encoded relative to the base address that
was last computed. It makes the compiler code *extremely* gross, it adds extra instructions on the
array access path, and it makes the VM support code for GENARRAY very complicated. Lastly, it makes
it very difficult to swap two slots in a nested dimension: all the relative calculations are wrong.

Direct arrays are much simpler. Addresses are absolute. Generating the indirection vectors is
trivial, and so is accessing slots in the array.

Unfortunately, the old scheme did have an advantage: it made it possible to memcpy() an initializer
out of DAT, and into the stack, to populate an array. Since all the internal addresses were
relative, nothing needed to be rebased or fixed up.

With this new scheme we have to introduce a new opcode called REBASE. REBASE takes an address in
PRI, and three constants: the DAT offset where the array initializer lives, the size of the
indirection vector table, and the size of the terminal dimension data. REBASE performs a memcpy()
over to the stack address, and then goes through and applies an offset to fix each address in the
indirection vector section. It will be slightly less performant, but it should still be quite fast,
and I'd like to inline this later on if/when the feature becomes permanent.

Compilers that use this scheme must error when using a native that takes a multi-dimensional array.
The old algorithms for accessing them no longer work. I'll propose an alternative and give
sorting.inc a treatment before merging this.
@dvander
Copy link
Member Author

dvander commented Apr 20, 2018

Currently, no compiler supports this new array schema. When it does, it will be illegal to call old natives like "SortCustom2D" that take a multi-dimensional array. Natives like this are extremely rare since there is no API to decode the indirection vectors. It has to be done by hand. Nonetheless, at least two exist, and they'll have to be versioned so they can use the correct API.

My tentative plan is to extend sp_native_t and the plugin's native binding table to have a feature-level field. All old natives would provide a feature set of 0. If a compiler does not use any new features, its natives will all request a feature level of 0. If, on the other hand, a compiler used kCodeFeatureDirectArrays, then all natives it uses taking a >=2D array would need to also request that feature.

The binding logic in PluginRuntime would need to be updated to make sure a native is only bound if it can satisfy the feature level a plugin requested.

This scheme is a bit complicated since it requires a new version of the sp_file_natives_t structure. Mucking up the binary format is not something I want to do lightly. In the past we've encoded other things like this by decorating the native name, which might be easier if we can standardize it.

@assyrianic
Copy link

so multi-dim arrays are implemented similar to an mallocd set of array pointers? Why doesn't the compiler just optimize multi-dim arrays as one giant array using row * column calculations? Unless that's what you're trying to say here but that wasn't clear.

// This case is easy... we can just read the rest of the file.
rd.Read(header.Data, Size, header.ImageSize - Size);
var new_stream = new MemoryStream(header.Data, Size, header.ImageSize - Size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant since that exact code was moved below the switch already.
eb4a2e0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I missed that.

@dvander
Copy link
Member Author

dvander commented Apr 21, 2018

@assyrianic That would be a fine thing to do under other circumstances... it's not clear to me it would work in Pawn.

You can coerce an array to a set of pointers, which you can't do in C (it has the data layout you mentioned):

void f(int n[][]);
public int main() {
    int n[20][30][40];
    f(n[3]);
}

This wouldn't work since the sizes of the intermediate dimensions are not known. We couldn't embed the sizes either, because Pawn supports "slicing" arrays.

Another thing is it would make supporting garbage-collected arrays much more difficult. You wouldn't be able to pass around interior references without holding the entire base object alive.

@dvander dvander merged commit 33c1ebf into master Apr 21, 2018
@dvander dvander deleted the rebased-arrays branch April 21, 2018 19:51
@dvander
Copy link
Member Author

dvander commented Apr 21, 2018

Leaving sorting.inc for later. I think the simplest thing is to decorate the name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants