Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analysis: Introduce Vulkan XML parser #745

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Friz64
Copy link
Contributor

@Friz64 Friz64 commented Apr 20, 2023

This adds a parser for Vulkan XML registries, namely vk.xml and video.xml. It doesn't have a 100% coverage of all possible attributes, and doesn't aim to do so, though it can be easily expanded when necessary. Any additional attributes that are not extracted will simply be ignored. It doesn't try to process the XML in any way at all, even for simple things such as parsing a number.

TODO

  • Fix CI
  • Add support for unions
  • Add support for external types
    • These are types either from C itself, or from another Vulkan library, which are referenced in this XML file.
    • Identified by not having a "category" attribute.
  • Improve the documentation
    • It is not immediately obvious which Rust struct corresponds to which specific parts of the XML. Maybe each struct should be accompanied by example XML data it can represent in order to make it easier to recognize.

@Friz64 Friz64 marked this pull request as draft April 20, 2023 06:59
@Friz64 Friz64 marked this pull request as ready for review April 20, 2023 21:20
@Jerrody
Copy link

Jerrody commented May 1, 2023

@Friz64
Yet another wonderful PR! Thank you!

Copy link
Collaborator

@MarijnS95 MarijnS95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very clean and simple, I really like it thus far!

Curious to see how much processing we're going to have to do on top and how sane we can make it (consider e.g. recent issues about multiple features/extensions depending on and providing the same functions and types, etc).

.github/workflows/ci.yml Outdated Show resolved Hide resolved
analysis/src/lib.rs Outdated Show resolved Hide resolved
Some("struct") => {
registry.struct_aliases.push(Alias::from_node(type_node));
}
_ => (),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warn about new/unhandled items? Same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I designed this parser around it only filtering out data WE want, and ignoring everything else. If new nodes with an unknown category get added, and we want to use that data, we can easily expand this. It's not often that there are new types of data points which are of use to our generator, and adding checks everywhere throughout the code to look for these feels unnecessary in my opinion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, it's not like we provide just the parser and need to be alerted whenever new elements / attributes are added; instead we might notice when updating to a newer vk.xml and implement it from the get-go, or need to implement it to be able to generate the right code for a new pattern.

Will consider this resolved now, though a debug!() stating all the features that we haven't implemented (for sensible reasons) might still be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though a debug!() stating all the features that we haven't implemented (for sensible reasons) might still be nice

Implemented this. I have also set up env_logger to go along with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this. I have also set up env_logger to go along with this.

Hm, CI fails because env_logger 0.11, which came out just hours ago, has an MSRV of 1.71. It's probably not worth bumping our MSRV over this? So, is it fine depending on env_logger 0.10? Or maybe use another logging library?

Copy link
Contributor Author

@Friz64 Friz64 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to tracing now, which is more modern.

analysis/src/xml.rs Outdated Show resolved Hide resolved
Comment on lines +187 to +219
node.attribute("supported")
.map(|values| values.split(',').any(|support| support == api))
.unwrap_or(true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplicate with api_matches? Only the attribute name is different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place in the code where we have to filter for the string "supported" instead of the much more common "api", so I decided not to disturb the api_matches function and just have a special case here. We could also have api_matches call another function attribute_matches with this code, and then use that here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Do we remember what the difference was between api and supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not exactly the same as api, because it's value can also be disabled.

Side note, this also means that we automatically skip all disabled extensions as a consequence of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also have api_matches call another function attribute_matches with this code, and then use that here.

So, to be clear, did you suggest that this should be done?

analysis/src/xml.rs Outdated Show resolved Hide resolved
analysis/src/xml.rs Outdated Show resolved Hide resolved
analysis/src/xml.rs Outdated Show resolved Hide resolved
analysis/src/xml.rs Outdated Show resolved Hide resolved
@MarijnS95 MarijnS95 requested a review from Ralith May 29, 2023 19:14
Comment on lines 22 to 35
fn child_text(node: Node, name: &str) -> Option<XmlStr> {
let child = node.children().find(|node| node.has_tag_name(name));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need more StringStorage -> Cow conversions, and should we factor it out into a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There probably won't be any more conversions. We could factor it out, but I don't think it's really worth it currently.

@MarijnS95
Copy link
Collaborator

@Friz64 looks like we have a few comments left after which we should be ready to merge this and build the next layer on top.

It is not immediately obvious which Rust struct corresponds to which specific parts of the XML. Maybe each struct should be accompanied by example XML data it can represent in order to make it easier to recognize.

Seems fine to me as it is, every struct and field name corresponds to the XML element / attribute name AFAIR.


I wonder if this crate should be renamed to parser or if you intend it to also contain the analysis stage (e.g. linking everything together) within it?
(Probably doesn't make much sense to only have a purpose-built parser crate for Ash)

@Friz64
Copy link
Contributor Author

Friz64 commented Jan 19, 2024

Trying to get back into working on this. Sorry for the silence. Life is hard.

I wonder if this crate should be renamed to parser or if you intend it to also contain the analysis stage (e.g. linking everything together) within it? (Probably doesn't make much sense to only have a purpose-built parser crate for Ash)

The intention was for this crate to contain the analysis stage, yes.

@Friz64 Friz64 force-pushed the rewrite-xml branch 3 times, most recently from 330a9dd to 8035d4d Compare January 20, 2024 00:04
@Friz64
Copy link
Contributor Author

Friz64 commented Jan 20, 2024

The next layer on top of this should take care of all remaining parsing.

For this, I was reading back on the conversation in #662, and the conclusion we seemingly came to for C parsing was to use a parsing library, not on the whole spec at once, but on small chunks, lifted from the XML. One idea was to leverage the lang-c library I had used in erupt for this, but after some testing I learned that it is not really fit for that purpose.

Today, I again searched for other C parsing libraries we could use, and one stuck out to me: tree-sitter-c. Tree-sitter is an parsing library written in C, with well-written and official Rust bindings. In terms of C libraries, it is pretty harmless with all its code included in the crate, so compiling it as part of the generator should not be an issue, even on Windows. tree-sitter-c is the official C grammar for it. I did some quick experiments throwing various code snippets at it, and it seems to be pretty much exactly what we are looking for. It was developed for use within code editors, so it does not require its input to be run through an pre-processor. It also doesn't require any context for parsing. This means that even something like uint8_t pipelineIdentifier[VK_UUID_SIZE], where lang-c would have previously errored out because it doesn't know what VK_UUID_SIZE is supposed to be exactly, parses successfully. The downside of this on the other hand is that is only produces an Concrete Syntax Tree, not a more higher-lever Abstract Syntax Tree, but this is enough for our purposes.

@eddyb
Copy link

eddyb commented Jan 27, 2024

This means that even something like uint8_t pipelineIdentifier[VK_UUID_SIZE], where lang-c would have previously errored out because it doesn't know what VK_UUID_SIZE is supposed to be exactly,

Sorry if this was discussed before (I couldn't easily find this specific point brought up), but:

<member><type>uint8_t</type>                <name>pipelineIdentifier</name>[<enum>VK_UUID_SIZE</enum>]</member>

The actual XML has <type>, <name> and <enum>, fully disambiguating the identifiers used in the declaration.

Even without the <enum> on VK_UUID_SIZE, the syntax is unambiguously Type · "[" Expr "]".

(if you'll excuse my shorthand - · would be like the "declaration focus" or "declaree", trivial here but more important for e.g. pointer-to-array/function, where the type syntax turns inside-out with parenthesization - e.g. X (Z ·) Y => (X Y) Z ·, where X/Y/Z are any tokens, if you could parenthesize types in C, or if you use a C++ noop type alias to emulate it etc.)


If there are counter-examples to that idea, I'd love to see them, because it's the direction I would've went in, for my own coincidental vk.xml parsing needs (before I remembered this effort might end up being a reusable way to generate ash-compatible automation - in theory even vk-layer-for-rust's Python scripts could potentially be fully replaced by something based on it, but that's out of scope for me for now).

I'm really tempted to try to at least categorize the shapes of child nodes (and text between them), which are currently flattened in this PR's StructureMember's c_declaration field - I'll post an update if I get there.
(EDIT: looks like descendant_text has 3 uses and they're all about flattening C declarations)

(disclaimer: back in the C11 days, I wrote a C declaration parser/compiler on top of someone else's yacc grammar, so the quirks of the syntax, like order independence of specifiers and the inside-out mode when the "declaree" is parenthesized, are "burned-in", making custom C parsers more appealing than they should be - my only hope is Khronos having been sensible enough, really)


EDIT: this went smoother than expected (except for the hard part of 11 function pointer typedefs which is the only thing I have left to tackle), I'm leaving this here for posterity, if anyone wants to experiment with C declaration syntax, https://godbolt.org/z/1oh8cWGaP has a good starting point:

template<typename T> using Id = T;
#define _(X) Id<X>
_(_(_(R(A, B, C))*)[3]) *f;

C++ lets us add a crude form of parentheses, which then lets us force the a much more intuitive nesting, and if removing those parentheses was viable, we'd be left with R(A, B, C)*[3]* as the type, which is infinitely closer to the Rust *mut [fn(A, B, C) -> R; 3] than R (*(*f)[3])(A, B, C) ever could be.

@eddyb
Copy link

eddyb commented Jan 28, 2024

Looks like custom parsing was 100% viable. By far most of the effort was spent making it powerful enough to handle the 11 function pointer typedefs with basically no special-casing (if vk.xml ever gets wild enough to have the C equivalent of fn() -> fn() -> fn() -> fn(), maybe with bonus extra pointers and arrays as well, my parser should be ready for it).

Outside of that, the only 2D array is in VkTransformMatrixKHR, and everything else is T*, const T*, T[N], and the occassional spicy const T[N] (thanks to function parameter decay, which I explicitly model in CType).

@eddyb
Copy link

eddyb commented Jan 28, 2024

Finally dug enough into the old generator to see how the C parsing works there:

ash/generator/src/lib.rs

Lines 86 to 272 in 92084df

fn parse_ctype(i: &str) -> IResult<&str, CType> {
(alt((value(CType::U64, tag("ULL")), value(CType::U32, tag("U")))))(i)
}
fn parse_cexpr(i: &str) -> IResult<&str, (CType, String)> {
(alt((
map(parse_cfloat, |f| (CType::Float, format!("{f:.2}"))),
parse_inverse_number,
parse_decimal_number,
parse_hexadecimal_number,
)))(i)
}
fn parse_cfloat(i: &str) -> IResult<&str, f32> {
(terminated(nom::number::complete::float, one_of("fF")))(i)
}
fn parse_inverse_number(i: &str) -> IResult<&str, (CType, String)> {
(map(
delimited(
char('('),
pair(
preceded(char('~'), parse_decimal_number),
opt(preceded(char('-'), digit1)),
),
char(')'),
),
|((ctyp, num), minus_num)| {
let expr = if let Some(minus) = minus_num {
format!("!{num}-{minus}")
} else {
format!("!{num}")
};
(ctyp, expr)
},
))(i)
}
// Like a C string, but does not support quote escaping and expects at least one character.
// If needed, use https://github.com/Geal/nom/blob/8e09f0c3029d32421b5b69fb798cef6855d0c8df/tests/json.rs#L61-L81
fn parse_c_include_string(i: &str) -> IResult<&str, String> {
(delimited(
char('"'),
map(many1(none_of("\"")), |c| {
c.iter().map(char::to_string).join("")
}),
char('"'),
))(i)
}
fn parse_c_include(i: &str) -> IResult<&str, String> {
(preceded(
tag("#include"),
preceded(multispace1, parse_c_include_string),
))(i)
}
fn parse_decimal_number(i: &str) -> IResult<&str, (CType, String)> {
(map(
pair(digit1.map(str::to_string), parse_ctype),
|(dig, ctype)| (ctype, dig),
))(i)
}
fn parse_hexadecimal_number(i: &str) -> IResult<&str, (CType, String)> {
(preceded(
alt((tag("0x"), tag("0X"))),
map(pair(hex_digit1, parse_ctype), |(num, typ)| {
(
typ,
format!("0x{}{}", num.to_ascii_lowercase(), typ.to_string()),
)
}),
))(i)
}
fn parse_c_identifier(i: &str) -> IResult<&str, &str> {
take_while1(|c: char| c == '_' || c.is_alphanumeric())(i)
}
fn parse_comment_suffix(i: &str) -> IResult<&str, Option<&str>> {
opt(delimited(tag("//"), take_until("\n"), newline))(i)
}
fn parse_parameter_names(i: &str) -> IResult<&str, Vec<&str>> {
delimited(
char('('),
separated_list1(tag(", "), parse_c_identifier),
char(')'),
)(i)
}
/// Parses a C macro define optionally prefixed by a comment and optionally
/// containing parameter names. The expression is left in the remainder
#[allow(clippy::type_complexity)]
fn parse_c_define_header(i: &str) -> IResult<&str, (Option<&str>, (&str, Option<Vec<&str>>))> {
(pair(
parse_comment_suffix,
preceded(
tag("#define "),
pair(parse_c_identifier, opt(parse_parameter_names)),
),
))(i)
}
#[derive(Debug)]
enum CReferenceType {
Value,
PointerToConst,
Pointer,
PointerToPointer,
PointerToPointerToConst,
PointerToConstPointer,
PointerToConstPointerToConst,
}
#[derive(Debug)]
struct CParameterType<'a> {
name: &'a str,
reference_type: CReferenceType,
}
fn parse_c_type(i: &str) -> IResult<&str, CParameterType<'_>> {
(map(
separated_pair(
tuple((
opt(tag("const ")),
preceded(opt(tag("struct ")), parse_c_identifier),
opt(char('*')),
)),
multispace0,
opt(pair(opt(tag("const")), char('*'))),
),
|((const_, name, firstptr), secondptr)| CParameterType {
name,
reference_type: match (firstptr, secondptr) {
(None, None) => CReferenceType::Value,
(Some(_), None) if const_.is_some() => CReferenceType::PointerToConst,
(Some(_), None) => CReferenceType::Pointer,
(Some(_), Some((Some(_), _))) if const_.is_some() => {
CReferenceType::PointerToConstPointerToConst
}
(Some(_), Some((Some(_), _))) => CReferenceType::PointerToConstPointer,
(Some(_), Some((None, _))) if const_.is_some() => {
CReferenceType::PointerToPointerToConst
}
(Some(_), Some((None, _))) => CReferenceType::PointerToPointer,
(None, Some(_)) => unreachable!(),
},
},
))(i)
}
#[derive(Debug)]
struct CParameter<'a> {
type_: CParameterType<'a>,
// Code only used to dissect the type surrounding this field name,
// not interested in the name itself.
_name: &'a str,
static_array: Option<usize>,
}
/// Parses a single C parameter instance, for example:
///
/// ```c
/// VkSparseImageMemoryRequirements2* pSparseMemoryRequirements
/// ```
fn parse_c_parameter(i: &str) -> IResult<&str, CParameter<'_>> {
(map(
separated_pair(
parse_c_type,
multispace0,
pair(
parse_c_identifier,
opt(delimited(char('['), map_res(digit1, str::parse), char(']'))),
),
),
|(type_, (name, static_array))| CParameter {
type_,
_name: name,
static_array,
},
))(i)
}

I would agree with anyone who would want to replace that, and here's my first impressions:

  • using nom is comparable to my code on the parsing side, but feels clunkier for lexing
  • it doesn't separate lexing and parsing which probably makes it more annoying than it needs to be
  • it special-cases a lot of declarator patterns (various combinations of maybe-const pointers and/or arrays) like I first theorized, but then I moved away from that very quickly due to combinatorial explosion
  • I am increasingly confused by vk-parse: it seems to also contain some ad-hoc C parsing (oh dear), but I guess it's not enough for the old ash generator?
    • should my more robust cdecl parser be upstreamed to vk-parse?
    • is this ash analysis crate work meant to be a replacement for vk-parse?

With some of these aspects I feel a bit out of my depth, especially when it comes to "what other efforts are out there and am I at risk of duplicating work?", but nothing else I've seen so far seems principled enough (at most there's a lot of reliance on the limited set of possible patterns which is somewhat reasonable I suppose).

@MarijnS95
Copy link
Collaborator

Finally dug enough into the old generator to see how the C parsing works there:
[..]
* I am increasingly confused by vk-parse: it seems to also contain some ad-hoc C parsing (oh dear), but I guess it's not enough for the old ash generator?

We use both. vk-parse creates vkxml constructs that represent parsed C in some "unmanageable" way, and a bit of nom-fu was added to parse out things (like simple C const expressions) that we could not get through vkxml. As a general code, a lot of this has been hacked together out of dire need, with a focus on generated code, not on generator internals. That's why things are so messy, and why a new generator needs to be written that takes all our ideas into account from the get-go.

With this approach from @Friz64 we should no longer need to rely on vk-parse (even though it is still maintained. The 7-year-old unmaintained vkxml is entirely out of the question, of course).


A little warning: this PR sat stale since June 2023, and right as activity is picking back up now, I'll be away from computers all of February. I hope no-one is affected by my absence and feel free to keep working/discussing on a generator rewrite, but it'll have to do without my input for a while 👍

@Friz64
Copy link
Contributor Author

Friz64 commented Feb 29, 2024

It doesn't try to process the XML in any way at all, even for simple things such as parsing a number.

This is not really true anymore with Friz64#1. Other than that, I've come around to saying simple parsing doesn't hurt.

@Friz64
Copy link
Contributor Author

Friz64 commented Mar 13, 2024

I've added support for the tracing crate. Here's a snip from the output when running via RUST_LOG=trace:

...
2024-03-13T21:00:30.028385Z TRACE xml{path=generator/Vulkan-Headers/registry/vk.xml}:type{node="<type name='VkColorSpaceKHR' category='enum'>"}: analysis::xml: encountered node
2024-03-13T21:00:30.028388Z TRACE xml{path=generator/Vulkan-Headers/registry/vk.xml}:type{node="<type name='VkCompositeAlphaFlagBitsKHR' category='enum'>"}: analysis::xml: encountered node
2024-03-13T21:00:30.028392Z TRACE xml{path=generator/Vulkan-Headers/registry/vk.xml}:type{node="<type name='VkDisplayPlaneAlphaFlagBitsKHR' category='enum'>"}: analysis::xml: encountered node
2024-03-13T21:00:30.028395Z TRACE xml{path=generator/Vulkan-Headers/registry/vk.xml}:type{node="<type name='VkPresentModeKHR' category='enum'>"}: analysis::xml: encountered node
2024-03-13T21:00:30.028399Z TRACE xml{path=generator/Vulkan-Headers/registry/vk.xml}:type{node="<type name='VkSurfaceTransformFlagBitsKHR' category='enum'>"}: analysis::xml: encountered node
2024-03-13T21:00:30.028403Z TRACE xml{path=generator/Vulkan-Headers/registry/vk.xml}:type{node="<type name='VkDebugReportFlagBitsEXT' category='enum'>"}: analysis::xml: encountered node
2024-03-13T21:00:30.028406Z TRACE xml{path=generator/Vulkan-Headers/registry/vk.xml}:type{node="<type name='VkDebugReportObjectTypeEXT' category='enum'>"}: analysis::xml: encountered node
...

@Friz64 Friz64 force-pushed the rewrite-xml branch 3 times, most recently from d610c44 to c64ea2e Compare March 16, 2024 22:32
@Friz64 Friz64 marked this pull request as draft May 25, 2024 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants