Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify the sym and map file formats #483

Closed
ISSOtm opened this issue Feb 13, 2020 · 17 comments · Fixed by gbdev/rgbds-www#37
Closed

Specify the sym and map file formats #483

ISSOtm opened this issue Feb 13, 2020 · 17 comments · Fixed by gbdev/rgbds-www#37
Labels
docs This affects the documentation (web-specific issues go to rgbds-www) enhancement Typically new features; lesser priority than bugs

Comments

@ISSOtm
Copy link
Member

ISSOtm commented Feb 13, 2020

The files follow no written specification, and so everyone makes their own assumptions... we should write a specification so we can decide what changes can be reliably made. This would also allow external tools (which I do know some use) to generate compatible files, and those are the reason I would like to make the specification broader than what's currently generated by RGBLINK.

To quote @mid-kid:

Honestly I just wanna know dumb shit such as: Will the format allways be '2:4` or should you search the colon and space, can there be more whitespace after the address before the label starts (can the space also be a tab?), are end-of-line comments a thing and how to handle whitespace there.

@aaaaaa123456789
Copy link
Member

Symfiles are generally easy to describe in terms of a couple of regexes:

  • Any line that matches [ \t]*(;.*)? is ignored.
  • All other lines match [ \t]*([0-9a-fA-F]{2,}):([0-9a-fA-F]{4})[ \t]+(.*)[ \t]*, where the three capture groups respectively represent the bank, the address and the symbol.
  • Lines are unordered.

It might be useful to allow banks to be optional, but RGBDS doesn't have any way of generating unbanked symbols anyway.

@ISSOtm
Copy link
Member Author

ISSOtm commented Feb 14, 2020

What about allowing comments on non-empty lines? Should shorter bank and addresses be allowed? Also, a valid RGBASM identifier is matched by this regex, according to globlex.c: [a-zA-Z_][a-zA-Z0-9_\\@#](.[a-zA-Z_][a-zA-Z0-9_\\@#])?

@aaaaaa123456789
Copy link
Member

What about allowing comments on non-empty lines?

A lot harder to parse by simple tools for little gain. This is not really a problem if you're writing a full-blown parser, but if you want to use a small shell script or something, it's better to not have them.

Should shorter bank and addresses be allowed?

For addresses, I'd say definitely no, since no advances in technology (e.g., new mappers) will ever change the size of an address, and every tool in the planet displays them as four hex digits anyway. For banks, it might be worth it to allow single-digit banks; I don't really mind either way. (RGBLINK currently outputs two or three digits, of course.)

Also, a valid RGBASM identifier is matched by this regex, ...

Probably out of scope for a format that is meant for interoperability; other tools might have their own rules for identifiers. This is an area where it pays off to be as lax as reasonably possible.

@ISSOtm
Copy link
Member Author

ISSOtm commented Feb 14, 2020

The regex could be added to the spec as a recommendation ("MAY reject or output a diagnostic if the identifier does not match the following regex:")

Agree with the rest.

@aaaaaa123456789
Copy link
Member

The problem would be agreeing on what regex to use.
For instance, consider an alternate assembler that uses #42 for immediates. That assembler might allow symbols beginning with a digit and even completely made up of digits, as they would be unambiguous. Is that acceptable? If not, why?

@ISSOtm
Copy link
Member Author

ISSOtm commented Feb 14, 2020

I don't think we're trying to specify a cross-assembler format, just to stabilize what RGBDS is outputting.

@mid-kid
Copy link
Contributor

mid-kid commented Feb 14, 2020

Honestly, comments on non-empty lines aren't that complex for simple tools. In general, you're going to read each line up until the ;, and check if what preceeds it is empty (only whitespace) or not, which is usually a single line: x.split(";")[0] in python, cut -d ; -f 1 in any shell script, ^([^;]*) in regex, etc.
Heck, you could expand your previous regex to something like: [ \t]*([0-9a-fA-F]{2,}):([0-9a-fA-F]{4})[ \t]+([^;]*)[ \t]*(;.*)?, and just ignore any line that doesn't match.
This has very little use in the current programs, but might allow for some non-standard extensions.

That, or we could remove comments as a whole. There's literally one comment, and it could be considered useless.

@aaaaaa123456789
Copy link
Member

Comments are generally useful when editing the files by hand. But I've never found any need for inline comments.

@AntonioND
Copy link
Member

I think that comment should be allowed everywhere... Because it's more predictable. If empty lines allow comments, why do other lines prohibit them? It's completely arbitrary, and counter-intuitive.

@mid-kid
Copy link
Contributor

mid-kid commented Feb 14, 2020

I can only see one scenario where I'd edit a .sym file by hand and it's to document a game without source. Using the .sym file as both a scratchpad with notes and a symbol list for the BGB emulator is pretty useful.

That being said, with BGB being one of the primary consumers of this format, it seems useful to look into what it does.

EDIT: Comments by the BGB author

beware: file size at most 10000000 bytes. a semicolon on any line starts a comment. address, 1 or more spaces, label. bank and addr both between 0 and ffff, no length check done. bank can be "BOOT" (case tolerant)
beware: spaces after label are stripped too
beware: like xx:yyyy label ;comment
beware: the part after the space, before trimming, is max 63 chars
beware: so xx:yyyy zzzzzzzzzzz;comment
beware: zzzzzz = max 63

@pinobatch
Copy link
Member

In file formats where semicolon can never be part of a value, semicolon meaning "ignore to end of line" is fine. Or it can even denote that a description for the preceding label follows. But in file formats that can contain semicolon as part of a value, especially in a quoted string such as a section name, stripping comments requires implementing the entire quoted string behavior.

@ISSOtm
Copy link
Member Author

ISSOtm commented Feb 14, 2020

Given that semicolons are not part of any valid value here (symbols don't take them), we thankfully don't have this problem.

I don't think we'll support semicolons in labels either, period. :P

@pinobatch
Copy link
Member

pinobatch commented Feb 14, 2020

WLALINK specifies the .sym file that it generates. It allows inline comments set off with ; and a third line type for section names, set off with [...] (regex is \[\S+\]).

The .sym dialect accepted by mgbdis allows a second consecutive line with the same bank and address to specify the type of a label, so as to mark it in a disassembly as data, text, or 2bpp image.

; This marks a 512-byte chunk of non-code data from $4800 to $49FF
0d:4800 Level_Data
0d:4800 .data:200

; This marks a 16-byte chunk of text from $3D00 to $3D0F
00:3d00 Character_Name
00:3d00 .text:10

; This marks a 1280-byte (80-tile) chunk of tile data from $791A to
; $7E19, 128 tiles (16 pixels) wide, for display with BGP set to $E4
02:791a Title_Screen_Tile_Data
02:791a .image:500:w128,pe4

@aaaaaa123456789
Copy link
Member

The "second line with same address" thing would be a nightmare to parse (again, thinking of something simple like a regex here, not a full-blown program). It would also require a good definition of what types are valid and how to declare them. And it would collide with symfiles with multiple symbols on the same address, which are common.
All in all, probably not a good idea.

As for sections, I don't know; they might be useful, but they would impose ordering constraints on the file (right now you can list symbols in any order), and besides, isn't that what map files are for?

I'd advocate for the simplest possible format — leave everything else to map files.

@nitro2k01
Copy link
Member

nitro2k01 commented Feb 17, 2020

Keep in mind that one use case for sym files is reverse engineering: iteratively editing the file and reloading the sym file in BGB. Since humans are not as precise as machines, it may be appropriate to apply the principle of being conservative in what you output and liberal in what you accept. For example, I'd advocate for allowing the bank and address value to be down to one digit (BGB allows this in practice atm) but also recommending that automatic tools should output 2:4 (or longer prefix if needed.)

As for the other questions my votes are:

Allowing the bank to be optional: No.

Extending the format with other metadata: No.

Allowing comments on non-empty lines: Yes. (Useful for annotating human-generated files and already supported in BGB.)

Another thing that may be useful to specify formally is the meaning of local labels. While it doesn't necessarily matter syntactically for validating the basic format, it matters semantically. Currently, BGB allows local labels to be referenced within the scope of the global label that the local label belongs to. For example, you could enter something like jr .loop in the assembler instead of needing to enter jr COPY.loop. This behavior should be encouraged if more debuggers or similar tools that can use the sym format would be developed.

@ISSOtm ISSOtm added docs This affects the documentation (web-specific issues go to rgbds-www) enhancement Typically new features; lesser priority than bugs labels Oct 14, 2020
@ISSOtm
Copy link
Member Author

ISSOtm commented Oct 14, 2020

A development as far as this issue is concerned: since we have a website, we have a place where we could publish such specifications. Now that I have some time on my hands, I will start compiling the thread above into a first draft, which I will then publish in rgbds-www on a branch (i.e. it won't be published online yet).

Should the discussion continue here, or there?

@ISSOtm
Copy link
Member Author

ISSOtm commented May 3, 2021

I wrote documentation on the SYM file format, but I'm less sure about the MAP one. It's currently still a moving target, and while following a somewhat systematic format, it's mostly intended for human consumption, and not scripts. I'm thinking that we should explicitly document the lack of documentation for now, and publish the SYM spec.

ISSOtm added a commit to gbdev/rgbds-www that referenced this issue Sep 16, 2022
Closes gbdev/rgbds#483 at least for now; as noted,
the `.map` file format is intentionally not specified for the time being.

Co-authored-by: Rangi <remy.oukaour+rangi42@gmail.com>
Co-authored-by: aaaaaa123456789 <aaaaaa123456789@acidch.at>
@Rangi42 Rangi42 removed this from the v1.0.0 milestone Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This affects the documentation (web-specific issues go to rgbds-www) enhancement Typically new features; lesser priority than bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants