Skip to content

Make some daffodil-runtime2 improvements#525

Merged
tuxji merged 4 commits intoapache:runtime2-2202from
tuxji:ji/runtime2-2202/make-improvements
Apr 14, 2021
Merged

Make some daffodil-runtime2 improvements#525
tuxji merged 4 commits intoapache:runtime2-2202from
tuxji:ji/runtime2-2202/make-improvements

Conversation

@tuxji
Copy link
Contributor

@tuxji tuxji commented Apr 7, 2021

Refactor duplicate code in Binary*CodeGenerator.scala files.

Change PState/UState member field from a string error message to a
struct error object. Make parse/unparse functions initialize that
member field if any error happens. Move checks for errors from
parse/unparse functions to their callers, e.g., "if (ustate->error)
return;".

Collect validation errors without stopping parsing or unparsing by
initializing another PState/UState member field pointing to a
validation struct with some fields. As last step, update all C files
using include-what-you-use to document what's used from each include.

Add instructions to README.md how to setup required C tool-chain
elements on Linux and Windows.

Generate daffodil-runtime2's name and version into generated_code.c
automatically so we won't need to keep it up to date.

Changelog:

In README.md, fix or disable markdownlint warnings. Expand Build
Requirements with instructions how to install C compiler, Mini-XML
library, java, and sbt on Linux and Windows. Verify/update old
hyperlinks and add new MSYS2 and clang hyperlinks.

In build.sbt, add Compile / cCompiler and ccArchiveCommand settings as
a guide to show developers how to override these default settings if
they want sbt to call the C compiler with a different name than "cc".
The sbt-cc plugin doesn't iterate through a list (${CC}, cc, gcc,
clang, zig cc) to find a C compiler like our runtime2 code does.

In daffodil_argp.c, remove argp_program_version's definition since we
now generate it automatically within generated_code.c.

In daffodil_main.c, add fflush_or_exit and change fopen_or_exit /
fclose_or_open to call continue_or_exit with error message instead of
calling exit directly. Make main print any validation diagnostics
before printing any errors since they've been split into separate
PState/UState member fields. Also make main call continue_or_exit
instead of calling error directly and pass Error structs, not strings,
to all continue_or_exit calls.

In stack.c, make stack functions call continue_or_exit instead of
calling error directly and pass Error structs, not strings, to all
continue_or_exit calls.

In xml_reader.c, make conversion functions return Error structs, not
strings, to indicate errors converting strings to primitive datums.
Make reader functions return Error structs, not strings, to indicate
errors reading XML data.

In xml_writer.c, make writer functions return Error structs, not
strings, to indicate errors writing XML data. Remove redundant stack
checking code already performed within stack fuctions.

In errors.c (new file), use switch statement inside error_message
function to centralize all error messages in one place for easier
future internationalization. Use switch statement inside
print_mybe_stop function to allow error messages to interpolate an
extra detail (string or integer) if needed. Define a single static
array that can be used to store validation diagnostics and define a
function to print validation diagnostics. Implement continue_or_exit
and eof_or_error functions here as well.

In errors.h (new file), define enumerations for each type of error
which could occur in C code. Define an Error struct to store an
enumeration and optionally an extra detail (string or integer). Also
define a Diagnostics struct to store an array of data validation
errors which should be reported separately from parse errors. Move
PState and UState structs here from infoset.h and give them member
fields pointing to data validation and parse errors. Also declare
need_diagnostics, print_diagnostics, continue_or_exit, eof_or_error
prototypes and UNUSED macro in errors.h.

In infoset.c, make walkInfosetNode and walkInfoset functions return
Error struct, not string. Remove eof_or_error_msg function (now moved
to errors.c with shorter name).

In infoset.h, change InitChoiceRD and Visit* prototypes to return
Error structs, not strings. Remove PState & UState structs,
eof_or_error_msg prototype, and UNUSED macro (all moved to errors.h).

In parsers.c, make parse__ functions stop enclosing
their code in "if (!pstate->error_msg)" statements and use Error
structs, not strings. Callers are now expected to check for errors
after each call immediately. Make parse_fill_bytes use
read_stream_update_position helper macro too. Make
parse_validate_fixed report errors in separate pstate->validati field,
not pstate->error field.

In unparsers.c, make unparse__ functions stop enclosing
their code in "if (!ustate->error_msg)" statements and use Error
structs, not strings. Callers are now expected to check for errors
after each call immediately. Make unparse_fill_bytes use
write_stream_update_position helper macro too. Make
unparse_validate_fixed report errors in separate ustate->validati
field, not ustate->error field.

In NestedUnion.c and ex_nums.c examples, update with freshly generated
code to illustrate code generator changes (in particular, defining
argp_program_version, adding an if statement after each parse/unparse
call to check for an error and skip rest of calls immediately, and
returning errors directly from initChoice functions with choice key
that failed to match any branch key).

In BinaryAbstractCodeGenerator.scala (new file), refactor common code
that used to be duplicated in BinaryBooleanCodeGenerator.scala,
BinaryFloatCodeGenerator.scala, and
BinaryIntegerKnownLengthCodeGenerator.scala; now they pass parameters
to binaryAbstractGenerateCode instead.

In CodeGeneratorState.scala, make all necessary changes to report
errors using structs instead of strings in generated code including
initChoice functions, add if statements after each parse/unparse call,
and generate the program name and version automatically.

In build.properties, change sbt.version from 1.4.1 to 1.4.9 to ensure
successful sbt build from scratch in a new Ubuntu or Windows VM (sbt
was getting java.lang.NoClassDefFound: scala/Serializer).

Refactor duplicate code in Binary*CodeGenerator.scala files.

Change PState/UState member field from a string error message to a
struct error object.  Make parse/unparse functions initialize that
member field if any error happens.  Move checks for errors from
parse/unparse functions to their callers, e.g., "if (ustate->error)
return;".

Collect validation errors without stopping parsing or unparsing by
initializing another PState/UState member field pointing to a
validation struct with some fields.  As last step, update all C files
using include-what-you-use to document what's used from each include.

Add instructions to README.md how to setup required C tool-chain
elements on Linux and Windows.

Generate daffodil-runtime2's name and version into generated_code.c
automatically so we won't need to keep it up to date.

Changelog:

In README.md, fix or disable markdownlint warnings.  Expand Build
Requirements with instructions how to install C compiler, Mini-XML
library, java, and sbt on Linux and Windows.  Verify/update old
hyperlinks and add new MSYS2 and clang hyperlinks.

In build.sbt, add Compile / cCompiler and ccArchiveCommand settings as
a guide to show developers how to override these default settings if
they want sbt to call the C compiler with a different name than "cc".
The sbt-cc plugin doesn't iterate through a list (${CC}, cc, gcc,
clang, zig cc) to find a C compiler like our runtime2 code does.

In daffodil_argp.c, remove argp_program_version's definition since we
now generate it automatically within generated_code.c.

In daffodil_main.c, add fflush_or_exit and change fopen_or_exit /
fclose_or_open to call continue_or_exit with error message instead of
calling exit directly.  Make main print any validation diagnostics
before printing any errors since they've been split into separate
PState/UState member fields.  Also make main call continue_or_exit
instead of calling error directly and pass Error structs, not strings,
to all continue_or_exit calls.

In stack.c, make stack functions call continue_or_exit instead of
calling error directly and pass Error structs, not strings, to all
continue_or_exit calls.

In xml_reader.c, make conversion functions return Error structs, not
strings, to indicate errors converting strings to primitive datums.
Make reader functions return Error structs, not strings, to indicate
errors reading XML data.

In xml_writer.c, make writer functions return Error structs, not
strings, to indicate errors writing XML data.  Remove redundant stack
checking code already performed within stack fuctions.

In errors.c (new file), use switch statement inside error_message
function to centralize all error messages in one place for easier
future internationalization.  Use switch statement inside
print_mybe_stop function to allow error messages to interpolate an
extra detail (string or integer) if needed.  Define a single static
array that can be used to store validation diagnostics and define a
function to print validation diagnostics.  Implement continue_or_exit
and eof_or_error functions here as well.

In errors.h (new file), define enumerations for each type of error
which could occur in C code.  Define an Error struct to store an
enumeration and optionally an extra detail (string or integer).  Also
define a Diagnostics struct to store an array of data validation
errors which should be reported separately from parse errors.  Move
PState and UState structs here from infoset.h and give them member
fields pointing to data validation and parse errors.  Also declare
need_diagnostics, print_diagnostics, continue_or_exit, eof_or_error
prototypes and UNUSED macro in errors.h.

In infoset.c, make walkInfosetNode and walkInfoset functions return
Error struct, not string.  Remove eof_or_error_msg function (now moved
to errors.c with shorter name).

In infoset.h, change InitChoiceRD and Visit* prototypes to return
Error structs, not strings.  Remove PState & UState structs,
eof_or_error_msg prototype, and UNUSED macro (all moved to errors.h).

In parsers.c, make parse_<endian>_<type> functions stop enclosing
their code in "if (!pstate->error_msg)" statements and use Error
structs, not strings.  Callers are now expected to check for errors
after each call immediately.  Make parse_fill_bytes use
read_stream_update_position helper macro too.  Make
parse_validate_fixed report errors in separate pstate->validati field,
not pstate->error field.

In unparsers.c, make unparse_<endian>_<type> functions stop enclosing
their code in "if (!ustate->error_msg)" statements and use Error
structs, not strings.  Callers are now expected to check for errors
after each call immediately.  Make unparse_fill_bytes use
write_stream_update_position helper macro too.  Make
unparse_validate_fixed report errors in separate ustate->validati
field, not ustate->error field.

In NestedUnion.c and ex_nums.c examples, update with freshly generated
code to illustrate code generator changes (in particular, defining
argp_program_version, adding an if statement after each parse/unparse
call to check for an error and skip rest of calls immediately, and
returning errors directly from initChoice functions with choice key
that failed to match any branch key).

In BinaryAbstractCodeGenerator.scala (new file), refactor common code
that used to be duplicated in BinaryBooleanCodeGenerator.scala,
BinaryFloatCodeGenerator.scala, and
BinaryIntegerKnownLengthCodeGenerator.scala; now they pass parameters
to binaryAbstractGenerateCode instead.

In CodeGeneratorState.scala, make all necessary changes to report
errors using structs instead of strings in generated code including
initChoice functions, add if statements after each parse/unparse call,
and generate the program name and version automatically.

In build.properties, change sbt.version from 1.4.1 to 1.4.9 to ensure
successful sbt build from scratch in a new Ubuntu or Windows VM (sbt
was getting java.lang.NoClassDefFound: scala/Serializer).
Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I had minor comments/suggestions. In general this is a really nice change set, as it handles the whole issue of generalizing the error diagnostics system to one that is more appropriately abstract. Not string oriented.

Copy link
Contributor Author

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions, I'll take care of one now and add the other to the runtime2 todos. I have a question below as well.

In errors.[ch], define new function add_diagnostic to add an error to
a Diagnostics struct.

In parsers.c and unparsers.c, make (un)parse_validate_fixed call
add_diagnostic instead of adding error directly to Diagnostics struct.
@tuxji
Copy link
Contributor Author

tuxji commented Apr 10, 2021

I can run sbt compile in a Windows VM fine, so I'm not sure why the Windows checks are hanging at that step. Let's continue reviewing the PR and merge it after a second +1 since I need to rebase the branch on the latest master anyway. Will someone else please review this PR?

Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 once windows CI issues are fixed, or understood to be spurious.

const Error *error; // any error which stops program
} UState;

// need_diagnostics - return pointer to validation diagnostics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a naming convention? I would have expected this to be named get_diagnostics, as it is a "getter".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, your suggestion is better. Renamed to get_diagnostics.

if (validati && error)
{
if (validati->length <
sizeof(validati->array) / sizeof(*validati->array))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only way more than one can accumulate here is if they are warnings.

But to continue execution when this is full feels like the wrong choice.

Maybe return a boolean which indicates whether the diagnostic was able to be recorded or not.

Then the place issuing the warning can decide if there are "too many warnings" and exit, or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I've made add_diagnostic return false if the array was too full to add another diagnostic.

Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Handful of questions, but looks good! Some nice refactorings.

README.md Outdated
data processing frameworks so as to bypass any XML/JSON overheads.

For more information about Daffodil, see https://daffodil.apache.org/.
For more information about Daffodil, see the [Website].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but I actually prefer the full URL here. The url is something that some people might want to know so we might not want to hide it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I put the full URL back with brackets around it to avoid markdownlint's bare url warning (that was why I changed it before).

README.md Outdated
and more. [SDK] offers an easy and uniform way to install both java
and sbt on any Unix based platform:

curl -s "https://get.sdkman.io" | bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should reccomend that people download a script from the internet and pipe it to bash. Especially when I'd assume most Linux distros already support installing openjdk. I'd also rather reference the sbt download page, which has deb, rpm, and windows installers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I actually read the script carefully before I ran it and I shouldn't suggest to anyone it's okay to download a script from the Internet and pipe it to bash. If you already have sdk installed, though, then showing people how to install the JDK and SBT become very simple and uniform one-liners that work on any UNIX platform (sigh, wish sdk was everywhere). I'll just recommend people install JDK and SBT directly from their websites, though.

README.md Outdated
and Ubuntu):

# Just mentioning all other packages you might need too
sudo apt install build-essential curl git libmxml-dev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would be better to have a separate mardown file that describes details to how to get the required packages from different distros if that's something we want to maintain. That way the main readme just lists the dependencies. Some users can easily figure that out, and then a more detailed page is available for those that need commands. Or at the very list this can be broken down into a "Buid Setup" section with different sections, e.g . DNF-based distros, APT-based distros, Windows, etc. so people can easily skip to the section they need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought, we could make build setup instructions for different distos collapsable: https://gist.github.com/pierrejoubert73/902cc94d79424356a8d20be2b382e1ab

That way it doesn't make the README huge by default, but has the information people need if they need it. For example:

Build Setup

Fedora
sudo dnf install ...

More stuff here

Ubuntu/Debian
sudo apt install ...

More stuff here

Windows

Windows Instructions here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of putting the instructions in a separate markdown file better. I've moved everything I added to README.md into BUILD.md except for the four build requirements and "See BUILD.md for more details" below them. I still improved some parts of README.md anyway :).

README.md Outdated
* C compiler C99 or higher
* Mini-XML Version 3.2 or higher

Since Daffodil has a DFDL to C backend, you will need a C compiler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the're any thoughts on making this C backend optional / is it even possible to be optional? I know of people that build Daffodil from source and build an RPM, and they probably would not want to the C backend, so a way to turn it off so that the potential attack surface is lessend might be a beneficial option. Is it even possible to turn off from the build right now, or is it too tightly coupled into the Daffodil compiler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C backend is a code generator written in 100% Scala. The most important reason why we call the C compiler during the sbt build is to alert developers as soon as possible if any change anywhere causes a runtime2 problem. We call the C compiler in multiple places:

  1. Explicitly in build.sbt (building daffodil-runtime2 always calls the sbt-cc plugin on all C source files)
  2. Implicitly in sbt test (running daffodil-runtime2's unit/TDML tests and daffodil-test's runtime2-specific TDML tests also tests Daffodil's ability to call the C compiler)

We probably could find a way (with a custom sbt plugin or some conditional code in build.sbt) to still build daffodil-runtime2 so that the CLI's "generate c" command still works but disable running both 1) the sbt-cc plugin and 2) the daffodil-runtime2 unit/TDML tests during a "special" sbt build. I'd prefer to put the onus on people to run a "special" sbt build that disables any C compilation so that we still alert developers to problems as early as possible (the longer problems hide, the more time it takes to fix them once they surface).

You're more familiar with SBT's DSL scripting than I am; what's the easiest way to make build.sbt skip running sbt-cc and skip running runtime2-specific unit/TDML tests upon request?

need_diagnostics(void)
{
static Diagnostics validati;
return &validati;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I ever seen these usage of static variable before. Is this just a way to have a global variable without actually looking like a global variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a pointer to a static variable defined inside a function is a good way to provide some storage without polluting the global namespace with a global variable's name, yes. It's also more flexible since we can change the function to use an array, a heap, etc., at a later time.

ERR_XML_LEFT,
ERR_XML_MISMATCH,
ERR_XML_WRITE
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime1 has different kinds of diagnostics, e.g. error, warning, validation error, recoverable error. Are these all just considered "errors" in runtime1? Are there any diagnostics that are fatal errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERR_FIXED_VALUE is used only as a validation error. The rest are considered fatal errors which cause the program to stop immediately. When runtime2 grows larger, then we may need ERR_, WRN_, VLD_, and TRY_ to distinguish different kinds of diagnostics, but we haven't gotten that far yet.


// need_diagnostics - return pointer to validation diagnostics

extern Diagnostics *need_diagnostics(void);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why get diagnostics from the static variable, rather than just having the PState/UState have a Diagnostics instances? Then PState/UState can be passed around and this state can be mutated as diagnostics/errors are created. Avoids potentialy issues with global state/threading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we could make the PState/UState have a Diagnostics instance itself. Then there would be no need for a get_diagnostics function. My thinking is that the majority of schemas I've seen don't use fixed values so I don't want to even call get_diagnostics and initialize the Diagnostics instance unless I have to. That thinking probably will change over time.

Comment on lines 40 to 41
val fixed = e.xml.attribute("fixed")
val fixedValue = if (fixed.isDefined) fixed.get.text else ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer we add this fixed logic to the dsom. For example, right now there is nothing pasing the fixed value and making sure it matches the type. For example

<xs:element name="someBool" type="xs:boolean" fixed="invalidBooleanValue" />

Ideally the dsom would convert the "fixed" value to match the primitive type, which the DSOM should be doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the fixed value logic to the DSOM and updated BinaryAbstractCodeGenerator.scala to use the DSOM getters.

if (fixedValue.nonEmpty) {
val init2 = ""
val parse2 =
s""" parse_validate_fixed(instance->$fieldName$deref == $fixedValue, "$fieldName", pstate);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if there's a potentially security issue here? Looks like we just inject the fixed value from teh schema directly into this C code. Which means I think a schema could look something like this to inject some code?

<xs:element name="foo" ... fixed='0, "foo", pstate); malcious C code here; //' />

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we had a security issue. Now that I'm using the DSOM to convert the XML attribute to the primitive data type and back to a string, a fixed='0, "foo", pstate); malicious C code here; //' will result in an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only supports fixed values where the primitive is a numeric? Are string primitives supported yet, or do we need a check here that SDE's if hasFixedValue but primitive type is not a numeric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime2 doesn't support strings yet, and we do need to consider how to handle fixed values that are strings (we'd have to SDE or call strcmp instead of using value == fixed). I would expect that strings in binary data could have length fields as well as be terminated by a null byte.

In BUILD.md, copy build requirements and setup instructions here from
README.md with some reordering and edits for better clarity.

In README.md, restore Daffodil website full url, remove build setup
instructions, and turn getting started instructions into bullet items
for less vertical space and better readability.

In build.sbt, allow environment variables CC and AR to override
cCompiler and ccArchiveCommand's default values.

In ElementBase.scala, ElementDeclMixin.scala, and ElementRef.scala,
add hasFixedValue, fixedValueAsString, and fixedValue getters.

In .clang-format, increase column limit to 110 (default had been 80).
Reformat C files accordingly.

In xml_reader.c, make xmlNumberElem define error_erd in block inside
switch statement and make all cases call return.

In xml_writer.c, use LIMIT_XML_NESTING constant from errors.h.

In errors.c, add assert to default case in error_message, rename
need_diagnostics to get_diagnostics, and make add_diagnostic check
LIMIT_DIAGNOSTICS and return false if array is full.

In errors.h, define limit constants inside enum Limits and use
LIMIT_DIAGNOSTICS constant in Diagnostics.  Rename member field of
PState/UState from validati to diagnostics, rename need_diagnostics to
get_diagnostics, and make add_diagnostic return bool.

In infoset.c, use LIMIT_NAME_LENGTH constant from errors.h.

In BinaryAbstractCodeGenerator.scala, use DSOM fixed getters instead
of reading e.xml.attribute("fixed") directly to ensure fixed is a
valid primitive data value before using it.

In CodeGeneratorState.scala, make addImplementation count both
parserStatements and unparserStatements, not only parserStatements.
Copy link
Contributor Author

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have addressed reviewer requested changes. Will wait for rebase on master branch to determine whether Windows checks really still fail.

README.md Outdated
data processing frameworks so as to bypass any XML/JSON overheads.

For more information about Daffodil, see https://daffodil.apache.org/.
For more information about Daffodil, see the [Website].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I put the full URL back with brackets around it to avoid markdownlint's bare url warning (that was why I changed it before).

README.md Outdated
* C compiler C99 or higher
* Mini-XML Version 3.2 or higher

Since Daffodil has a DFDL to C backend, you will need a C compiler
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C backend is a code generator written in 100% Scala. The most important reason why we call the C compiler during the sbt build is to alert developers as soon as possible if any change anywhere causes a runtime2 problem. We call the C compiler in multiple places:

  1. Explicitly in build.sbt (building daffodil-runtime2 always calls the sbt-cc plugin on all C source files)
  2. Implicitly in sbt test (running daffodil-runtime2's unit/TDML tests and daffodil-test's runtime2-specific TDML tests also tests Daffodil's ability to call the C compiler)

We probably could find a way (with a custom sbt plugin or some conditional code in build.sbt) to still build daffodil-runtime2 so that the CLI's "generate c" command still works but disable running both 1) the sbt-cc plugin and 2) the daffodil-runtime2 unit/TDML tests during a "special" sbt build. I'd prefer to put the onus on people to run a "special" sbt build that disables any C compilation so that we still alert developers to problems as early as possible (the longer problems hide, the more time it takes to fix them once they surface).

You're more familiar with SBT's DSL scripting than I am; what's the easiest way to make build.sbt skip running sbt-cc and skip running runtime2-specific unit/TDML tests upon request?

README.md Outdated
and Ubuntu):

# Just mentioning all other packages you might need too
sudo apt install build-essential curl git libmxml-dev
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of putting the instructions in a separate markdown file better. I've moved everything I added to README.md into BUILD.md except for the four build requirements and "See BUILD.md for more details" below them. I still improved some parts of README.md anyway :).

README.md Outdated
and more. [SDK] offers an easy and uniform way to install both java
and sbt on any Unix based platform:

curl -s "https://get.sdkman.io" | bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I actually read the script carefully before I ran it and I shouldn't suggest to anyone it's okay to download a script from the Internet and pipe it to bash. If you already have sdk installed, though, then showing people how to install the JDK and SBT become very simple and uniform one-liners that work on any UNIX platform (sigh, wish sdk was everywhere). I'll just recommend people install JDK and SBT directly from their websites, though.

README.md Outdated
sdk install sbt

You can edit the Compile / cCompiler setting in build.sbt if you don't
want sbt to call your C compiler with "cc" as the driver command.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, only Daffodil has logic for checking an environment variable and trying a list of known driver commands to find a C compiler. However, your set Compiler / cCompiler := "gcc" line makes me realize we could embed at least the environment variable logic into build.sbt. I've made a one-line change to build.sbt that sets cCompiler to either "cc" or the value of the environment variable "CC".

Also note that setting CC = "true" could be one way people could avoid calling the C compiler although some unit/TDML tests still would break.

ERR_XML_LEFT,
ERR_XML_MISMATCH,
ERR_XML_WRITE
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERR_FIXED_VALUE is used only as a validation error. The rest are considered fatal errors which cause the program to stop immediately. When runtime2 grows larger, then we may need ERR_, WRN_, VLD_, and TRY_ to distinguish different kinds of diagnostics, but we haven't gotten that far yet.

const Error *error; // any error which stops program
} UState;

// need_diagnostics - return pointer to validation diagnostics
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, your suggestion is better. Renamed to get_diagnostics.


// need_diagnostics - return pointer to validation diagnostics

extern Diagnostics *need_diagnostics(void);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we could make the PState/UState have a Diagnostics instance itself. Then there would be no need for a get_diagnostics function. My thinking is that the majority of schemas I've seen don't use fixed values so I don't want to even call get_diagnostics and initialize the Diagnostics instance unless I have to. That thinking probably will change over time.

Comment on lines 40 to 41
val fixed = e.xml.attribute("fixed")
val fixedValue = if (fixed.isDefined) fixed.get.text else ""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the fixed value logic to the DSOM and updated BinaryAbstractCodeGenerator.scala to use the DSOM getters.

if (fixedValue.nonEmpty) {
val init2 = ""
val parse2 =
s""" parse_validate_fixed(instance->$fieldName$deref == $fixedValue, "$fieldName", pstate);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we had a security issue. Now that I'm using the DSOM to convert the XML attribute to the primitive data type and back to a string, a fixed='0, "foo", pstate); malicious C code here; //' will result in an error.

@tuxji tuxji merged commit e73310c into apache:runtime2-2202 Apr 14, 2021
@tuxji tuxji deleted the ji/runtime2-2202/make-improvements branch April 14, 2021 17:59
Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional minor questions/comments on the updates.

then run the following commands in a "MSYS2 MSYS" window:

pacman -S gcc git libargp-devel make pkgconf
git clone https://github.com/michaelrsweet/mxml.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include a specific tag here, e.g.

git clone -t v3.2 https://github.com/michaelrsweet/mxml.git

I don't really like the idea of reccommending people use the master branch of a repo we don't control.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will put -t v3.2 in the git clone command both here and in the CI workflow when I rebase the runtime2-2204 branch.

operating system's package manager (for example, `apt` on Debian or
Ubuntu):

sudo apt install build-essential curl git libmxml-dev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is libxml-dev MiniXML in ubuntu? On fedora libxml is something completely different. This is why I would prefer that we have step by step build instructions for common operating systems to avoid this kind of confusion. Build instructions like "you might need this because your OS my not provide it can be confusing for new users. By giving specific instructions, it makes it easier to figure out exactly what steps to follow, and if you'r on an varient, likely commands that need to be run.

It's also a bit confusing because Windows and no windows things are interleaved. If I'm not on windows, I want to just ignored anything windows related to quickly get set up. For example, I'm on Fedora, there are git clone commands to clone and install mxml. I don't know if I need to do that or not. Turns out, Fedora does have mxml via sudo dnf instal mxml-devel but that isn't anywhere in the instructions. And that version is only 3.1, so I assume it won't work for this, and I still need to install via source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mini-XML is mxml-dev in Fedora and libmxml-dev in Ubuntu; libxml is something completely different on both Fedora and Ubuntu. If Fedora's mxml-dev is only 3.1, the API is missing a function which we expect to call so we still need to install Mini-XML from source there too. I'll separate the build instructions for Fedora, Ubuntu, and Windows and make them more clear when I rebase the runtime2-2204 branch.

some of the more commonly used commands for Daffodil development.

### Compile
* Compile source code:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally perfer the ### headings over the bullets. It makes it easier to skim over and find the command you might need, rather than having the full sentences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fair enough. I'll put the ### headings back.

Compile / cCompiler := "cc",
Compile / ccArchiveCommand := "ar",
Compile / cCompiler := sys.env.getOrElse("CC", "cc"),
Compile / ccArchiveCommand := sys.env.getOrElse("AR", "ar"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth mentiong the AR enironment variable in the BUILD setup readme?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if someone says CC=true, they probably need to say AR=true too. I'll mention AR too. Can you think of a clean way to skip the tests too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a simple way to do it. If all that's really needed is to install a compiler and mini xml, I dont't think it's worth it. I think maybe if we get more runtimes or more devs focusing on different runtimes, it might be worth figuring out how to make things more pluggable so people can really only focus on runtimes they care about (as long as CI continues to test all runtimes). But for now, I dont' think it's a big deal. And it will also force people to test it, which is probably important until we get more people working on runtime2.

}

/**
* Is either None or Some(primTypeValue).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't accurent. This is always a pimitive, but might be the special NoValue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had noticed that the comment seemed inaccurate when I copied it too. Yes, I'll update both places.

enum Limits
{
LIMIT_DIAGNOSTICS = 100, // limits how many diagnostics can accumulate
LIMIT_NAME_LENGTH = 9999, // limits how long infoset names can become
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limit seems pretty huge. I would think we could make this smaller and reduce memory usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at infoset.c which uses LIMIT_NAME_LENGTH, it's just a couple of functions with static char[9999] buffers and their storage are used only temporarily by the callers long enough to hand off to the Mini-XML library (which copies the actual, usually less than 20 characters long, strings immediately) or compare against strings from the Mini-XML library using strcmp. Setting the limit to 9999 characters only uses static storage in a couple of places and reduces the chance that an infoset namespace or name might not fit in the static buffer.

if (fixedValue.nonEmpty) {
val init2 = ""
val parse2 =
s""" parse_validate_fixed(instance->$fieldName$deref == $fixedValue, "$fieldName", pstate);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only supports fixed values where the primitive is a numeric? Are string primitives supported yet, or do we need a check here that SDE's if hasFixedValue but primitive type is not a numeric?

Copy link
Contributor Author

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've updated my todos accordingly.

operating system's package manager (for example, `apt` on Debian or
Ubuntu):

sudo apt install build-essential curl git libmxml-dev
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mini-XML is mxml-dev in Fedora and libmxml-dev in Ubuntu; libxml is something completely different on both Fedora and Ubuntu. If Fedora's mxml-dev is only 3.1, the API is missing a function which we expect to call so we still need to install Mini-XML from source there too. I'll separate the build instructions for Fedora, Ubuntu, and Windows and make them more clear when I rebase the runtime2-2204 branch.

then run the following commands in a "MSYS2 MSYS" window:

pacman -S gcc git libargp-devel make pkgconf
git clone https://github.com/michaelrsweet/mxml.git
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will put -t v3.2 in the git clone command both here and in the CI workflow when I rebase the runtime2-2204 branch.

some of the more commonly used commands for Daffodil development.

### Compile
* Compile source code:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fair enough. I'll put the ### headings back.

Compile / cCompiler := "cc",
Compile / ccArchiveCommand := "ar",
Compile / cCompiler := sys.env.getOrElse("CC", "cc"),
Compile / ccArchiveCommand := sys.env.getOrElse("AR", "ar"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if someone says CC=true, they probably need to say AR=true too. I'll mention AR too. Can you think of a clean way to skip the tests too?

}

/**
* Is either None or Some(primTypeValue).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had noticed that the comment seemed inaccurate when I copied it too. Yes, I'll update both places.

enum Limits
{
LIMIT_DIAGNOSTICS = 100, // limits how many diagnostics can accumulate
LIMIT_NAME_LENGTH = 9999, // limits how long infoset names can become
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at infoset.c which uses LIMIT_NAME_LENGTH, it's just a couple of functions with static char[9999] buffers and their storage are used only temporarily by the callers long enough to hand off to the Mini-XML library (which copies the actual, usually less than 20 characters long, strings immediately) or compare against strings from the Mini-XML library using strcmp. Setting the limit to 9999 characters only uses static storage in a couple of places and reduces the chance that an infoset namespace or name might not fit in the static buffer.

if (fixedValue.nonEmpty) {
val init2 = ""
val parse2 =
s""" parse_validate_fixed(instance->$fieldName$deref == $fixedValue, "$fieldName", pstate);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime2 doesn't support strings yet, and we do need to consider how to handle fixed values that are strings (we'd have to SDE or call strcmp instead of using value == fixed). I would expect that strings in binary data could have length fields as well as be terminated by a null byte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants