New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update config parsing to optionally return issues instead of raising exceptions #159
Update config parsing to optionally return issues instead of raising exceptions #159
Conversation
The method of 'validate_config_structure' was duplicating a lot of (almost all of) the logic of 'parse_config_yaml'. This commit merges the "extra" stuff 'validate_config_structure' was doing (validating the yaml conformed to the jsonschema spec). This in turn means that: we load the yaml documents one less time, we iterate over the yaml documents one less time, and we reduce the number of duplicate errors being produced, and reduce code complexity by keeping things more DRY.
3aa6595
to
55c3063
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on stack trace preservation for ease of debugging? str(Exception) is just str(Exception.args).
@@ -43,7 +43,7 @@ | |||
|
|||
@dataclass(frozen=True) | |||
class ModelBuildResult: # noqa: D | |||
model: Optional[UserConfiguredModel] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -134,11 +138,11 @@ def parse_config_yaml( | |||
data_source_class: Type[DataSource] = DataSource, | |||
metric_class: Type[Metric] = Metric, | |||
materialization_class: Type[Materialization] = Materialization, | |||
) -> List[Union[DataSource, Metric, Materialization]]: | |||
) -> Tuple[List[Union[DataSource, Metric, Materialization]], List[ValidationIssueType]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally when you have a pair of lists in a return like this they line up. So I'd expect this to return a 1:1 mapping of results to issues, or possibly a result or an issue in a given index location.
But what this does is it interleaves them in the 1:1 case - if there's an issue I generally won't get a whatever - but it can also spawn a bunch of issues. This means there's no way for me to map issues back to whatever I'm seeing in the return output.
Would we be better off making this return a List[Tuple[Union[DataSource etc.], [List[ValidationIssueType]]]]
? Then every entry has one or both of an object and a list of validation issues.
Alternatively, maybe we're better off with a more descriptive data structure that has result
and errors
and is documented as having one or the other but not both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not intended that the two lists be related in that manner. I.e. I wasn't intending to ensure a one-to-one matching. Additionally it's possible that an issue exists for an object that doesn't because creation of the object from the config failed. I'll go the route of having a more descriptive data structure that has results
and errors
, it'll be somewhat similar to ModelBuildResults
issues.append( | ||
ValidationError( | ||
context=FileContext(file_name=ctx.filename, line_number=ctx.start_line), | ||
message=f"Unsupported version {version} in config document.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See like this doesn't continue or skip or anything, so we'll get results offset from errors by one per document with VERSION_KEY set to the wrong thing.
message=f"YAML document did not conform to metric spec.\nError: {e}", | ||
) | ||
) | ||
except ScannerError as e: | ||
raise ParsingException( | ||
message=str(e), | ||
ctx=ctx, | ||
config_filepath=config_yaml.filepath, | ||
) from e | ||
except ParsingException: | ||
raise | ||
context = FileContext(file_name=ctx.filename, line_number=ctx.start_line) if ctx is not None else None | ||
issues.append(ValidationError(context=context, message=str(e))) | ||
except ParsingException as e: | ||
context = FileContext(file_name=ctx.filename, line_number=ctx.start_line) if ctx is not None else None | ||
issues.append(ValidationError(context=context, message=str(e))) | ||
except Exception as e: | ||
raise ParsingException( | ||
message=str(e), | ||
ctx=ctx, | ||
config_filepath=config_yaml.filepath, | ||
) from e | ||
context = FileContext(file_name=ctx.filename, line_number=ctx.start_line) if ctx is not None else None | ||
issues.append(ValidationError(context=context, message=str(e))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This obliterates the stack traces, which will make debugging really painful for us if we ever hit a model validation issue that resolves to an issue in the parsing/validation stack itself.
Let's either store the stack traces or else collect exceptions and re-throw in the end.
if raise_issues_as_exceptions and build_issues.has_blocking_issues: | ||
raise ModelValidationException(build_issues.all_issues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we can store the original exceptions we can throw this from those, and we'd be able to collect all of the stack traces. Here it just has string pointers to things happening... somewhere.
I've had mixed experiences with this kind of global "collect issues and maybe throw later" implementation pattern - I think we might want to have cleaner separation between what constitutes an exception and what constitutes a user error, and throw exceptions for the former and collect issues for the latter. I'm not sure how to go about achieving this, honestly, because I don't know where our parsing error boundary ends and the user input error boundary begins, but it's something to think about as we tinker further here.
In the meantime, preserving stack trace information is probably sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed 🤔
) | ||
) | ||
# catches exceptions from jsonschema validator | ||
except exceptions.ValidationError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, this thing is from jsonschema. Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea 🙃 I should have realized names like ValidationError
would eventually have collisions like this. But I didn't, and so here we are 😬
context = FileContext(file_name=ctx.filename, line_number=ctx.start_line) | ||
issues.append(ValidationError(context=context, message=str(e))) | ||
# If a runtime error occured, we still want this to break things | ||
except RuntimeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Is this establishing our boundary between user input issues and internal errors?
I don't know if our error hierarchy is this disciplined, though. Even with Python - ValueError and AssertionErrors are not RuntimeError subclasses. I suggest we lose the global Exception -> issue conversion and instead refine around what user input might throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed. Basically internal errors aren't actually a validation error, or shouldn't be. If we swallow them, they might be harder to realize. Ideally if there is an internal logic error, it should be raised an exception and hopefully caught during unit tests or sanity testing changes.
I've been debating adding more properties to ValidationIssue. Specifically because some errors are rather large, and and much of it isn't actually needed. I.e. perhaps adding an |
6a7b266
to
28e13c0
Compare
…el instead of raising exceptions
…eturned Instead of collecting and raising errors in 'parse_directory_of_yaml_files_to_model' and the methods it calls, we now collect 'ValidationIssues' and build a 'ModelValidationResults' object. The method also takes a new argument 'raise_issues_as_exceptions'. If 'raise_issues_as_exceptions' is 'True', then the default behavior of the method is the same, in that if there are any 'ERROR' level 'ValidationIssues' it raises an exception with the message being a concatenation of all the issues. However, if 'raise_issues_as_exceptions' is 'False' then even if there are 'ERROR' level issues, a ModelValidationResults object is returned and it is expected the caller will handle the results.
Over the past month it has become the case that ModelBuildResult is never instantiated without a UserConfiguredModel. We want to continue this pattern. To ensure this, we should make the model property of ModelBuildResult required, which this commit does. This also makes it so that we no longer have to assert that a ModelBuildResult has a model.
Previously we weren't exception handling on a document by document basis. Which means if there was an issue in one document in the config yaml, we'd cease validating the rest of the documents. This wasn't desireable, because it means if there are multiple issues, you'd only find out about the first one run into each time parsing is run. This in turn makes fixing issues in configs slow and tedious. Now in, in most cases, if there is an exception with a document, we catch it early and continue validating/parsing the rest of the documents.
…ions Previously during parsing, if an exception happened, it was breaking and validation ceased. Now however, we are able to collect all the issues for a config and return them all gracefully.
Previously we'd been using 'load_all_without_context', to load config yaml for jsonschema validation. However, now we use 'load_all_with_context' in all cases, and want to encourage this as a continued behavior. This is because 'load_all_with_context' adds a '__parsing_context__' to parsed yaml and if someone were to specify a '__parsing_context__' in their yaml, it will be clobbered by 'load_all_with_context', which is desired.
1b4956c
to
694f0ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything but the last commit (the custom jsonschema validator) looks good to me.
I think it's worth separating that one into its own PR, honestly, it'll be easier for people to find it and we can do a bit more discussion on that without blocking the rest of these improvements. Or maybe it's fine to put a pattern in each schema and just use the standard Draft7Validator - how many new schemas will we create?
I wish there was a way to share core properties in all schemas, haven't found one yet.
@@ -8,7 +8,6 @@ | |||
from datetime import date | |||
from enum import Enum | |||
from typing import Any, Callable, Dict, List, Optional, Sequence, Tuple, Union | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in commit message: is desire
-> is desired
@@ -155,6 +154,7 @@ class ValidationIssue(ABC, BaseModel): | |||
|
|||
message: str | |||
context: Optional[ValidationContext] = None | |||
extra_detail: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a List[str] here, or maybe a Dict[str, str]? I guess it doesn't matter much right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List[str] works from now. We're gonna want a simple way to stringify whatever extra_detail is. But what that looks like doesn't seem important for this PR. If it was a dict, perhaps just str(extra_details)
would be good enough 🤷 But again, I think List[str]
is good enough for now.
issues.append( | ||
ValidationError( | ||
context=MetricContext( | ||
file_context=FileContext.from_metadata(metadata=metric.metadata), | ||
metric=MetricModelReference(metric_name=metric.name), | ||
), | ||
message=traceback.format_exc(), | ||
message="".join(traceback.format_exception_only(etype=type(e), value=e)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we want to do this format_exception_only treatment on the other errors? I don't have strong opinions one way or the other, just wondering if the difference between these calls and the dir_to_model exceptions is intentional, and if there's something we should prefer.
"type": _validators.type, | ||
"uniqueItems": _validators.uniqueItems, | ||
}, | ||
type_checker=_types.draft7_type_checker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep the format checker and whatever applicable_validators
is doing in place here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, looks like I either copied an old version Draft7Validator or incompletely copied Draft7Validator somehow
"exclusiveMaximum": _validators.exclusiveMaximum, | ||
"exclusiveMinimum": _validators.exclusiveMinimum, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've updated a number of "_legacy_validators" references here. These two plus a few others. Was that intentional?
I'm fine with it if it works, might as well make those more current, I'm just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it was not not intentional. Things were supposed to be on par with whatever Draft7Validator is doing currently
for error in validator.descend(instance[extra], aP, path=extra): | ||
yield error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl,dr; I think we should put yield from
back where it was.
This is not exactly the same as the yield from
in the original. The difference is subtle - a function using the for loop for x in iterator: yield x
holds control over the iterator, while a function using yield from iterator
delegates control flow to its caller. The distinction is most likely to manifest if anything in the iterator chain is using send()
or if anything is dealing with exception forwarding.
https://peps.python.org/pep-0380/
Note I just learned all of this within the last half hour so I might be missing something about why this is always going to be totally fine, but it seems like there are cases where this won't exactly do what it needs to do.
From poking around in the jsonschema validator code a bit, it looks like the implementation here is essentially a set of pipes (in the Unix sense) for sending errors from one process step to the next. So breaking that chain sequence could cause weird edge cases to pop up, or maybe just malformed stack traces
Simplified examples:
A excellent primer on coroutines and yield expressions, with some examples of unix pipe style implementations:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if taking over control flow at this point is intentional, that should probably be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference for this method in comparison to the original was supposed to be the call to custom_find_additonal_properties
, so yes I agree, and the control flow change is unintentional.
error = "%s %s not match any of the regexes: %s" % ( | ||
", ".join(map(repr, sorted(extras))), | ||
verb, | ||
", ".join(map(repr, patterns)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - revert to format strings please.
Based on what I've seen in Metricflow we generally prefer format string syntax (f"...{substitution}...."
). In fact, these would be the first printf-style format strings in the repo.
I don't think this change is all that helpful for readability over the local variable format string substitution in the original, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was copied from the original, although somehow an out of date version. Happy to update to the the current version which I somehow didn't copy
Because this PR includes the merge of the old All that said, I'm happy to drop the custom validator and instead manually add the it as a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, the jsonschema validator stack got updated. We're like a year out of date. And it looks like they did yield from because it's more concise but there's no way to be sure.
I think we should unblock this at this point and then open an issue to update the jsonschema package and clean everything up later. We can punt the decision about whether to keep the custom validator or replace it with a utility function to then. I'll document what I"m talking about in the issue.
One change you might want to make before merging is to use extend() instead of create() and pass our custom validator through to clobber the one Draft7 is using. That way we get whatever Draft7 is using but updated with our own validator. Then when we update the package we only need to clean up our custom validator implementation - the Draft7 inheritance should continue to work as-is.
yield ValidationError(error % extras_msg(extras)) | ||
|
||
|
||
SchemaValidator = create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use extend()
here you only need to specify validators = {"additionalProperties": customAdditionalProperties}
According to the docblock that will override the Draft7 default for that one validator, while keeping everything else as it is defined in the package, and will give you what amounts to a customized Draft7Validator instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice call, that makes things more straight forward!
…de in readable str Sometimes some extra detail about a validation issue is desired. The most common example of this is validation issues that are created from exceptions wherein it can be useful to include the stacktrace of the exception.
…yaml' returns The method 'parse_config_yaml' was returning a tuple of two lists. This was confusing because there exists a convention wherein when a tuple of two lists is returned, it is assumed that the lists are a one to one mapping, i.e. that element x of the first list is related to elment x of the second list. This convention of association was not intended for the return of 'parse_config_yaml', thus we added a new dataclass 'FileParsingResult' to clarify what is being returned and avoid this unintentional assumed association.
694f0ff
to
8f4de4a
Compare
We want to be able to ignore all properties in config documents that match the regex '^__(.*)__$', e.g. '__parsing_context__', '__custom_property_name__', and etc. To do this without actually adding a patternProperty to every schema, we had to override the 'additonalProperties' validator of the Draft7Validator. As we can not actually change out the validators used by the Draft7Validator, we had to write our own custom validator, which we call 'SchemaValidator'
8f4de4a
to
113fe93
Compare
The follower are examples prior to this PR and with this PR for configs which have two issues. 1) A metric without a name 2) a time dimension missing type params. Note that prior to this PR only one of the issues gets caught (thats because its the first one hit), whereas both are identified with the updates from this PR. I admit that the JSON schema validations aren't the best still, but they are a lot easier to look at than before.
Previously
now