Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update config parsing to optionally return issues instead of raising exceptions #159

Merged

Conversation

QMalcolm
Copy link
Contributor

@QMalcolm QMalcolm commented Jul 14, 2022

The follower are examples prior to this PR and with this PR for configs which have two issues. 1) A metric without a name 2) a time dimension missing type params. Note that prior to this PR only one of the issues gets caught (thats because its the first one hit), whereas both are identified with the updates from this PR. I admit that the JSON schema validations aren't the best still, but they are a lot easier to look at than before.

Previously

poetry run mf validate-configs
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully linted config YAML files (FATALS: 0, ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
⠙ Building model from configs
ERROR: An error occurred when attempting to build the semantic model
Log file: /Users/tqm/.metricflow/logs/metricflow.log
Traceback (most recent call last):
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/model/parsing/dir_to_model.py", line 141, in parse_config_yaml
    validate_config_structure(config_yaml)
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/model/parsing/validation.py", line 65, in validate_config_structure
    raise exceptions.ValidationError("\n".join(errors))
jsonschema.exceptions.ValidationError: {'name': 'ds', 'type': 'time', 'expr': 'CAST(created_date as DATETIME)'} is not valid under any of the given schemas

Failed validating 'anyOf' in schema['properties']['dimensions']['items']:
    {'$id': 'dimension_schema',
     'additionalProperties': False,
     'anyOf': [{'not': {'$ref': '#/definitions/is-time-dimension'}},
               {'required': ['type_params']}],
     'definitions': {'is-time-dimension': {'properties': {'type': {'enum': ['TIME',
                                                                            'time']}},
                                           'required': ['type']}},
     'properties': {'description': {'type': 'string'},
                    'display_name': {'type': 'string'},
                    'expr': {'type': ['string', 'boolean']},
                    'is_partition': {'type': 'boolean'},
                    'name': {'pattern': '(?!.*__).*^[a-z][a-z0-9_]*[a-z0-9]$',
                             'type': 'string'},
                    'owners': {'items': {'type': 'string'},
                               'type': 'array'},
                    'tier': {'type': ['string', 'integer']},
                    'type': {'enum': ['CATEGORICAL',
                                      'TIME',
                                      'categorical',
                                      'time']},
                    'type_params': {'$ref': 'dimension_type_params_schema'}},
     'required': ['name', 'type'],
     'type': 'object'}

On instance['dimensions'][0]:
    {'expr': 'CAST(created_date as DATETIME)', 'name': 'ds', 'type': 'time'}
'name' is a required property

Failed validating 'required' in schema:
    {'$id': 'metric',
     'additionalProperties': False,
     'properties': {'constraint': {'type': 'string'},
                    'description': {'type': 'string'},
                    'display_name': {'type': 'string'},
                    'locked_metadata': {'$ref': 'locked_metadata'},
                    'name': {'pattern': '(?!.*__).*^[a-z][a-z0-9_]*[a-z0-9]$',
                             'type': 'string'},
                    'owners': {'items': {'type': 'string'},
                               'type': 'array'},
                    'tier': {'type': ['string', 'integer']},
                    'type': {'enum': ['MEASURE_PROXY',
                                      'RATIO',
                                      'EXPR',
                                      'CUMULATIVE',
                                      'measure_proxy',
                                      'ratio',
                                      'expr',
                                      'cumulative']},
                    'type_params': {'$ref': 'metric_type_params'},
                    'where_constraint': {'type': 'string'}},
     'required': ['name', 'type', 'type_params'],
     'type': 'object'}

On instance:
    {'constraint': "stage_name = '6 - Closed Won' \n",
     'owners': ['support@transformdata.io'],
     'type': 'measure_proxy',
     'type_params': {'measures': ['opps']}}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/engine/utils.py", line 23, in build_user_configured_model_from_config
    model = parse_directory_of_yaml_files_to_model(models_path).model
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/model/parsing/dir_to_model.py", line 74, in parse_directory_of_yaml_files_to_model
    model = parse_yaml_files_to_model(yaml_config_files)
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/model/parsing/dir_to_model.py", line 103, in parse_yaml_files_to_model
    objects = parse_config_yaml(  # parse config file
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/model/parsing/dir_to_model.py", line 223, in parse_config_yaml
    raise ParsingException(
metricflow.errors.errors.ParsingException: Failed to parse YAML file '/Users/tqm/Developer/transform-data/metricflow/config-templates/salesforce/opportunities.yaml' - YAML file did not conform to metric spec.
Error: {'name': 'ds', 'type': 'time', 'expr': 'CAST(created_date as DATETIME)'} is not valid under any of the given schemas

Failed validating 'anyOf' in schema['properties']['dimensions']['items']:
    {'$id': 'dimension_schema',
     'additionalProperties': False,
     'anyOf': [{'not': {'$ref': '#/definitions/is-time-dimension'}},
               {'required': ['type_params']}],
     'definitions': {'is-time-dimension': {'properties': {'type': {'enum': ['TIME',
                                                                            'time']}},
                                           'required': ['type']}},
     'properties': {'description': {'type': 'string'},
                    'display_name': {'type': 'string'},
                    'expr': {'type': ['string', 'boolean']},
                    'is_partition': {'type': 'boolean'},
                    'name': {'pattern': '(?!.*__).*^[a-z][a-z0-9_]*[a-z0-9]$',
                             'type': 'string'},
                    'owners': {'items': {'type': 'string'},
                               'type': 'array'},
                    'tier': {'type': ['string', 'integer']},
                    'type': {'enum': ['CATEGORICAL',
                                      'TIME',
                                      'categorical',
                                      'time']},
                    'type_params': {'$ref': 'dimension_type_params_schema'}},
     'required': ['name', 'type'],
     'type': 'object'}

On instance['dimensions'][0]:
    {'expr': 'CAST(created_date as DATETIME)', 'name': 'ds', 'type': 'time'}
'name' is a required property

Failed validating 'required' in schema:
    {'$id': 'metric',
     'additionalProperties': False,
     'properties': {'constraint': {'type': 'string'},
                    'description': {'type': 'string'},
                    'display_name': {'type': 'string'},
                    'locked_metadata': {'$ref': 'locked_metadata'},
                    'name': {'pattern': '(?!.*__).*^[a-z][a-z0-9_]*[a-z0-9]$',
                             'type': 'string'},
                    'owners': {'items': {'type': 'string'},
                               'type': 'array'},
                    'tier': {'type': ['string', 'integer']},
                    'type': {'enum': ['MEASURE_PROXY',
                                      'RATIO',
                                      'EXPR',
                                      'CUMULATIVE',
                                      'measure_proxy',
                                      'ratio',
                                      'expr',
                                      'cumulative']},
                    'type_params': {'$ref': 'metric_type_params'},
                    'where_constraint': {'type': 'string'}},
     'required': ['name', 'type', 'type_params'],
     'type': 'object'}

On instance:
    {'constraint': "stage_name = '6 - Closed Won' \n",
     'owners': ['support@transformdata.io'],
     'type': 'measure_proxy',
     'type_params': {'measures': ['opps']}}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/cli/utils.py", line 198, in wrapper
    func(*args, **kwargs)
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/telemetry/reporter.py", line 149, in wrapped
    return func(*args, **kwargs)
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/cli/main.py", line 721, in validate_configs
    user_model = cfg.user_configured_model
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/cli/cli_context.py", line 111, in user_configured_model
    self._user_configured_model = build_user_configured_model_from_config(self.config)
  File "/Users/tqm/Developer/transform-data/metricflow/metricflow/engine/utils.py", line 27, in build_user_configured_model_from_config
    raise ModelCreationException from e
metricflow.errors.errors.ModelCreationException: An error occurred when attempting to build the semantic model

now

$ poetry run mf validate-configs    
(To see warnings and future-errors, run again with flag `--show-all`)
✔ 🎉 Successfully linted config YAML files (FATALS: 0, ERRORS: 0, FUTURE_ERRORS: 0, WARNINGS: 0)
✖ Breaking issues found when building model from configs (FATALS: 0, ERRORS: 2, FUTURE_ERRORS: 0, WARNINGS: 0)
• ERROR: in file `/Users/tqm/Developer/transform-data/metricflow/config-templates/salesforce/opportunities.yaml` on line #1 - YAML document did not conform to metric spec.
Error: {'name': 'ds', 'type': 'time', 'expr': 'CAST(created_date as DATETIME)', '__parsing_context__': <metricflow.model.parsing.yaml_loader.ParsingContext object at 0x120578490>} is not valid under any of the given schemas

Failed validating 'anyOf' in schema['properties']['dimensions']['items']:
    {'$id': 'dimension_schema',
     'additionalProperties': False,
     'anyOf': [{'not': {'$ref': '#/definitions/is-time-dimension'}},
               {'required': ['type_params']}],
     'definitions': {'is-time-dimension': {'properties': {'type': {'enum': ['TIME',
                                                                            'time']}},
                                           'required': ['type']}},
     'properties': {'__parsing_context__': {},
                    'description': {'type': 'string'},
                    'display_name': {'type': 'string'},
                    'expr': {'type': ['string', 'boolean']},
                    'is_partition': {'type': 'boolean'},
                    'name': {'pattern': '(?!.*__).*^[a-z][a-z0-9_]*[a-z0-9]$',
                             'type': 'string'},
                    'owners': {'items': {'type': 'string'},
                               'type': 'array'},
                    'tier': {'type': ['string', 'integer']},
                    'type': {'enum': ['CATEGORICAL',
                                      'TIME',
                                      'categorical',
                                      'time']},
                    'type_params': {'$ref': 'dimension_type_params_schema'}},
     'required': ['name', 'type'],
     'type': 'object'}

On instance['dimensions'][0]:
    {'__parsing_context__': <metricflow.model.parsing.yaml_loader.ParsingContext object at 0x120578490>,
     'expr': 'CAST(created_date as DATETIME)',
     'name': 'ds',
     'type': 'time'}
• ERROR: in file `/Users/tqm/Developer/transform-data/metricflow/config-templates/salesforce/opportunities.yaml` on line #62 - YAML document did not conform to metric spec.
Error: 'name' is a required property

Failed validating 'required' in schema:
    {'$id': 'metric',
     'additionalProperties': False,
     'properties': {'__parsing_context__': {},
                    'constraint': {'type': 'string'},
                    'description': {'type': 'string'},
                    'display_name': {'type': 'string'},
                    'locked_metadata': {'$ref': 'locked_metadata'},
                    'name': {'pattern': '(?!.*__).*^[a-z][a-z0-9_]*[a-z0-9]$',
                             'type': 'string'},
                    'owners': {'items': {'type': 'string'},
                               'type': 'array'},
                    'tier': {'type': ['string', 'integer']},
                    'type': {'enum': ['MEASURE_PROXY',
                                      'RATIO',
                                      'EXPR',
                                      'CUMULATIVE',
                                      'measure_proxy',
                                      'ratio',
                                      'expr',
                                      'cumulative']},
                    'type_params': {'$ref': 'metric_type_params'},
                    'where_constraint': {'type': 'string'}},
     'required': ['name', 'type', 'type_params'],
     'type': 'object'}

On instance:
    {'__parsing_context__': <metricflow.model.parsing.yaml_loader.ParsingContext object at 0x12057a430>,
     'constraint': "stage_name = '6 - Closed Won' \n",
     'owners': ['support@transformdata.io'],
     'type': 'measure_proxy',
     'type_params': {'__parsing_context__': <metricflow.model.parsing.yaml_loader.ParsingContext object at 0x12057a4f0>,
                     'measures': ['opps']}}

The method of 'validate_config_structure' was duplicating a lot of
(almost all of) the logic of 'parse_config_yaml'. This commit merges
the "extra" stuff 'validate_config_structure' was doing (validating
the yaml conformed to the jsonschema spec). This in turn means that:
we load the yaml documents one less time, we iterate over the yaml
documents one less time, and we reduce the number of duplicate errors
being produced, and reduce code complexity by keeping things more
DRY.
@QMalcolm QMalcolm requested a review from tlento July 14, 2022 21:21
@QMalcolm QMalcolm force-pushed the qmalcolm--parsing-returns-issues-during-for-config-validation branch from 3aa6595 to 55c3063 Compare July 14, 2022 23:08
@QMalcolm QMalcolm marked this pull request as ready for review July 18, 2022 18:44
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on stack trace preservation for ease of debugging? str(Exception) is just str(Exception.args).

@@ -43,7 +43,7 @@

@dataclass(frozen=True)
class ModelBuildResult: # noqa: D
model: Optional[UserConfiguredModel] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metricflow/model/parsing/schemas.py Outdated Show resolved Hide resolved
@@ -134,11 +138,11 @@ def parse_config_yaml(
data_source_class: Type[DataSource] = DataSource,
metric_class: Type[Metric] = Metric,
materialization_class: Type[Materialization] = Materialization,
) -> List[Union[DataSource, Metric, Materialization]]:
) -> Tuple[List[Union[DataSource, Metric, Materialization]], List[ValidationIssueType]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally when you have a pair of lists in a return like this they line up. So I'd expect this to return a 1:1 mapping of results to issues, or possibly a result or an issue in a given index location.

But what this does is it interleaves them in the 1:1 case - if there's an issue I generally won't get a whatever - but it can also spawn a bunch of issues. This means there's no way for me to map issues back to whatever I'm seeing in the return output.

Would we be better off making this return a List[Tuple[Union[DataSource etc.], [List[ValidationIssueType]]]]? Then every entry has one or both of an object and a list of validation issues.

Alternatively, maybe we're better off with a more descriptive data structure that has result and errors and is documented as having one or the other but not both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not intended that the two lists be related in that manner. I.e. I wasn't intending to ensure a one-to-one matching. Additionally it's possible that an issue exists for an object that doesn't because creation of the object from the config failed. I'll go the route of having a more descriptive data structure that has results and errors, it'll be somewhat similar to ModelBuildResults

Comment on lines +180 to +218
issues.append(
ValidationError(
context=FileContext(file_name=ctx.filename, line_number=ctx.start_line),
message=f"Unsupported version {version} in config document.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See like this doesn't continue or skip or anything, so we'll get results offset from errors by one per document with VERSION_KEY set to the wrong thing.

Comment on lines 223 to 234
message=f"YAML document did not conform to metric spec.\nError: {e}",
)
)
except ScannerError as e:
raise ParsingException(
message=str(e),
ctx=ctx,
config_filepath=config_yaml.filepath,
) from e
except ParsingException:
raise
context = FileContext(file_name=ctx.filename, line_number=ctx.start_line) if ctx is not None else None
issues.append(ValidationError(context=context, message=str(e)))
except ParsingException as e:
context = FileContext(file_name=ctx.filename, line_number=ctx.start_line) if ctx is not None else None
issues.append(ValidationError(context=context, message=str(e)))
except Exception as e:
raise ParsingException(
message=str(e),
ctx=ctx,
config_filepath=config_yaml.filepath,
) from e
context = FileContext(file_name=ctx.filename, line_number=ctx.start_line) if ctx is not None else None
issues.append(ValidationError(context=context, message=str(e)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This obliterates the stack traces, which will make debugging really painful for us if we ever hit a model validation issue that resolves to an issue in the parsing/validation stack itself.

Let's either store the stack traces or else collect exceptions and re-throw in the end.

Comment on lines +98 to +112
if raise_issues_as_exceptions and build_issues.has_blocking_issues:
raise ModelValidationException(build_issues.all_issues)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we can store the original exceptions we can throw this from those, and we'd be able to collect all of the stack traces. Here it just has string pointers to things happening... somewhere.

I've had mixed experiences with this kind of global "collect issues and maybe throw later" implementation pattern - I think we might want to have cleaner separation between what constitutes an exception and what constitutes a user error, and throw exceptions for the former and collect issues for the latter. I'm not sure how to go about achieving this, honestly, because I don't know where our parsing error boundary ends and the user input error boundary begins, but it's something to think about as we tinker further here.

In the meantime, preserving stack trace information is probably sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed 🤔

)
)
# catches exceptions from jsonschema validator
except exceptions.ValidationError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this thing is from jsonschema. Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea 🙃 I should have realized names like ValidationError would eventually have collisions like this. But I didn't, and so here we are 😬

context = FileContext(file_name=ctx.filename, line_number=ctx.start_line)
issues.append(ValidationError(context=context, message=str(e)))
# If a runtime error occured, we still want this to break things
except RuntimeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Is this establishing our boundary between user input issues and internal errors?

I don't know if our error hierarchy is this disciplined, though. Even with Python - ValueError and AssertionErrors are not RuntimeError subclasses. I suggest we lose the global Exception -> issue conversion and instead refine around what user input might throw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. Basically internal errors aren't actually a validation error, or shouldn't be. If we swallow them, they might be harder to realize. Ideally if there is an internal logic error, it should be raised an exception and hopefully caught during unit tests or sanity testing changes.

metricflow/cli/main.py Show resolved Hide resolved
metricflow/model/parsing/dir_to_model.py Outdated Show resolved Hide resolved
@QMalcolm
Copy link
Contributor Author

Any thoughts on stack trace preservation for ease of debugging? str(Exception) is just str(Exception.args).

I've been debating adding more properties to ValidationIssue. Specifically because some errors are rather large, and and much of it isn't actually needed. I.e. perhaps adding an extra_details property which is then available if a flag like --verbose is specified

@QMalcolm QMalcolm force-pushed the qmalcolm--parsing-returns-issues-during-for-config-validation branch from 6a7b266 to 28e13c0 Compare July 20, 2022 00:03
…eturned

Instead of collecting and raising errors in 'parse_directory_of_yaml_files_to_model'
and the methods it calls, we now collect 'ValidationIssues' and build a
'ModelValidationResults' object. The method also takes a new argument
'raise_issues_as_exceptions'. If 'raise_issues_as_exceptions' is 'True', then
the default behavior of the method is the same, in that if there are any 'ERROR'
level 'ValidationIssues' it raises an exception with the message being a
concatenation of all the issues. However, if 'raise_issues_as_exceptions' is
'False' then even if there are 'ERROR' level issues, a ModelValidationResults
object is returned and it is expected the caller will handle the results.
Over the past month it has become the case that ModelBuildResult is
never instantiated without a UserConfiguredModel. We want to continue
this pattern. To ensure this, we should make the model property of
ModelBuildResult required, which this commit does. This also makes
it so that we no longer have to assert that a ModelBuildResult has
a model.
Previously we weren't exception handling on a document by document
basis. Which means if there was an issue in one document in the config
yaml, we'd cease validating the rest of the documents. This wasn't
desireable, because it means if there are multiple issues, you'd only
find out about the first one run into each time parsing is run. This
in turn makes fixing issues in configs slow and tedious. Now in, in
most cases, if there is an exception with a document, we catch it early
and continue validating/parsing the rest of the documents.
…ions

Previously during parsing, if an exception happened, it was breaking and
validation ceased. Now however, we are able to collect all the issues for
a config and return them all gracefully.
Previously we'd been using 'load_all_without_context', to load config
yaml for jsonschema validation. However, now we use 'load_all_with_context'
in all cases, and want to encourage this as a continued behavior. This is
because 'load_all_with_context' adds a '__parsing_context__' to parsed yaml
and if someone were to specify a '__parsing_context__' in their yaml, it
will be clobbered by 'load_all_with_context', which is desired.
@QMalcolm QMalcolm force-pushed the qmalcolm--parsing-returns-issues-during-for-config-validation branch from 1b4956c to 694f0ff Compare July 21, 2022 19:01
@QMalcolm QMalcolm requested a review from tlento July 21, 2022 19:02
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything but the last commit (the custom jsonschema validator) looks good to me.

I think it's worth separating that one into its own PR, honestly, it'll be easier for people to find it and we can do a bit more discussion on that without blocking the rest of these improvements. Or maybe it's fine to put a pattern in each schema and just use the standard Draft7Validator - how many new schemas will we create?

I wish there was a way to share core properties in all schemas, haven't found one yet.

@@ -8,7 +8,6 @@
from datetime import date
from enum import Enum
from typing import Any, Callable, Dict, List, Optional, Sequence, Tuple, Union

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in commit message: is desire -> is desired

@@ -155,6 +154,7 @@ class ValidationIssue(ABC, BaseModel):

message: str
context: Optional[ValidationContext] = None
extra_detail: Optional[str]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a List[str] here, or maybe a Dict[str, str]? I guess it doesn't matter much right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List[str] works from now. We're gonna want a simple way to stringify whatever extra_detail is. But what that looks like doesn't seem important for this PR. If it was a dict, perhaps just str(extra_details) would be good enough 🤷 But again, I think List[str] is good enough for now.

issues.append(
ValidationError(
context=MetricContext(
file_context=FileContext.from_metadata(metadata=metric.metadata),
metric=MetricModelReference(metric_name=metric.name),
),
message=traceback.format_exc(),
message="".join(traceback.format_exception_only(etype=type(e), value=e)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we want to do this format_exception_only treatment on the other errors? I don't have strong opinions one way or the other, just wondering if the difference between these calls and the dir_to_model exceptions is intentional, and if there's something we should prefer.

"type": _validators.type,
"uniqueItems": _validators.uniqueItems,
},
type_checker=_types.draft7_type_checker,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep the format checker and whatever applicable_validators is doing in place here?

https://github.com/python-jsonschema/jsonschema/blob/72c3200d9b14139c9ab3094d0da4c0dd68acafe3/jsonschema/validators.py#L533-L535

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks like I either copied an old version Draft7Validator or incompletely copied Draft7Validator somehow

Comment on lines 79 to 80
"exclusiveMaximum": _validators.exclusiveMaximum,
"exclusiveMinimum": _validators.exclusiveMinimum,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've updated a number of "_legacy_validators" references here. These two plus a few others. Was that intentional?

I'm fine with it if it works, might as well make those more current, I'm just curious.

https://github.com/python-jsonschema/jsonschema/blob/72c3200d9b14139c9ab3094d0da4c0dd68acafe3/jsonschema/validators.py#L505-L506

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it was not not intentional. Things were supposed to be on par with whatever Draft7Validator is doing currently

Comment on lines +47 to +48
for error in validator.descend(instance[extra], aP, path=extra):
yield error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl,dr; I think we should put yield from back where it was.

This is not exactly the same as the yield from in the original. The difference is subtle - a function using the for loop for x in iterator: yield x holds control over the iterator, while a function using yield from iterator delegates control flow to its caller. The distinction is most likely to manifest if anything in the iterator chain is using send() or if anything is dealing with exception forwarding.

https://peps.python.org/pep-0380/

Note I just learned all of this within the last half hour so I might be missing something about why this is always going to be totally fine, but it seems like there are cases where this won't exactly do what it needs to do.

From poking around in the jsonschema validator code a bit, it looks like the implementation here is essentially a set of pipes (in the Unix sense) for sending errors from one process step to the next. So breaking that chain sequence could cause weird edge cases to pop up, or maybe just malformed stack traces

Simplified examples:

https://stackoverflow.com/questions/9708902/in-practice-what-are-the-main-uses-for-the-yield-from-syntax-in-python-3-3

A excellent primer on coroutines and yield expressions, with some examples of unix pipe style implementations:

http://dabeaz.com/coroutines/Coroutines.pdf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if taking over control flow at this point is intentional, that should probably be documented.

Copy link
Contributor Author

@QMalcolm QMalcolm Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference for this method in comparison to the original was supposed to be the call to custom_find_additonal_properties, so yes I agree, and the control flow change is unintentional.

Comment on lines +56 to +59
error = "%s %s not match any of the regexes: %s" % (
", ".join(map(repr, sorted(extras))),
verb,
", ".join(map(repr, patterns)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - revert to format strings please.

Based on what I've seen in Metricflow we generally prefer format string syntax (f"...{substitution}...."). In fact, these would be the first printf-style format strings in the repo.

I don't think this change is all that helpful for readability over the local variable format string substitution in the original, either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copied from the original, although somehow an out of date version. Happy to update to the the current version which I somehow didn't copy

@QMalcolm
Copy link
Contributor Author

Everything but the last commit (the custom jsonschema validator) looks good to me.

I think it's worth separating that one into its own PR, honestly, it'll be easier for people to find it and we can do a bit more discussion on that without blocking the rest of these improvements. Or maybe it's fine to put a pattern in each schema and just use the standard Draft7Validator - how many new schemas will we create?

I wish there was a way to share core properties in all schemas, haven't found one yet.

Because this PR includes the merge of the old validate_config_structure into parse_config_yaml, we have to ignore the __parsing_context__ key in objects somehow. I originally did this by simply included it as a property in all schemas, which was a sticking point, which is why we ended up adding the custom validator instead, because it was desired to ignore the key programmatically and not explicitly say "hey you can add this to your configs". I am happy to instead do it as a patternProperty, but I don't know how that gets passed the originally raised concern. It seems though I copied an out of date version of jsonschema and the included Draft7Validator.

All that said, I'm happy to drop the custom validator and instead manually add the it as a patternProperty to every schema if that is acceptable (which it sounds like it is).

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, the jsonschema validator stack got updated. We're like a year out of date. And it looks like they did yield from because it's more concise but there's no way to be sure.

I think we should unblock this at this point and then open an issue to update the jsonschema package and clean everything up later. We can punt the decision about whether to keep the custom validator or replace it with a utility function to then. I'll document what I"m talking about in the issue.

One change you might want to make before merging is to use extend() instead of create() and pass our custom validator through to clobber the one Draft7 is using. That way we get whatever Draft7 is using but updated with our own validator. Then when we update the package we only need to clean up our custom validator implementation - the Draft7 inheritance should continue to work as-is.

yield ValidationError(error % extras_msg(extras))


SchemaValidator = create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use extend() here you only need to specify validators = {"additionalProperties": customAdditionalProperties}

According to the docblock that will override the Draft7 default for that one validator, while keeping everything else as it is defined in the package, and will give you what amounts to a customized Draft7Validator instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice call, that makes things more straight forward!

…de in readable str

Sometimes some extra detail about a validation issue is desired. The most common
example of this is validation issues that are created from exceptions wherein
it can be useful to include the stacktrace of the exception.
…yaml' returns

The method 'parse_config_yaml' was returning a tuple of two lists. This was
confusing because there exists a convention wherein when a tuple of two lists
is returned, it is assumed that the lists are a one to one mapping, i.e. that
element x of the first list is related to elment x of the second list. This
convention of association was not intended for the return of
'parse_config_yaml', thus we added a new dataclass 'FileParsingResult' to
clarify what is being returned and avoid this unintentional assumed
association.
@QMalcolm QMalcolm force-pushed the qmalcolm--parsing-returns-issues-during-for-config-validation branch from 694f0ff to 8f4de4a Compare July 25, 2022 16:57
We want to be able to ignore all properties in config documents that
match the regex '^__(.*)__$', e.g. '__parsing_context__',
'__custom_property_name__', and etc. To do this without actually
adding a patternProperty to every schema, we had to override the
'additonalProperties' validator of the Draft7Validator. As we can
not actually change out the validators used by the Draft7Validator,
we had to write our own custom validator, which we call 'SchemaValidator'
@QMalcolm QMalcolm force-pushed the qmalcolm--parsing-returns-issues-during-for-config-validation branch from 8f4de4a to 113fe93 Compare July 25, 2022 17:04
@QMalcolm QMalcolm merged commit b1f2003 into main Jul 25, 2022
@QMalcolm QMalcolm deleted the qmalcolm--parsing-returns-issues-during-for-config-validation branch July 26, 2022 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants