Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding ge bigquery operator #6

Merged
merged 4 commits into from
Jan 29, 2021

Conversation

brian-lavery
Copy link
Contributor

Hi @spbail. This pull request is adding the GE Bigquery operator that extends the GE base operator. The only change to the base operator is that the execute method returns the results at the end so the bigquery operator can examine the results to determine success or failure and and use them to get a data docs link. I did a functional test and it worked fine using GE 0.13.1 and Cloud Composer version composer-1.11.3-airflow-1.10.6. I'll still working on unit tests.

…ly change to the base operator is that the execute method returns the results at the end so the bigquery operator can examine the results and use them to get a data docs link
@spbail
Copy link
Contributor

spbail commented Dec 15, 2020

Thanks so much for resubmitting @brian-lavery! I'll take a look this week!

@spbail
Copy link
Contributor

spbail commented Dec 15, 2020

Just one immediate question/suggestion: Do you want to try out using the convenience methods to create a Data Context which sets some defaults for the GCS backend? Not sure if this covers all your parameters, but maybe take a look? https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_contexts/how_to_instantiate_a_data_context_without_a_yml_file.html

@brian-lavery
Copy link
Contributor Author

Hi @spbail . I made the change to use the convenience method. I tested it out and it worked. Happy New Year!

Just one immediate question/suggestion: Do you want to try out using the convenience methods to create a Data Context which sets some defaults for the GCS backend? Not sure if this covers all your parameters, but maybe take a look? https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_contexts/how_to_instantiate_a_data_context_without_a_yml_file.html

Copy link
Contributor

@spbail spbail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, love seeing how this is coming together! I just have a few remarks which would be good to address/explain, but none of them are absolutely critical.

:type datadocs_domain: str
:param email_to: Email address to receive any alerts when expectations are not met.
:type email_to: str
:param fail_if_expectations_not_met: Fail the Airflow task if expectations are not met? Defaults to True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent class uses "fail_task_on_validation_failure" for this, which would be good to stick with so the behavior is similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sam. This change has been made.

}

if self.validation_type == GreatExpectationsValidations.SQL.value:
batch_kwargs["query"] = self.validation_type_input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought here: If you look at the parent class, it allows different types of mutually exclusive parameters to specify what asset to validate. I find that a little nicer than specifying the validation_type. So the possible parameters for the class could be
(...., data_asset_name, query, table, ...) and then check in the code that only one of table or query is populated. See line 67 in great_expectations.py

This means you're doing more work behind the scenes, but a user only needs to specify either the table or query and doesn't have to worry about the validation_type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sam. This change has been made.

data_context_config = self.create_data_context_config()
data_context = BaseDataContext(project_config=data_context_config)
batch_kwargs = self.get_batch_kwargs()
super().__init__(data_context=data_context, batch_kwargs=batch_kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See note below, I'd add a comment to explain why you're not passing through the fail_task_on_validation_failure/fail_task_if_expectations_not_met argument as-is.

Maybe it's a bit ugly, but do you think it's possible to stick with "fail_task_on_validation_failure", so if anyone looks at the parent class and this one it's clear they're the same thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sam. This change has been made.

site_name='gcs_site')[0][
'site_url']
self.log.info("Data docs url is: %s", data_docs_url)
if results["success"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the logging is already done in the parent class, so probably not needed here.

I'm assuming the reason for not passing through the "fail on validation failure" is that you want the alert email before raising the error right?

I think this works for now, but I'd put in an explicit comment that you're overwriting the parent class behavior somewhat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sam. This change has been made.

@spbail
Copy link
Contributor

spbail commented Jan 7, 2021

@cla-bot check

@spbail
Copy link
Contributor

spbail commented Jan 7, 2021

@brian-lavery Do you mind signing the CLA, see link in the comment above ^

@kyleaton
Copy link
Contributor

@cla-bot check

@kyleaton
Copy link
Contributor

@spbail Discovered this preparing mugs. @brian-lavery signed :)

@spbail
Copy link
Contributor

spbail commented Jan 29, 2021

Just tested it out locally, and it works! Yay! I didn't test out the email configuration, but the GCS stores and BigQuery datasource work well. I think we're good to ship this! I'll add an example in another commit, and then I can cut a release later.

Copy link
Contributor

@spbail spbail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works \o/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants