DQX Performance #1137

klippies7 · 2026-04-29T13:03:28Z

klippies7
Apr 29, 2026

We have an interesting demand put on DQX where a Data Contract generates over 150 rules. The performance one would expect to take a hit but when we start to compare against spark equivalent checks, DQX really starts to lag and take excessive time to process what can be considered a rather mediocre size dataset. Typical rule challenges are around the function "is_in_range". Are there any recommendations or guidance about dealing with so many rules?

ghanse · 2026-05-04T21:09:20Z

ghanse
May 4, 2026
Maintainer

Row checks should perform well at that scale. Are you applying any dataset level checks or using any custom SQL (e.g. for SQL expression checks or filter clauses)?

0 replies

klippies7 · 2026-05-05T10:10:07Z

klippies7
May 5, 2026
Author

Hi, Upon looking at one the contracts, this is the layout: Function name Number of instances is_in_list 64 sql_expression 43 is_unique 2

…

On Mon, 4 May 2026 at 22:09, Greg Hansen ***@***.***> wrote: Row checks should perform well at that scale. Are you applying any dataset level checks or using any custom SQL (e.g. for SQL expression checks or filter clauses)? — Reply to this email directly, view it on GitHub <#1137 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQDEVOLIIM3LPOTYF64EUT4ZEBJJAVCNFSM6AAAAACYKYAQ2GVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMOBQHE4TCNA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

ghanse May 5, 2026
Maintainer

Uniqueness checks will certainly add some delay. They introduce a shuffle in the Spark plan. SQL expression checks may also introduce some delays if the expressions are very complex.

You mentioned that adding is_in_list checks has degraded the performance. Have you tested the performance without the is_unique checks? Are there any other observations or metrics you can share (e.g. driver running out-of-memory, long time spent on execution planning, etc.)?

FinnVautier · 2026-05-05T20:35:47Z

FinnVautier
May 5, 2026

Hi there, jumping in to add some additional context to this issue.

In the sample run:

Contract fields: 291
Generated rules: 131
Input rows: 700
Workflow total time: 507.015s
DQX Rule application: 186.157s
Native PySpark rule application: 1.87s

In the current iteration of the workflow, we split the 131 rules into 21 cross field rules that are applied by DQX and
rules (120x) that can be applied using native PySpark (nullability, is_in_list, numerical ranges etc.), as when using DQX to apply all of the rules in the contract we were suffering from 30min+ runtimes. An example of the cross field check would be the following expression;

- type: custom engine: dqx implementation: name: inception_date_in_past_and_before_expiry criticality: error rule_type: conditional check: function: sql_expression arguments: expression: >- TRY_TO_DATE(Inception Date) < CURRENT_DATE() AND TRY_TO_DATE(Inception Date) < TRY_TO_DATE(Expiry Date) msg: "Inception Date must be in the past and before Expiry Date" columns: ["Inception Date", "Expiry Date"]
We're using DQX's sql_expression to construct these crossfield rules and would benefit from some advice on if this is best practice or the most performant way of applying these kind of checks with DQX.

We apply the DQX checks with the following apply_checks_by_metadata . Curious if there's anything we should be careful in terms of factors that commonly drive latency for sql_expression heavy workloads?

Additionally, we notice significant performance degradation writing the results of DQX to a delta quarantine table. Is there any best practice or suggestions of writing DQX results to a delta table?

Is there any additional instrumentation available to break down planning/compilation vs execution time inside DQX?

Please let me know if any additional metrics/information would be helpful. Thank very much in advance!

4 replies

FinnVautier May 11, 2026

Hi there! @ghanse Is there any movement on the above investigation? Thank you!

ghanse May 11, 2026
Maintainer

This helps, thank you. Are you using the same patterns as DQX when writing data with the native PySpark implementation (e.g. writing 2 output tables to Delta)?

The Spark UI or Query Profile will provide more details on where the execution is taking time. Can you review those for both the DQX and native queries and look for anything that stands out?

FinnVautier May 11, 2026

Yes that's correct we write to a quarantine and quarantine summary delta tables.

When looking at the Query Profile, I am only able to see the following;

Time	Command	Duration
03:12:01	schema_id distinct check	4.21s
03:12:06	mapping_df collect	586ms
03:12:08	input row count	2.34s
03:15:32	summary metrics collect	2.50m
03:18:05	quarantine_df write (append)	2.61m
03:20:46	summary_df write (append)	13.68s

Total visible duration: ~5m 23s (excludes any unlogged DQX processing time). Does the DQX rule application run internally? As it doesn't look like it's surfacing as a separate named query in the query profile.

The rule application step in the stdout logs takes 197s for the 21 rules that are routes through the DQX engine, is that expected for sql_expression type rules?

Are there any best practices for writing the outputs of the DQX to quarantine tables as this also seems to be a bottleneck in our workflow? We're have wide tables that we're processing with ~300 columns.

ghanse May 28, 2026
Maintainer

The rules are run lazily when a Spark action is triggered on the checked DataFrame or any downstream DataFrame that selects from the checked DataFrame.

Identifying the exact root cause would likely require a review of the query profile data. I would check for any complex expressions (e.g. distinct clauses, aggregation, joins). If not, test the workload without summary metrics. These aggregate data and may introduce moderate delay for very large datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DQX Performance #1137

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DQX Performance #1137

Uh oh!

klippies7 Apr 29, 2026

Replies: 3 comments · 5 replies

Uh oh!

ghanse May 4, 2026 Maintainer

Uh oh!

klippies7 May 5, 2026 Author

Uh oh!

ghanse May 5, 2026 Maintainer

Uh oh!

Uh oh!

FinnVautier May 5, 2026

Uh oh!

FinnVautier May 11, 2026

Uh oh!

ghanse May 11, 2026 Maintainer

Uh oh!

FinnVautier May 11, 2026

Uh oh!

ghanse May 28, 2026 Maintainer

klippies7
Apr 29, 2026

Replies: 3 comments 5 replies

ghanse
May 4, 2026
Maintainer

klippies7
May 5, 2026
Author

ghanse May 5, 2026
Maintainer

FinnVautier
May 5, 2026

ghanse May 11, 2026
Maintainer

ghanse May 28, 2026
Maintainer