Add dataset event timestamp to dataset dag run queue #26376

dstandish · 2022-09-13T22:58:56Z

This allows us to derive the dag run logical date from the event timestamp rather than the queue record created_at, which allows us to be slightly more precise when associating dag runs with dataset events.

Also, in the first commit, I change the signature a little bit, which reduces (a tiny bit) some duplication and makes it (again a tiny bit) more explicit what exactly these methods require, and therefore clearer what they actually do.

jedcunningham · 2022-09-14T02:27:54Z

Overall looks okay. I do wonder if instead we should pass down the event itself instead. I could see custom manager classes wanting the dataset itself, so passing the event makes the api pretty clean?

dstandish · 2022-09-14T02:37:01Z

Overall looks okay. I do wonder if instead we should pass down the event itself instead. I could see custom manager classes wanting the dataset itself, so passing the event makes the api pretty clean?

The thing about that, though, is that these methods are all private anyway, not part of the public API... So users should not implement them anyway right?

jedcunningham · 2022-09-14T02:57:00Z

That's fair. I overlooked at they override register_dataset_change, not the others.

jedcunningham · 2022-09-14T03:02:21Z

I wonder if the future annotations changes are going to cause heartache for #26290? Might be worth removing those here?

dstandish · 2022-09-14T05:25:32Z

I wonder if the future annotations changes are going to cause heartache for #26290? Might be worth removing those here?

Oh yeah I thought everything had been merged.... I can undo that stuff tomorrow

jedcunningham · 2022-09-14T14:51:09Z

It's merged now, so just resolve the conflicts and we should be good to go.

blag · 2022-09-14T17:48:39Z

Just to color in my thoughts on the private methods in the dataset manager - I made them private only because I was trying to keep the dataset manager public interface as small as possible, and because I'm very cautious when it comes to API promises. I wouldn't fight if others want to make them public methods, and if we don't have any negative feedback about them by 2.5, I'd be fine with publicizing them (and probably making the private methods call their public equivalents), whatever is reasonable.

blag · 2022-09-14T18:30:16Z

My personal preference would be to pass around the DatasetModel and DatasetEvent objects directly, since that just seems cleaner to me. If/when we publicize more/all of the dataset manager methods that might also make it easier on plugin authors to implement those methods.

At first glance it seems like it would be simpler to just add a foreign key to DDRQ that points to the dataset event row that caused it, but since we only really care about the timestamp on the DDRQ in the scheduler, we just store and use the dataset event timestamp (copied through to the DDRQ). Is that correct?

dstandish · 2022-09-14T20:00:40Z

My personal preference would be to pass around the DatasetModel and DatasetEvent objects directly

ultimately, they are still private, so, if someone wants to change it back, well, there's no reason they couldn't. it's sort of a cosmetic decision.

but really, the main reason this came up was.... and i forgot to mention this earlier... was because i was performanrce testing the insert logic, and, (1) doing it this way meant not having to call the sqlalchemy relationship attr repeatedly, and (2) it also meant that i didn't have to set up the upstream-downstream relationships on the dataset ... i could just provide whatever IDs i wanted. so it made it marginally easier to test when structured this way, and that was the the main reason i went ahead and changed it. ultimately it's not of great consequence. it's a difference of one argument in the signature.

At first glance it seems like it would be simpler to just add a foreign key to DDRQ that points to the dataset event row that caused it, but since we only really care about the timestamp on the DDRQ in the scheduler, we just store and use the dataset event timestamp (copied through to the DDRQ). Is that correct?

right so... yeah ... keying DDRQ to the event that caused it... it does kindof make sense. but the thing is .... it could ultimately be multiple events that end up getting mapped to the same dag run. so ultimately what we do is.... so if dag 1 depends on datasets A and B, then if A is updated 5 times then finally B is updated once.... all of those A updates and the one B update are associated with the dag run, and the way that we figure that out is ... all the A events came before the B one...
you know... it's an ephemeral table so there isn't a need really for a persistent ongoing association between the two (event and queue) and... i guess including just the value (not the key) saves a join in a query that needs to be fast.

ideally... i'd prefer it if we could avoid timestamp comparisons completely and have the event-dag-run association be determined even more precisely but when i was thinking about how to do it, it seemed a bit fraught due to race conditions but if you have ideas i'm happy to explore.

blag · 2022-09-14T21:05:48Z

Ease of testing is a perfectly valid and good reason. You've convinced me. 😄

And yeah, I was thinking the dataset event would be the last of all of the dependencies that was updated but that would be a bit of an incomplete picture.

I'm afraid I don't have a solution that implements a race-free, complete, and accurate picture of the dependencies/events/dag run relationships.

uranusjr · 2022-09-23T07:26:36Z

airflow/jobs/scheduler_job.py

                run_id = dag.timetable.generate_run_id(
                    run_type=DagRunType.DATASET_TRIGGERED,
-                    logical_date=exec_date,
+                    logical_date=last_event_timestamp,  # type: ignore


Hmm what is wrong with the type of this?

it thinks it wants DateTime
but, what we actually get when we read timestamp values from the database is datetime
i think we've encountered this recently, and it was a bit of a rabbit hole trying to fix it... but fundamentally i think that those timetable functions that say DateTime are more often / more likely / maybe always getting datetime

Hopefully SQLAlchemy will improve this in 2.0. (IIRC they have plan to provide better typing support for ORM models.)

Makes it more explicit what exactly these methods require, and therefore clearer what they actually do.

This allows us to derive the dag run logical date from the event timestamp rather than the queue record created_at, which allows us to be slightly more precise when associating dag runs with dataset events.

github-actions · 2022-11-08T00:15:07Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

dstandish requested review from kaxil, potiuk, XD-DENG and ashb as code owners September 13, 2022 22:58

boring-cyborg bot added area:Scheduler Scheduler or dag parsing Issues kind:documentation labels Sep 13, 2022

jedcunningham added the AIP-48 Data-aware Scheduling label Sep 14, 2022

jedcunningham approved these changes Sep 14, 2022

View reviewed changes

dstandish force-pushed the dataset-event-dag-run-tightening branch from ecaae6b to 78774a8 Compare September 14, 2022 16:16

dstandish force-pushed the dataset-event-dag-run-tightening branch from be1fa68 to fad526e Compare September 14, 2022 20:24

dstandish force-pushed the dataset-event-dag-run-tightening branch from 77b45d0 to 564774b Compare September 15, 2022 18:01

blag approved these changes Sep 15, 2022

View reviewed changes

dstandish force-pushed the dataset-event-dag-run-tightening branch from 564774b to 571a6bf Compare September 22, 2022 04:26

uranusjr reviewed Sep 23, 2022

View reviewed changes

dstandish added 6 commits September 23, 2022 00:38

Don't pass the dataset object around

fbaaa75

Makes it more explicit what exactly these methods require, and therefore clearer what they actually do.

Add dataset event timestamp to dataset dag run queue

b168679

This allows us to derive the dag run logical date from the event timestamp rather than the queue record created_at, which allows us to be slightly more precise when associating dag runs with dataset events.

move to new migration

fbfc8de

move to 2.4.2

6796ecc

put event_timestamp at the end

e564b71

use batch op

faf6482

dstandish force-pushed the dataset-event-dag-run-tightening branch from 74da81e to faf6482 Compare September 23, 2022 07:39

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 8, 2022

github-actions bot closed this Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset event timestamp to dataset dag run queue #26376

Add dataset event timestamp to dataset dag run queue #26376

dstandish commented Sep 13, 2022

jedcunningham commented Sep 14, 2022

dstandish commented Sep 14, 2022

jedcunningham commented Sep 14, 2022

jedcunningham commented Sep 14, 2022

dstandish commented Sep 14, 2022

jedcunningham commented Sep 14, 2022

blag commented Sep 14, 2022

blag commented Sep 14, 2022

dstandish commented Sep 14, 2022 •

edited

Loading

blag commented Sep 14, 2022

uranusjr Sep 23, 2022

dstandish Sep 23, 2022 •

edited

Loading

uranusjr Sep 23, 2022

github-actions bot commented Nov 8, 2022

Add dataset event timestamp to dataset dag run queue #26376

Add dataset event timestamp to dataset dag run queue #26376

Conversation

dstandish commented Sep 13, 2022

jedcunningham commented Sep 14, 2022

dstandish commented Sep 14, 2022

jedcunningham commented Sep 14, 2022

jedcunningham commented Sep 14, 2022

dstandish commented Sep 14, 2022

jedcunningham commented Sep 14, 2022

blag commented Sep 14, 2022

blag commented Sep 14, 2022

dstandish commented Sep 14, 2022 • edited Loading

blag commented Sep 14, 2022

uranusjr Sep 23, 2022

Choose a reason for hiding this comment

dstandish Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

uranusjr Sep 23, 2022

Choose a reason for hiding this comment

github-actions bot commented Nov 8, 2022

dstandish commented Sep 14, 2022 •

edited

Loading

dstandish Sep 23, 2022 •

edited

Loading