Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset event timestamp to dataset dag run queue #26376

Closed

Conversation

dstandish
Copy link
Contributor

This allows us to derive the dag run logical date from the event timestamp rather than the queue record created_at, which allows us to be slightly more precise when associating dag runs with dataset events.

Also, in the first commit, I change the signature a little bit, which reduces (a tiny bit) some duplication and makes it (again a tiny bit) more explicit what exactly these methods require, and therefore clearer what they actually do.

@boring-cyborg boring-cyborg bot added area:Scheduler Scheduler or dag parsing Issues kind:documentation labels Sep 13, 2022
@jedcunningham
Copy link
Member

Overall looks okay. I do wonder if instead we should pass down the event itself instead. I could see custom manager classes wanting the dataset itself, so passing the event makes the api pretty clean?

@jedcunningham jedcunningham added the AIP-48 Data-aware Scheduling label Sep 14, 2022
@dstandish
Copy link
Contributor Author

Overall looks okay. I do wonder if instead we should pass down the event itself instead. I could see custom manager classes wanting the dataset itself, so passing the event makes the api pretty clean?

The thing about that, though, is that these methods are all private anyway, not part of the public API... So users should not implement them anyway right?

@jedcunningham
Copy link
Member

That's fair. I overlooked at they override register_dataset_change, not the others.

@jedcunningham
Copy link
Member

I wonder if the future annotations changes are going to cause heartache for #26290? Might be worth removing those here?

@dstandish
Copy link
Contributor Author

I wonder if the future annotations changes are going to cause heartache for #26290? Might be worth removing those here?

Oh yeah I thought everything had been merged.... I can undo that stuff tomorrow

@jedcunningham
Copy link
Member

It's merged now, so just resolve the conflicts and we should be good to go.

@dstandish dstandish force-pushed the dataset-event-dag-run-tightening branch from ecaae6b to 78774a8 Compare September 14, 2022 16:16
@blag
Copy link
Contributor

blag commented Sep 14, 2022

Just to color in my thoughts on the private methods in the dataset manager - I made them private only because I was trying to keep the dataset manager public interface as small as possible, and because I'm very cautious when it comes to API promises. I wouldn't fight if others want to make them public methods, and if we don't have any negative feedback about them by 2.5, I'd be fine with publicizing them (and probably making the private methods call their public equivalents), whatever is reasonable.

@blag
Copy link
Contributor

blag commented Sep 14, 2022

My personal preference would be to pass around the DatasetModel and DatasetEvent objects directly, since that just seems cleaner to me. If/when we publicize more/all of the dataset manager methods that might also make it easier on plugin authors to implement those methods.

At first glance it seems like it would be simpler to just add a foreign key to DDRQ that points to the dataset event row that caused it, but since we only really care about the timestamp on the DDRQ in the scheduler, we just store and use the dataset event timestamp (copied through to the DDRQ). Is that correct?

@dstandish
Copy link
Contributor Author

dstandish commented Sep 14, 2022

My personal preference would be to pass around the DatasetModel and DatasetEvent objects directly

ultimately, they are still private, so, if someone wants to change it back, well, there's no reason they couldn't. it's sort of a cosmetic decision.

but really, the main reason this came up was.... and i forgot to mention this earlier... was because i was performanrce testing the insert logic, and, (1) doing it this way meant not having to call the sqlalchemy relationship attr repeatedly, and (2) it also meant that i didn't have to set up the upstream-downstream relationships on the dataset ... i could just provide whatever IDs i wanted. so it made it marginally easier to test when structured this way, and that was the the main reason i went ahead and changed it. ultimately it's not of great consequence. it's a difference of one argument in the signature.

At first glance it seems like it would be simpler to just add a foreign key to DDRQ that points to the dataset event row that caused it, but since we only really care about the timestamp on the DDRQ in the scheduler, we just store and use the dataset event timestamp (copied through to the DDRQ). Is that correct?

right so... yeah ... keying DDRQ to the event that caused it... it does kindof make sense. but the thing is .... it could ultimately be multiple events that end up getting mapped to the same dag run. so ultimately what we do is.... so if dag 1 depends on datasets A and B, then if A is updated 5 times then finally B is updated once.... all of those A updates and the one B update are associated with the dag run, and the way that we figure that out is ... all the A events came before the B one...
you know... it's an ephemeral table so there isn't a need really for a persistent ongoing association between the two (event and queue) and... i guess including just the value (not the key) saves a join in a query that needs to be fast.

ideally... i'd prefer it if we could avoid timestamp comparisons completely and have the event-dag-run association be determined even more precisely but when i was thinking about how to do it, it seemed a bit fraught due to race conditions but if you have ideas i'm happy to explore.

@dstandish dstandish force-pushed the dataset-event-dag-run-tightening branch from be1fa68 to fad526e Compare September 14, 2022 20:24
@blag
Copy link
Contributor

blag commented Sep 14, 2022

Ease of testing is a perfectly valid and good reason. You've convinced me. 😄

And yeah, I was thinking the dataset event would be the last of all of the dependencies that was updated but that would be a bit of an incomplete picture.

I'm afraid I don't have a solution that implements a race-free, complete, and accurate picture of the dependencies/events/dag run relationships.

@dstandish dstandish force-pushed the dataset-event-dag-run-tightening branch from 77b45d0 to 564774b Compare September 15, 2022 18:01
@dstandish dstandish force-pushed the dataset-event-dag-run-tightening branch from 564774b to 571a6bf Compare September 22, 2022 04:26
run_id = dag.timetable.generate_run_id(
run_type=DagRunType.DATASET_TRIGGERED,
logical_date=exec_date,
logical_date=last_event_timestamp, # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what is wrong with the type of this?

Copy link
Contributor Author

@dstandish dstandish Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it thinks it wants DateTime
but, what we actually get when we read timestamp values from the database is datetime
i think we've encountered this recently, and it was a bit of a rabbit hole trying to fix it... but fundamentally i think that those timetable functions that say DateTime are more often / more likely / maybe always getting datetime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully SQLAlchemy will improve this in 2.0. (IIRC they have plan to provide better typing support for ORM models.)

Makes it more explicit what exactly these methods require, and therefore clearer what they actually do.
This allows us to derive the dag run logical date from the event timestamp rather than the queue record created_at, which allows us to be slightly more precise when associating dag runs with dataset events.
@dstandish dstandish force-pushed the dataset-event-dag-run-tightening branch from 74da81e to faf6482 Compare September 23, 2022 07:39
@github-actions
Copy link

github-actions bot commented Nov 8, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 8, 2022
@github-actions github-actions bot closed this Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-48 Data-aware Scheduling area:Scheduler Scheduler or dag parsing Issues kind:documentation stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants