Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dataset timestamps #190

Merged
merged 12 commits into from
Jul 19, 2024
Merged

Fix dataset timestamps #190

merged 12 commits into from
Jul 19, 2024

Conversation

tillprochaska
Copy link
Contributor

@tillprochaska tillprochaska commented Jun 28, 2024

I noticed the same issue reported in alephdata/aleph#3787 when working on alephdata/aleph#3788. The current implementation updates the start time whenever a new task is added or checked out, no matter whether the dataset is active or not.

This PR changes the implementation of the dataset status data stored in Redis as follows:

  • The implementation last_update remains unchanged.
  • start_time is set to the current time only if it isn’t set yet, i.e. if the task being added is the first task added to a dataset that was inactive before.
  • I’ve removed end_time (see fd29298 for reasoning), but I’m happy to add it back if that is preferred.
  • Once a dataset is done (i.e. there are no more running/pending tasks), all status data is removed from Redis. Previously, some data was reset (e.g. the number of finished tasks) and some other data was retained (e.g. the start time).
  • I’ve added a bunch of tests for the dataset status data stored in Redis. I added the time-machine package as a dev dependency for this purpose. In my experience, manually mocking dates/times in tests is quite brittle, and I’ve successfully used this package in multiple projects (including in this Aleph PR).

Just a note for future reference, as we discussed this but it isn’t documented anywhere: Parts of the status tracking in Redis is probably prone to race conditions when multiple workers are running. This PR doesn't change that. In case this is an issue in practice, we might want to try to replace logic with operations that can be executed atomically.

@tillprochaska tillprochaska force-pushed the fix/dataset-timestamps branch 5 times, most recently from 0ecae5f to 054f5f2 Compare July 1, 2024 14:08
@tillprochaska
Copy link
Contributor Author

tillprochaska commented Jul 1, 2024

I noticed that two methods aren’t called anywhere (neither in servicelayer nor in Aleph). I was trying to add test coverage to ensure my changes do not break their expected behavior, but now I’m wondering whether these methods can be removed?

  • Dataset.cleanup_dataset_status
  • Dataset.remove_tasks

@tillprochaska tillprochaska force-pushed the fix/dataset-timestamps branch 2 times, most recently from 2b45674 to 216acbf Compare July 1, 2024 16:38
@@ -275,9 +248,8 @@ def checkout_task(self, task_id, stage):

pipe.srem(self.pending_key, task_id)
pipe.sadd(self.running_key, task_id)
pipe.set(self.start_key, pack_now())
pipe.set(self.start_key, pack_now(), nx=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not entirely sure setting the "start_time" timestamp here is necessary in the first place as a worker shouldn’t be able to check out a task without add_task having been executed before. (The same probably also applies for pipe.sadd(self.key, self.name).)

Didn’t change it because the primary goal of this PR is to fix the timestamps and not to refactor, but I’d still be interested to understand if there’s a legit case where this is relevant or if it’s merely as a precaution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like some sort of defensive programming approach. Which I can't say I dislike :)

@tillprochaska
Copy link
Contributor Author

@stchris @catileptic This isn’t 100% done, but I added two comments with questions and would like to know whether you agree with the approach (primarily removing end_time and resetting all dataset status data after all tasks have been processed).

@tillprochaska tillprochaska force-pushed the fix/dataset-timestamps branch 2 times, most recently from 67a63d2 to c3372dd Compare July 3, 2024 15:50
@tillprochaska tillprochaska marked this pull request as ready for review July 4, 2024 09:32
@tillprochaska
Copy link
Contributor Author

I’m still a little unsure about the two questions above, but apart from that, this is ready.

Copy link
Contributor

@stchris stchris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, @tillprochaska ! I also appreciate the proper cleanup and the concise test cases

@@ -275,9 +248,8 @@ def checkout_task(self, task_id, stage):

pipe.srem(self.pending_key, task_id)
pipe.sadd(self.running_key, task_id)
pipe.set(self.start_key, pack_now())
pipe.set(self.start_key, pack_now(), nx=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like some sort of defensive programming approach. Which I can't say I dislike :)

@@ -364,6 +322,29 @@ def is_task_tracked(self, task: Task):

return tracked

def flush_status(self, pipe):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, extracting helped the readability of the code tremendously! 🥇

collection_id="1",
)

# Adding a task updates `start_time` and `last_update`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate your comments in this non-trivial test!

@tillprochaska
Copy link
Contributor Author

Thanks for the review, @stchris! Any thoughts about #190 (comment)?

@stchris
Copy link
Contributor

stchris commented Jul 16, 2024

Sorry I totally forgot about these questions again @tillprochaska

* `Dataset.cleanup_dataset_status`

This one is used by the Aleph worker in the period task impl https://github.com/alephdata/aleph/blob/7f26ac7bbf52864f048d58cd48b8d5f5784d29a5/aleph/worker.py#L141

* `Dataset.remove_tasks`

Couldn't find any use of this either, I think this can go!

We currently do not retain information about inactive datasets (i.e. datasets that do not have any more pending or running tasks). For this reason, the "end_time" timestamp is currently of no use, as it would never be displayed to users anyway.

While there are some plans around providing more detail about the results of processed tasks as well as completed jobs, it is unclear where this data will be stored and what the implementation will look like. As it is easy enough to add this information back (< 10 LOC), I’ve removed it for now.
@tillprochaska
Copy link
Contributor Author

tillprochaska commented Jul 17, 2024

@stchris Thanks, I’ve rebased the PR and applied changes based on your answers.

Couldn't find any use of this either, I think this can go!

👍 Removed it.

This one is used by the Aleph worker in the period task impl https://github.com/alephdata/aleph/blob/7f26ac7bbf52864f048d58cd48b8d5f5784d29a5/aleph/worker.py#L141

Ah, thanks, I overlooked that. This makes me wonder whether it is required? cleanup_dataset_status and flush_status mostly do the same thing. The difference is that flush_status is called immediately after a task is executed or a dataset is cancelled. cleanup_dataset_status is called periodically - but at the time it is called, there should never be any dataset that needs to be cleaned up.

(I’m asking because I was trying to come up with a test case for this method, but wasn’t able to reproduce any situation where it isn’t just a no-op.)

@stchris
Copy link
Contributor

stchris commented Jul 17, 2024

This one is used by the Aleph worker in the period task impl https://github.com/alephdata/aleph/blob/7f26ac7bbf52864f048d58cd48b8d5f5784d29a5/aleph/worker.py#L141

Ah, thanks, I overlooked that. This makes me wonder whether it is required? cleanup_dataset_status and flush_status mostly do the same thing. The difference is that flush_status is called immediately after a task is executed or a dataset is cancelled. cleanup_dataset_status is called periodically - but at the time it is called, there should never be any dataset that needs to be cleaned up.

(I’m asking because I was trying to come up with a test case for this method, but wasn’t able to reproduce any situation where it isn’t just a no-op.)

I didn't check the logic flow, but it seems to me like you are right. There should not be a problem with you removing this code then ™️

tillprochaska and others added 4 commits July 19, 2024 11:44
`cleanup_dataset_status` iterates over all active datasets and removes status information if it is done (i.e. there are no more pending or running tasks). It is called by the Aleph worker periodically.

We already do this for individual datasets whenever a task from the dataset is done or the dataset is cancelled as a whole. That means that `cleanup_dataset_status` is redundant.

Also see: #190 (comment)
@catileptic catileptic merged commit eedfb94 into release/1.23.0 Jul 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants