Fix dataset timestamps #190

tillprochaska · 2024-06-28T17:08:55Z

I noticed the same issue reported in alephdata/aleph#3787 when working on alephdata/aleph#3788. The current implementation updates the start time whenever a new task is added or checked out, no matter whether the dataset is active or not.

This PR changes the implementation of the dataset status data stored in Redis as follows:

The implementation last_update remains unchanged.
start_time is set to the current time only if it isn’t set yet, i.e. if the task being added is the first task added to a dataset that was inactive before.
I’ve removed end_time (see fd29298 for reasoning), but I’m happy to add it back if that is preferred.
Once a dataset is done (i.e. there are no more running/pending tasks), all status data is removed from Redis. Previously, some data was reset (e.g. the number of finished tasks) and some other data was retained (e.g. the start time).
I’ve added a bunch of tests for the dataset status data stored in Redis. I added the time-machine package as a dev dependency for this purpose. In my experience, manually mocking dates/times in tests is quite brittle, and I’ve successfully used this package in multiple projects (including in this Aleph PR).

Just a note for future reference, as we discussed this but it isn’t documented anywhere: Parts of the status tracking in Redis is probably prone to race conditions when multiple workers are running. This PR doesn't change that. In case this is an issue in practice, we might want to try to replace logic with operations that can be executed atomically.

tillprochaska · 2024-07-01T16:07:29Z

I noticed that two methods aren’t called anywhere (neither in servicelayer nor in Aleph). I was trying to add test coverage to ensure my changes do not break their expected behavior, but now I’m wondering whether these methods can be removed?

Dataset.cleanup_dataset_status
Dataset.remove_tasks

tillprochaska · 2024-07-01T18:20:27Z

servicelayer/taskqueue.py

@@ -275,9 +248,8 @@ def checkout_task(self, task_id, stage):

        pipe.srem(self.pending_key, task_id)
        pipe.sadd(self.running_key, task_id)
-        pipe.set(self.start_key, pack_now())
+        pipe.set(self.start_key, pack_now(), nx=True)


I’m not entirely sure setting the "start_time" timestamp here is necessary in the first place as a worker shouldn’t be able to check out a task without add_task having been executed before. (The same probably also applies for pipe.sadd(self.key, self.name).)

Didn’t change it because the primary goal of this PR is to fix the timestamps and not to refactor, but I’d still be interested to understand if there’s a legit case where this is relevant or if it’s merely as a precaution?

It seems like some sort of defensive programming approach. Which I can't say I dislike :)

tillprochaska · 2024-07-01T18:35:42Z

@stchris @catileptic This isn’t 100% done, but I added two comments with questions and would like to know whether you agree with the approach (primarily removing end_time and resetting all dataset status data after all tasks have been processed).

tillprochaska · 2024-07-04T09:32:30Z

I’m still a little unsure about the two questions above, but apart from that, this is ready.

stchris

Thanks for this, @tillprochaska ! I also appreciate the proper cleanup and the concise test cases

stchris · 2024-07-15T18:06:29Z

servicelayer/taskqueue.py

@@ -275,9 +248,8 @@ def checkout_task(self, task_id, stage):

        pipe.srem(self.pending_key, task_id)
        pipe.sadd(self.running_key, task_id)
-        pipe.set(self.start_key, pack_now())
+        pipe.set(self.start_key, pack_now(), nx=True)


It seems like some sort of defensive programming approach. Which I can't say I dislike :)

stchris · 2024-07-15T18:07:06Z

servicelayer/taskqueue.py

@@ -364,6 +322,29 @@ def is_task_tracked(self, task: Task):

        return tracked

+    def flush_status(self, pipe):


Thanks, extracting helped the readability of the code tremendously! 🥇

stchris · 2024-07-16T07:30:45Z

tests/test_taskqueue.py

+        collection_id="1",
+    )
+
+    # Adding a task updates `start_time` and `last_update`


I appreciate your comments in this non-trivial test!

tillprochaska · 2024-07-16T09:04:29Z

Thanks for the review, @stchris! Any thoughts about #190 (comment)?

stchris · 2024-07-16T09:08:59Z

Sorry I totally forgot about these questions again @tillprochaska

* `Dataset.cleanup_dataset_status`

This one is used by the Aleph worker in the period task impl https://github.com/alephdata/aleph/blob/7f26ac7bbf52864f048d58cd48b8d5f5784d29a5/aleph/worker.py#L141

* `Dataset.remove_tasks`

Couldn't find any use of this either, I think this can go!

alephdata/aleph#3787

We currently do not retain information about inactive datasets (i.e. datasets that do not have any more pending or running tasks). For this reason, the "end_time" timestamp is currently of no use, as it would never be displayed to users anyway. While there are some plans around providing more detail about the results of processed tasks as well as completed jobs, it is unclear where this data will be stored and what the implementation will look like. As it is easy enough to add this information back (< 10 LOC), I’ve removed it for now.

tillprochaska · 2024-07-17T17:13:30Z

@stchris Thanks, I’ve rebased the PR and applied changes based on your answers.

Couldn't find any use of this either, I think this can go!

👍 Removed it.

This one is used by the Aleph worker in the period task impl https://github.com/alephdata/aleph/blob/7f26ac7bbf52864f048d58cd48b8d5f5784d29a5/aleph/worker.py#L141

Ah, thanks, I overlooked that. This makes me wonder whether it is required? cleanup_dataset_status and flush_status mostly do the same thing. The difference is that flush_status is called immediately after a task is executed or a dataset is cancelled. cleanup_dataset_status is called periodically - but at the time it is called, there should never be any dataset that needs to be cleaned up.

(I’m asking because I was trying to come up with a test case for this method, but wasn’t able to reproduce any situation where it isn’t just a no-op.)

stchris · 2024-07-17T17:36:34Z

This one is used by the Aleph worker in the period task impl https://github.com/alephdata/aleph/blob/7f26ac7bbf52864f048d58cd48b8d5f5784d29a5/aleph/worker.py#L141

Ah, thanks, I overlooked that. This makes me wonder whether it is required? cleanup_dataset_status and flush_status mostly do the same thing. The difference is that flush_status is called immediately after a task is executed or a dataset is cancelled. cleanup_dataset_status is called periodically - but at the time it is called, there should never be any dataset that needs to be cleaned up.

(I’m asking because I was trying to come up with a test case for this method, but wasn’t able to reproduce any situation where it isn’t just a no-op.)

I didn't check the logic flow, but it seems to me like you are right. There should not be a problem with you removing this code then ™️

`cleanup_dataset_status` iterates over all active datasets and removes status information if it is done (i.e. there are no more pending or running tasks). It is called by the Aleph worker periodically. We already do this for individual datasets whenever a task from the dataset is done or the dataset is cancelled as a whole. That means that `cleanup_dataset_status` is redundant. Also see: #190 (comment)

tillprochaska force-pushed the fix/dataset-timestamps branch 5 times, most recently from 0ecae5f to 054f5f2 Compare July 1, 2024 14:08

tillprochaska force-pushed the fix/dataset-timestamps branch 2 times, most recently from 2b45674 to 216acbf Compare July 1, 2024 16:38

tillprochaska commented Jul 1, 2024

View reviewed changes

tillprochaska requested review from stchris and catileptic July 1, 2024 18:34

tillprochaska force-pushed the fix/dataset-timestamps branch 2 times, most recently from 67a63d2 to c3372dd Compare July 3, 2024 15:50

tillprochaska marked this pull request as ready for review July 4, 2024 09:32

tillprochaska mentioned this pull request Jul 4, 2024

Display start and last updated timestamp on system status page alephdata/aleph#3788

Merged

1 task

stchris approved these changes Jul 16, 2024

View reviewed changes

tillprochaska added 8 commits July 17, 2024 18:35

Reproduce incorrect dataset timestamps

42240d2

alephdata/aleph#3787

Only set start_time timestamp if it hasn’t been set yet

cbc905f

Delete dataset timestamps when all tasks have been processed

c110105

Add tests covering dataset status when cancelling tasks

053ac1d

Extract flushing status data into separate method

3128c0a

Ensure time-machine is installed in GitHub Actions

a68ff0e

Delete dead code

2b9f9ae

tillprochaska force-pushed the fix/dataset-timestamps branch from c3372dd to 2b9f9ae Compare July 17, 2024 16:56

catileptic mentioned this pull request Jul 18, 2024

Refactor cleanup of task IDs in order to fix zombie tasks #202

Merged

tillprochaska and others added 4 commits July 19, 2024 11:44

Use is_done helper method

811ef5b

Merge branch 'release/1.23.0' into fix/dataset-timestamps

297872b

Fix linter errors

292c3c0

catileptic approved these changes Jul 19, 2024

View reviewed changes

catileptic merged commit eedfb94 into release/1.23.0 Jul 19, 2024
1 check passed

tillprochaska mentioned this pull request Jul 19, 2024

BUG: Information in JSON for an ingesting dataset does not seem to be entirely correct alephdata/aleph#3787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataset timestamps #190

Fix dataset timestamps #190

tillprochaska commented Jun 28, 2024 •

edited

Loading

tillprochaska commented Jul 1, 2024 •

edited

Loading

tillprochaska Jul 1, 2024

stchris Jul 15, 2024

tillprochaska commented Jul 1, 2024

tillprochaska commented Jul 4, 2024

stchris left a comment

stchris Jul 15, 2024

stchris Jul 15, 2024

stchris Jul 16, 2024

tillprochaska commented Jul 16, 2024

stchris commented Jul 16, 2024

tillprochaska commented Jul 17, 2024 •

edited

Loading

stchris commented Jul 17, 2024

		@@ -364,6 +322,29 @@ def is_task_tracked(self, task: Task):

		return tracked

		def flush_status(self, pipe):

Fix dataset timestamps #190

Fix dataset timestamps #190

Conversation

tillprochaska commented Jun 28, 2024 • edited Loading

tillprochaska commented Jul 1, 2024 • edited Loading

tillprochaska Jul 1, 2024

Choose a reason for hiding this comment

stchris Jul 15, 2024

Choose a reason for hiding this comment

tillprochaska commented Jul 1, 2024

tillprochaska commented Jul 4, 2024

stchris left a comment

Choose a reason for hiding this comment

stchris Jul 15, 2024

Choose a reason for hiding this comment

stchris Jul 15, 2024

Choose a reason for hiding this comment

stchris Jul 16, 2024

Choose a reason for hiding this comment

tillprochaska commented Jul 16, 2024

stchris commented Jul 16, 2024

tillprochaska commented Jul 17, 2024 • edited Loading

stchris commented Jul 17, 2024

tillprochaska commented Jun 28, 2024 •

edited

Loading

tillprochaska commented Jul 1, 2024 •

edited

Loading

tillprochaska commented Jul 17, 2024 •

edited

Loading