Implement HDCA update_time #9722

jmchilton · 2020-05-05T20:46:37Z

This brings in the state update optimization from #9571 - I prefer this to the trigger approach in #9515 in part because the trigger as implemented was broken (there was a syntax error and if it did work it would always update the info fields for instance - though they should not be copied over at the end of the job for instance when they've been set to other values elsewhere). I extended that approach to also be used when job.set_final_state is called and replaced our call that WorkflowInvocationStep update times with this lower-level SQL approach.

All of that was largely just to be able to also bring in a change that updates HDCA update_time at the same time these other operations are being executed for #9591 and the new frontend history component.

I can't really show any speed up by doing all of this (#9691 was a much bigger effect - but relatively narrow and doesn't help line up things to alleviate my concerns around #7791). I think there is a small slowdown if anything as a result of all this in parallel. I hit some slow query updates and large spikes in times when testing that has me nervous but nothing consistent or blocking. So I'm not super comfortable with this... but I don't really feel comfortable slowing down anything based on my concerns without evidence and I did remember that we are essentially doing the same problematic synchronized writes around WorkflowInvocationStep just like HDCAs. So we're not introducing a new order of magnitude problem here. We are just doing it twice.

Perhaps problems around implementing that step update - including this hack - https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/__init__.py#L1744 are part of my concern with doing this for HDCAs also. Clearly we needed to do something hacky to get that working in the distance past and avoid difficult to debug/track lockups and I didn't understand why 😅.

There are a number of things we can do from this point to get back small amounts of the lost time and optimize things toward more efficient job state changes (for #7791). We can update the update time in the job source objects (for HDCAs and datasets for HDAs) since we join against those in the queries anyway - so why not speed up the writes and eliminate the extra join at that time. I outlined this a bit in #9591 (comment). Likewise, we should probably normalize dataset state and info so the jobs info is just used instead of copied around the database. Another important thing we can do is make the update of WorkflowInvocationStep and HDCA update time async with a MQ - get it out of the main job update loop - and throttle the updates for a given object, but that depends on infrastructure not yet ready.

Nerdinacan · 2020-05-06T20:09:15Z

Sold!

As long as the date gets updated I'm happy. Want me to close out my trigger-based PR? (#9591) Since the rest of that PR has migrated into separate other PRs?

jmchilton · 2020-05-06T22:36:42Z

Sounds good to me @Nerdinacan - thanks for patience on this issue.

mvdbeek · 2020-05-20T14:52:21Z

lib/galaxy/model/__init__.py

+        # sa_session.flush()
+


Suggested change

# sa_session.flush()

lib/galaxy/model/__init__.py

nsoranzo · 2020-05-20T18:25:39Z

lib/galaxy/jobs/__init__.py

-                dataset_assoc.dataset.dataset.set_total_size()
-                collected_bytes += dataset_assoc.dataset.dataset.get_total_size()
+                # don't call get_total_size - forces a flush we don't want in here.
+                collected_bytes += dataset_assoc.dataset.dataset.set_total_size()


Suggested change

collected_bytes += dataset_assoc.dataset.dataset.set_total_size()

collected_bytes += dataset_assoc.dataset.set_total_size()

Both versions are correct. job.output_datasets are JobToOutputDatasetAssociation objects. So dataset_assoc.dataset is the HDA and dataset_assoc.dataset.dataset is a Dataset. I guess it saves one function call to use dataset_assoc.dataset.dataset.set_total_size() ?

My reasoning is that the HDA is correct level of abstraction here, unless there is an important performance gain in skipping a method call. This may become a bug if one day the HDA method start doing more than just call the Dataset method.

That is fair - I mean this is performant critical code to my mind but the extra python function call is the least of our problems so I'll use the better abstraction.

But the size refers to the dataset, not the history dataset relation ? In my mind we'll have to separate out extra files at one point in the future and relate that back to the dataset, so that we can have HDAs of different datatypes sharing a dataset. I think that's a shortcoming of our current model and pretending that HDAs have a size seems like a step away from that (Admittedly that was not my initial argument :D)

I don't think I agree with that - at least at first glance. The dataset is physical collection of files. I think the right way to handle different composite files between HDAs would be to duplicate the dataset with different contents and find some other mechanism to de-duplicate actual files? A concept below Dataset maybe?

I guess put another way Metadata files and extra files made very different choices and should be converged in some fashion - but toward datasets would make more sense to me and my view of things. I could be wrong though.

But even if we had a concept below Dataset the file size wouldn't belong on the HDA but the Dataset? It seems counterintuitive to me for the relation to have a size, especially since the composite file content is also part of the Dataset. Effectively we get and set the size of the Dataset here, so I'm fine with merging it as is, but I feel like the current form is clearer both to what actually happens and how we may want to go forward ?

I didn't see your second comment before posting my last one. I guess we can revisit this when someone starts working on it, let's get this in. MetadataFile being part of the HDA seems to make more sense to me, I didn't think about that. In fact making extra files a first class object that relates to HDAs would probably also neatly close that gap I am worried about in our model.

nsoranzo · 2020-05-20T18:31:20Z

lib/galaxy/jobs/__init__.py


        collected_bytes = 0
        # Once datasets are collected, set the total dataset size (includes extra files)
        for dataset_assoc in job.output_datasets:
            if not dataset_assoc.dataset.dataset.purged:
-                dataset_assoc.dataset.dataset.set_total_size()
-                collected_bytes += dataset_assoc.dataset.dataset.get_total_size()
+                # don't call get_total_size - forces a flush we don't want in here.


It's quite confusing that set_total_size doesn't save to the database but get_total_size does...
Maybe add an explicit , flush=True parameter to both?

Can I just drop the comment then? It isn't confusing that set_size returns the size right - the confusion is that get size initializes the data but that isn't relevant if the comment weren't here?

+1 on @nsoranzo's suggesion. I think, it is unexpected that get_... may update the database. (just a lazy load with calling the db if value not set would be fine, but to me this feels different).
Maybe not flush=True, but set_if_none=True or calculate_if_none? That way the argument explicitly says what will be done (vs. flush: what updates will be flushed?)
~~(BTW, set_size does not return anything, unless I'm looking in the wrong place).~~ I see this now.

@jmchilton I had nothing against that comment, it was useful. What I have always found confusing is the implementation of both get_total_size and set_total_size methods. What I am proposing is:

def get_total_size(self, flush=True): if self.total_size is not None: return self.total_size return self.set_total_size(flush=flush) def set_total_size(self, flush=True): if self.file_size is None: self.set_size() self.total_size = self.file_size or 0 rel_path = self._extra_files_rel_path if rel_path is not None: if self.object_store.exists(self, extra_dir=rel_path, dir_only=True): for root, dirs, files in os.walk(self.extra_files_path): self.total_size += sum([os.path.getsize(os.path.join(root, file)) for file in files if os.path.exists(os.path.join(root, file))]) if flush: db_session = object_session(self) db_session.flush() return self.total_size

That's very probably material for another PR, but what do you think of this?

@ic4f Regarding the parameter name, flush is used in all Galaxy performance-sensitive code to mean if a modified object should be flushed to the database or the caller is responsible for that, so I think its meaning would be clear in this context.

@nsoranzo thanks for clarifying, this makes sense. +1

Nerdinacan · 2020-05-28T17:17:02Z

lib/galaxy/model/__init__.py

+            );
+        ''', '''
+            UPDATE history_dataset_association
+            SET


If performance is your primary concern you might consider updating from joins rather than with where-in subqueries. Traditionally that has been a more performant way to do this in many databases, though I do find it harder to understand visually, and I think the way you've written it is easier to understand for future coders.

If you choose to go that route, because of the variety of sql syntaxes for an update-from, this is a reasonably good place to leverage the orm's query builder funcs.

update from is not supported in sqlite, and the code needs to support both postgres and sqlite. Checking the current db engine and executing the appropriate query is easy (here's one example https://github.com/galaxyproject/galaxy/pull/9515/files#diff-5da75d9743ecb1b3e59a6ba6527ccb16R63). My concern is that this module is already 6000+ lines long. It's bad enough it contains raw SQL.. Unless we know that using a join for postgres updates does indeed address a bottleneck, then sure, let's have 2 versions and execute them conditionally. Other than that, I'd keep it simple.

- re-arrange order of execution to respect flush in the parent method (I forgot to actually remove the flush - ugh) - still committing this because I have a chart. - Include update HDCA update_time by embedding query from Mason on 9591.

Update HDCA update_time using SQL statement on job completion and convert SA update of workflow invocations over to this method also.

lib/galaxy/model/__init__.py

jdavcs · 2020-06-02T15:07:04Z

I believe this is ready for merging?

This reverts commit 703bc5e, reversing changes made to 294252e.

…ate_time" This reverts commit 703bc5e, reversing changes made to 294252e.

galaxybot added triage status/WIP labels May 5, 2020

jmchilton changed the title ~~[WIP] Implement HDCA update_time~~ Implement HDCA update_time May 6, 2020

jmchilton added area/API area/database Galaxy's database or data access layer kind/enhancement area/histories area/dataset-collections and removed status/WIP triage labels May 6, 2020

galaxybot added this to the 20.09 milestone May 6, 2020

Nerdinacan mentioned this pull request May 20, 2020

History component refactor #8441

Merged

mvdbeek reviewed May 20, 2020

View reviewed changes

lib/galaxy/model/__init__.py Outdated

Comment on lines 1242 to 1252

# sa_session.flush()

Copy link

Member

mvdbeek May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

# sa_session.flush()

mvdbeek approved these changes May 20, 2020

View reviewed changes

mvdbeek reviewed May 20, 2020

View reviewed changes

lib/galaxy/model/__init__.py Show resolved Hide resolved

nsoranzo reviewed May 20, 2020

View reviewed changes

Nerdinacan reviewed May 28, 2020

View reviewed changes

jdavcs and others added 6 commits June 1, 2020 10:12

Update dataset states w/one query: initial pass

c3787aa

For sqlite these need to be seprate statements.

6d50b3c

Optimize job finish.

131d41e

Convert more state change stuff to SQL statements.

04cb4c6

Update HDCA update_time using SQL statement on job completion and convert SA update of workflow invocations over to this method also.

Reviewer comments - thanks to Marius, Mason, and Nicola.

3931fa4

jmchilton force-pushed the hdca_update_time branch from b557680 to 3931fa4 Compare June 1, 2020 14:20

jdavcs reviewed Jun 1, 2020

View reviewed changes

lib/galaxy/model/__init__.py Show resolved Hide resolved

Remove now broken unit test.

650489d

jmchilton force-pushed the hdca_update_time branch from 7af48fe to 650489d Compare June 2, 2020 11:55

jdavcs approved these changes Jun 2, 2020

View reviewed changes

This was referenced Jun 2, 2020

WIP: Update dataset states w/one query: initial pass #9571

Closed

Add trigger to optimize job change_state() #9515

Closed

mvdbeek merged commit 703bc5e into galaxyproject:dev Jun 2, 2020

jdavcs mentioned this pull request Jun 4, 2020

Optimize changing Galaxy job/dataset states. #8704

Open

jmchilton mentioned this pull request Jun 4, 2020

Increase Job Throughput #4959

Closed

mvdbeek added a commit that referenced this pull request Sep 28, 2020

Revert "Merge pull request #9722 from jmchilton/hdca_update_time"

601abb9

This reverts commit 703bc5e, reversing changes made to 294252e.

mvdbeek added a commit that referenced this pull request Sep 28, 2020

Revert "Merge pull request #9722 from jmchilton/hdca_update_time"

7e89d89

This reverts commit 703bc5e, reversing changes made to 294252e.

nsoranzo deleted the hdca_update_time branch September 28, 2020 17:34

mvdbeek added a commit to mvdbeek/galaxy that referenced this pull request Oct 2, 2020

Revert "Merge pull request galaxyproject#9722 from jmchilton/hdca_upd…

0488225

…ate_time" This reverts commit 703bc5e, reversing changes made to 294252e.

jdavcs mentioned this pull request Nov 22, 2021

Assorted mapping cleanup #12951

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement HDCA update_time #9722

Implement HDCA update_time #9722

jmchilton commented May 5, 2020

Nerdinacan commented May 6, 2020

jmchilton commented May 6, 2020

mvdbeek May 20, 2020

nsoranzo May 20, 2020

mvdbeek May 21, 2020

nsoranzo May 21, 2020 •

edited

jmchilton Jun 1, 2020

mvdbeek Jun 1, 2020 •

edited

jmchilton Jun 1, 2020

jmchilton Jun 1, 2020

mvdbeek Jun 1, 2020

mvdbeek Jun 1, 2020

nsoranzo May 20, 2020

jmchilton Jun 1, 2020

jdavcs Jun 1, 2020 •

edited

nsoranzo Jun 1, 2020

jdavcs Jun 1, 2020

Nerdinacan May 28, 2020

jdavcs Jun 1, 2020

jdavcs commented Jun 2, 2020

	collected_bytes += dataset_assoc.dataset.dataset.set_total_size()
	collected_bytes += dataset_assoc.dataset.set_total_size()

Implement HDCA update_time #9722

Implement HDCA update_time #9722

Conversation

jmchilton commented May 5, 2020

Nerdinacan commented May 6, 2020

jmchilton commented May 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsoranzo May 21, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvdbeek Jun 1, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdavcs Jun 1, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdavcs commented Jun 2, 2020

nsoranzo May 21, 2020 •

edited

mvdbeek Jun 1, 2020 •

edited

jdavcs Jun 1, 2020 •

edited