Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HDCA update_time #9722

Merged
merged 7 commits into from Jun 2, 2020
Merged

Conversation

jmchilton
Copy link
Member

This brings in the state update optimization from #9571 - I prefer this to the trigger approach in #9515 in part because the trigger as implemented was broken (there was a syntax error and if it did work it would always update the info fields for instance - though they should not be copied over at the end of the job for instance when they've been set to other values elsewhere). I extended that approach to also be used when job.set_final_state is called and replaced our call that WorkflowInvocationStep update times with this lower-level SQL approach.

All of that was largely just to be able to also bring in a change that updates HDCA update_time at the same time these other operations are being executed for #9591 and the new frontend history component.

I can't really show any speed up by doing all of this (#9691 was a much bigger effect - but relatively narrow and doesn't help line up things to alleviate my concerns around #7791). I think there is a small slowdown if anything as a result of all this in parallel. I hit some slow query updates and large spikes in times when testing that has me nervous but nothing consistent or blocking. So I'm not super comfortable with this... but I don't really feel comfortable slowing down anything based on my concerns without evidence and I did remember that we are essentially doing the same problematic synchronized writes around WorkflowInvocationStep just like HDCAs. So we're not introducing a new order of magnitude problem here. We are just doing it twice.

Perhaps problems around implementing that step update - including this hack - https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/jobs/__init__.py#L1744 are part of my concern with doing this for HDCAs also. Clearly we needed to do something hacky to get that working in the distance past and avoid difficult to debug/track lockups and I didn't understand why 😅.

There are a number of things we can do from this point to get back small amounts of the lost time and optimize things toward more efficient job state changes (for #7791). We can update the update time in the job source objects (for HDCAs and datasets for HDAs) since we join against those in the queries anyway - so why not speed up the writes and eliminate the extra join at that time. I outlined this a bit in #9591 (comment). Likewise, we should probably normalize dataset state and info so the jobs info is just used instead of copied around the database. Another important thing we can do is make the update of WorkflowInvocationStep and HDCA update time async with a MQ - get it out of the main job update loop - and throttle the updates for a given object, but that depends on infrastructure not yet ready.

@Nerdinacan
Copy link
Contributor

Sold!

As long as the date gets updated I'm happy. Want me to close out my trigger-based PR? (#9591) Since the rest of that PR has migrated into separate other PRs?

@jmchilton
Copy link
Member Author

Sounds good to me @Nerdinacan - thanks for patience on this issue.

@jmchilton jmchilton changed the title [WIP] Implement HDCA update_time Implement HDCA update_time May 6, 2020
@galaxybot galaxybot added this to the 20.09 milestone May 6, 2020
Comment on lines 1242 to 1252
# sa_session.flush()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# sa_session.flush()

dataset_assoc.dataset.dataset.set_total_size()
collected_bytes += dataset_assoc.dataset.dataset.get_total_size()
# don't call get_total_size - forces a flush we don't want in here.
collected_bytes += dataset_assoc.dataset.dataset.set_total_size()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
collected_bytes += dataset_assoc.dataset.dataset.set_total_size()
collected_bytes += dataset_assoc.dataset.set_total_size()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both versions are correct. job.output_datasets are JobToOutputDatasetAssociation objects. So dataset_assoc.dataset is the HDA and dataset_assoc.dataset.dataset is a Dataset. I guess it saves one function call to use dataset_assoc.dataset.dataset.set_total_size() ?

Copy link
Member

@nsoranzo nsoranzo May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning is that the HDA is correct level of abstraction here, unless there is an important performance gain in skipping a method call. This may become a bug if one day the HDA method start doing more than just call the Dataset method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fair - I mean this is performant critical code to my mind but the extra python function call is the least of our problems so I'll use the better abstraction.

Copy link
Member

@mvdbeek mvdbeek Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the size refers to the dataset, not the history dataset relation ? In my mind we'll have to separate out extra files at one point in the future and relate that back to the dataset, so that we can have HDAs of different datatypes sharing a dataset. I think that's a shortcoming of our current model and pretending that HDAs have a size seems like a step away from that (Admittedly that was not my initial argument :D)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I agree with that - at least at first glance. The dataset is physical collection of files. I think the right way to handle different composite files between HDAs would be to duplicate the dataset with different contents and find some other mechanism to de-duplicate actual files? A concept below Dataset maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess put another way Metadata files and extra files made very different choices and should be converged in some fashion - but toward datasets would make more sense to me and my view of things. I could be wrong though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even if we had a concept below Dataset the file size wouldn't belong on the HDA but the Dataset? It seems counterintuitive to me for the relation to have a size, especially since the composite file content is also part of the Dataset. Effectively we get and set the size of the Dataset here, so I'm fine with merging it as is, but I feel like the current form is clearer both to what actually happens and how we may want to go forward ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see your second comment before posting my last one. I guess we can revisit this when someone starts working on it, let's get this in. MetadataFile being part of the HDA seems to make more sense to me, I didn't think about that. In fact making extra files a first class object that relates to HDAs would probably also neatly close that gap I am worried about in our model.


collected_bytes = 0
# Once datasets are collected, set the total dataset size (includes extra files)
for dataset_assoc in job.output_datasets:
if not dataset_assoc.dataset.dataset.purged:
dataset_assoc.dataset.dataset.set_total_size()
collected_bytes += dataset_assoc.dataset.dataset.get_total_size()
# don't call get_total_size - forces a flush we don't want in here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite confusing that set_total_size doesn't save to the database but get_total_size does...
Maybe add an explicit , flush=True parameter to both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I just drop the comment then? It isn't confusing that set_size returns the size right - the confusion is that get size initializes the data but that isn't relevant if the comment weren't here?

Copy link
Member

@jdavcs jdavcs Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @nsoranzo's suggesion. I think, it is unexpected that get_... may update the database. (just a lazy load with calling the db if value not set would be fine, but to me this feels different).
Maybe not flush=True, but set_if_none=True or calculate_if_none? That way the argument explicitly says what will be done (vs. flush: what updates will be flushed?)
(BTW, set_size does not return anything, unless I'm looking in the wrong place). I see this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmchilton I had nothing against that comment, it was useful. What I have always found confusing is the implementation of both get_total_size and set_total_size methods. What I am proposing is:

    def get_total_size(self, flush=True):
        if self.total_size is not None:
            return self.total_size
        return self.set_total_size(flush=flush)

    def set_total_size(self, flush=True):
        if self.file_size is None:
            self.set_size()
        self.total_size = self.file_size or 0
        rel_path = self._extra_files_rel_path
        if rel_path is not None:
            if self.object_store.exists(self, extra_dir=rel_path, dir_only=True):
                for root, dirs, files in os.walk(self.extra_files_path):
                    self.total_size += sum([os.path.getsize(os.path.join(root, file)) for file in files if os.path.exists(os.path.join(root, file))])
        if flush:
            db_session = object_session(self)
            db_session.flush()
        return self.total_size

That's very probably material for another PR, but what do you think of this?

@ic4f Regarding the parameter name, flush is used in all Galaxy performance-sensitive code to mean if a modified object should be flushed to the database or the caller is responsible for that, so I think its meaning would be clear in this context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsoranzo thanks for clarifying, this makes sense. +1

);
''', '''
UPDATE history_dataset_association
SET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If performance is your primary concern you might consider updating from joins rather than with where-in subqueries. Traditionally that has been a more performant way to do this in many databases, though I do find it harder to understand visually, and I think the way you've written it is easier to understand for future coders.

If you choose to go that route, because of the variety of sql syntaxes for an update-from, this is a reasonably good place to leverage the orm's query builder funcs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update from is not supported in sqlite, and the code needs to support both postgres and sqlite. Checking the current db engine and executing the appropriate query is easy (here's one example https://github.com/galaxyproject/galaxy/pull/9515/files#diff-5da75d9743ecb1b3e59a6ba6527ccb16R63). My concern is that this module is already 6000+ lines long. It's bad enough it contains raw SQL.. Unless we know that using a join for postgres updates does indeed address a bottleneck, then sure, let's have 2 versions and execute them conditionally. Other than that, I'd keep it simple.

jdavcs and others added 6 commits June 1, 2020 10:12
- re-arrange order of execution to respect flush in the parent method (I forgot to actually remove the flush - ugh) - still committing this because I have a chart.
- Include update HDCA update_time by embedding query from Mason on 9591.
Update HDCA update_time using SQL statement on job completion and convert SA update of workflow invocations over to this method also.
@jdavcs
Copy link
Member

jdavcs commented Jun 2, 2020

I believe this is ready for merging?

@mvdbeek mvdbeek merged commit 703bc5e into galaxyproject:dev Jun 2, 2020
mvdbeek added a commit that referenced this pull request Sep 28, 2020
This reverts commit 703bc5e, reversing
changes made to 294252e.
mvdbeek added a commit that referenced this pull request Sep 28, 2020
This reverts commit 703bc5e, reversing
changes made to 294252e.
@nsoranzo nsoranzo deleted the hdca_update_time branch September 28, 2020 17:34
mvdbeek added a commit to mvdbeek/galaxy that referenced this pull request Oct 2, 2020
…ate_time"

This reverts commit 703bc5e, reversing
changes made to 294252e.
@jdavcs jdavcs mentioned this pull request Nov 22, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants