Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archiving Histories #14771

Closed
9 tasks
davelopez opened this issue Oct 10, 2022 · 16 comments
Closed
9 tasks

Archiving Histories #14771

davelopez opened this issue Oct 10, 2022 · 16 comments

Comments

@davelopez
Copy link
Contributor

davelopez commented Oct 10, 2022

After discussing a bit more about this long-standing feature request a couple of weeks ago during the backend working group meeting (slides) here are a couple of use cases and a high-level break down of the tasks.

Use Case: Archive a history

The most basic use case will be just "marking" a history as archived. This action will freeze the history by restricting any further mutation.
Example of mutation operations:

  • Upload a dataset
  • Run a tool
  • Run a workflow
  • Edit history metadata (name, annotation, etc.)
  • Any other?

An archived history will not be displayed when listing histories by default, but there should be a filter option to show them on demand or a dedicated view.
Those histories will display a badge or some clear indication of their state with possibly additional information on how to restore them.
The Storage Dashboard can detect these histories and suggest exporting them to external storage (see the following use cases) in order to free some space in your Galaxy.

Use Case: Package and Publish an Archived History

After a history is archived, the user can recover its storage space. One way is by publishing the contents to a remote repository.
In this scenario, the users should be able to configure the API credentials in the settings to connect to an external DOI repository (like Zenodo, InvenioRDM, etc.). Then the history and its contents can be packaged into a structured container like RO-Crate and published to the desired external repository.
Galaxy will track the DOI returned by the repository and associate it with the archived history so it can be restored later (as a new copy). The DOI publication also enables other users to import the history for reproducibility.
The packaging and publishing steps should be easy to do by using the Storage Dashboard.

Use Case: Package and Export an Archived History

This scenario replaces the publishing step by exporting the package to any private storage or remote file source.
After exporting the packaged history, the export destination is associated with the history (similar to the DOI case above) and used to recover it in the future. However, the user is responsible for maintaining the exported package.

Implementation Tasks

  • Make history exports trackable Add task-based history export tracking #14839
    • By creating a generic StoreExportAssociation table to track when and where was a history (or other exportable objects) exported.
  • Export a full history as RO-Crate.
    • This could also be the regular history export package. But a structured format like RO-Crate will improve the import experience of the resulting package (i.e. the ability to select what you want to import from the full package).
  • Integrations with external DOI repositories
    • Export to Zenodo
    • Export to InvenioRDM
    • Allow users to set up their API credentials for those integrations
  • Integrate the history archiving system with the Storage Dashboard
  • Import histories from RO-Crate

Bonus

  • Make histories freezable
    • By adding a frozen boolean column to the History model and restricting mutations based on it.

Please feel free to add your ideas or concerns in the comments :)

xref #1734, #3088

@bgruening
Copy link
Member

Thanks @davelopez for writing this up, very appreciated!

If we have a history that we export in some way, I think we need a dataset state that is indicating "archived" in some way. We do not want to show datasets to be "deleted", even if they are - they are archived in this case. Not sure how the "deferred" concept comes in here and if those archived datasets should be turned into "deferred" ones referencing the exported archive?

One other thought that is tricky is if we have a frozen history and we export it, can we change the dataset state to "archived" or "deleted"? Can frozen datasets be deleted?

@jdavcs
Copy link
Member

jdavcs commented Oct 10, 2022

A couple random considerations:

  • Can we publish a history more than once? We probably can. So if we intend to keep track of any publication metadata in addition to a "is published" flag, (e.g. DOI), it would be a one-to-many relationship.
  • A published history should not be unfrozen, so there probably needs to be a check preventing unfreezing before "unpublishing", I think.

@hexylena
Copy link
Member

Any other?

any dataset modifications, datatype, name, tags, etc. No modifications period.

@davelopez
Copy link
Contributor Author

If we have a history that we export in some way, I think we need a dataset state that is indicating "archived" in some way. We do not want to show datasets to be "deleted", even if they are - they are archived in this case.

Yeah, dealing with the contents of the archived history needs a bit more thinking... I was hoping to use only the "archived/frozen" state of the history to display a "virtual" state on the contents. In other words, the dataset can be deleted or in any other state, but as long as it is part of an archived history it will be displayed as "Archived" in the UI regardless of the real internal state. The idea sounded simple enough for a start, but we probably need to set the "frozen" or "archived" state also at the dataset level to prevent any further mutations.
I guess the concern is still, once you have a final history that you want to archive there may be active and deleted datasets on it. But when you archive and export it to remove the contents from Galaxy all datasets will be deleted and we no longer can distinguish between them...

Not sure how the "deferred" concept comes in here and if those archived datasets should be turned into "deferred" ones referencing the exported archive?

That would be an interesting approach, but I'm not sure if the concept is the same or if we can effectively address a dataset inside of a remote package. In any case, this will likely only "work" with published histories. Worth thinking a bit about it though.

One other thought that is tricky is if we have a frozen history and we export it, can we change the dataset state to "archived" or "deleted"? Can frozen datasets be deleted?

Yep, another tricky situation, I guess if we can combine the Archive/Export/Freeze in one "kind of atomical" process in the Storage Dashboard, we can allow controlled mutations in that particular case by unfreezing if the history is already frozen, then archive/export, and then freeze again. But yeah... lots of tricky scenarios... 😵‍💫

Can we publish a history more than once? We probably can. So if we intend to keep track of any publication metadata in addition to a "is published" flag, (e.g. DOI), it would be a one-to-many relationship.

Sure, the idea is to associate and keep track of all exports of a particular history (whether the export implies publishing or not). The "Archiving" state can just associate one of the exports for recovery purposes when the contents need to be removed from Galaxy.

A published history should not be unfrozen, so there probably needs to be a check preventing unfreezing before "unpublishing", I think.

In this context, "published" means the history has been packaged and stored in a remote public repository. So I don't think "unpublishing" is an option... But I haven't thought about it yet 🤔

@hexylena
Copy link
Member

hexylena commented Oct 11, 2022

Does it need to stay accessible in Galaxy, tracked in the DB?
Can we just export it + purge it? It would simplify things a lot right?

== Edit ==

I really think that's the way to go:

  • no user confusion over a history that they can't edit
  • no new UI needed
  • no new concepts implemented over frozen histories, it's just a typical uneditable purged one.
  • the user can track DOIs themself, like they do for everything else (though I get the motivation of having that attached to their profile somehow)
  • User can export any history using current history export mechanism + existing filesources endpoints, doesn't need to be published or any other weird restrictions.
  • Doesn't need to be shown on storage dashboard

So you're down to

  • HistoryExportAssociation

as ro-crate is done #14595

@mvdbeek
Copy link
Member

mvdbeek commented Oct 11, 2022

We do not want to show datasets to be "deleted", even if they are - they are archived in this case.

That's on the history level, I wouldn't make it so complicated on a first pass. If the history indicates it's archived that should be good enough IMO. A simple thing is to just transform the state on the client.

That would be an interesting approach, but I'm not sure if the concept is the same or if we can effectively address a dataset inside of a remote package. In any case, this will likely only "work" with published histories. Worth thinking a bit about it though.

👍 it's a very cool idea and I think this might be possible, but I wouldn't bother with this in a first pass. That's something we can look into later and that wouldn't affect decisions we'd have to make now.

@davelopez
Copy link
Contributor Author

Does it need to stay accessible in Galaxy, tracked in the DB?
Can we just export it + purge it? It would simplify things a lot right?

So, if I understand it correctly, when the user selects "Archive history" at some point:

  • The Export History UI is shown
  • The user selects the format and where to export/download
  • If the export/download is successful, the history is automatically purged
  • There is nothing else in the UI that tells you this history was archived. It is now just a regular purged history. There is no track of when or where it was exported.

This is certainly easier 😆
Do we all agree on that?

@hexylena
Copy link
Member

There is no track of when or where it was exported.

I think this wouldn't hurt to have, and it wouldn't be a huge implementation burden right? Everything else, yeah, YAGNI :)

@hexylena
Copy link
Member

hexylena commented Oct 11, 2022

As a user, that's all I want. I just want to backup histories that are important to S3, and publish even more important ones to e.g. zenodo. Everything else, ok, it was never a desired feature for me.

For the use case of my old boss, which prompted #1734, this would have solved his problem perfectly. He had old analyses he wanted to keep the results of (maybe because he wanted to look at them randomly) but they could easily be on cheaper storage. His issue would be solved by this simplified version

For the user story of my medical university coworkers, we want to archive these results because they can be relevant later, but generally they're not looked at. Packing it into an RO Crate and throwing it in the cheap storage is fine there too.

For the user story of publishing a history for e.g. a paper, I think it's probably a great test of our export/import system if it must go to the location of publication (e.g. zenodo) and then can be re-imported there or other servers to function as a demo.

@bgruening
Copy link
Member

My impression was that we can not purge stuff from Galaxys database. The entries always stay, unfortunately - I would love to really throw everything away from the COVID histories :)

So if the entries stay I think there is a lot of value to render an exported history nicely and show the provenance make tools reloadable, linking to the workflow, and make it in general re-usable. An exported history is not easily inspectable for the foreseeable future and we always need to import it again. This easily can create more data than we wish to have?

I just imagine a link from a paper to the doi and it directly you to the data but also to the Galaxy rendered view.

@hexylena
Copy link
Member

purge

yes, of course. But it solves the immediate problem of "editing" histories without a new feature, and removes the data (which is the bigger problem for most non-covid users)

The covid histories will always need an other solution right? Even if you have this, you're still going to have to write your own purging right to remove metadata?

I just imagine a link from a paper to the doi and it directly you to the data but also to the Galaxy rendered view.

For me, my imagination is like a DOI to WFHub, that there's a button "import in Galaxy". Or that you paste the DOI in Galaxy and can import and explore it.

@mvdbeek
Copy link
Member

mvdbeek commented Oct 11, 2022

An exported history is not easily inspectable for the foreseeable future and we always need to import it again

I think metadata and individual files should be retrievable quite easily, I wouldn't say this needs to be far into the future. If you have the resources I'd have a frontend person look at what data needs to be in a RO-crate manifest in order for us the render a read-only history preview of an export without ingesting everything.

Or alternatively ask the RO-crate people to develop a frontend component we can reuse to list contents.

@bgruening
Copy link
Member

yes, of course. But it solves the immediate problem of "editing" histories without a new feature and removes the data (which is the bigger problem for most non-covid users)

So if you export you always want to mark history as deleted? I think archiving, exporting, and freezing are all different concepts. They can go together but should not strictly. I know that a few people want to freeze a history for example without exporting.

Or alternatively ask the RO-crate people to develop a frontend component we can reuse to list contents.

And we are discussing this for a year or so ;)

@hexylena
Copy link
Member

I think archiving, exporting, and freezing are all different concepts.

I guess that's the main point. We can work on these things separately, rather than as on unified concept since they address different needs. To solve the archive/export case, we can do that very simply today, just need to make it easier for users (and help them remember where they sent the history maybe via a HistoryExportAssociation). To solve the freezing case, we need pretty much everything @davelopez mentioned above.

Freezing addresses a completely different concept, good to know there is a real use case for it though. When I filed the initial issue I surely conflated these terms, but what I and my users have always needed is "export to an archive + delete".

Could you clarify the difference between archiving and exporting? I see those as essentially the same function.

So if you export you always want to mark history as deleted?

could be optional, but it's useful for most people. (Deleted after the export/archive is successful.)

@nuwang
Copy link
Member

nuwang commented Nov 17, 2022

I think archiving, exporting, and freezing are all different concepts.

I've run into this same issue, the terms would benefit from some clarification. I think it would also be beneficial to compare and align these terms with other common apps like gmail, trello etc.

"Archive" caused the most confusion for me personally while following this discussion. It seems better to stick with what it would generally mean elsewhere - as a means of hiding unused items, while export would be used for moving data out of Galaxy into an external medium.

The term freezing sounds like something that could potentially be replaced by read-only?

@mvdbeek
Copy link
Member

mvdbeek commented Jul 21, 2023

I think #16003 closes this

@mvdbeek mvdbeek closed this as completed Jul 21, 2023
Backend Working Group automation moved this from 23.1 To do to Done Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants