Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pluggable URI handling across upload components. #9888

Merged
merged 5 commits into from
Jul 27, 2020

Conversation

jmchilton
Copy link
Member

@jmchilton jmchilton commented Jun 18, 2020

Overview

This work defines an interface for interacting with "filesystem"-like entities during "upload". In addition to being a pluggable framework adding important new capabilities to Galaxy, this is a generalization and formalization of existing file sources (e.g. the directories described by library_import_dir, user_library_dir, and ftp_upload_dir).

Screenshots

Here is my Dropbox where I created an access token tied to an App folder called galaxytest.

Screen Shot 2020-06-18 at 7 03 56 AM

Here is the slightly modified upload dialog box - it now says select remote files instead of select FTP files.

Screen Shot 2020-06-18 at 7 04 07 AM

I will readily admit the new UI is more utilitarian (in the worst way) than the previous FTP popover code - so I kept all that code in place - we need to figure out how to best present these new selection options, it may not be this dialog.

Screen Shot 2020-06-18 at 7 05 02 AM

My Dropbox files!

Screen Shot 2020-06-18 at 7 05 15 AM

The times aren't there because it isn't exposed in the API we are using, but if I navigate a WebDAV source or the existing Galaxy directories such as my FTP directory (pictured below) these times are available:

Screen Shot 2020-06-18 at 7 05 35 AM

Plugin Infrastructure

This introduces a new plugin FilesSource to represent sources of directories and files during "upload". A FilesSource plugin should be able to index directories and download (called 'realize' to be generic) files to local posix directories. Indexing is used by the remote_files API to provide the client with hierarchies to navigate and to build URIs for the files. The 'realize' operation is used by the upload1 and __DATA_FETCH__ and tools during upload to bring the files into Galaxy as datasets.

An instance of the ConfiguredFileSources class is responsible for managing individual instances of FilesSource plugins. It has methods to map URIs to the appropriate plugin instance.

The ConfiguredFileSources class tracks the loaded plugins and reuses the go to galaxy.util.plugin_config module for loading YAML (or XML) definitions of plugins (the same dependency resolvers, job metrics, auth backends, etc. do). A ConfiguredFileSources object can serialize itself to a file and re-materialize it during job execution to allow using this abstraction during uploads.

When operating within the Galaxy app, the ConfiguredFileSources uses an adapter pattern to parse user-level information from Galaxy's trans object. During serialization, the ConfiguredFileSources object is expected to encode all the required information about the user that is needed into the output JSON description of the file sources. This is because the web transaction won't be available remotely during the upload job. These objects working in such different ways between the Galaxy process and in the remote job is mildly jarring - so unit tests have been written to ensure this all functions properly.

Plugin Implementations

The FilesSource interface has a helper implementation base class BaseFilesSource that provides some assistance for plugin development. Additionally, the base class PyFilesystem2FilesSource extends BaseFilesSource but assumes a PyFilesystem2 implementation exists to target the file source of interest - so the plugin author need only provide a PyFilesystem FS object describing the target. This commit includes three concrete implementations - posix, webdav, and dropbox. posix extends BaseFilesSource while the others are light-weight extensions of PyFilesystem2FilesSource.

posix

While one could imagine a very lightweight implementation based on PyFilesystem2FilesSource this fully worked through plugin is implemented directly to ensure we respect Galaxy's strong security checks on paths containing symlinks and preserve the semantics user_library_import_symlink_allowlist.

webdav

Galaxy tools for integrating OwnCloud exist - see https://github.com/shiltemann/Galaxy-Owncloud-Integration, part of the driver for this work was extending that idea to provide more integrated UX for uploading that data. So this work includes a WebDav plugin (and associated test cases) that could potentially target OwnCloud.

This plugin was a good exercise in flushing and testing the PyFilesystem2 interface but the PyFilesystem2 WebDAV implementation seems a bit fragile... we might want to replace it with more direct APIs but we can take a wait and see approach.

The config YAML for a webdav plugin that lets user's target their own OwnCloud servers configured via user preferences might look something like:

- type: webdav
  id: owncloud1
  label: OwnCloud
  doc: User-configured OwnCloud files
  url: ${user.preferences['owncloud|url']}
  login: ${user.preferences['owncloud|username']}
  password: ${user.preferences['webdav|password']}

The configuration would provide a user's OwnCloud files at gxfiles://owncloud1/.

If instead, a big centralized WebDav server is made available with public data for all users (mirroring use cases of library_import_dir) - a simpler configuration not requiring user preferences might be something like:

- type: webdav
  id: lab
  label: Lab WebDAV server
  doc: Our lab's research files managed at ourlab.org.
  url: http://ourlab.org:7083
  login: ${environ.get('WEBDAV_LOGIN')}
  password: ${environ.get('WEBDAV_PASSWORD')}

The configuration would provide a these WebDAV files at gxfiles://lab/.

These two examples demonstrate basic templating is allowed inside the YAML configuration. These are Cheetah templates exposing very specific views of the 'user', 'config', and the whole 'environ' available to the Galaxy server.

dropbox

The Dropbox PyFilesystem2 plugin is even easier to configure, all that is needed is a Dropbox access token (this can be configured from the settings menu and may be isolated to a specific app specific folder for added security on the user's part).

An example of such a plugin might be:

- type: dropbox
  id: dropbox1
  label: Dropbox Files
  doc: Your Dropbox files - configure an access token via the user preferences
  accessToken: ${user.preferences['dropbox|access_token']}

The configuration would provide a user's Dropbox files at gxfiles://dropbox1/.

gxftp

This is an automatically populated plugin (if ftp_upload_dir is configured in Galaxy) that provides the user's FTP files at gxftp://.

gximport

This is an automatically populated plugin (if library_import_dir is configured in Galaxy) that provides Galaxy's library import files at gximport://.

gxuserimport

This is an automatically populated plugin (if user_library_import_dir is configured in Galaxy) that provides the requesting user's Galaxy's user library import files at gxuserimport://.

Why not a tool?

One could imagine a tool - but the upload dialog has many advanced options for selecting how to ingest files (convert tabs and newlines, select format vs. detect, select dbkey, organize into collections, organize via rules, etc...). It would be next to impossible to provide all these same options via a normal tool and the user experience would be very different than using the upload components in Galaxy - which have been optimized and designed for this task.

That said - one future direction I would like to take this is to be able to mark plugins as writable and implement a new tool form input type "export_directory" or something like that. This could then be used to write data export tools. This could be used to write generalizations of the the cloud send tool.

ObjectStore vs FilesSource

ObjectStores provide datasets not files, the files are organized logically in a very flat way around a dataset. FilesSource s instead provide files and directories, not datasets. A FilesSource is meant to be browsed in hierarchical fashion - and also has no concept of extra files, etc..

Future Work

  • This is hopefully going to serve as the basis of a first pass at Terra integration with Galaxy using the FISS lib. Having an implementation based on PyFilesytem2 means we could potentially integrate support for S3, Basespace, Google Drive, OneDrive, etc..
  • Tool form support for selecting files for import and directories for export.
  • Allow writing collection archives, history export, etc.. to the FilesSource - this would really enhance the UI around getting big stuff out of Galaxy potentially I think.

@nuwang
Copy link
Member

nuwang commented Jun 30, 2020

Tried again today and it's looking great! Didn't need to specify the ftp options, and the posix path option worked very nicely. Very intuitive to use too. Some minor issues I ran into:

  1. When hovering over an item, the mouse cursor type didn't change to a hand.
  2. Webdav paths are not being inferred correctly. It looks like the root folder is being appended again (see examples below). Can you try with a webdav server running in a non-root path?
urllib3.connectionpool DEBUG 2020-06-30 13:34:04,072 [p:4599,w:1,m:0] [uWSGIWorker1Core1] https://gvl5playground.genomicsvl.cloud.edu.au:443 "HEAD /nuwan4/owncloud/remote.php/dav/files/admin/admin/ HTTP/1.1" 404 0
galaxy.web.framework.decorators ERROR 2020-06-30 13:34:04,076 [p:4599,w:1,m:0] [uWSGIWorker1Core1] Uncaught exception in exposed API method:
Traceback (most recent call last):
  File "/Users/Nuwan/work/galaxy/.venv/lib/python3.8/site-packages/webdavfs/webdavfs.py", line 248, in getinfo
    info = self.client.info(_path.encode('utf-8'))
  File "/Users/Nuwan/work/galaxy/.venv/lib/python3.8/site-packages/webdav2/client.py", line 58, in _wrapper
    res = fn(self, *args, **kw)
  File "/Users/Nuwan/work/galaxy/.venv/lib/python3.8/site-packages/webdav2/client.py", line 614, in info
    raise RemoteResourceNotFound(remote_path)
webdav2.exceptions.RemoteResourceNotFound: Remote resource: b'/admin' not found
  1. Same thing happened with a different server. Note the second /webdav/ appended to the path, same as above.
urllib3.connectionpool DEBUG 2020-06-30 13:33:47,806 [p:4599,w:1,m:0] [uWSGIWorker1Core0] https://cloudstor.aarnet.edu.au:443 "HEAD /plus/remote.php/webdav//webdav/ HTTP/1.1" 404 0
galaxy.web.framework.decorators ERROR 2020-06-30 13:33:47,819 [p:4599,w:1,m:0] [uWSGIWorker1Core0] Uncaught exception in exposed API method:
Traceback (most recent call last):
  File "/Users/Nuwan/work/galaxy/.venv/lib/python3.8/site-packages/webdavfs/webdavfs.py", line 248, in getinfo
    info = self.client.info(_path.encode('utf-8'))
  File "/Users/Nuwan/work/galaxy/.venv/lib/python3.8/site-packages/webdav2/client.py", line 58, in _wrapper
    res = fn(self, *args, **kw)
  File "/Users/Nuwan/work/galaxy/.venv/lib/python3.8/site-packages/webdav2/client.py", line 614, in info
    raise RemoteResourceNotFound(remote_path)
webdav2.exceptions.RemoteResourceNotFound: Remote resource: b'/webdav' not found

Some enhancements:

  1. Would be nice to have a breadcrumb trail or a tree-view, so that users don't lose track of their location when traversing deeply nested folder structures.
  2. Would be nice to also be able to select entire folders, but it's already nice that multiple files can be selected so easily. Alternatively, a select all button and check boxes for selection or something?
  3. And as already discussed, a way to also export files would really tie everything up neatly, and solve two big ticket items - a consistent way to get data in and out of Galaxy.
  4. Also a forth option for consideration - is it worth folding in data-source tools into this too?

All in all, this is a massive improvement to usability, and should be a major reason to rush 20.09 out :-) (I confess to having a vested interested in this - this will allow webdav handling to be much better in general over the tool that I've been working on)

@jmchilton jmchilton force-pushed the galaxy_files_2 branch 5 times, most recently from 191f8a7 to b4927fa Compare July 1, 2020 15:37
@jmchilton jmchilton changed the title [WIP] Pluggable URI handling across upload components. Pluggable URI handling across upload components. Jul 1, 2020
@galaxybot galaxybot added this to the 20.09 milestone Jul 1, 2020
@jmchilton
Copy link
Member Author

I'm pulling this out of WIP because I think what is here represents an atomic first pass despite being sprawling at this point. But I would still like to address these issues.

@nuwang - I put a lot of the GUI issues you mentioned over into #9942. I think I agree with most of the requests, but I think they are iteration 2 sorts of things that it would be easier to address or delegate after this is merged.

Would be nice to also be able to select entire folders

So this can be done using the rule builder now - this isn't in the screenshots above yet but you can select a whole directory and load it in to be worked on using the rule builder to parse out metadata. But it would be both good to have a way to select all the contents and to start with a directory and use something simpler than the rule builder to pull out the metadata. I did mention this on the linked issue.

And as already discussed, a way to also export files would really tie everything up neatly, and solve two big ticket items - a consistent way to get data in and out of Galaxy.

Obviously totally agree and I'm definitely on board - this would be a game changer. I've created an issue here #9948.

Also a forth option for consideration - is it worth folding in data-source tools into this too?

I don't know how we would do that - but it sounds fun. Want to create an issue describing that in more detail? I can't really imagine it.

I will also keep working on the webdav that isn't tied to a root URL. My first task is to determine if the problem is with the framework or the plugin.

@bgruening
Copy link
Member

@jmchilton this is awesome! How do you envision to save the user-credentials? Storing them in a safe way would help all kinds of data exporters. One that we should focus on imho are Zenodo exporter and SRA/ENA ones.

@jmchilton
Copy link
Member Author

jmchilton commented Jul 1, 2020

@bgruening This plugin approach has many applications that don't involve needing to capture and store secrets from users but I'm sure that is the right question. I haven't done any research so I wouldn't want to be the one to pick a best practice right now but I think this approach could work with many different sources of private information. The example above uses user preferences - which could be made slightly more secure with #9876 but I think we should research and invest in an external secret manager ideally. Storing this stuff unencrypted in our dataset isn't something we want to do long term. That said - I do think user preferences and such are a big step forward from the tool framework which we know people are using currently so I don't hate it as a medium term thing.

@afgane
Copy link
Contributor

afgane commented Jul 1, 2020

but I think we should research and invest in an external secret manager ideally

This is on the roadmap for the Custos project and will be integrated into Galaxy once the service there becomes available.

@bgruening
Copy link
Member

Fully agree! I was just wondering if you have a master plan already for secrets :)

@bgruening
Copy link
Member

@afgane very nice, thanks for sharing! Are there more information about this and the plans you have?

@afgane
Copy link
Contributor

afgane commented Jul 1, 2020

I believe this is as far as implementation on that topic got: apache/airavata-custos#68. The idea is to use Vault behind the scenes and provide an API to science gateways to consume.

There's a team meeting scheduled tomorrow at 11am ET where I'll bring up this secrets service. Anyone interested is welcome to join (https://iu.zoom.us/j/788176034).

@bgruening
Copy link
Member

@jmchilton sorry, this needs a rebase.

*Overview*

This work defines an interface for interacting with "filesystem"-like entities during "upload". In addition to being a pluggable framework adding important new capabilities to Galaxy, this is a generalization and formalization of existing file sources (e.g. the directories described by `library_import_dir`, `user_library_dir`, and `ftp_upload_dir`).

*Plugin Infrastructure*

This introduces a new plugin `FilesSource` to represent sources of directories and files during "upload". A `FilesSource` plugin should be able to index directories and download (called 'realize' to be generic) files to local posix directories. Indexing is used by the remote_files API to provide the client with hierarchies to navigate and to build URIs for the files. The 'realize' operation is used by the 'upload1' and '__DATA_FETCH__' and tools during upload to bring the files into Galaxy as datasets.

An instance of the `ConfiguredFileSources` class is responsible for managing individual instances of `FilesSource` plugins. It has methods to map URIs to the appropriate plugin instance.

The `ConfiguredFileSources` class tracks the loaded plugins and reuses the go to `galaxy.util.plugin_config` module for loading YAML (or XML) definitions of plugins (the same dependency resolvers, job metrics, auth backends, etc. do). A `ConfiguredFileSources` object can serialize itself to a file and re-materialize it during job execution to allow using this abstraction during uploads.

When operating within the Galaxy app, the `ConfiguredFileSources` uses an adapter pattern to parse user-level information from Galaxy's `trans` object. During serialization, the `ConfiguredFileSources` object is expected to encode all the required information about the user that is needed into the output JSON description of the file sources. This is because the web transaction won't be available remotely during the upload job. These objects working in such different ways between the Galaxy process and in the remote job is mildly jarring - so unit tests have been written to ensure this all functions properly.

*Plugin Implementations*

The `FilesSource` interface has a helper implementation base class `BaseFilesSource` that provides some assistance for plugin development. Additionally, the base class `PyFilesystem2FilesSource` extends `BaseFilesSource` but assumes a PyFilesystem2 implementation exists to target the file source of interest - so the plugin author need only provide a PyFilesystem `FS` object describing the target. This commit includes three concrete implementations - posix, webdav, and dropbox. `posix` extends `BaseFilesSource` while the others are light-weight extensions of `PyFilesystem2FilesSource`.

**posix**

While one could imagine a very lightweight implementation based on `PyFilesystem2FilesSource` this fully worked through plugin is implemented directly to ensure we respect Galaxy's strong security checks on paths containing symlinks and preserve the semantics `user_library_import_symlink_allowlist`.

**webdav**

Galaxy tools for integrating OwnCloud exist - see https://github.com/shiltemann/Galaxy-Owncloud-Integration, part of the driver for this work was extending that idea to provide more integrated UX for uploading that data. So this work includes a WebDav plugin (and associated test cases) that could potentially target OwnCloud.

This plugin was a good exercise in flushing and testing the PyFilesystem2 interface but the PyFilesystem2 WebDAV implementation seems a bit fragile... we might want to replace it with more direct APIs but we can take a wait and see approach.

The config YAML for a webdav plugin that lets user's target their own OwnCloud servers configured via user preferences might look something like:

```
- type: webdav
  id: owncloud1
  label: OwnCloud
  doc: User-configured OwnCloud files
  url: ${user.preferences['owncloud|url']}
  login: ${user.preferences['owncloud|username']}
  password: ${user.preferences['webdav|password']}
```

The configuration would provide a user's OwnCloud files at `gxfiles://owncloud1/`.

If instead, a big centralized WebDav server is made available with public data for all users (mirroring use cases of `library_import_dir`) - a simpler configuration not requiring user preferences might be something like:

```
- type: webdav
  id: lab
  label: Lab WebDAV server
  doc: Our lab's research files managed at ourlab.org.
  url: http://ourlab.org:7083
  login: ${environ.get('WEBDAV_LOGIN')}
  password: ${environ.get('WEBDAV_PASSWORD')}
```

The configuration would provide a these WebDAV files at `gxfiles://lab/`.

These two examples demonstrate basic templating is allowed inside the YAML configuration. These are Cheetah templates exposing very specific views of the 'user', 'config', and the whole 'environ' available to the Galaxy server.

**dropbox**

The Dropbox PyFilesystem2 plugin is even easier to configure, all that is needed is a Dropbox access token (this can be configured from the settings menu and may be isolated to a specific app specific folder for added security on the user's part).

An example of such a plugin might be:

```
- type: dropbox
  id: dropbox1
  label: Dropbox Files
  doc: Your Dropbox files - configure an access token via the user preferences
  accessToken: ${user.preferences['dropbox|access_token']}
```

The configuration would provide a user's Dropbox files at `gxfiles://dropbox1/`.

**gxftp**

This is an automatically populated plugin (if `ftp_upload_dir` is configured in Galaxy) that provides the user's FTP files at `gxftp://`.

**gximport**

This is an automatically populated plugin (if `library_import_dir` is configured in Galaxy) that provides Galaxy's library import files at `gximport://`.

**gxuserimport**

This is an automatically populated plugin (if `user_library_import_dir` is configured in Galaxy) that provides the requesting user's Galaxy's user library import files at `gximportfiles://`.

*Why not a tool?*

One could imagine a tool - but the upload dialog has many advanced options for selecting how to ingest files (convert tabs and newlines, select format vs. detect, select dbkey, organize into collections, organize via rules, etc...). It would be next to impossible to provide all these same options via a normal tool and the user experience would be very different than using the upload components in Galaxy - which have been optimized and designed for this task.

That said - one future direction I would like to take this is to be able to mark plugins as writable and implement a new tool form input type "export_directory" or something like that. This could then be used to write data export tools. This could be used to write generalizations of the the cloud send tool.

*`ObjectStore` vs `FilesSource`*

ObjectStores provide datasets not files, the files are organized logically in a very flat way around a dataset. `FilesSource` s instead provide files and directories, not datasets. A `FilesSource` is meant to be browsed in hierarchical fashion - and also has no concept of extra files, etc..

*Future Work*

- This is hopefully going to serve as the basis of a first pass at Terra integration with Galaxy using the FISS lib. Having an implementation based on `PyFilesytem2` means we could potentially integrate support for S3, Basespace, Google Drive, OneDrive, etc..
- Tool form support for selecting files for import and directories for export.
- Allow writing collection archives, history export, etc.. to the `FilesSource` - this would really enhance the UI around getting big stuff out of Galaxy potentially I think.

Rebase into galaxy.files...
- Extend FileDialog to allow selection of directories.
- Add new rule source that is a remote directory, pre-loaded into the rule builder with URL column assigned from the metadata.
@bgruening
Copy link
Member

Ah, awesome sauce! This is amazing! Hopefully, we will see many plugins in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants