Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aquainfra importer #108

Merged
merged 17 commits into from
May 14, 2024
Merged

Conversation

MarkusKonk
Copy link
Contributor

This tool is a data source to import datasets from the AquaINFRA Interaction platform. Users will get redirected to the platform where they can search for datasets. Some of the datasets will have an import to Galaxy button which redirects back to Galaxy where the download starts.

@bgruening
Copy link
Collaborator

Wow cool. I need @wm75 here, he is expert in those data_sources.

@yvanlebras
Copy link
Contributor

THANK you @MarkusKonk for this PR! I let Björn and Wolfgang comment but it seems to me creating such "data importer" can be hard to maintain in both Galaxy and data provider sides with years isn't it ? I often think on my mind that it is better / easier to have "data import tools" who are directly using 'data provider" api for example. Is this comment make sense? THANK you for your work!

<tool id="aquainfra_importer" name="AquaINFRA Importer" tool_type="data_source" version="1.0" profile="22.05">
<description>downloads content via the AquaINFRA interaction platform</description>
<command><![CDATA[
python '$__tool_directory__/data_source.py' '$output' $__app__.config.output_size_limit '$output'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When installing from the toolshed, this will only work if you also provide the data_source.py script together with the .xml because this literally expects the .xml and the .py file in the same directory.
If you haven't customized anything you can just copy over https://github.com/galaxyproject/galaxy/blob/dev/tools/data_source/data_source.py here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review. I just added the file to PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'm now checking things on my own instance to see how it behaves :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the aquainfra part of this is not implemented yet? or should it be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't sure what to do first. The problem is that only very very few datasets have metadata including a direct download link. Most don't have a link or redirect to the website of the data provider where you need to accept conditions or login or similar. Ideally the "Import to Galaxy" Button would be on the website of the data provider but I don't see that happen in the near future.
Give me some time to update the platform and I will show you an example record to test the import to Galaxy.

@MarkusKonk
Copy link
Contributor Author

THANK you @MarkusKonk for this PR! I let Björn and Wolfgang comment but it seems to me creating such "data importer" can be hard to maintain in both Galaxy and data provider sides with years isn't it ? I often think on my mind that it is better / easier to have "data import tools" who are directly using 'data provider" api for example. Is this comment make sense? THANK you for your work!

Thanks for the hint. Do you mean that the platform passes the parameters to the data source tool which uses these to fetch the data via the API directly from the provider?

@yvanlebras
Copy link
Contributor

THANK you @MarkusKonk for this PR! I let Björn and Wolfgang comment but it seems to me creating such "data importer" can be hard to maintain in both Galaxy and data provider sides with years isn't it ? I often think on my mind that it is better / easier to have "data import tools" who are directly using 'data provider" api for example. Is this comment make sense? THANK you for your work!

Thanks for the hint. Do you mean that the platform passes the parameters to the data source tool which uses these to fetch the data via the API directly from the provider?

Thank you for your rapid feedback. I mean ""just"" having a Galaxy tool who can use the data provider API to create a command line allowing to import data into Galaxy. An example here https://github.com/galaxyecology/tools-ecology/tree/master/tools/spocc

@bgruening
Copy link
Collaborator

THANK you @MarkusKonk for this PR! I let Björn and Wolfgang comment but it seems to me creating such "data importer" can be hard to maintain in both Galaxy and data provider sides with years isn't it ? I often think on my mind that it is better / easier to have "data import tools" who are directly using 'data provider" api for example. Is this comment make sense? THANK you for your work!

Thanks for the hint. Do you mean that the platform passes the parameters to the data source tool which uses these to fetch the data via the API directly from the provider?

Thank you for your rapid feedback. I mean ""just"" having a Galaxy tool who can use the data provider API to create a command line allowing to import data into Galaxy. An example here https://github.com/galaxyecology/tools-ecology/tree/master/tools/spocc

My 5cents ... both approaches have their pros and cons. data_source have a better UX, feel natural to users that are used to this data repo. tools wrapping a API have a worse UX IMHO, but are more independent - which means we do not need to have code changes in the data repo side - just an API.

In the end, it boils down to me to - do we have a good contact that we trust at the data-repo side. Do they inform us about upcoming internal changes, and are they willing to work with us more closely in the future? If this Q can be answered with a yes, I would prefer the better UX. If this is not an assumption that we can make, the API tool might be better.

@MarkusKonk
Copy link
Contributor Author

Ah, got it. I think we will follow both directions. The data source is useful for people who don't know where to find data or who are rather at the beginning of their search. They start with the platform, search fo data, find it, and then import it to Galaxy. Your way would be more useful for people who want to create, for example, a subset of the data. I am pretty sure we will need to cover both use cases in the project.

Co-authored-by: Wolfgang Maier <maierw@posteo.de>
@wm75
Copy link
Contributor

wm75 commented Apr 10, 2024

@MarkusKonk is zip the only thing the remote server can return? On the Galaxy side you could get way more sophisticated than that.

@wm75
Copy link
Contributor

wm75 commented Apr 10, 2024

and the failing test is about flake8 linting of data_source.py, which is apparently configured differently here from the galaxy repo.

@MarkusKonk
Copy link
Contributor Author

I finally managed to create a proper example for the data import from aquainfra to galaxy. Here are two examples:

Both have an "Import to Galaxy" button. I changed the Galaxy tool just a bit. It now has "auto" as an output type. It worked well with zip files, json, and geojson.

@MarkusKonk
Copy link
Contributor Author

I have the same data_source.py as here (https://github.com/galaxyproject/galaxy/blob/dev/tools/data_source/data_source.py) but I am still getting a linting error

@bgruening
Copy link
Collaborator

I guess this one is not linted :)
Can you fix this here?

@bgruening
Copy link
Collaborator

Ok, I can take that and fix it :)

Its not lintr, it the python linting stuff.

@MarkusKonk
Copy link
Contributor Author

MarkusKonk commented May 14, 2024

I am either under- or over-indented :D
image

image

@MarkusKonk
Copy link
Contributor Author

@bgruening
Whohooo, finally got it! The checks pass now.

Co-authored-by: Wolfgang Maier <maierw@posteo.de>
@bgruening
Copy link
Collaborator

Wolfgang is busy this week and on vacation for the next 2 weeks, so let's merge it so you can test it further. He can look at it in more detail when he is back.

@bgruening bgruening merged commit 2b586af into galaxyecology:master May 14, 2024
11 checks passed
@bgruening bgruening deleted the aquainfra_importer branch May 14, 2024 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants