Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data dump to GitHub #355

Closed
joncison opened this issue Sep 4, 2018 · 17 comments
Closed

Data dump to GitHub #355

joncison opened this issue Sep 4, 2018 · 17 comments
Assignees
Labels
content Concerns bio.tools content. discussion General discussion around bio.tools.

Comments

@joncison
Copy link
Member

joncison commented Sep 4, 2018

Nightly dump of all content (in XML and JSON formats?) to GitHub, as a convenience (or least to begin, just a one-off dump)

@joncison joncison added the content Concerns bio.tools content. label Sep 4, 2018
@joncison joncison self-assigned this Sep 4, 2018
@joncison joncison changed the title Data dump from GitHub Data dump to GitHub Sep 4, 2018
@joncison joncison added data model / integrity / quality Concerns the underlying data model (verification, validity etc.) discussion General discussion around bio.tools. and removed data model / integrity / quality Concerns the underlying data model (verification, validity etc.) labels Dec 8, 2018
@joncison
Copy link
Member Author

joncison commented Dec 8, 2018

We already have a repo for this (https://github.com/bio-tools/bio.tools-content) but the names maybe a bit crappy? How about:

Preferences? I'll need to spell out this is strictly for experimental purposes (like what I said here already).

And in what format:

  • XML (we already have the schema == files can already be validated directly, biotoolsSchema 3.0.0 XML supported by bio.tools in next release) - my preference)
  • JSON (format natively supported by bio.tools) preferred by web devs? - requires shim for conversion to XML/validation)
  • YAML (format natively supported by bio.tools) most readable format? - requires shim for conversion to XML/validation)

Preferences?

I'd personally prefer XML because it will make the validation direct and easier (and avoid any drift to using not very rigorous JSON schema equivalents of biotoolsSchema etc.)

And what about the structure - I propose one folder per tool, where the folder name is the bio.tools toolID - which allows for adding other tool descriptors / files / formats under a common directory. Also one XML with everything in.

Preferences?

cc @bgruening @hmenager @hansioan : what do you think?

@bgruening
Copy link

My gut feeling is https://github.com/bio-tools/tools. Its bio.tools so tools makes a lot of sense :)

I would prefere YAML, as this is currently the most easiest format for people to edit in an editor or browser. This can change if we dump the final version and when we have an curation interface, but for now I would prefer YAML. The shim is hopefully not complicated to write and would be used on CI to 1) convert it to XML and 2) validate and changes.

Thanks @joncison for working on this.

@joncison
Copy link
Member Author

joncison commented Dec 8, 2018

OK thanks!

Just a note that https://dev.bio.tools/api/tool/ is currently serving a mess (in all of XML, JSON and YAML :) ) - but this has been sorted locally.

I plan to play more with shims next week, so let's see how this goes ...

And @bgruening - what about the directory structure; are you happy with folders as we talked about previously ?

@scapella
Copy link

scapella commented Dec 8, 2018 via email

@hansioan
Copy link
Member

@joncison @hmenager @bgruening
Why not all of them? I would prefer it to be JSON of course :) , but perhaps the best is to have all three (JSON, XML, YAML). bio.tools supports that.

https://bio.tools/api/t?page=1&format=json
https://bio.tools/api/t?page=1&format=yaml
https://bio.tools/api/t?page=1&format=xml

In the case of biotoolsSchema xml for now we only have that on a per tool basis (example shown on dev but will soon work on production too)
https://dev.bio.tools/api/signalp?format=xml

@jlgelpi
Copy link

jlgelpi commented Dec 10, 2018

I would go for a single format for the repository (one that can be easily checked against a schema). Having several formats may introduce inconsistences. Perhaps we can accept any input format (XML, json, yaml) and convert the data on the pull request.

@joncison
Copy link
Member Author

Please let us know what you think @hmenager then I'll write back addressing all comments above ...

@bgruening
Copy link

And @bgruening - what about the directory structure; are you happy with folders as we talked about previously?

Yes. Folders are good.

How do you want to handle versions? different subfolders in the same tools folders?

Most likely. Would make sense. Whatever we do, we can change this easily later one. So nothing is set in stone imho.

I would prefer it to be JSON of course

@hansioan any reason? JSON is a subset of YAML so that should be fine for both worlds and conversion is easy.

Perhaps we can accept any input format (XML, json, yaml) and convert the data on the pull request.

I guess the idea was to accept only one format and then on CI add all validation. This validation could happen by intermediate conversion to XML if @joncison thinks that's best.
I would prefer only one format in the mast repo to not confuse users, but if other formats are needed we can have a bot that converts them automatically and syncs it so a bio.tools-json repo etc. ...

@redmitry
Copy link
Contributor

Hello,

I know that for the mere human being the form ?page=1&format=json is a natural way, as it permits to use usual browser for the GET requests, but talking about REST architecture, it is better to use headers:

Accept: application/json
Range: tools=10-30
Response:
Content-Type: application/json
Content-Range: tools 10-30/20000

The advantage of standard http pagination is that a client knows from the beginning the total size (headers go before the body) and may calculate the number of pages in the table, while loading only one page only.

Of course nobody prevent someone to implement both forms.

@hansioan
Copy link
Member

Regarding the versions... Technically in bio.tools we don't have a fine grained track of tool versions. In the new schema 3.0 which will go in this week if not today we allow multiple version annotations per tool. The reason for this is that if there is no difference in annotation for a set of tool versions, they all go in the same tool annotation.
If there IS a (significant) difference in tool functionality -> thus annotation between different tool versions, then that tool, along with the version will go into a separate tool entry, given its own tool id, with separate annotation and so on...

Thus I am not really sure if any of the folder structure is needed, at least not at the core of the tool descriptions. We can have the option of providing multiple copies of the same tool, separated by version, but that should be something that results from the initial structure, and not something that IS the initial structure.

@scapella
Copy link

scapella commented Dec 10, 2018 via email

@hansioan
Copy link
Member

Yes, but having version specific information for each tools gets us back to 2 years ago when tools were accessed like https://bio.tools/toolid/version

This way was basically creating a tool whenever a new version appeared, and in 90% of all cases there were no (zero) differences between the annotations, except for the version property. We had a very famous example of a tool that appeared over 10 times in bio.tools with the same annotation, because the people were just going in and updating the version information whenever they released a new version (e.g. new tool between tool version 1.2.23 and 1.2.24).

There is no good way to do separate versions for each tool except modeling this in the API request, and even if there was we would still have to store versioned tools in the database.

While this can certainly apply for things like conda, containers and other projects that require the exact versions, I don't think applies as much to bio.tools. We must remember that 90% of our users just want to find a tool that meets their scientific requirements (focus on find).

All this being said, I am not opposed to having a good solution that can work for everyone, it is just something which is complicated and not in our list of main tasks right now. We have opened the code and once all the remaining plumbing tasks are done and we are ready to accept pull requests, perhaps this can be one of the initial tasks for contributors.

@bgruening
Copy link

@redmitry Json will be served from the web service. But this question issue is about the data storage in GitHub, no one disagrees that Json will be served from the web service, imho.

@scapella @hansioan don't let the version discussion stop the initial drop, please. Versioning can be added at any time. Subfolders are easy to add for anyone if they care about the version or if the difference between the tools/benchmarks are too big. You simply need to adjust your bio.tools-github-parser to traverse recursively to all folders. And people that don't need this, can simply assume that the latest version is in the root dir. Really not a big deal. We can add this later if we need to.

@hmenager
Copy link
Member

hmenager commented Dec 10, 2018 via email

@scapella
Copy link

scapella commented Dec 10, 2018 via email

@joncison
Copy link
Member Author

joncison commented Dec 18, 2018

Quick update - will be revisiting this in new year - but for now a few points:

  • repo name most likely https://github.com/bio-tools/content, reason being bio.tools scope is broad: "tool" covers many types of software; command-line tools, Web applications, database portals, workflows etc.
  • repo structure most likely one folder per tool, folder names will be biotoolsIDs. Folders will allow other files and sub-folders to be added in future as needed, e.g. alternative formats, or even other other tool descriptors, wrappers, test data etc.
  • metadata format will be biotoolsSchema 3.0.0-compatible XML to begin with, reason being, priority in 1st instance is to achieve content integration with other projects (BioConda, BioContainers, Galaxy etc) hence need to prioritise ease of validation and (I strongly suspect) updating biotoolsSchema to enable this integration
  • YAML format can come later, once integration use-case is advanced, and individual developers are more a priority. It has to wait for the shims which I'll play with soon-ish.
  • initial dump likely will be everything (easier)
  • version information there are well-established guidelines on how tool versions are currently handled. The current model allows specification of version information in a pragmatic / flexible way, including for the entry itself, relevant downloads and publications. There is certainly scope to improve this model, but let's take that discussion here in 1st instance - with view to a better rendering of version-specific info. in bio.tools. In future (with new content architecture), we could go further, but one thing at a time ...

Let's keep this issue for the data dump and use this for technical discussions about a GitHub-based content architecture.

Pls. bear in mind the priority on the DK side is getting the deployment and open-dev process sorted, critical / high priority issues scheduled for the 2019 Q1 release, the website redesign, and other features with direct impact on end-users.

The new content architecture under discussion would be awesome, but depends on other components including an independent curators interface e.g. based on edamToolAnnotator and independent validation mechanism, e.g. biotoolsLint. It's a lot of work, hence a matter of priorities.

@joncison
Copy link
Member Author

This issue was moved to research-software-ecosystem/content#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Concerns bio.tools content. discussion General discussion around bio.tools.
Projects
None yet
Development

No branches or pull requests

7 participants