Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usegalaxy.* shared data updater creates duplicates #162

Open
hexylena opened this issue Jul 15, 2020 · 4 comments
Open

usegalaxy.* shared data updater creates duplicates #162

hexylena opened this issue Jul 15, 2020 · 4 comments

Comments

@hexylena
Copy link
Member

So this is quite unfortunate. Roughly every time it runs, it creates some duplicates.

$ bash no-dupes.sh
# Checking eu
	Error! Duplicates!
      4 "/Metabolomics/Msi analyte distribution/DOI: 10.5281/zenodo.484496/Uploaded Composite Dataset (imzml)"
      3 "/Proteomics/Mass spectrometry imaging 1: Loading and exploring MSI data/DOI: 10.5281/zenodo.1560645/Uploaded Composite Dataset (imzml)"
# Checking org
	Error! Duplicates!
     52 "/Metabolomics/Msi analyte distribution/DOI: 10.5281/zenodo.484496/Uploaded Composite Dataset (imzml)"
     51 "/Proteomics/Mass spectrometry imaging 1: Loading and exploring MSI data/DOI: 10.5281/zenodo.1560645/Uploaded Composite Dataset (imzml)"
# Checking au
	Error! Duplicates!
     49 "/Metabolomics/Msi analyte distribution/DOI: 10.5281/zenodo.484496/Uploaded Composite Dataset (imzml)"
     53 "/Proteomics/Mass spectrometry imaging 1: Loading and exploring MSI data/DOI: 10.5281/zenodo.1560645/Uploaded Composite Dataset (imzml)"
# Checking fr
	Error! Duplicates!
      2 "/Metabolomics/Msi analyte distribution/DOI: 10.5281/zenodo.484496/Uploaded Composite Dataset (imzml)"
      2 "/Proteomics/Mass spectrometry imaging 1: Loading and exploring MSI data/DOI: 10.5281/zenodo.1560645/Uploaded Composite Dataset (imzml)"

EU has seen this quite prominently, the other servers less so. I'd never seen it on them until I wrote the script to check them just now, and clearly it has been going on for quite some time judging by the counts.

https://github.com/usegalaxy-eu/shared-data/blob/master/no-dupes.sh is the script to check, I'm just dumping the contents of the GTN folder.

Also that API really probably doesn't need enforced authentication, since I can browse those while anonymous on the web.

I can add another script to try and remove duplicates, but, shared data already has one script hacking around upload permissions, another feels like too much duct-tape.

@mvdbeek
Copy link
Member

mvdbeek commented Jul 15, 2020

What is the initial script that populates the folders ? Which API routes does it use ?

@gmauro
Copy link
Member

gmauro commented Jul 15, 2020

@mvdbeek mvdbeek transferred this issue from galaxyproject/galaxy Jul 15, 2020
@mvdbeek
Copy link
Member

mvdbeek commented Jul 15, 2020

I've moved this to ephemeris then. I don't think there is a logical path towards de-duplication on Galaxy's end (at least not without either a checksum or something else that can identify a piece of data), this should be handled in the setup-data-libraries script IMO.

@hexylena
Copy link
Member Author

Sounds good. Thanks for the move, I should've made it here in the first place.

@Slugger70 @natefoo @lecorguille this issue affects all of you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants