Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug?] Galago doesn't like UShER JSON (yet) #204

Open
AngieHinrichs opened this issue Sep 29, 2022 · 11 comments
Open

[Bug?] Galago doesn't like UShER JSON (yet) #204

AngieHinrichs opened this issue Sep 29, 2022 · 11 comments
Assignees
Labels
[type] bug Something isn't working

Comments

@AngieHinrichs
Copy link

Describe the bug
This may be a bug in the JSON produced by the UShER web interface, not Galago, but they're not working together yet so let's figure it out.

Expected behavior / How to reproduce
This URL contains an Auspice V2 tree produced by an UShER web interface query:

https://genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_16aa0_445360.json

[unfortunately that is a temporary file, note the "trash" in the name -- it will go away in a couple days, so I have saved a copy here: https://hgwdev.gi.ucsc.edu/~angie/XAY_XBA_XBC_2022-09-28.json ]

So I hoped this Galago Fetch URL would work:

https://galago.czgenepi.org/#/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_16aa0_445360.json

But I get an error "Woops! Error fetching tree file
We weren't able to import your tree data. Please confirm your URL is correct and publicly accessible, or upload your JSON file directly below."

Interestingly, I do get farther with fetch if I use the backup copy on a different server:

https://galago.czgenepi.org/#/fetch/hgwdev.gi.ucsc.edu/~angie/XAY_XBA_XBC_2022-09-28.json

-- that gets me as far as the "Analyze your data in Galago" dialog, where I can choose the pathogen (SARS-CoV-2) -- but I can't choose a State/Province, probably because my JSON has only the country level. There is a drop-down for State/Province, but it has no values.

Would it be possible to use the country metadata instead if the state metadata is missing from the JSON?

@AngieHinrichs AngieHinrichs added the [type] bug Something isn't working label Sep 29, 2022
@sidneymbell
Copy link
Collaborator

Ah! I somehow didn't get a notification for this issue. Thanks so much for investigating, @AngieHinrichs !

I just pushed a PR to our staging server to make all geographical data optional, and I'm mostly able to load your file via
https://galago-labs.czgenepi.org/#/fetch/https://hgwdev.gi.ucsc.edu/~angie/XAY_XBA_XBC_2022-09-28.json but it hiccups because it expects num_date rather than date.

This is easy to fix on my end. I'll get this up and running on prod by early next week at the latest and let you know as soon as it's ready. Thanks again! So excited :)

@sidneymbell sidneymbell self-assigned this Sep 30, 2022
@sidneymbell sidneymbell added this to the Integration prep milestone Sep 30, 2022
@AngieHinrichs
Copy link
Author

Great! Yeah, UShER JSON doesn't have all of the cool stuff that Augur JSON does, but I'm glad you can work with it anyway! Looking forward to adding a linkout. 😄

@sidneymbell
Copy link
Collaborator

@AngieHinrichs -- I haven't forgotten about this! Got unexpectedly slammed with a few other things this week. Next week is looking wide open, though, and this is top of my list. Thanks for your patience.

@AngieHinrichs
Copy link
Author

No worries, same here! :) (except not sure about next week) No pressure from my side. It will be easy to add a linkout whenever.

@sidneymbell
Copy link
Collaborator

Hey @AngieHinrichs! At long last (apologies -- covid finally found us after 3 yrs), I've got a fix for this.

The issue was indeed parsing dates on our end. We now accept either date or num_date fields. I also made a couple tweaks to the visualizations to just leave out tips with no date field (there were just a handful in this test JSON with missing dates). Thanks again for flagging the incompatibility and providing the test data!

My current patch of leaving these samples out might not be a great solution for datasets with more than a few missing dates, though. Do most UShER samples come in with dates, or is it common to have a significant percentage of samples without?

@AngieHinrichs
Copy link
Author

Hi Sidney! So sorry to hear about the covid, but good job avoiding it for so long. Glad you're back in dev-land.

The "UShER samples" are a mix of sequences from INSDC (GenBank, ENA, DDBJ) and/or GISAID (many sequences are in both and I attempt to de-duplicate). Most of them have dates, but not all, and some (by law in some locations) are year-month-only unfortunately. If it turns out to be a big problem then there are several things we could try, such as suggesting that people choose a larger subtree size in UShER to send onward to you so there's more margin for having to discard some samples.

Is there an optimal range of sizes for Galago input trees? Does it depend on the number of the user's samples of interest? I imagine some users might upload a handful of sequences from an outbreak that probably fall into one or two subtrees, while others might have hundreds of sequences from a week's worth of runs in their lab (potentially resulting in many subtrees). The UShER web interface's default subtree size is 50 which is OK for finding the few most closely related sequences, but for other purposes like evaluating a possible new lineage for pangolin, 1000 is a better size. The max is 5000.

@sidneymbell
Copy link
Collaborator

Glad to be back! Although I've got a lot of foggy brain still, so lmk if any of this doesn't make sense :)

We can accommodate any of those tree sizes, although performance is best at <3000-3500ish.
We also have some UI tools to help users sift through a given tree to find clades with their samples of interest.
One thing to note is that (at least for now) Galago only ingests one tree at a time.

In an ideal world, I'd recommend something along the lines of:

  • N input samples + 1.5N contextual samples
  • Min = 50
  • Max = 3500
    But any of the subtree sizes you mention above (N=50 - 5000) should be fine as a first step.

@sidneymbell
Copy link
Collaborator

@AngieHinrichs -- another idea we could think about at some point -- Galago helps the user find which clade(s) to generate a report for based on their samples of interest. It could be useful to pass through the names of their input samples via query param, although this could very quickly get too long and cumbersome to be functional. Would need to noodle on this a bit more.

@AngieHinrichs
Copy link
Author

Great about the sample size flexibility.

It could be useful to pass through the names of their input samples via query param, although this could very quickly get too long and cumbersome to be functional.

Yeah. Maybe in a text file alongside the JSON file that has the tree? One name per line? Or -- actually they can be extracted from the JSON itself, filter nodes for userOrOld == "uploaded sample" if there's already a convenient way to do that.

@sidneymbell
Copy link
Collaborator

sidneymbell commented Oct 19, 2022 via email

@AngieHinrichs
Copy link
Author

Hi @sidneymbell -- sorry I let this all get buried in my inbox for, yikes! almost a year! 🤯 But I would still like to link out to Galago. This is the linkout format that I have:

https://galago-labs.czgenepi.org/#/fetch/https://hgwdev.gi.ucsc.edu/~angie/test_UShER_MicrobeTrace.json

but when I try that I get an error message:
image

Javascript console says

index.a7b9082c.js:277 XHR failed loading: GET "https://hgwdev.gi.ucsc.edu/~angie/test_UShER_MicrobeTrace.json".

I can view https://hgwdev.gi.ucsc.edu/~angie/test_UShER_MicrobeTrace.json in my web browser and see its response headers with curl:

curl -SsI https://hgwdev.gi.ucsc.edu/~angie/test_UShER_MicrobeTrace.json

HTTP/1.1 200 OK
Date: Tue, 12 Sep 2023 16:19:15 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips PHP/5.4.16 mod_wsgi/3.4 Python/2.7.5
Last-Modified: Thu, 07 Sep 2023 18:16:07 GMT
ETag: "943b-604c8da97400c"
Accept-Ranges: bytes
Content-Length: 37947
Vary: Accept-Encoding
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: Range
Content-Type: application/json

? If you don't have time to work on this, no problem! Just wanted to let you know I'm still interested if you do have time. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[type] bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

2 participants