Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get-gtdb refactor #169

Merged
merged 7 commits into from
Jan 10, 2024
Merged

get-gtdb refactor #169

merged 7 commits into from
Jan 10, 2024

Conversation

mikerobeson
Copy link
Collaborator

@mikerobeson mikerobeson commented Dec 12, 2023

Fixes #167 and #168.

This refactor:

  • File parsing is different.
    • We now pull taxonomy from the FASTA files directly, not the taxonomy files.
  • Download Options:
    • GTDB SSU Species Representatives, as we've been doing
      • Both
      • Bacteria
      • Archaea
    • Non-species representative GTDB SSUs
      • Contains all Bacteria and Archaea that have passed GTDBs QA/QC But are not clustered into species representatives.

To do:

  • update test code
  • break up code into simpler functions where possible.

Copy link
Collaborator

@nbokulich nbokulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mikerobeson looks good, thanks! I pulled down and tested and it works like a charm 😉 . I just have some super minor questions/comments on the code inline.

rescript/get_gtdb.py Outdated Show resolved Hide resolved
rescript/get_gtdb.py Outdated Show resolved Hide resolved
'207': {'Archaea': 'ar53', 'Bacteria': 'bac120'},
'202': {'Archaea': 'ar122', 'Bacteria': 'bac120'}}

VERSION_MAP_DICT = {'214.1': {'Archaea': 'ar53', 'Bacteria': 'bac120'},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theoretical question: should we keep the exact same nomenclature as used by GTDB? Or introduce the decimals for all versions as you have done here? I don't have an opinion on this but just pausing for a moment's thought. Would any users be confused that we support a version 207.0 but this does not exist in GTDB (strictly speaking)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was pondering this myself. The issue is that the http paths all contain a folder with a decimal:
https://data.gtdb.ecogenomic.org/releases/releases/release202/202.0/
https://data.gtdb.ecogenomic.org/releases/releases/release207/207.0/
https://data.gtdb.ecogenomic.org/releases/releases/release214/214.0/
https://data.gtdb.ecogenomic.org/releases/releases/release214/214.1/

At the time I figured I'd just use that folder label and then parse out the base version, e.g. 214 from 214.1 rather than split off the decimal and rebuild it for the http string.

All I'd need to do is check for a decimal within the version string e.g. 214.1 and then use that for the folder. Then append a .0 to those versions that do not have a a decimal, e.g. 207, etc...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense... let's use the decimals then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR should be good to go then. :-)

Copy link
Collaborator

@nbokulich nbokulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

@nbokulich nbokulich merged commit 51df26b into bokulich-lab:master Jan 10, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add option to download ssu_all_<release>.tar.gz from GTDB.
2 participants