Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to export the viral DNA gene sequence? #116

Open
wfgui opened this issue Aug 2, 2024 · 5 comments
Open

How to export the viral DNA gene sequence? #116

wfgui opened this issue Aug 2, 2024 · 5 comments

Comments

@wfgui
Copy link

wfgui commented Aug 2, 2024

Hi,
In the example above I can see proteins FASTA file of GCF_009025895.1_virus_proteins.faa. I want to calculate the gene abundance with virus gene sequence.Can we output the corresponding nucleotide sequence?

Thanks!

@apcamargo
Copy link
Owner

There is currently no option to do this, but I could implement it as a feature in the future. In the meantime, you can obtain the nucleotide sequences of the CDSs by extracting them from the genomes using the gene coordinates.

@wfgui
Copy link
Author

wfgui commented Aug 14, 2024

I also had a seemingly simple question about whether I could format the output at taxonomy, such as converting it to k__; p__; c__; o__; f__; g__; s__.

Thanks!

@apcamargo
Copy link
Owner

apcamargo commented Aug 19, 2024

You can use taxopy for that. geNomad's taxdump is inside the database directory, and you can find the TaxIds in the <prefox>_annotate/<prefox>_taxonomy.tsv file.

For instance:

import taxopy

taxdb = taxopy.TaxDb(
    nodes_dmp="genomad_db/nodes.dmp",
    names_dmp="genomad_db/names.dmp",
    keep_files=True
)
taxon = taxopy.Taxon(5797, taxdb)
for rank, name in reversed(taxon.ranked_name_lineage):
    if name != "root":
        print(f"{rank}__{name}")
realm__Duplodnaviria
kingdom__Heunggongvirae
phylum__Uroviricota
class__Caudoviricetes
order__Crassvirales

@wfgui
Copy link
Author

wfgui commented Aug 23, 2024

1
What's the difference between "Unclassified" and "Viruses;;;;;;" ?

@apcamargo
Copy link
Owner

"Unclassified" means that the genes in the sequence had no matches to markers with taxonomy information. "Viruses" means that the classification is uncertain at a high rank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants