New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify the gene name feature type for GFF3 files #486
Comments
As an alternative consideration use prodigal... A similar issue came up there and @althonos has already added this feature. |
Are you wanting an option to specify which field the GFF parsing should use for unique IDs instead of Otherwise, I'm not quite sure what you're asking for here, since we're already using a wrapper around prodigal to uniquely name features found, and all other prodigal uses are pre-antiSMASH. I also don't recognise "featureCounts" in an antiSMASH context. |
Yes, if one could specify using The difficulty with Prodigal's ID field is that the gene identifiers will not be unique between genomes. In Prodigal, the fasta files that are created for proteins and cds use a longer gene id that includes the entire contig name which is more informative. I'm running these on individual genomes in a metagenome so have gene ids that are unique between genomes comes in handy. For example, if you had this for your GFF3 CDS record:
You would be able to select the default
It's a bit out of context but I meant just to use the same phrasing. For example, here is the doc on featureCounts:
If I used featureCounts with Prodigal default output, then I would use |
In general, antiSMASH expects either a sequence with no annotations (i.e. just a FASTA file), or a sequence with the annotations a user wants to use, with the standards for whichever format the input is provided as (GenBank, FASTA + GFF, etc). Adding alternative interpretation options for those standards to antiSMASH is a very specific edge case that isn't really relevant to what antiSMASH does. Since the user is in full control of their inputs and when they're running antiSMASH, it seems to us that the best place for that to happen is outside antiSMASH. The user can adjust their data however they like and then, once they're satisfied with their annotations, give them to antiSMASH. |
Agreed, that's what I'm providing. A fasta file + a GFF file. The only difference would be that a user could specify to use a specific field that is used for the gene label ( This will also be helpful for eukaryotic gene models from metaeuk.
Agreed, but the feature request doesn't fall in this category since it's still using gene identifiers just choosing which field in the GFF to use. |
I think what Simon is saying is that from our perspective the correct way to fix this would be to make sure that your Having said that, I don't even see how we'd do this on the antiSMASH side from a technical perspective apart from completely replacing our GFF3 parser. We're currently using the BCBio GFF parser library, and that's hard-wired to use the |
Oh ok, yea I can definitely do that if it's not of interest. My thought process was that there are a lot of tools being developed for eukaryotic gene modeling (e.g., MetaEuk, MetaEukSanity, etc) that will be used for fungal and plant genomes which may not come with an
I looked into it and the information is in the qualifiers for each new_feature.qualifiers
# {'ID': ['9927_3'],
# 'partial': ['01'],
# 'start_type': ['ATG'],
# 'rbs_motif': ['AACAA'],
# 'rbs_spacer': ['15bp'],
# 'gc_cont': ['0.533'],
# 'conf': ['99.38'],
# 'score': ['22.10', '22.1'],
# 'cscore': ['10.51'],
# 'sscore': ['11.58'],
# 'rscore': ['8.81'],
# 'uscore': ['-0.53'],
# 'tscore': ['3.30'],
# 'contig_id': ['NODE_9927_length_1904_cov_4.437534'],
# 'gene_id': ['NODE_9927_length_1904_cov_4.437534_3'],
# 'source': ['Prodigal_v2.6.3'],
# 'phase': ['0']} If you were interested, this should work: def get_features_from_file(handle: IO, gene_identifier_field:str="ID") -> Dict[str, List[SeqFeature]]:
""" Generates new SeqFeatures from a GFF file.
Arguments:
handle: a file handle/stream with the GFF contents
Returns:
a dictionary mapping record ID to a list of SeqFeatures for that record
"""
#
# try:
# gff_records = list(GFF.parse(handle))
# except Exception as err:
# raise AntismashInputError("could not parse records from GFF3 file") from err
gff_records = list(GFF.parse(handle)) # This is to test locally where I don't have antiSMASH installed
results = {}
for gff_record in gff_records:
features = []
for feature in gff_record.features:
# ----------------------------------------------------
# This section is new
assert gene_identifier_field in feature.qualifiers, "'{}' not available in feature qualifiers '{}' on sequence record '{}'".format(gene_identifier_field, feature.id, gff_record.id)
feature.id = feature.qualifiers[gene_identifier_field][0]
# ----------------------------------------------------
if feature.type == 'CDS':
new_features = [feature]
else:
new_features = check_sub(feature)
if feature.type == "gene":
features.append(feature)
if not new_features:
continue
name = feature.id
locus_tag = feature.qualifiers.get("locus_tag")
for qtype in ["gene", "name", "Name"]:
if qtype in feature.qualifiers:
name_tmp = feature.qualifiers[qtype][0]
# Assume name/Name to be sane if they don't contain a space
if " " in name_tmp:
continue
name = name_tmp
break
multiple_cds = len(list(filter(lambda x: x.type == "CDS", new_features))) > 1
for i, new_feature in enumerate(new_features):
variant = name
if new_feature.type == "CDS" and multiple_cds:
variant = "{0}_{1}".format(name, i)
new_feature.qualifiers['gene'] = [variant]
if locus_tag is not None:
new_feature.qualifiers["locus_tag"] = locus_tag
features.append(new_feature)
results[gff_record.id] = features
return results I've added a new variable Here's the default behavior that would maintain functionality:
Here's using a custom gene id field: import gzip
gff_filepath = "SRR17458646__METABAT2__P.1__bin.1.gff.gz"
with gzip.open(gff_filepath, "rt") as f:
results = get_features_from_file(f, "gene_id")
results
{'NODE_100_length_41737_cov_5.284607': [
SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(596), strand=1), type='CDS', id='NODE_100_length_41737_cov_5.284607_1'),
SeqFeature(FeatureLocation(ExactPosition(812), ExactPosition(3224), strand=1), type='CDS', id='NODE_100_length_41737_cov_5.284607_2'),
SeqFeature(FeatureLocation(ExactPosition(3239), ExactPosition(3650), strand=1), type='CDS', id='NODE_100_length_41737_cov_5.284607_3'),
SeqFeature(FeatureLocation(ExactPosition(3747), ExactPosition(4527), strand=-1), type='CDS',
... Here's the GFF file I used: Hope this helps. |
Is your feature request related to a problem? Please describe.
The default GFF3 created by Prodigal uses names like
22_1
for the gene ID instead of the actual protein/cds files created that use more informative names likeNODE_22_length_13167_cov_40.764185_1
.Describe the solution you'd like
I wrote a wrapper for the Prodigal output that adds a more informative gene_id that can be used with other tools that expect this field: https://github.com/jolespin/veba/blob/main/src/scripts/append_geneid_to_prodigal_gff.py
I'd like to be able to specify this using the
--genefinding-gff3
option. For example adding a--genefinding-gff3-feature-type
(similar phrasing to featureCounts)Describe alternatives you've considered
Post hoc scripts
Additional context
The default behavior can be used by setting the default value to
ID
The text was updated successfully, but these errors were encountered: