Specify the gene name feature type for GFF3 files #486

jolespin · 2022-10-31T20:49:06Z

Is your feature request related to a problem? Please describe.
The default GFF3 created by Prodigal uses names like 22_1 for the gene ID instead of the actual protein/cds files created that use more informative names like NODE_22_length_13167_cov_40.764185_1.

Describe the solution you'd like
I wrote a wrapper for the Prodigal output that adds a more informative gene_id that can be used with other tools that expect this field: https://github.com/jolespin/veba/blob/main/src/scripts/append_geneid_to_prodigal_gff.py

I'd like to be able to specify this using the --genefinding-gff3 option. For example adding a --genefinding-gff3-feature-type (similar phrasing to featureCounts)

Describe alternatives you've considered
Post hoc scripts

Additional context
The default behavior can be used by setting the default value to ID

The text was updated successfully, but these errors were encountered:

zdk123 · 2022-10-31T21:52:25Z

As an alternative consideration use prodigal... A similar issue came up there and @althonos has already added this feature.
althonos/pyrodigal#18

SJShaw · 2022-11-01T11:22:03Z

Are you wanting an option to specify which field the GFF parsing should use for unique IDs instead of Id (or an option that specifically uses the Name field)?

Otherwise, I'm not quite sure what you're asking for here, since we're already using a wrapper around prodigal to uniquely name features found, and all other prodigal uses are pre-antiSMASH. I also don't recognise "featureCounts" in an antiSMASH context.

jolespin · 2022-11-01T14:15:12Z

Are you wanting an option to specify which field the GFF parsing should use for unique IDs instead of Id (or an option that specifically uses the Name field)?

Yes, if one could specify using --genefinding-gff3-feature-type gene_id'

The difficulty with Prodigal's ID field is that the gene identifiers will not be unique between genomes. In Prodigal, the fasta files that are created for proteins and cds use a longer gene id that includes the entire contig name which is more informative. I'm running these on individual genomes in a metagenome so have gene ids that are unique between genomes comes in handy.

For example, if you had this for your GFF3 CDS record:

NODE_22_length_13167_cov_40.764185      Prodigal_v2.6.3 CDS     2       622     52.8    -       0       ID=22_1;partial=10;start_type=ATG;rbs_motif=4Base/6BMM;rbs_spacer=13-15bp;gc_cont=0.654;conf=100.00;score=54.01;cscore=49.47;sscore=4.54;rscore=0.21;uscore=-0.56;tscore=3.65;contig_id=NODE_22_length_13167_cov_40.764185;gene_id=NODE_22_length_13167_cov_40.764185_1;

You would be able to select the default ID (what antiSMASH uses I believe) but could alternatively use gene_id. The former would use 22_1 for the gene name which isn't unique across multiple genomes. The latter would use NODE_22_length_13167_cov_40.764185_1 as the gene name.

I also don't recognise "featureCounts" in an antiSMASH context.

It's a bit out of context but I meant just to use the same phrasing. For example, here is the doc on featureCounts:

  -g <string>         Specify attribute type in GTF annotation. 'gene_id' by
                      default. Meta-features used for read counting will be
                      extracted from annotation using the provided value.

If I used featureCounts with Prodigal default output, then I would use -g ID instead of -g gene_id. It would be useful if antiSMASH had similar functionality.

SJShaw · 2022-11-02T11:15:49Z

In general, antiSMASH expects either a sequence with no annotations (i.e. just a FASTA file), or a sequence with the annotations a user wants to use, with the standards for whichever format the input is provided as (GenBank, FASTA + GFF, etc).

Adding alternative interpretation options for those standards to antiSMASH is a very specific edge case that isn't really relevant to what antiSMASH does. Since the user is in full control of their inputs and when they're running antiSMASH, it seems to us that the best place for that to happen is outside antiSMASH. The user can adjust their data however they like and then, once they're satisfied with their annotations, give them to antiSMASH.

jolespin · 2022-11-03T18:22:48Z

In general, antiSMASH expects either a sequence with no annotations (i.e. just a FASTA file), or a sequence with the annotations a user wants to use, with the standards for whichever format the input is provided as (GenBank, FASTA + GFF, etc).

Agreed, that's what I'm providing. A fasta file + a GFF file. The only difference would be that a user could specify to use a specific field that is used for the gene label (gene_id instead of the default ID).

This will also be helpful for eukaryotic gene models from metaeuk.

Adding alternative interpretation options for those standards to antiSMASH is a very specific edge case that isn't really relevant to what antiSMASH does.

Agreed, but the feature request doesn't fall in this category since it's still using gene identifiers just choosing which field in the GFF to use.

kblin · 2022-11-04T12:13:26Z

I think what Simon is saying is that from our perspective the correct way to fix this would be to make sure that your ID field looks the way you want it to look, instead of how prodigal wants it to look.

Having said that, I don't even see how we'd do this on the antiSMASH side from a technical perspective apart from completely replacing our GFF3 parser. We're currently using the BCBio GFF parser library, and that's hard-wired to use the ID attribute for all I can see.

jolespin · 2022-11-04T16:57:50Z

I think what Simon is saying is that from our perspective the correct way to fix this would be to make sure that your ID field looks the way you want it to look, instead of how prodigal wants it to look.

Oh ok, yea I can definitely do that if it's not of interest. My thought process was that there are a lot of tools being developed for eukaryotic gene modeling (e.g., MetaEuk, MetaEukSanity, etc) that will be used for fungal and plant genomes which may not come with an ID field. If no one else has brought this up then maybe it's not as pressing as I had imagined.

Having said that, I don't even see how we'd do this on the antiSMASH side from a technical perspective apart from completely replacing our GFF3 parser. We're currently using the BCBio GFF parser library, and that's hard-wired to use the ID attribute for all I can see

I looked into it and the information is in the qualifiers for each new_feature object:

new_feature.qualifiers
# {'ID': ['9927_3'],
#  'partial': ['01'],
#  'start_type': ['ATG'],
#  'rbs_motif': ['AACAA'],
#  'rbs_spacer': ['15bp'],
#  'gc_cont': ['0.533'],
#  'conf': ['99.38'],
#  'score': ['22.10', '22.1'],
#  'cscore': ['10.51'],
#  'sscore': ['11.58'],
#  'rscore': ['8.81'],
#  'uscore': ['-0.53'],
#  'tscore': ['3.30'],
#  'contig_id': ['NODE_9927_length_1904_cov_4.437534'],
#  'gene_id': ['NODE_9927_length_1904_cov_4.437534_3'],
#  'source': ['Prodigal_v2.6.3'],
#  'phase': ['0']}

If you were interested, this should work:

def get_features_from_file(handle: IO, gene_identifier_field:str="ID") -> Dict[str, List[SeqFeature]]:
    """ Generates new SeqFeatures from a GFF file.
        Arguments:
            handle: a file handle/stream with the GFF contents
        Returns:
            a dictionary mapping record ID to a list of SeqFeatures for that record
    """
    # 
    # try:
    #     gff_records = list(GFF.parse(handle))
    # except Exception as err:
    #     raise AntismashInputError("could not parse records from GFF3 file") from err
    
    gff_records = list(GFF.parse(handle)) # This is to test locally where I don't have antiSMASH installed

    results = {}
    for gff_record in gff_records:
        features = []
        for feature in gff_record.features:
            # ----------------------------------------------------
            # This section is new
            assert gene_identifier_field in feature.qualifiers, "'{}' not available in feature qualifiers '{}' on sequence record '{}'".format(gene_identifier_field, feature.id, gff_record.id)
            feature.id = feature.qualifiers[gene_identifier_field][0]
            # ----------------------------------------------------
            if feature.type == 'CDS':
                new_features = [feature]
            else:
                new_features = check_sub(feature)
                if feature.type == "gene":
                    features.append(feature)
                if not new_features:
                    continue

            name = feature.id
            locus_tag = feature.qualifiers.get("locus_tag")

            for qtype in ["gene", "name", "Name"]:
                if qtype in feature.qualifiers:
                    name_tmp = feature.qualifiers[qtype][0]
                    # Assume name/Name to be sane if they don't contain a space
                    if " " in name_tmp:
                        continue
                    name = name_tmp
                    break

            multiple_cds = len(list(filter(lambda x: x.type == "CDS", new_features))) > 1
            for i, new_feature in enumerate(new_features):
                variant = name
                if new_feature.type == "CDS" and multiple_cds:
                    variant = "{0}_{1}".format(name, i)
                new_feature.qualifiers['gene'] = [variant]
                if locus_tag is not None:
                    new_feature.qualifiers["locus_tag"] = locus_tag
                features.append(new_feature)
        results[gff_record.id] = features
    return results

I've added a new variable gene_identifier_field with default set to ID.

Here's the default behavior that would maintain functionality:

import gzip
gff_filepath = "SRR17458646__METABAT2__P.1__bin.1.gff.gz"
with gzip.open(gff_filepath, "rt") as f:
    results = get_features_from_file(f, "ID")
results

{'NODE_100_length_41737_cov_5.284607': [
  SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(596), strand=1), type='CDS', id='100_1'),
  SeqFeature(FeatureLocation(ExactPosition(812), ExactPosition(3224), strand=1), type='CDS', id='100_2'),
  SeqFeature(FeatureLocation(ExactPosition(3239), ExactPosition(3650), strand=1), type='CDS', id='100_3'),

...

Here's using a custom gene id field:

import gzip
gff_filepath = "SRR17458646__METABAT2__P.1__bin.1.gff.gz"
with gzip.open(gff_filepath, "rt") as f:
    results = get_features_from_file(f, "gene_id")
results

{'NODE_100_length_41737_cov_5.284607': [
   SeqFeature(FeatureLocation(ExactPosition(2), ExactPosition(596), strand=1), type='CDS', id='NODE_100_length_41737_cov_5.284607_1'),
  SeqFeature(FeatureLocation(ExactPosition(812), ExactPosition(3224), strand=1), type='CDS', id='NODE_100_length_41737_cov_5.284607_2'),
  SeqFeature(FeatureLocation(ExactPosition(3239), ExactPosition(3650), strand=1), type='CDS', id='NODE_100_length_41737_cov_5.284607_3'),
  SeqFeature(FeatureLocation(ExactPosition(3747), ExactPosition(4527), strand=-1), type='CDS', 
...

Here's the GFF file I used:
SRR17458646__METABAT2__P.1__bin.1.gff.gz

Hope this helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify the gene name feature type for GFF3 files #486

Specify the gene name feature type for GFF3 files #486

jolespin commented Oct 31, 2022

zdk123 commented Oct 31, 2022

SJShaw commented Nov 1, 2022

jolespin commented Nov 1, 2022 •

edited

SJShaw commented Nov 2, 2022

jolespin commented Nov 3, 2022

kblin commented Nov 4, 2022

jolespin commented Nov 4, 2022 •

edited

Specify the gene name feature type for GFF3 files #486

Specify the gene name feature type for GFF3 files #486

Comments

jolespin commented Oct 31, 2022

zdk123 commented Oct 31, 2022

SJShaw commented Nov 1, 2022

jolespin commented Nov 1, 2022 • edited

SJShaw commented Nov 2, 2022

jolespin commented Nov 3, 2022

kblin commented Nov 4, 2022

jolespin commented Nov 4, 2022 • edited

jolespin commented Nov 1, 2022 •

edited

jolespin commented Nov 4, 2022 •

edited