The following GFF3 file will be used as example, to show how gffpandas has to be used. It contains a header and eleven annotation entries:
##gff-version 3 ##sequence-region NC_016810.1 1 20 NC_016810.1 RefSeq region 1 4000 . + . Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=genomic;mol_type=genomic DNA;serovar=Typhimurium;strain=SL1344 NC_016810.1 RefSeq gene 1 20 . + . ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_tag=SL1344_0001 NC_016810.1 RefSeq CDS 13 235 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene1;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 1 20 . + . ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_tag=SL1344_0002 NC_016810.1 RefSeq CDS 341 523 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene2;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 1 600 . - . ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_tag=SL1344_0003 NC_016810.1 RefSeq CDS 21 345 . - 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene3;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 41 255 . + . ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_tag=SL1344_0004 NC_016810.1 RefSeq CDS 61 195 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene4;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 170 546 . + . ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_tag=SL1344_0005 NC_016810.1 RefSeq CDS 34 335 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene5;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
The library can be imported as the following:
import gffpandas.gffpandas as gffpd
First step is to read in the GFF3 file with the method called 'read_gff3'. Then a dataframe (.df) or rather header (.header) can be returned:
>>> annotation = gffpd.read_gff3('annotation.gff') >>> print(annotation.header) >>> print(annotation.df) Out[1]: ##gff-version 3 ##sequence-region NC_016810.1 1 20 seq_id source type start end score strand phase \ 0 NC_016810.1 RefSeq region 1 4000 . + . 1 NC_016810.1 RefSeq gene 1 20 . + . 2 NC_016810.1 RefSeq CDS 13 235 . + 0 3 NC_016810.1 RefSeq gene 1 20 . + . 4 NC_016810.1 RefSeq CDS 341 523 . + 0 5 NC_016810.1 RefSeq gene 1 600 . - . 6 NC_016810.1 RefSeq CDS 21 345 . - 0 7 NC_016810.1 RefSeq gene 41 255 . + . 8 NC_016810.1 RefSeq CDS 61 195 . + 0 9 NC_016810.1 RefSeq gene 170 546 . + . 10 NC_016810.1 RefSeq CDS 34 335 . + 0 attributes 0 Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=ge... 1 ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_... 2 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 3 ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_... 4 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 5 ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_... 6 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 7 ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_... 8 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 9 ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_... 10 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
The created data frame contains all eleven annotation entries and can be changed now. Depending on which annotation entries are desired, different options of gffpandas can be used and/or combined.
In this example, the user wants to return a GFF3 file, but only its coding sequences ('CDS'), which base pair length (bp) is minimal 10 bp long and maximal 250 bp long. Therefore, the following functions will be combined:
>>> combined_df = annotation.filter_feature_of_type(['CDS']).filter_by_length(10, 250).to_gff3('temp.gff') >>> gff_content = open('temp.gff').read() >>> print(gff_content) Out[2]: ##gff-version 3 ##sequence-region NC_016810.1 1 20 NC_016810.1 RefSeq CDS 13 235 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene1;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq CDS 341 523 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene2;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq CDS 61 195 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene4;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
In this subsection, the possible functions of gffpandas will be presented.
For example:
>>> filtered_df = annotation.filter_feature_of_type(['gene']) >>> print(filtered_df.df) Out[2]: seq_id source type start end score strand phase \ 1 NC_016810.1 RefSeq gene 1 20 . + . 3 NC_016810.1 RefSeq gene 1 20 . + . 5 NC_016810.1 RefSeq gene 1 600 . - . 7 NC_016810.1 RefSeq gene 41 255 . + . 9 NC_016810.1 RefSeq gene 170 546 . + . attributes 1 ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_... 3 ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_... 5 ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_... 7 ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_... 9 ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_...
For example:
>>> filtered_by_length = annotation.filter_by_length(min_length=10, max_length=300) >>> print(filtered_by_length.df) Out[3]: seq_id source type start end score strand phase \ 1 NC_016810.1 RefSeq gene 1 20 . + . 2 NC_016810.1 RefSeq CDS 13 235 . + 0 3 NC_016810.1 RefSeq gene 1 20 . + . 4 NC_016810.1 RefSeq CDS 341 523 . + 0 7 NC_016810.1 RefSeq gene 41 255 . + . 8 NC_016810.1 RefSeq CDS 61 195 . + 0 attributes 1 ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_... 2 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 3 ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_... 4 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 7 ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_... 8 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
For example:
>>> feature_by_attribute = annotation.get_feature_by_attribute('gbkey', ['CDS']) >>> print(feature_by_attribute.df) Out[4]: seq_id source type start end score strand phase \ 2 NC_016810.1 RefSeq CDS 13 235 . + 0 4 NC_016810.1 RefSeq CDS 341 523 . + 0 6 NC_016810.1 RefSeq CDS 21 345 . - 0 8 NC_016810.1 RefSeq CDS 61 195 . + 0 10 NC_016810.1 RefSeq CDS 34 335 . + 0 attributes 2 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 4 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 6 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 8 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 10 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
For example:
>>> attr_to_columns = annotation.attributes_to_columns() >>> print(attr_to_columns) Out[5]: seq_id source type start end score strand phase \ 0 NC_016810.1 RefSeq region 1 4000 . + . 1 NC_016810.1 RefSeq gene 1 20 . + . 2 NC_016810.1 RefSeq CDS 13 235 . + 0 3 NC_016810.1 RefSeq gene 1 20 . + . 4 NC_016810.1 RefSeq CDS 341 523 . + 0 5 NC_016810.1 RefSeq gene 1 600 . - . 6 NC_016810.1 RefSeq CDS 21 345 . - 0 7 NC_016810.1 RefSeq gene 41 255 . + . 8 NC_016810.1 RefSeq CDS 61 195 . + 0 9 NC_016810.1 RefSeq gene 170 546 . + . 10 NC_016810.1 RefSeq CDS 34 335 . + 0 attributes \ 0 Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=ge... 1 ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_... 2 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 3 ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_... 4 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 5 ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_... 6 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 7 ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_... 8 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 9 ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_... 10 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... Dbxref ... gbkey \ 0 taxon:216597 ... Src 1 None ... Gene 2 UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_00517... ... CDS 3 None ... Gene 4 UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_00517... ... CDS 5 None ... Gene 6 UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_00517... ... CDS 7 None ... Gene 8 UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_00517... ... CDS 9 None ... Gene 10 UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_00517... ... CDS gene genome locus_tag mol_type product \ 0 None genomic None genomic DNA None 1 thrL None SL1344_0001 None None 2 None None None None thr operon leader peptide 3 thrA None SL1344_0002 None None 4 None None None None thr operon leader peptide 5 thrX None SL1344_0003 None None 6 None None None None thr operon leader peptide 7 thrB None SL1344_0004 None None 8 None None None None thr operon leader peptide 9 thrC None SL1344_0005 None None 10 None None None None thr operon leader peptide protein_id serovar strain transl_table 0 None Typhimurium SL1344 None 1 None None None None 2 YP_005179941.1 None None 11 3 None None None None 4 YP_005179941.1 None None 11 5 None None None None 6 YP_005179941.1 None None 11 7 None None None None 8 YP_005179941.1 None None 11 9 None None None None 10 YP_005179941.1 None None 11
For example:
>>> overlapings = annotation.overlaps_with(seq_id='NC_016811.1', type='gene', start=40, end=300, strand='+') >>> no_overlap = annotation.overlaps_with(seq_id='NC_016811.1', start=1, end=4000, strand='+', complement=True) >>> print(overlapings.df) >>> print(no_overlap.df) Out[6]: seq_id source type start end score strand phase \ 0 NC_016810.1 RefSeq region 1 4000 . + . 2 NC_016810.1 RefSeq CDS 13 235 . + 0 7 NC_016810.1 RefSeq gene 41 255 . + . 8 NC_016810.1 RefSeq CDS 61 195 . + 0 9 NC_016810.1 RefSeq gene 170 546 . + . 10 NC_016810.1 RefSeq CDS 34 335 . + 0 attributes 0 Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=ge... 2 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 7 ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_... 8 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... 9 ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_... 10 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y... Out[7]: Empty DataFrame Columns: [seq_id, source, type, start, end, score, strand, phase, attributes] Index: []
For example:
>>> redundant_entries = annotation.find_duplicated_entries(seq_id='NC_016811.1', type='gene') >>> print(redundant_entries.df) Out[8]: seq_id source type start end score strand phase \ 3 NC_016810.1 RefSeq gene 1 20 . + . attributes 3 ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_...
The following methods of the library won't return a data frame:
For example:
>>> annotation.to_gff3('temp.gff') >>> gff3_file = open('temp.gff').read() >>> print(gff3_file) Out[9]: ##gff-version 3 ##sequence-region NC_016810.1 1 20 NC_016810.1 RefSeq region 1 4000 . + . Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=genomic;mol_type=genomic DNA;serovar=Typhimurium;strain=SL1344 NC_016810.1 RefSeq gene 1 20 . + . ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_tag=SL1344_0001 NC_016810.1 RefSeq CDS 13 235 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene1;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 1 20 . + . ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_tag=SL1344_0002 NC_016810.1 RefSeq CDS 341 523 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene2;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 1 600 . - . ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_tag=SL1344_0003 NC_016810.1 RefSeq CDS 21 345 . - 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene3;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 41 255 . + . ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_tag=SL1344_0004 NC_016810.1 RefSeq CDS 61 195 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene4;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 170 546 . + . ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_tag=SL1344_0005 NC_016810.1 RefSeq CDS 34 335 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene5;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
For example:
>>> annotation.to_csv('temp.csv') >>> csv_file = open('temp.csv').read() >>> print(csv_file) Out[9]: seq_id,source,type,start,end,score,strand,phase,attributes NC_016810.1,RefSeq,region,1,4000,.,+,.,Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=genomic;mol_type=genomic DNA;serovar=Typhimurium;strain=SL1344 NC_016810.1,RefSeq,gene,1,20,.,+,.,ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_tag=SL1344_0001 NC_016810.1,RefSeq,CDS,13,235,.,+,0,Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene1;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1,RefSeq,gene,1,20,.,+,.,ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_tag=SL1344_0002 NC_016810.1,RefSeq,CDS,341,523,.,+,0,Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene2;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1,RefSeq,gene,1,600,.,-,.,ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_tag=SL1344_0003 NC_016810.1,RefSeq,CDS,21,345,.,-,0,Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene3;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1,RefSeq,gene,41,255,.,+,.,ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_tag=SL1344_0004 NC_016810.1,RefSeq,CDS,61,195,.,+,0,Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene4;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1,RefSeq,gene,170,546,.,+,.,ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_tag=SL1344_0005 NC_016810.1,RefSeq,CDS,34,335,.,+,0,Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene5;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
For example:
>>> annotation.to_tsv('temp.tsv') >>> tsv_file = open('temp.tsv').read() >>> print(tsv_file) Out[10]: seq_id source type start end score strand phase attributes NC_016810.1 RefSeq region 1 4000 . + . Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=genomic;mol_type=genomic DNA;serovar=Typhimurium;strain=SL1344 NC_016810.1 RefSeq gene 1 20 . + . ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_tag=SL1344_0001 NC_016810.1 RefSeq CDS 13 235 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene1;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 1 20 . + . ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_tag=SL1344_0002 NC_016810.1 RefSeq CDS 341 523 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene2;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 1 600 . - . ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_tag=SL1344_0003 NC_016810.1 RefSeq CDS 21 345 . - 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene3;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 41 255 . + . ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_tag=SL1344_0004 NC_016810.1 RefSeq CDS 61 195 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene4;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11 NC_016810.1 RefSeq gene 170 546 . + . ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_tag=SL1344_0005 NC_016810.1 RefSeq CDS 34 335 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene5;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
For example:
>>> statistics = annotation.stats_dic() >>> print(statistics.df) Out[11]: {'Maximal_bp_length': 599, 'Minimal_bp_length': 19, 'Counted_strands': + 9 - 2 Name: strand, dtype: int64, 'Counted_feature_types': gene 5 CDS 5 region 1 Name: type, dtype: int64}