Skip to content

gff toolbox plot

Felipe Marques de Almeida edited this page Sep 3, 2021 · 6 revisions

About

This commands enables a fast visualization of genomic regions based on one (or multiple) GFF annotations with DNA features viewer package.

Help message

# Trigger help
gff-toolbox plot -h

# Help
gff-toolbox:

            Plot

Enabling the visualization of a genomic region from a GFF using the DNA features python package

usage:
    gff-toolbox plot -h|--help
    gff-toolbox plot check-gff [ --input <gff> ]
    gff-toolbox plot [ --input <gff> | --fofn <file> ] ( --contig <contig_name> ) [ --start <start_base> --end <end_base> --feature <feature_type> --identification <id> --title <title> --label <label> --color <color> --output <png_out> --width <width> --height <height> ]

options:
    -h --help                               Show this screen.

    -v --version                            Show version information.

    check-gff                               Does a simple parsing of the GFF file so the user knows the available qualifiers that
                                            can be used as gene identifiers. GFF qualifiers are retrieved from the 9th column.
                                            Same as gff-toolbox overview command.

    -i, --input=<gff>                       Used to plot dna features from a single GFF file [Default: stdin].

    -f, --fofn=<file>                       Used to plot dna multiple features from multiple GFF files. Contents must be in csv format with 3 columns:
                                            gff,custom_label,color (HEX format). Features from each GFF will have the color set in the 3rd column.
                                            Useful to compare annotations and to plot features with different colors if you separate them into multiple
                                            gff files each containing one type of feature.

    --start=<start_base>                    Starting position for plotting. [Default: 1].

    --end=<end_base>                        Ending position for plotting. [Default: 500].

    --contig=<contig_name>                  Name of the contig which you want to plot.

    --identification=<id>                   Which GFF qualifier must be used as gene identification to write in the plot?
                                            Please check for available qualifiers with 'check-gff'. Must be the exact name
                                            of the key found in the attributes columns. [Default: ID].

    -t, --title=<title>                     Legend/plot title. [Default: Gene Plot].

    -l, --label=<label>                     Custom label for plotting features. This is the string that appears in the legend. [Default: Gene].

    --feature=<feature_type>                Type of the GFF feature (3rd column) which you want to plot. It is possible to set more than one feature to be
                                            plotted by giving it separated by commas, eg. CDS,rRNA. [Default: gene].

    --color=<color>                         HEX entry for desired plotting color. [Default: #ccccff].

    --width=<width>                         Plot width ratio. [Default: 20].

    --height=<height>                       Plot height ratio. [Default: 5].

    -o, --output=<png_out>                  Output PNG filename. [Default: ./out.png].
                                            You can output SVG with: "-o out.svg".

example:

    ## Plotting CDS and/or rRNA features found inside the region between base 1 and 10000 from
    ## contig_1_segment0 sequence. Without much customization. Giving a custom label for the
    ## genes to appear in the legend and giving a different legend title.

$ gff-toolbox plot -i Kp_ref.gff --contig NC_016845.1 --feature CDS,rRNA --start 10000 --end 20000 -l "Generic features (CDS and rRNAs)" -t "Kp annotation"

    ## Plotting CDS, rRNA and tRNA features with different colors. Setting the genomic region
    ## in NC_016845.1:10000-20000. Checkout the example fofn file to understand it better.

$ gff-toolbox plot -f kp_gffs.fofn --start 10000 --end 20000 --contig NC_016845.1 -t "Kp annotation" --feature CDS,rRNA,tRNA

    ## Same as above.
    ## This time, instead of plotting gene names we plot the gene products by setting the
    ## parameter --identification to the exact name of the key in the attributes column.

$ gff-toolbox plot -f kp_gffs.fofn --start 10000 --end 20000 --contig NC_016845.1 -t "Kp annotation" --feature CDS,rRNA,tRNA --identification product

Execution

Plotting from a single GFF file

## Plotting CDS and/or rRNA features found inside the region between base 1 and 10000 from
## contig_1_segment0 sequence. Without much customization. Giving a custom label for the
## genes to appear in the legend and giving a different legend title.
gff-toolbox plot -i Kp_ref.gff --contig NC_016845.1 --feature CDS,rRNA --start 10000 --end 20000 -l "Generic features (CDS and rRNAs)" -t "Kp annotation"

logo

Plotting with multiple GFF files

## Plotting CDS, rRNA and tRNA features with different colors. Setting the genomic region
## in NC_016845.1:10000-20000. Checkout the example fofn file to understand it better.
## saving as svg
gff-toolbox plot -f kp_gffs.fofn \
   --start 10000 --end 20000 --contig NC_016845.1 -t "Kp annotation" \
   --feature CDS,rRNA,tRNA -o plot.svg

logo

The fofn file format for plotting with genomic regions with multiple GFFs can be checked here.

Removing gene labels

If you do not want a gene label to appear you must remove the value from the target GFF. Supposing you're plotting the "product" key from the GFF as the gene labels, in order to remove these label from specific features/genes you must remove this key from the GFF.

For example, the attributes column from the target genes goes from:

[...] ID=cds-YP_005224301.1;Parent=gene-KPHS_00010;product=flavodoxin;Dbxref=Genbank:YP_005224301.1,GeneID:11849782;Name=YP_005224301.1;gbkey=CDS;locus_tag=KPHS_00010;protein_id=YP_005224301.1;transl_table=11

to

[...] ID=cds-YP_005224301.1;Parent=gene-KPHS_00010;Dbxref=Genbank:YP_005224301.1,GeneID:11849782;Name=YP_005224301.1;gbkey=CDS;locus_tag=KPHS_00010;protein_id=YP_005224301.1;transl_table=11

Note that we removed the "product" key from the 9th column. Thus, if plotting the "product" key as labels, this gene will not have a label since this key does not exist in the GFF.