Skip to content

gff toolbox mongo ingest

rodtheo edited this page Sep 9, 2021 · 1 revision

About

This command add annotations into an existing GFF mongo database created through gff toolbox convert command. It'll follow GFF convention to include annotations into Dbxref and Ontology_term inside attributes field informing, for each anotation, from which external database the information (DBTAG field), what's the annotation code (ID tag) and optional it can have a description field. For details about gff file convention, please check this link.

Help message

# Trigger help
gff-toolbox mongo-ingest -h

# Help
gff-toolbox:

            Mongo-ingest

This command add annotations into an already created GFF mongo database.

usage:
    gff-toolbox mongo-ingest --input <tsv> [--gff_feature gene --db_name <db_name> --genome_name <genome_name> --mongo_path <mongo_path> ]
    gff-toolbox mongo-ingest -h | --help

options:
    -h, --help                                              Show this screen
    -i, --input=<tsv>                                       Annotation file in TSV (tab-separated values) format describing, for each line, the gff feature
                                                            (default: gene, can be changed in parameter --gff_feature) id and
                                                            corresponding annotations that should be included in mongodb database.
                                                            The annotation file must contain four columns (#locusName\tId\tIdType\tdescription). [Default: stdin].
    -l, --gff_feature=<feature_type>                        Which GFF feature type must be annotated. [Default: gene].
    -d, --db_name=<db_name>                                 Name of existing mongodb database to update with anotations.
                                                            If database doesnt exist, create it using gff-toolbox convert module [Default: annotation_db].
    -n, --genome_name=<genome_name>                         When loading the mongodb this will be used as collection name. [Default: Genome].
    -p, --mongo_path=<mongo_path>                           Where to load your mongoDB? [Default: ./mongodb].
                                                            If you insert a path that already have a mongoDB in it will include (append)
                                                            the GFF as new collection (<genome_name>) in a new or existing DB (<db_name>).


example:

    ## Create a mongo database from a GFF, if it doesnt exist yet
    ## This will create the GFF mongodb collection named <genome_name> in mongodb <db_name>
    ## DBs are writen in the localhost 27027 mongo db connection of mongo shell    

    $ gff-toolbox convert --format mongodb -i Kp_ref.gff --genome_name Kp --db_name annotation_db

    ## Next, include annotations written in gene.functions.txt file to corresponding
    ## gene features in existing GFF mongo collection Kp

    $ gff-toolbox mongo-ingest -i gene.functions.txt -n Kp

Execution

# Example
## Suppose we'd like to annotate our 
## First, build the mongo database from a GFF or use an already created
gff-toolbox convert --format mongodb -i Kp_ref.gff --genome_name Kp --db_name annotation_db

## If display the database entry for genes gene-KPHS_00170 and gene-KPHS_02590 in our created collection
## using pymongo (or mongo shell, other tool to interact with mongodb) we get the following output in json format

{ ...
  {'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '22533',
  'end': '22802',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_00170',
   'Dbxref': 'GeneID:11844995',
   'Name': 'KPHS_00170',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_00170'},
 {'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '298103',
  'end': '299212',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_02590',
   'Dbxref': 'GeneID:11845246',
   'Name': 'KPHS_02590',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_02590'},
...
}

## Notice that both genes have some attributes inherited from GFF file
## Now, suppose we'd like to supply those genes with more information
## We can do this using by generating a tab-separated annotation file like the one bellow
cat test/gene.functions.tsv

## output
FeatureId	AnnotId	IdType	Description
gene-KPHS_00170	PTHR30520:SF0	PANTHER	TRANSPORTER-RELATED
gene-KPHS_00170	GO:0006810	GO	transport
gene-KPHS_00170	3.4.16.2	EC	Lysosomal Pro-Xaa carboxypeptidase
gene-KPHS_00170	GO:0005215	GO	transporter activity
gene-KPHS_02590	GO:0003735	GO	structural constituent of ribosome
gene-KPHS_02590	PTHR36029	PANTHER	

## Using the annotation file as input to toolbox mongo-ingest to include information for genes gene-KPHS_00170 and gene-KPHS_02590
gff-toolbox mongo-ingest -i gene.functions.txt -n Kp --db_name annotation_db

## Visualization of database entry allows to check if annotations were included

{
...
{'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '22533',
  'end': '22802',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_00170',
   'Dbxref': [{'DBTAG': 'GeneID', 'ID': '11844995'},
    {'DBTAG': 'PANTHER',
     'ID': 'PTHR30520:SF0',
     'Description': 'TRANSPORTER-RELATED'},
    {'DBTAG': 'EC',
     'ID': '3.4.16.2',
     'Description': 'Lysosomal Pro-Xaa carboxypeptidase'}],
   'Name': 'KPHS_00170',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_00170',
   'Ontology_term': [{'DBTAG': 'GO',
     'ID': 'GO:0006810',
     'Description': 'transport'},
    {'DBTAG': 'GO',
     'ID': 'GO:0005215',
     'Description': 'transporter activity'}]}},
 {'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '298103',
  'end': '299212',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_02590',
   'Dbxref': [{'DBTAG': 'GeneID', 'ID': '11845246'},
    {'DBTAG': 'PANTHER', 'ID': 'PTHR36029'}],
   'Name': 'KPHS_02590',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_02590',
   'Ontology_term': [{'DBTAG': 'GO',
     'ID': 'GO:0003735',
     'Description': 'structural constituent of ribosome'}]}
...
}
Clone this wiki locally