gff toolbox mongo ingest

This command add annotations into an existing GFF mongo database created through gff toolbox convert command. It'll follow GFF convention to include annotations into Dbxref and Ontology_term inside attributes field informing, for each anotation, from which external database the information (DBTAG field), what's the annotation code (ID tag) and optional it can have a description field. For details about gff file convention, please check this link.

Help message

# Trigger help
gff-toolbox mongo-ingest -h

# Help


    gff-toolbox mongo-ingest --input <tsv> [--gff_feature gene --db_name <db_name> --genome_name <genome_name> --mongo_path <mongo_path> ]
    gff-toolbox mongo-ingest -h | --help

    -h, --help                                              Show this screen
    -i, --input=<tsv>                                       Annotation file in TSV (tab-separated values) format describing, for each line, the gff feature
                                                            (default: gene, can be changed in parameter --gff_feature) id and
                                                            corresponding annotations that should be included in mongodb database.
                                                            The annotation file must contain four columns (#locusName\tId\tIdType\tdescription). [Default: stdin].
    -l, --gff_feature=<feature_type>                        Which GFF feature type must be annotated. [Default: gene].
    -d, --db_name=<db_name>                                 Name of existing mongodb database to update with anotations.
                                                            If database doesnt exist, create it using gff-toolbox convert module [Default: annotation_db].
    -n, --genome_name=<genome_name>                         When loading the mongodb this will be used as collection name. [Default: Genome].
    -p, --mongo_path=<mongo_path>                           Where to load your mongoDB? [Default: ./mongodb].
                                                            If you insert a path that already have a mongoDB in it will include (append)
                                                            the GFF as new collection (<genome_name>) in a new or existing DB (<db_name>).


    ## Create a mongo database from a GFF, if it doesnt exist yet
    ## This will create the GFF mongodb collection named <genome_name> in mongodb <db_name>
    ## DBs are writen in the localhost 27027 mongo db connection of mongo shell    

    $ gff-toolbox convert --format mongodb -i Kp_ref.gff --genome_name Kp --db_name annotation_db

    ## Next, include annotations written in gene.functions.txt file to corresponding
    ## gene features in existing GFF mongo collection Kp

    $ gff-toolbox mongo-ingest -i gene.functions.txt -n Kp


# Example
## Suppose we'd like to annotate our 
## First, build the mongo database from a GFF or use an already created
gff-toolbox convert --format mongodb -i Kp_ref.gff --genome_name Kp --db_name annotation_db

## If display the database entry for genes gene-KPHS_00170 and gene-KPHS_02590 in our created collection
## using pymongo (or mongo shell, other tool to interact with mongodb) we get the following output in json format

{ ...
  {'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '22533',
  'end': '22802',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_00170',
   'Dbxref': 'GeneID:11844995',
   'Name': 'KPHS_00170',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_00170'},
 {'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '298103',
  'end': '299212',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_02590',
   'Dbxref': 'GeneID:11845246',
   'Name': 'KPHS_02590',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_02590'},

## Notice that both genes have some attributes inherited from GFF file
## Now, suppose we'd like to supply those genes with more information
## We can do this using by generating a tab-separated annotation file like the one bellow
cat test/gene.functions.tsv

## output
FeatureId	AnnotId	IdType	Description
gene-KPHS_00170	GO:0006810	GO	transport
gene-KPHS_00170	EC	Lysosomal Pro-Xaa carboxypeptidase
gene-KPHS_00170	GO:0005215	GO	transporter activity
gene-KPHS_02590	GO:0003735	GO	structural constituent of ribosome
gene-KPHS_02590	PTHR36029	PANTHER	

## Using the annotation file as input to toolbox mongo-ingest to include information for genes gene-KPHS_00170 and gene-KPHS_02590
gff-toolbox mongo-ingest -i gene.functions.txt -n Kp --db_name annotation_db

## Visualization of database entry allows to check if annotations were included

{'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '22533',
  'end': '22802',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_00170',
   'Dbxref': [{'DBTAG': 'GeneID', 'ID': '11844995'},
    {'DBTAG': 'PANTHER',
     'ID': 'PTHR30520:SF0',
     'Description': 'TRANSPORTER-RELATED'},
    {'DBTAG': 'EC',
     'ID': '',
     'Description': 'Lysosomal Pro-Xaa carboxypeptidase'}],
   'Name': 'KPHS_00170',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_00170',
   'Ontology_term': [{'DBTAG': 'GO',
     'ID': 'GO:0006810',
     'Description': 'transport'},
    {'DBTAG': 'GO',
     'ID': 'GO:0005215',
     'Description': 'transporter activity'}]}},
 {'recid': 'NC_016845.1',
  'source': 'RefSeq',
  'type': 'gene',
  'start': '298103',
  'end': '299212',
  'score': '.',
  'strand': '+',
  'phase': '.',
  'attributes': {'ID': 'gene-KPHS_02590',
   'Dbxref': [{'DBTAG': 'GeneID', 'ID': '11845246'},
    {'DBTAG': 'PANTHER', 'ID': 'PTHR36029'}],
   'Name': 'KPHS_02590',
   'gbkey': 'Gene',
   'gene_biotype': 'protein_coding',
   'locus_tag': 'KPHS_02590',
   'Ontology_term': [{'DBTAG': 'GO',
     'ID': 'GO:0003735',
     'Description': 'structural constituent of ribosome'}]}
