SIGA.py

SIGA.py is a command-line tool written in Python to generate Semantically Interoperable Genome Annotations from text files in the Generic Feature Format (GFF) according to the Resource Description Framework (RDF) specification.

Fig. SIGA software architecture.

Key features

Input:
- one or more files in the GFF format (version 2 or 3)
- config.ini file with ontology mappings and feature type amendments (if applicable)
Output: genomic features stored in a SQLite database or serialized in one of the RDF formats:
- XML
- N-Triples
- Turtle
- Notation3 (N3)
check referential integrity for parent-child feature relationships in SQLite
controlled vocabularies and ontologies used:
- DCMI terms (e.g. creator, hasVersion, license)
- Sequence Ontology (SO) to describe feature types (e.g. genome, chromosome, gene, transcript) and their relationships (e.g. has part/part of, genome of, transcribed to, translated_to)
- Feature Annotation Location Description Ontology (FALDO)

Requirements

python (>=2.7)
docopt (0.6.2
RDFLib (4.2.2)
gffutils (https://github.com/arnikz/gffutils)
optional: RDF store to query ingested data using SPARQL (e.g. using Virtuoso or Berkeley DB)

Installation

git clone https://github.com/candYgene/siga.git
cd siga
virtualenv .sigaenv
source .sigaenv/bin/activate
pip install -r requirements.txt

How to use

Command-line interface

Usage:
  SIGA.py -h|--help
  SIGA.py -v|--version
  SIGA.py db [-ruV] [-d DB_FILE | -e DB_FILEXT] GFF_FILE...
  SIGA.py rdf [-V] [-o FORMAT] [-c CFG_FILE] DB_FILE...

Arguments:
  GFF_FILE...      Input file(s) in GFF version 2 or 3.
  DB_FILE...       Input database file(s) in SQLite.

Options:
  -h, --help
  -v, --version
  -V, --verbose    Show verbose output in debug mode.
  -c FILE          Set the path of config file [default: config.ini]
  -d DB_FILE       Create a database from GFF file(s).
  -e DB_FILEXT     Set the database file extension [default: .db].
  -r               Check the referential integrity of the database(s).
  -u               Generate unique IDs for duplicated features.
  -o FORMAT        Output RDF graph in one of the following formats:
                     turtle (.ttl) [default: turtle]
                     nt (.nt),
                     n3 (.n3),
                     xml (.rdf)

Input files

Small test set in examples/features.gff3 including config.ini. Alternatively, download tomato or potato genome annotations.

wget ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/ITAG2.4_gene_models.gff3
wget http://solanaceae.plantbiology.msu.edu/data/PGSC_DM_V403_genes.gff.zip

Generate RDF graph

GFF->DB

python SIGA.py db -rV ../examples/features.gff3 # output *.db

DB->RDF (default: turtle)

python SIGA.py rdf -c config.ini ../examples/features.db # output *.ttl

Summary of I/O files:

config file: config.ini
GFF file: features.gff3
SQLite DB file: features.db
RDF Turtle file: features.ttl

Import RDF graph into Virtuoso RDF Quad Store

See the documentation on bulk data loading.

Edit virtuoso.ini config file by adding /mydir/ to DirsAllowed.

Connect to db server as dba user:

isql 1111 dba dba

Delete (existing) RDF graph if necessary:

SPARQL CLEAR GRAPH <http://solgenomics.net/genome/Solanum_lycopersicum> ;

Delete any previously registered data files:

DELETE FROM DB.DBA.load_list ;

Register data file(s):

ld_dir('/mydir/', 'features.ttl', 'http://solgenomics.net/genome/Solanum_lycopersicum') ;

List registered data file(s):

SELECT * FROM DB.DBA.load_list ;

Bulk data loading:

rdf_loader_run() ;

Re-index triples for full-text search (via Faceted Browser):

DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ() ;

Note: A single data file can be uploaded using the following command:

SPARQL LOAD "file:///mydir/features.ttl" INTO "http://solgenomics.net/genome/Solanum_lycopersicum" ;

Count imported RDF triples:

SPARQL
SELECT COUNT(*)
FROM <http://solgenomics.net/genome/Solanum_lycopersicum>
WHERE { ?s ?p ?o } ;

Alternatively, import RDF graph into Berkeley DB (requires Redland RDF processor)

rdfproc features parse features.ttl turtle
rdfproc features serialize turtle

License

The software is released under Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
doc		doc
examples		examples
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIGA.py

Key features

Requirements

Installation

How to use

License

About

Releases 2

Packages

Contributors 4

Languages

License

candYgene/siga

Folders and files

Latest commit

History

Repository files navigation

SIGA.py

Key features

Requirements

Installation

How to use

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Languages

Packages