This is a pure perl library. All dependencies are handled in a cpanfile and using cpanm.
cpanm --installdeps .
If you have the excellent carton library available you can use this to manage your dependencies.
carton install
perl -I ./local/lib/perl5/ ./bin/load_expanded_con.pl -h
./bin/load_expanded_con.pl --store-path where/you/will/store/things --file-path EMBL.file.dat.gz --process-id anything
The above program will
- Open the given EMBL expanded con file, download them from ftp://ftp.ebi.ac.uk/pub/databases/ena/sequence/release/expanded_con
- Parse them using the
Bio::Perl
embl parser - Extract the sequence, calculate checksums and metadata
- Write these to disk under a
seq
,json
andlogs
path under the given--store-path
- Logs are written with the
--process-id
(see below for the types of logs)
To keep a handle on processing, this code produces a number of logs to disk. See below for the various types of log generated. All files are uncompressed CSVs with headers.
Populated with all pertinent metadata from a load. Columns are:
- trunc512 checksum
- md5 checksum
- sequence length
- sha512 checksum
- ga4gh checksum (base64 url encoded version of trunc512 with ga4gh:SQ. as specified by VR spec 1.0)
- versioned accession
- record type e.g. expanded_con
- species (full name)
- biosample accession (nullable)
- taxon identifier (nullable)
Records as/when a record was processed with a success boolean and a path. Columns are:
- timestamp
- loading completed (boolean but set to 1 or 0)
- trunc512 checksum
- md5 checksum
- path to sequence
- path to json
Logs if there is further processing required on a record because the version was greater than 1. Columns are:
- timestamp
- accession
- current version