ENA Mirror Processor

Install dependencies

This is a pure perl library. All dependencies are handled in a cpanfile and using cpanm.

cpanm --installdeps .

If you have the excellent carton library available you can use this to manage your dependencies.

carton install
perl -I ./local/lib/perl5/ ./bin/load_expanded_con.pl -h

Load expanded cons

./bin/load_expanded_con.pl --store-path where/you/will/store/things --file-path EMBL.file.dat.gz --process-id anything

The above program will

Open the given EMBL expanded con file, download them from ftp://ftp.ebi.ac.uk/pub/databases/ena/sequence/release/expanded_con
Parse them using the Bio::Perl embl parser
Extract the sequence, calculate checksums and metadata
Write these to disk under a seq, json and logs path under the given --store-path
Logs are written with the --process-id (see below for the types of logs)

Generated Logs

To keep a handle on processing, this code produces a number of logs to disk. See below for the various types of log generated. All files are uncompressed CSVs with headers.

Metadata log

Populated with all pertinent metadata from a load. Columns are:

trunc512 checksum
md5 checksum
sequence length
sha512 checksum
ga4gh checksum (base64 url encoded version of trunc512 with ga4gh:SQ. as specified by VR spec 1.0)
versioned accession
record type e.g. expanded_con
species (full name)
biosample accession (nullable)
taxon identifier (nullable)

Loader log

Records as/when a record was processed with a success boolean and a path. Columns are:

timestamp
loading completed (boolean but set to 1 or 0)
trunc512 checksum
md5 checksum
path to sequence
path to json

Version log

Logs if there is further processing required on a record because the version was greater than 1. Columns are:

timestamp
accession
current version

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
bin		bin
docs		docs
lib		lib
t		t
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cpanfile		cpanfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

docs

docs

lib

lib

t

t

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

cpanfile

cpanfile

Repository files navigation

ENA Mirror Processor

Install dependencies

Load expanded cons

Generated Logs

Metadata log

Loader log

Version log

About

Releases

Packages

Languages

License

andrewyatz/ena-refget-processor

Folders and files

Latest commit

History

Repository files navigation

ENA Mirror Processor

Install dependencies

Load expanded cons

Generated Logs

Metadata log

Loader log

Version log

About

Resources

License

Stars

Watchers

Forks

Languages