Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
VCF sample metadata - proposal for a GenotypedSampleMetadata object #1039
Maybe not the best place for this, but I hadn't tried writing such a chart before:
Variant annotation (VCF ANN spec)
Currently GA4GH doesn't model Sample, rather sampleId is used in CallSet and ReadGroup. bdg-formats uses sample name in RecordGroupMetadata and AlignmentRecord and sampleId in Genotype.
The Sample model I'm most familiar with is from SRA, modeled in XML schema here
Here's a proposal for flattening the SRA sample XSD to avro
Although after all that it is not clear where
ENA default sample checklist (values that should go in sample attributes, maybe some of these could move to fields):
If you might prefer reading javadoc over XSD docs, we generated jaxb mappings here, though they might be out of date.
We do want our sample metadata schema in bdg-formats to be able to map to the SRA schema - but IMO we may not want to adopt the SRA schema as fully as the proposal above because many SRA fields only make sense in the use case of running a data archive/repository like NCBI/EBI - and may create noise and ambiguity for users in where to place data in which field for our more general audience.
For example -
I'd suggest for the bdg-formats SampleRecord a minimal schema with fields:
For data derived from SRA/ENA, we could provide suggested keys for the
I think such a minimal schema for the AssayedSampleID, AssayedSampleIdAlias and SampleName provides clarity to that primary cardinality relationship, but at same time leaves plenty of room in the
From what I understand, storing data in
That would seem to lead to a design principle of preferring nullable fields to attributes.
But isn't the size of data in these sample metadata records so trivially small that we need not worry about performance inefficiency in this case?
If we feel SRA sourced metadata is a major use case and we want to make nullable field names which map to sra explicitly rather than suggested key/value attributes then I suggest prefixing the nullable names to be like:
I'd still then like to have the basic three fields
Yep. I'm still trying to figure out what the design principles are for our schema, so I'm trying argumentation. :)
I don't believe the fields of SRA SAMPLE are the interesting bits, rather what might be stored as attributes according to e.g. the ENA default sample checklist linked above.
Starting from a minimal record of
would be fine with me.
Then since the ENA checklist is the "minimum information required for the sample" for ENA, and I assume one can dig up minimal requirements for SRA, CGHub, dbGap, etc., which should be similar, we might want to add some of those keyed values as nullable fields.
Looking at the
I'd suggest to add