VCF sample metadata - proposal for a GenotypedSampleMetadata object #1039

Closed
jpdna opened this Issue May 26, 2016 · 15 comments

Comments

Projects
None yet
3 participants
@jpdna
Member

jpdna commented May 26, 2016

proposal for a GenotypedSampleMetadata avro schema separate from RecordGroup so as to not confound the concept of Read/Record group metadata and VCF sample metadata.

As discussed in #1015

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 26, 2016

Member

@heuermh voiced a -0 RE: having Metadata in any names.

Member

fnothaft commented May 26, 2016

@heuermh voiced a -0 RE: having Metadata in any names.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 26, 2016

Member

Maybe not the best place for this, but I hadn't tried writing such a chart before:

References

GA4GH bdg-formats ADAM
Reference Contig
ReferenceSet SequenceDictionary
NucleotideContigFragment
Position ReferencePosition
ReferencePositionPair
ReferenceRegion

Reads

GA4GH bdg-formats ADAM
ReadGroup RecordGroupMetadata
ReadGroupSet RecordGroupDictionary
ReadStats
LinearAlignment
ReadAlignment AlignmentRecord
CigarOperation
CigarElement
Program ProgramRecord
Fragment
ReadBucket
SingleReadBucket

Variants

GA4GH bdg-formats ADAM
Variant Variant
VariantSetMetadata
VariantSet
CallSet
Call Genotype
GenotypeAllele
GenotypeType
VariantCallingAnnotations
DatabaseVariantAnnotation
StructuralVariant
StructuralVariantType
VariantContext

Variant annotation (VCF ANN spec)

GA4GH bdg-formats ADAM
VariantAnnotationSet
VariantAnnotation VariantAnnotation
TranscriptEffect TranscriptEffect
VariantAnnotationMessage
AnalysisResult
AlleleLocation
HGVSAnnotation

Sequence features

GA4GH bdg-formats ADAM
Feature Feature
FeatureSet
Strand Strand
ExternalIdentifier Dbxref
OntologyTerm OntologyTerm
Attributes
Gene
Transcript
Exon
CDS
UTR

Other

GA4GH bdg-formats ADAM
Analysis
Experiment
Dataset

Currently GA4GH doesn't model Sample, rather sampleId is used in CallSet and ReadGroup. bdg-formats uses sample name in RecordGroupMetadata and AlignmentRecord and sampleId in Genotype.

The Sample model I'm most familiar with is from SRA, modeled in XML schema here
https://www.ebi.ac.uk/ena/submit/metadata-model

There are of course 15 other competing standards. I'm not suggesting we get carried away modeling samples, just add enough to support useful queries and get cardinality relationships right.

Member

heuermh commented May 26, 2016

Maybe not the best place for this, but I hadn't tried writing such a chart before:

References

GA4GH bdg-formats ADAM
Reference Contig
ReferenceSet SequenceDictionary
NucleotideContigFragment
Position ReferencePosition
ReferencePositionPair
ReferenceRegion

Reads

GA4GH bdg-formats ADAM
ReadGroup RecordGroupMetadata
ReadGroupSet RecordGroupDictionary
ReadStats
LinearAlignment
ReadAlignment AlignmentRecord
CigarOperation
CigarElement
Program ProgramRecord
Fragment
ReadBucket
SingleReadBucket

Variants

GA4GH bdg-formats ADAM
Variant Variant
VariantSetMetadata
VariantSet
CallSet
Call Genotype
GenotypeAllele
GenotypeType
VariantCallingAnnotations
DatabaseVariantAnnotation
StructuralVariant
StructuralVariantType
VariantContext

Variant annotation (VCF ANN spec)

GA4GH bdg-formats ADAM
VariantAnnotationSet
VariantAnnotation VariantAnnotation
TranscriptEffect TranscriptEffect
VariantAnnotationMessage
AnalysisResult
AlleleLocation
HGVSAnnotation

Sequence features

GA4GH bdg-formats ADAM
Feature Feature
FeatureSet
Strand Strand
ExternalIdentifier Dbxref
OntologyTerm OntologyTerm
Attributes
Gene
Transcript
Exon
CDS
UTR

Other

GA4GH bdg-formats ADAM
Analysis
Experiment
Dataset

Currently GA4GH doesn't model Sample, rather sampleId is used in CallSet and ReadGroup. bdg-formats uses sample name in RecordGroupMetadata and AlignmentRecord and sampleId in Genotype.

The Sample model I'm most familiar with is from SRA, modeled in XML schema here
https://www.ebi.ac.uk/ena/submit/metadata-model

There are of course 15 other competing standards. I'm not suggesting we get carried away modeling samples, just add enough to support useful queries and get cardinality relationships right.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 26, 2016

Member

Here's a proposal for flattening the SRA sample XSD to avro

record Sample {

  // flattened IDENTIFIERS
  union { null, string } primaryId = null;
  union { null, QualifiedName } submitterId = null;
  array<QualifiedName> externalIds = [];
  array<string> secondaryIds = [];
  array<string> uuids = [];

  // flattened SAMPLE_NAME
  union { null, string } anonymizedName = null;
  union { null, string } commonName = null;
  union { null, string } individualName = null;
  union { null, string } scientificName = null;
  union { null, int } taxonId = null;

  // SAMPLE
  union { null, string } accession = null;
  union { null, string } alias = null;
  union { null, string } centerName = null;
  union { null, string } brokerName = null;
  union { null, string } title = null;
  union { null, string } description = null;

  // SAMPLE_LINKS
  array<XrefLink> xrefLinks = [];
  array<UrlLink> urlLinks = [];
  array<EntrezLink> entrezLinks = [];

  // SAMPLE_ATTRIBUTES
  map<string> attributes = {};
}

Although after all that it is not clear where sampleId or sampleName from the current schema should go, and often the stuff you'd most want to have for interesting queries would be on the experimental design side, not necessarily attributes of the sample.

Example:
https://www.ebi.ac.uk/ena/data/view/ERS573550&display=xml

ENA default sample checklist (values that should go in sample attributes, maybe some of these could move to fields):
https://www.ebi.ac.uk/ena/data/view/ERC000011&display=xml

If you might prefer reading javadoc over XSD docs, we generated jaxb mappings here, though they might be out of date.

Member

heuermh commented May 26, 2016

Here's a proposal for flattening the SRA sample XSD to avro

record Sample {

  // flattened IDENTIFIERS
  union { null, string } primaryId = null;
  union { null, QualifiedName } submitterId = null;
  array<QualifiedName> externalIds = [];
  array<string> secondaryIds = [];
  array<string> uuids = [];

  // flattened SAMPLE_NAME
  union { null, string } anonymizedName = null;
  union { null, string } commonName = null;
  union { null, string } individualName = null;
  union { null, string } scientificName = null;
  union { null, int } taxonId = null;

  // SAMPLE
  union { null, string } accession = null;
  union { null, string } alias = null;
  union { null, string } centerName = null;
  union { null, string } brokerName = null;
  union { null, string } title = null;
  union { null, string } description = null;

  // SAMPLE_LINKS
  array<XrefLink> xrefLinks = [];
  array<UrlLink> urlLinks = [];
  array<EntrezLink> entrezLinks = [];

  // SAMPLE_ATTRIBUTES
  map<string> attributes = {};
}

Although after all that it is not clear where sampleId or sampleName from the current schema should go, and often the stuff you'd most want to have for interesting queries would be on the experimental design side, not necessarily attributes of the sample.

Example:
https://www.ebi.ac.uk/ena/data/view/ERS573550&display=xml

ENA default sample checklist (values that should go in sample attributes, maybe some of these could move to fields):
https://www.ebi.ac.uk/ena/data/view/ERC000011&display=xml

If you might prefer reading javadoc over XSD docs, we generated jaxb mappings here, though they might be out of date.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna May 27, 2016

Member

We do want our sample metadata schema in bdg-formats to be able to map to the SRA schema - but IMO we may not want to adopt the SRA schema as fully as the proposal above because many SRA fields only make sense in the use case of running a data archive/repository like NCBI/EBI - and may create noise and ambiguity for users in where to place data in which field for our more general audience.

For example - submitterId is the field for the "local" submitter name for an assayed sample - but really makes sense only in context of metadata being submitted to a public repository. Similarly primaryId is the NCBI/EBI assigned accession. We may want to capture those fields from datasets that come from public archives, but having a submitterId or primaryId centerName or brokerName may make little sense for some of our users and encourage ad hoc re-purposing of these fields and an ensuing mess interpreting these records. Similarly, capturing the species common name, scientific name, and taxonID - seems more a higher level archive meta-data function that we don't need to capture in the primary schema - if anything they can go in the key-value section.

I'd suggest for the bdg-formats SampleRecord a minimal schema with fields:

  • AssayedSampleId this represent the name for the physical sample that was assayed, and possibly also maps to the ReadGroup name in the upstream BAM file. In case of SRA maps to "submitterID".
  • AssayedSampleIdAlias meant to be used if the name in the upstream source file - like the readGroup name, is different than the above - as we want to capture a direct mapping to readGroup if at all possible.
  • SampleName - used to capture a unique name for a person/individual such as NA12878. Maps to individualName in the SRA schema.
  • possibly some standardized dbxref or uri array mean to link to external ids
  • the attributes key value map of custom fields

For data derived from SRA/ENA, we could provide suggested keys for the attributes map that directly capture all the fields from SRA that we want to explicitly map, without having to complicate our main schema with the archive repository specific details.

I think such a minimal schema for the AssayedSampleID, AssayedSampleIdAlias and SampleName provides clarity to that primary cardinality relationship, but at same time leaves plenty of room in the attributes key value section, with associated documentation and guidance we might provide, to capture the full glory of the SRA schema - and future other metadata sources.

Member

jpdna commented May 27, 2016

We do want our sample metadata schema in bdg-formats to be able to map to the SRA schema - but IMO we may not want to adopt the SRA schema as fully as the proposal above because many SRA fields only make sense in the use case of running a data archive/repository like NCBI/EBI - and may create noise and ambiguity for users in where to place data in which field for our more general audience.

For example - submitterId is the field for the "local" submitter name for an assayed sample - but really makes sense only in context of metadata being submitted to a public repository. Similarly primaryId is the NCBI/EBI assigned accession. We may want to capture those fields from datasets that come from public archives, but having a submitterId or primaryId centerName or brokerName may make little sense for some of our users and encourage ad hoc re-purposing of these fields and an ensuing mess interpreting these records. Similarly, capturing the species common name, scientific name, and taxonID - seems more a higher level archive meta-data function that we don't need to capture in the primary schema - if anything they can go in the key-value section.

I'd suggest for the bdg-formats SampleRecord a minimal schema with fields:

  • AssayedSampleId this represent the name for the physical sample that was assayed, and possibly also maps to the ReadGroup name in the upstream BAM file. In case of SRA maps to "submitterID".
  • AssayedSampleIdAlias meant to be used if the name in the upstream source file - like the readGroup name, is different than the above - as we want to capture a direct mapping to readGroup if at all possible.
  • SampleName - used to capture a unique name for a person/individual such as NA12878. Maps to individualName in the SRA schema.
  • possibly some standardized dbxref or uri array mean to link to external ids
  • the attributes key value map of custom fields

For data derived from SRA/ENA, we could provide suggested keys for the attributes map that directly capture all the fields from SRA that we want to explicitly map, without having to complicate our main schema with the archive repository specific details.

I think such a minimal schema for the AssayedSampleID, AssayedSampleIdAlias and SampleName provides clarity to that primary cardinality relationship, but at same time leaves plenty of room in the attributes key value section, with associated documentation and guidance we might provide, to capture the full glory of the SRA schema - and future other metadata sources.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 27, 2016

Member

From what I understand, storing data in attributes is inefficient given how ADAM and Spark operate on schema records (projections, filters, column compression on disk, SparkSQL queries, etc.). For example, it is not possible to project only a single key from the attributes field.

That would seem to lead to a design principle of preferring nullable fields to attributes.

Member

heuermh commented May 27, 2016

From what I understand, storing data in attributes is inefficient given how ADAM and Spark operate on schema records (projections, filters, column compression on disk, SparkSQL queries, etc.). For example, it is not possible to project only a single key from the attributes field.

That would seem to lead to a design principle of preferring nullable fields to attributes.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna May 27, 2016

Member

But isn't the size of data in these sample metadata records so trivially small that we need not worry about performance inefficiency in this case?

If we feel SRA sourced metadata is a major use case and we want to make nullable field names which map to sra explicitly rather than suggested key/value attributes then I suggest prefixing the nullable names to be like:
-sra_submitterId
-sra_primaryId
etc. to make it clear that the meaning of these fields is for SRA mapping use case, and makes it easy for users who don't deal with SRA data to ignore these all.

I'd still then like to have the basic three fields AssayedSampleID AssayedSampleIDAlias SampleName as I proposed above for the more general audience, and we could suggest in the SRA use case that users map the appropriate field there as well so as to make combining/comparing SRA and non-SRA records easier.

Member

jpdna commented May 27, 2016

But isn't the size of data in these sample metadata records so trivially small that we need not worry about performance inefficiency in this case?

If we feel SRA sourced metadata is a major use case and we want to make nullable field names which map to sra explicitly rather than suggested key/value attributes then I suggest prefixing the nullable names to be like:
-sra_submitterId
-sra_primaryId
etc. to make it clear that the meaning of these fields is for SRA mapping use case, and makes it easy for users who don't deal with SRA data to ignore these all.

I'd still then like to have the basic three fields AssayedSampleID AssayedSampleIDAlias SampleName as I proposed above for the more general audience, and we could suggest in the SRA use case that users map the appropriate field there as well so as to make combining/comparing SRA and non-SRA records easier.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 27, 2016

Member

But isn't the size of data in these sample metadata records so trivially small that we need not worry about performance inefficiency in this case?

Yep. I'm still trying to figure out what the design principles are for our schema, so I'm trying argumentation. :)

If we feel SRA sourced metadata is a major use case ...

I don't believe the fields of SRA SAMPLE are the interesting bits, rather what might be stored as attributes according to e.g. the ENA default sample checklist linked above.

Starting from a minimal record of

record Sample {
  union { null, string } id = null;  // primary key
  union { null, string } alias = null;  // alternate key for joining with ReadGroups
  union { null, string } name = null;  // descriptive name, e.g. SRA individualName
  map<string> attributes = {};
}

would be fine with me.

Then since the ENA checklist is the "minimum information required for the sample" for ENA, and I assume one can dig up minimal requirements for SRA, CGHub, dbGap, etc., which should be similar, we might want to add some of those keyed values as nullable fields.

Member

heuermh commented May 27, 2016

But isn't the size of data in these sample metadata records so trivially small that we need not worry about performance inefficiency in this case?

Yep. I'm still trying to figure out what the design principles are for our schema, so I'm trying argumentation. :)

If we feel SRA sourced metadata is a major use case ...

I don't believe the fields of SRA SAMPLE are the interesting bits, rather what might be stored as attributes according to e.g. the ENA default sample checklist linked above.

Starting from a minimal record of

record Sample {
  union { null, string } id = null;  // primary key
  union { null, string } alias = null;  // alternate key for joining with ReadGroups
  union { null, string } name = null;  // descriptive name, e.g. SRA individualName
  map<string> attributes = {};
}

would be fine with me.

Then since the ENA checklist is the "minimum information required for the sample" for ENA, and I assume one can dig up minimal requirements for SRA, CGHub, dbGap, etc., which should be similar, we might want to add some of those keyed values as nullable fields.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna May 27, 2016

Member

Starting from a minimal record of

+1 to the minimal Sample record @heuermh proposes above above being added as a member of bdg-formats avro schema and being the type written by vcf2adam into the _samples.avro file alongside the parquet files.

Member

jpdna commented May 27, 2016

Starting from a minimal record of

+1 to the minimal Sample record @heuermh proposes above above being added as a member of bdg-formats avro schema and being the type written by vcf2adam into the _samples.avro file alongside the parquet files.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 28, 2016

Member

+1 to the Sample @heuermh proposed as well!

Member

fnothaft commented May 28, 2016

+1 to the Sample @heuermh proposed as well!

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh May 31, 2016

Member

I was hoping that proposal would be the start of discussion, not the end of it.

Is it correct that nullable fields are preferable to attributes? If so, are there any attributes of the ENA checklist or other minimal requirements that we should add as fields?

Member

heuermh commented May 31, 2016

I was hoping that proposal would be the start of discussion, not the end of it.

Is it correct that nullable fields are preferable to attributes? If so, are there any attributes of the ENA checklist or other minimal requirements that we should add as fields?

@heuermh heuermh modified the milestone: 0.20.0 Jun 5, 2016

@heuermh heuermh referenced this issue in bigdatagenomics/bdg-formats Jun 6, 2016

Merged

Add schema for sample #84

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Jun 7, 2016

Member

Looking at the genotype bdg-formats schema just now, I really don't like the repeated per-sample data such as sampleDescription and processingDescription. These should be normalized out into the new Sample record so as to not cause memory bloat we we actually materialize Genotype objects.

I'd suggest to add
sampleDescription
processingDescription
to the new Sample record and to remove those fields from Genotype

Member

jpdna commented Jun 7, 2016

Looking at the genotype bdg-formats schema just now, I really don't like the repeated per-sample data such as sampleDescription and processingDescription. These should be normalized out into the new Sample record so as to not cause memory bloat we we actually materialize Genotype objects.

I'd suggest to add
sampleDescription
processingDescription
to the new Sample record and to remove those fields from Genotype

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 7, 2016

Member

Strong +1 @jpdna

Member

fnothaft commented Jun 7, 2016

Strong +1 @jpdna

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 7, 2016

Member

I'm +1 removing them from Genotype. -0 adding them as fields to Sample.

Member

heuermh commented Jun 7, 2016

I'm +1 removing them from Genotype. -0 adding them as fields to Sample.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 27, 2016

Member

New schema for sample was added in bigdatagenomics/bdg-formats#84.

That pull request did not remove any fields from Genotype or add code to ADAM to read or write data in those formats. @jpdna would you like to take up those tasks?

Member

heuermh commented Jun 27, 2016

New schema for sample was added in bigdatagenomics/bdg-formats#84.

That pull request did not remove any fields from Genotype or add code to ADAM to read or write data in those formats. @jpdna would you like to take up those tasks?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 19, 2016

Member

Closed by commit 4b6e107.

Member

heuermh commented Jul 19, 2016

Closed by commit 4b6e107.

@heuermh heuermh closed this Jul 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment