Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

first-class ReadGroups + convenient ReadSets #32

Closed
dglazer opened this issue Apr 27, 2014 · 15 comments
Closed

first-class ReadGroups + convenient ReadSets #32

dglazer opened this issue Apr 27, 2014 · 15 comments

Comments

@dglazer
Copy link
Member

dglazer commented Apr 27, 2014

After much discussion (mostly in #24 with @richarddurbin, @lh3, @cassiedoll, and @fnothaft), I have a suggestion on a way forward that hopefully addresses all the requirements people have raised. This writeup is meant to replace #24, which I'm closing, so people don't have to wade through all the history of how we got here.

The design principles are:

  1. for flexibility, allow query and analysis over any ad hoc collection of RG’s
  2. for convenience, provide a default collection of RG’s that callers can choose to use

To get there, I suggest we proceed in two logical steps:

  1. agree on a core API that has ReadGroups only (no ReadSets), to ensure it supports all the ad hoc mix-and-match-RG use cases that people have raised
  2. agree on how to add a lightweight "default collection" concept of ReadSets, to make callers’ lives easier for the other use cases that people have raised

I believe that will get us an API that helps callers by making “simple things simple and complex things possible”. See below for a sketch of the details -- if people like this direction, we can turn it into pull requests quickly, since most of the work is already at least partly in flight.

Please comment on whether you’re comfortable taking that next step and putting this in code. If folks like both step 1 and step 2, we can turn it all into pull requests at once, which will be a bit more efficient. If folks are comfortable with step 1 but not sure about step 2, we can do them one at a time. (And of course details like object and field names can be hashed out in the pull requests themselves.)

Step 1: ReadGroups only; no ReadSets

Mental model:
red-only

  • RG’s are the fundamental unit of grouping
  • you can query over any ad hoc collection of RG’s
  • there are no default or persistent collections of RG’s

GARead changes:

GAReadGroup changes:

  • add the GAHeaderSection fields (currently in GAReadSet)
  • add tags field (to allow individual repositories to support not-yet-standard metadata)
  • add /readgroups/search method, with RG-level params (e.g. id, library, sample, tags)
  • PROBABLY: add /readgroups/get method, and require RGid to be unique

Note on implementation-specific extensibility -- implementers:

  • SHOULD NOT add new fields to standard objects
  • MAY use the tags fields on GARead and GAReadGroup to add extra info
  • MAY take advantage of /search support for tag fields to filter by that extra info
  • MAY support extra methods, as long as they don’t conflict with the standard methods

Step 2: introduce ReadSets

Mental model:
red-blue

  • RGs are still the fundamental unit of grouping
  • you can still query over any ad hoc collection of RG’s
  • you can also use a new default collection, which is the ‘suggested’ unit of analysis

GAReadSet object:

  • introduce a new GAReadSet object. Every RG belongs to exactly 1 ReadSet
  • fields are id, name, list of RGs
  • add /readsets/search method
  • MAYBE: expose all homogeneous RG fields (e.g. library, sample) for easy browsing

Use of GAReadSet in other places:

  • add [readsetId, …] as a query param to /readgroups/search and /reads/search
  • add readsetId to the GARead and GAReadGroup objects

Note on implementation-specific flexibility:

  • Implementations are presumably in the best position to understand their data, and therefore can choose how to assign RG’s to ReadSets.
    • One possible heuristic is “all the RGs that were imported from a single BAM file, and share the same sample name, go into a single ReadSet”.
    • Another possible heuristic, if there is no known preferred grouping of the data, is “every RG goes into its own ReadSet”. (Which would be easy to implement on the fly with zero storage overhead.)
@fnothaft
Copy link
Contributor

I am +1 on doing step 1 and 2 simultaneously.

@delagoya
Copy link
Contributor

I've been hesitant to enter this conversation, but throwing caution to the wind now:

I think ReadSets and ReadGroups are orthogonal groupings of reads, where:

  • ReadGroup is a disjoint collection of Reads that correspond to some possibly confounding factor introduced during the experimental production of the read. A ReadGroup is defined on submission by the submitter and represents an atomic grouping of reads (could be library, could barcode, could be source sample). A Read can only belong to one ReadGroup
  • ReadSet is an arbitrary collection of Reads that represents a set of reads to be analyzed at the same time. Furthermore:
    • A Read can belong to many ReadSets
    • ReadSets can contain all or subsets of Reads that are present in other ReadSets. Whether we make them recursive or not is a matter of style, but I can live with either.
    • A ReadSet can define it's Read sort order
    • When dereferencing the Reads of a ReadSet, the API should iterate through the Reads in the sort order specific to the ReadSet
    • Reads from the same ReadGroup can belong to different ReadSets

That last point is where my understanding of a ReadSet deviates from the current discussion. In particular for RNA-Seq quantification, you want to be able to define separate ReadSets for each paired end for normalization purposes. Logically, the pairs belong to the same ReadGroup, but you want to analyze the groups separately and hence would go into disjoint ReadSets for at least a portion of the data analysis pipeline.

@vadimzalunin
Copy link

ReadSet is an arbitrary collection of Reads that represents a set of reads to be analyzed at the same time.

Can ReadSets be created for already imported reads?

@delagoya
Copy link
Contributor

Can ReadSets be created for already imported reads?

@vadimzalunin That would be my preference.

@dglazer
Copy link
Member Author

dglazer commented Apr 29, 2014

Angel, sounds like we're all in sync on what a ReadGroup is -- good. Do you agree with the proposed step 1, and think it's ready to convert into code, or do you have concerns about it?

I think I understand your definition of ReadSet, but it's a very different object than the one I proposed in step 2. (Not debating which is better; just making it clear they are different objects, would need different red/blue pictures to represent, and would address different use cases.) More descriptive naming would be PreferredCollectionOfReadGroups (for the proposal in this issue) and AdHocCollectionOfReads (for your proposal, if I understand it right).

@delagoya
Copy link
Contributor

Angel, sounds like we're all in sync on what a ReadGroup is -- good. Do you agree with the proposed step 1, and think it's ready to convert into code, or do you have concerns about it?

Yes I agree with the proposals of step 1 and have no further comment on it.

I think I understand your definition of ReadSet, but it's a very different object than the one I proposed in step 2.

Yes, which is why I brought it up :)

(Not debating which is better; just making it clear they are different objects, would need different red/blue pictures to represent, and would address different use cases.)

Yes, we do and I am working on it, but have some hard commitments for tomorrow that are delaying my progress. Suffice to say, I think of a ReadSet as the resulting array of Reads from a "select" (or query / search) operation. My notion of a ReadSet is defined by the selection criteria, as well as other metadata that help define the ReadSet. An ad-hoc ReadSet is just one you have not bothered to save for the long-term or applied other sorts of annotation on top of.

More descriptive naming would be PreferredCollectionOfReadGroups (for the proposal in this issue) and AdHocCollectionOfReads (for your proposal, if I understand it right).

I don't think we should make this distinction. I have Bad Memories of over-specified objects from my previous foray into developing standards.

@birney
Copy link

birney commented Apr 29, 2014

Ok, many thanks. This is useful. I support this, but the main pattern in the future will
be arbitrary Read Groups is my honest opinion. I guess I should add this to the comments here.
This needs to be in big capital letters at the start, as implementations that optimise for default
readsets is just going to get themselves into trouble. I think elevating “ReadSet” as a different
concept to a “List of Read Groups” is dangerous in terms of mindset, whilst one can still provide
the concept of helper functions. The basic pattern I’d like is something like:

// pseudo Java like code

rs = make_ReadStore_Factory(“http://www.ebi.ac.uk/EGA/GA4GH/Root/“);

// helper function to handle references, must go through indirection to make
// sure callers are explicit about the reference
chr = rs.fetch_reference_chromosome(”GRCh38”,”1”);
start = 1;
end = 1000000;

// notice I could have come up with different rgl's
rgl = rs.fetch_ReadGroupListHelper_by_name(“UK10K_Cohort”);

// here are some other rgl’s
rgl1000g = rs.fetch_ReadGroupListHelper_by_name(“1000G_PhaseI”);
an_rgl = rs.fetch_ReadGroupListHelper_by_accession(“EGAS00000000012”);

// you can merge etc RGLs
// This is UK10K plus 1000g
new_rgl = rgl.merge_ReadGroupList(rg1000g)

// you can iterate over rgls, delete etc

// This is the big call for the ReadStore
// Reads have one and only one ReadGroup
arc = rs.get_ReadCollection(chr,start,end,newrgl);

@dglazer
Copy link
Member Author

dglazer commented Apr 29, 2014

Thanks @birney. Making sure I understand:

  • you agree with the proposed step 1, and think it's ready to convert into code
  • you're lukewarm (or worse) about step 2, and are suggesting a different model

Is that right? (In particular, I want to make sure that you're comfortable with step 1 as is.)

@dglazer
Copy link
Member Author

dglazer commented Apr 29, 2014

I just talked to @birney -- he confirmed my understanding above (yes to step 1; cool on step 2).

@kozanitis
Copy link
Contributor

I am +1 for step1 and -0 for step2.

As far as step2 goes, if we assume arbitrary read groups according to @birney we will end up applying frequently your final bullet:

“every RG goes into its own ReadSet”

I guess this might nullify the purpose of ReadSets.

Also I am not that comfortable with your assumption:

Implementations are presumably in the best position to understand their data, and therefore can choose how to assign RG’s to ReadSets

Note that "best position" doesn't necessarily mean ground truth. Although I see your point of view, you might need a wider consensus from the GA before adopting that.

@birney
Copy link

birney commented Apr 30, 2014

Yes I agree with Step1.

Looking at the IDL, I think the concept of "ReadSet" is being confused a bit with a factory "ReadStore" somewhat - and/or the meta-data about the dictionary for reference sequences etc with groupings of people. In my view we have 4 things (I know many people agree with this)

ReadStores - these are the root factory interfaces where the main calls to fetch things come from
(EGA and ENA would both be ReadStore; GenomicsEngland might be another; dbGaP etc). ReadStores should provider helper functions so that people can use easy strings for assemblies, references, cohorts etc

ReadGroups - these are sets of reads which should be analysed together because the scientist who generated the data say that they are together. Most obviously these are from the same sequencing run. It will always be messy about modelling the precise lane/barcode/run/details of different institutes/companies/etc - inside of an institute there might be conventions or other schemes, but once this is exposed to API clients, the contract is that the client can consider these reads to be from (a) only one sample and (b) have relatively homogenous characteristics.

Samples - A ReadGroup has one and only one Sample. However we can either allow Samples to have > 1 ReadGroup and/or allow Samples to be equivalent in some testing manner as for sure there will be >1 ReadGroup per Sample in a number of scenarios.

ReadGroupLists - these are sets of ReadGroups allowing functions to call across ReadGroups. It is up to the precise implementation (and therefore the client has to know outside of the API) about whether the ReadGroupList means 1 ReadGroup per Sample or >1 ReadGroup per Sample. I like using the name ReadGroupList as it is explicit about the arbitrary-ness of this, and how this can be
manipulated (and should be manipulated) client side. I do think one needs helper functions for Cohorts or indeed quite flexibile cohort functions (get a ReadGroupList for 1 sample), but I think all of this is more about helper functions in make ReadGroupLists, rather than trying to model the differentially nested behaviour that will occur across different implementations.

@dglazer
Copy link
Member Author

dglazer commented Apr 30, 2014

Thanks Ewan. Three quick thoughts:

  1. yay on step 1 - we'll hopefully confirm common ground there in today's
    call and then make it so.

  2. I agree today's IDL is a bit confusing, and think it's at least
    partially for two reasons - it hasn't caught up with this discussion re RG
    vs. RS, and it doesn't yet define any methods. Both of those will get
    better after step 1.

  3. lots left to talk about re RG, RGL, RS, Sample, etc. -- we can do that
    right after step 1 lands
    On Apr 30, 2014 5:53 AM, "birney" notifications@github.com wrote:

Yes I agree with Step1.

Looking at the IDL, I think the concept of "ReadSet" is being confused a
bit with a factory "ReadStore" somewhat - and/or the meta-data about the
dictionary for reference sequences etc with groupings of people. In my view
we have 4 things (I know many people agree with this)

ReadStores - these are the root factory interfaces where the main calls to fetch things come from

(EGA and ENA would both be ReadStore; GenomicsEngland might be another;
dbGaP etc). ReadStores should provider helper functions so that people can
use easy strings for assemblies, references, cohorts etc

ReadGroups - these are sets of reads which should be analysed together because the scientist who generated the data say that they are together. Most obviously these are from the same sequencing run. It will always be messy about modelling the precise lane/barcode/run/details of different institutes/companies/etc - inside of an institute there might be conventions or other schemes, but once this is exposed to API clients, the contract is that the client can consider these reads to be from (a) only one sample and (b) have relatively homogenous characteristics.

Samples - A ReadGroup has one and only one Sample. However we can either allow Samples to have > 1 ReadGroup and/or allow Samples to be equivalent in some testing manner as for sure there will be >1 ReadGroup per Sample in a number of scenarios.

ReadGroupLists - these are sets of ReadGroups allowing functions to call across ReadGroups. It is up to the precise implementation (and therefore the client has to know outside of the API) about whether the ReadGroupList means 1 ReadGroup per Sample or >1 ReadGroup per Sample. I like using the name ReadGroupList as it is explicit about the arbitrary-ness of this, and how this can be

manipulated (and should be manipulated) client side. I do think one needs
helper functions for Cohorts or indeed quite flexibile cohort functions
(get a ReadGroupList for 1 sample), but I think all of this is more about
helper functions in make ReadGroupLists, rather than trying to model the
differentially nested behaviour that will occur across different
implementations.


Reply to this email directly or view it on GitHubhttps://github.com//issues/32#issuecomment-41779223
.

@lh3
Copy link
Member

lh3 commented Apr 30, 2014

About ReadGroup IDs. I think there should be an internal ID and an external accession number. Each read store instance can freely use any internal IDs in any formats, but it should only expose accession numbers to end users.

In SRA, a ReadGroup-level accession looks like SRR012345 and in ENA, looks like ERR012345. The first two letters mark the source of accessions, such that accessions will never be conflictive between read store instances. A submitter only submits data only once to either SRA or ENA. (S)he gets an accession that is synced routinely between SRA/ENA. We can get ENA data (at least accession numbers) from SRA and vice versa. This is not only the practice for short reads, but also for all biological sequences in public repositories. In future read stores, we should continue this practice to assign each read group a stable and globally unique accession across all read store instances. Google, for example, may give a read group submitted to Google Genomics an accession like GOR012345. We should not use a string hash as an accession (fine for internal IDs).

For already accessioned public data, such as 1000g, all read store instances should use the same accession numbers. When a user query: /readgroups/search?id=SRR012345, (s)he should get the same data from all read store instances.

The implementation of accession should be simple. We may just add a "string accession;" field to ReadGroup. There are a few delicate issues, such as how to recognize existing accessions, what to do if reads are mapped with two mappers and how to sync, but these should be solvable.

@dglazer
Copy link
Member Author

dglazer commented May 1, 2014

Quick update from yesterday's Reads task team call:

  • we agreed to proceed with step 1; initial pull requests are in flight
  • we agreed to temporarily defer step 2
  • we had an initial conversation with the Metadata team about what fields belong directly in Reads / RGs, vs. as pointers to some external system -- they're going to come back with some thoughts

I'll leave this open until all the step 1 changes are underway.

@dglazer
Copy link
Member Author

dglazer commented May 17, 2014

Closed -- all the step 1 changes to make ReadGroups a first-class object are now in. (And #52 has a proposal for addressing the remaining loose ends.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants