look up all reads in fragment #212

saupchurch · 2014-12-15T16:45:04Z

For fragment reconstruction when dealing with RNA data we need to have the full CIGAR as well as position of the mate pair. A method to query all the reads for a fragment would allow this as well as support cases (i.e. middle reads) containing more than a single paired read for a fragment.

delagoya · 2014-12-17T14:40:06Z

When we had spoken on the call, I believe that the plan was to look into the needed changes within the Reads API to add in the notion of a Fragment record, and to create methods that interact with that record type. I had assumed that we already had a representation of Fragment in the API, but it looks like we do not. All we have is ReadAlignment.fragmentName which is the equivalent of QNAME in SAM.

So we need a Pull Request to add in the Fragment record type, which is distinct from ReadGroup, and to add in methods for it. Any volunteers?

pgrosu · 2014-12-17T17:33:06Z

I wasn't on the call, though this seems fairly straightforward, since a fragment is just a collection of reads under the same ReadGroup. I can give it a shot if it's okay, but it would be nice to be in sync with what was discussed before I start, so that I don't miss anything that was decided as required. Let me know either way.

Thanks,
Paul

saupchurch · 2015-01-06T16:28:33Z

Paul, thanks for taking a stab at this. Delagoya has the essence of it - the plan is to implement the Fragment portion of the API hierarchy. The driver behind this as discussed on the call is the need to have the mate pair CIGAR in addition to position for fragment reconstruction when on the RNA side of the things due to splicing and other potential modifications. It is also desirable to have the structure in place to handle multiple reads in a fragment for future proofing against things like middle reads.

pgrosu · 2015-01-06T19:17:42Z

Hi Sean,

Thank you for the reply and the helpful information. I will work through several implementations until one seems most clear/intuitive, and then post something back. I might have a small question here and there, but this now gives me a good foundation from which to work from.

If there are any additional meeting notes/diagrams/docs/presentations from the call regarding this, that would be nice to sync with.

Thanks,
Paul

saupchurch · 2015-03-09T23:05:00Z

Paul, how is this going? Do you have any further questions?

pgrosu · 2015-03-09T23:06:45Z

Sorry for the delay. I got a little vacation this week so I should have time to complete this now.

pgrosu · 2015-03-12T21:55:49Z

Hi Sean and Angel,

Sorry for the delay and I wanted to post this here before performing a pull-request, in order to open it up first for discussion.

So without further ado, below is the most succinct implementation I am thinking of:

record Fragment {

  /* The fragment ID. */
  string id;

  /* The ID of the read group this fragment belongs to. */
  string readGroupId;

  /* The collection of ReadAlignment which belong to the same fragment. */
  array<ReadAlignment> readAlignments = [];

}

record ReadAlignment {

  ...

  /* The Fragment that this ReadAlignment belongs to. */
  string fragmentId;

  ...

}

The array of ReadAlignment might not be necessary under Fragment, if ReadAlignment references a specific Fragment via fragmentId.

Please let me know what you think.

Thank you,
Paul

saupchurch · 2015-03-13T15:48:45Z

I like it - it captures the idea without being too weighty. From looking at how ReadAlignments fit into the ReadGroup it seems we are not putting arrays of alignments in the group. I'm not sure if that was a design choice or if there were more 'real' schema considerations. To follow that model, I'd remove the array of ReadAlignment from Fragment and reference via fragmentId. I'd go ahead and open a pull request - further discussion can be carried out there.

pgrosu · 2015-03-13T16:47:17Z

Thank you for helping me with this. Yes, clarity is my preferred style :) I'll take out the array of ReadAlignment, and regarding ReadGroup I was trying to integrate the following definition from the schema:

* A *fragment* is a single stretch of a DNA molecule. There are typically
millions of fragments in a ReadGroup. A fragment has a name (QNAME in BAM
spec), a length (TLEN in BAM spec), and an array of reads.

In fact we can take out ReadGroup as well, since it gets referenced in ReadAlignment.

Regarding the reasoning behind choices made, that came out of several fun discussions from last year - which I listed below, in chronological order - but we're always free to update at any time:

#38
#47
#51
#60
#63

I'll get started on the PR.

Thank you,
Paul

pgrosu · 2015-03-13T23:06:25Z

Hi Sean,

Something very funny is happening when I fork schemas using the master branch. I do not get the current schema but a very old version of the schema, which uses the GA prefix and only a limited number of Avro files. I tried forking from several places but the same thing keeps happening. If you know how I should approach forking it in a different way, I would greatly appreciate it. Below are two screenshots of what I get in my repository and the link to it is the following:

https://github.com/pgrosu/schemas

Thank you in advance,
Paul

saupchurch · 2015-03-13T23:32:21Z

I suspect that you must have forked this at some point in the past. From what we have been able to figure out here, you can't sync a fork from the web interface. See: https://help.github.com/articles/syncing-a-fork/

pgrosu · 2015-03-13T23:46:11Z

Thank you for the link, and I will try out these steps until I see no difference between ga4gh and my repository.

Thank you for helping me with this,
Paul

pgrosu · 2015-03-14T08:59:31Z

Hi Sean,

It took some work, but I finally submitted it (#259). Rebasing still is a process I am working at smoothing out :)

Many thanks for helping me,
Paul

afirth · 2015-04-27T13:51:16Z

Is this complete now that #259 has been merged?

pgrosu · 2015-04-27T14:01:47Z

@saupchurch, would you like me to add the methods too?

saupchurch · 2015-04-27T15:19:01Z

@pgrosu Let's see if there is a need from the community before adding. Closing now that #259 is merged.

pgrosu · 2015-04-27T15:21:14Z

@saupchurch Sounds good.
Thanks,
Paul

afirth · 2015-05-20T19:42:11Z

I'd like to reopen this issue as the title is not solved by the current solution. There is, as @saupchurch discussed on the DWG and RNAseq calls, a need to have methods here. I'm working to do that, but running into a few issues.

According to the Reads protocol, A ReadAlignment object is a flattened representation of the bottom layers of this hierarchy. There's exactly one such object per linear alignment or graph alignment. The object contains alignment info, plus fragment and read info for easy access.
- But
  - ReadGroupSet >--< ReadGroup --< fragment --< read --< alignment --< linear/graph alignment
  - A ReadGroupSet is a logical collection of ReadGroup's.
  - A ReadGroup is all the data that's processed the same way by the sequencer. There are typically 1-10 ReadGroup's in a ReadGroupSet.
  - A fragment is a single stretch of a DNA molecule. There are typically millions of fragments in a ReadGroup. A fragment has a name (QNAME in BAM spec), a length (TLEN in BAM spec), and an array of reads.
- Could someone fill me in on why this was flattened? It would be difficult (as I'm finding) to unflatten it. However, without doing that the primary use case I know of for fragments return the reads which are in this fragment is O(n). I understand the desire to model off of BAM, but it does an equally terrible job of capturing this. This will be compounded dramatically (as Sean alluded to on today's call) if/when multireads of n>2 become common.

From an API and implementation standpoint for RNA, it seems like a terrible idea to have fragmentId be part of readAlignment, however I understand that is desirable for regenerating BAM files to give to other tools. How has this question [which side of data contains the one-many relationship] been decided in the past? E.g a readGroupSet contains an array of readGroup[Id]s, but it also contains the datasetId (rather than the dataset containing an array of readGroupSets.

So, sure, we can slap a fragmentIds array field into searchReadsRequest, and it will work, but it is heinously inefficient and seems backwards. I'd like to hear from @pgrosu and @saupchurch in particular, and maybe someone can tag the folks responsible for the flattening of fragment into readAlignment?

Regards,
Alastair

pgrosu · 2015-05-21T00:15:32Z

Hi Alastair (@afirth),

I would be happy to provide the history. I wish I knew about the DWG and RNAseq calls, though unfortunately I don't get the emails. If it would be possible to be added regarding future ones, that would be really nice - my email is pgrosu@gmail.com just in case. Do you have any meeting notes for the two calls and what @saupchurch and everyone discussed? This is just to be in sync.

So below is a little bit of the history regarding how we got to flattening things into a ReadAlignment:

So initially the discussion first got integrated to #33 from (#3, #8, #9, #18, #22, #28 and #30) where chimeric reads could be combined into GARead. Then GAFragment was introduced in #38, where GAReads where under the umbrella of GAFragment. In #38 they were sort of separate with a hierarchy of Fragment <- Read <- Alignment <- LinearAlignment, but it was later determined that it could be slow with Fragment having multiple alignments that might be distant from each other.

So then the discussion continued with the idea that maybe we can extend on SAM records, by combining related reads which would also improve indexing. This generated #47 that got finalized in #60. The #51 pull was introduced to help with consolidating related reads (i.e. mates) into arrays, with #60 becoming more favorable design which also helped with chimeric reads. This then got summarized in #63. The scope was to revisit things - as suggested in #100 - when more complexity might be required. @saupchurch, @delagoya, @lh3, @fnothaft did I forget anything?

Now regarding of why we associated fragmentId to ReadAlignment, was because it looked to be the most straightforward way in associating the two, without incorporating too much complexity. Since these would define a general concept of the wire protocol, it would be the servers that would store these which would optimize via indexing. Regarding indexing, there are several approaches one can take where among them would be the inverted indices approach, which is similar to how search engines implement things with parallelization in mind for document-term pair frequencies. I detailed this in several previous posts, and the lookup can be optimized down to O(1), where in this case it would help with the lookup for the fragment -> read mapping.

Hope it helps,
Paul

saupchurch · 2015-05-21T17:33:28Z

Thank you @pgrosu for the sleuthing. I'm a recent add to the project and came in after the Reads API was for the most part done so the early history is not something I'm very familiar with. The organization of the one-many relationships has been tripping me up a bit as evidenced by the issues raised here. Perhaps this is the right time to take a step back and re-examine what we want a Fragment to be and what we want to accomplish with a Fragment.

From my understanding, the driver for this change is to have access to the CIGAR of the mate pair. The nextMatePosition was not thought to be enough to unambiguously search out the mate in order to extract the CIGAR from its' LinearAlignment. There is also a desire to create an API that can handle more than pairs of reads in a Fragment as a forward-looking goal.

pgrosu · 2015-05-22T16:44:04Z

I also came at the tail end of it, and only wanted read up on it beforehand - for about a month - just to get in sync with the project. You are right that we need to re-visit the balance of ease of use for analysis with the level of compression/encapsulation for transmission over the wire. I agree that Fragment would need to be updated.

The schema can be expanded - which I agree with - to support accessing the CIGAR for the mate pair more directly. I didn't want to perform a major change since that seem to trip a major response in the past. It think many people feel this data is not dynamic, but it can be. Would you be interested is associating/connecting sequences and processes on-the-fly? What I mean is let's say you have a set of reads in the system, and then you can treat them like a sets of objects that can have transformed-associated mapping. Below is an example:

read1
read2

Then via a command-line you type:

> ReadGroupSet5 <- associate(read1, read2, primary=read1)
> AlignedReadGroupSet15 <- align( ReadGroupSet5, ReadGroupSet1)
> update(read1.annotation[+3, methylated], AlignedReadGroupSet15)
> table.lessthan5mismatch.byReadGroup <- search.cigar(<5M, *GroupSet*, mate.location < 180-250, groupby=ReadGroup)
> table.methylated.byReadGroup <- search.annotation("*methylated*", AlignedReadGroupSet*, groupby=ReadGroup )
> collapse.to.matched.genename(table.lessthan5mismatch.byReadGroup, table.methylated.byReadGroup)

Maybe we can brainstorm the different types of analysis and associated data-searches we would prefer to have or others would wish for now and into the future - without worrying about the implemented data-structures yet. We can always can always find optimal implementations for them afterward this exploration. I think this would be great if many people also contributed to be sure we cover all the bases.

delagoya mentioned this issue Dec 17, 2014

Adding sequence records, and cleaning up reference records. #162

Closed

ying-w mentioned this issue Jan 22, 2015

Sort ordering and position handling for unmapped mate pairs of mapped reads #224

Open

pgrosu mentioned this issue Mar 14, 2015

Added Fragment to Reads #259

Merged

saupchurch added the RNAseq label Mar 27, 2015

saupchurch closed this as completed Apr 27, 2015

skeenan reopened this May 20, 2015

This was referenced May 20, 2015

oops #309

Closed

Move fragment data into the fragment object (WIP) #310

Closed

pgrosu mentioned this issue Nov 30, 2015

Revert "Added Fragment to Reads" #499

Merged

pgrosu mentioned this issue Jun 17, 2016

Sequence Annotation search by external identifiers #633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

look up all reads in fragment #212

look up all reads in fragment #212

saupchurch commented Dec 15, 2014

delagoya commented Dec 17, 2014

pgrosu commented Dec 17, 2014

saupchurch commented Jan 6, 2015

pgrosu commented Jan 6, 2015

saupchurch commented Mar 9, 2015

pgrosu commented Mar 9, 2015

pgrosu commented Mar 12, 2015

saupchurch commented Mar 13, 2015

pgrosu commented Mar 13, 2015

pgrosu commented Mar 13, 2015

saupchurch commented Mar 13, 2015

pgrosu commented Mar 13, 2015

pgrosu commented Mar 14, 2015

afirth commented Apr 27, 2015

pgrosu commented Apr 27, 2015

saupchurch commented Apr 27, 2015

pgrosu commented Apr 27, 2015

afirth commented May 20, 2015

pgrosu commented May 21, 2015

saupchurch commented May 21, 2015

pgrosu commented May 22, 2015

look up all reads in fragment #212

look up all reads in fragment #212

Comments

saupchurch commented Dec 15, 2014

delagoya commented Dec 17, 2014

pgrosu commented Dec 17, 2014

saupchurch commented Jan 6, 2015

pgrosu commented Jan 6, 2015

saupchurch commented Mar 9, 2015

pgrosu commented Mar 9, 2015

pgrosu commented Mar 12, 2015

saupchurch commented Mar 13, 2015

pgrosu commented Mar 13, 2015

pgrosu commented Mar 13, 2015

saupchurch commented Mar 13, 2015

pgrosu commented Mar 13, 2015

pgrosu commented Mar 14, 2015

afirth commented Apr 27, 2015

pgrosu commented Apr 27, 2015

saupchurch commented Apr 27, 2015

pgrosu commented Apr 27, 2015

afirth commented May 20, 2015

pgrosu commented May 21, 2015

saupchurch commented May 21, 2015

pgrosu commented May 22, 2015