Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

look up all reads in fragment #212

Open
saupchurch opened this issue Dec 15, 2014 · 21 comments
Open

look up all reads in fragment #212

saupchurch opened this issue Dec 15, 2014 · 21 comments
Labels

Comments

@saupchurch
Copy link
Contributor

For fragment reconstruction when dealing with RNA data we need to have the full CIGAR as well as position of the mate pair. A method to query all the reads for a fragment would allow this as well as support cases (i.e. middle reads) containing more than a single paired read for a fragment.

@delagoya
Copy link
Contributor

When we had spoken on the call, I believe that the plan was to look into the needed changes within the Reads API to add in the notion of a Fragment record, and to create methods that interact with that record type. I had assumed that we already had a representation of Fragment in the API, but it looks like we do not. All we have is ReadAlignment.fragmentName which is the equivalent of QNAME in SAM.

So we need a Pull Request to add in the Fragment record type, which is distinct from ReadGroup, and to add in methods for it. Any volunteers?

@pgrosu
Copy link
Contributor

pgrosu commented Dec 17, 2014

I wasn't on the call, though this seems fairly straightforward, since a fragment is just a collection of reads under the same ReadGroup. I can give it a shot if it's okay, but it would be nice to be in sync with what was discussed before I start, so that I don't miss anything that was decided as required. Let me know either way.

Thanks,
Paul

@saupchurch
Copy link
Contributor Author

Paul, thanks for taking a stab at this. Delagoya has the essence of it - the plan is to implement the Fragment portion of the API hierarchy. The driver behind this as discussed on the call is the need to have the mate pair CIGAR in addition to position for fragment reconstruction when on the RNA side of the things due to splicing and other potential modifications. It is also desirable to have the structure in place to handle multiple reads in a fragment for future proofing against things like middle reads.

@pgrosu
Copy link
Contributor

pgrosu commented Jan 6, 2015

Hi Sean,

Thank you for the reply and the helpful information. I will work through several implementations until one seems most clear/intuitive, and then post something back. I might have a small question here and there, but this now gives me a good foundation from which to work from.

If there are any additional meeting notes/diagrams/docs/presentations from the call regarding this, that would be nice to sync with.

Thanks,
Paul

@saupchurch
Copy link
Contributor Author

Paul, how is this going? Do you have any further questions?

@pgrosu
Copy link
Contributor

pgrosu commented Mar 9, 2015

Sorry for the delay. I got a little vacation this week so I should have time to complete this now.

@pgrosu
Copy link
Contributor

pgrosu commented Mar 12, 2015

Hi Sean and Angel,

Sorry for the delay and I wanted to post this here before performing a pull-request, in order to open it up first for discussion.

So without further ado, below is the most succinct implementation I am thinking of:

record Fragment {

  /* The fragment ID. */
  string id;

  /* The ID of the read group this fragment belongs to. */
  string readGroupId;

  /* The collection of ReadAlignment which belong to the same fragment. */
  array<ReadAlignment> readAlignments = [];

}

record ReadAlignment {

  ...

  /* The Fragment that this ReadAlignment belongs to. */
  string fragmentId;

  ...

}

The array of ReadAlignment might not be necessary under Fragment, if ReadAlignment references a specific Fragment via fragmentId.

Please let me know what you think.

Thank you,
Paul

@saupchurch
Copy link
Contributor Author

I like it - it captures the idea without being too weighty. From looking at how ReadAlignments fit into the ReadGroup it seems we are not putting arrays of alignments in the group. I'm not sure if that was a design choice or if there were more 'real' schema considerations. To follow that model, I'd remove the array of ReadAlignment from Fragment and reference via fragmentId. I'd go ahead and open a pull request - further discussion can be carried out there.

@pgrosu
Copy link
Contributor

pgrosu commented Mar 13, 2015

Thank you for helping me with this. Yes, clarity is my preferred style :) I'll take out the array of ReadAlignment, and regarding ReadGroup I was trying to integrate the following definition from the schema:

* A *fragment* is a single stretch of a DNA molecule. There are typically
millions of fragments in a ReadGroup. A fragment has a name (QNAME in BAM
spec), a length (TLEN in BAM spec), and an array of reads.

In fact we can take out ReadGroup as well, since it gets referenced in ReadAlignment.

Regarding the reasoning behind choices made, that came out of several fun discussions from last year - which I listed below, in chronological order - but we're always free to update at any time:

#38
#47
#51
#60
#63

I'll get started on the PR.

Thank you,
Paul

@pgrosu
Copy link
Contributor

pgrosu commented Mar 13, 2015

Hi Sean,

Something very funny is happening when I fork schemas using the master branch. I do not get the current schema but a very old version of the schema, which uses the GA prefix and only a limited number of Avro files. I tried forking from several places but the same thing keeps happening. If you know how I should approach forking it in a different way, I would greatly appreciate it. Below are two screenshots of what I get in my repository and the link to it is the following:

https://github.com/pgrosu/schemas

fork

oldreads

Thank you in advance,
Paul

@saupchurch
Copy link
Contributor Author

I suspect that you must have forked this at some point in the past. From what we have been able to figure out here, you can't sync a fork from the web interface. See: https://help.github.com/articles/syncing-a-fork/

@pgrosu
Copy link
Contributor

pgrosu commented Mar 13, 2015

Thank you for the link, and I will try out these steps until I see no difference between ga4gh and my repository.

Thank you for helping me with this,
Paul

@pgrosu
Copy link
Contributor

pgrosu commented Mar 14, 2015

Hi Sean,

It took some work, but I finally submitted it (#259). Rebasing still is a process I am working at smoothing out :)

Many thanks for helping me,
Paul

@afirth
Copy link
Member

afirth commented Apr 27, 2015

Is this complete now that #259 has been merged?

@pgrosu
Copy link
Contributor

pgrosu commented Apr 27, 2015

@saupchurch, would you like me to add the methods too?

@saupchurch
Copy link
Contributor Author

@pgrosu Let's see if there is a need from the community before adding. Closing now that #259 is merged.

@pgrosu
Copy link
Contributor

pgrosu commented Apr 27, 2015

@saupchurch Sounds good.
Thanks,
Paul

@afirth
Copy link
Member

afirth commented May 20, 2015

I'd like to reopen this issue as the title is not solved by the current solution. There is, as @saupchurch discussed on the DWG and RNAseq calls, a need to have methods here. I'm working to do that, but running into a few issues.

  1. According to the Reads protocol, A ReadAlignment object is a flattened representation of the bottom layers of this hierarchy. There's exactly one such object per linear alignment or graph alignment. The object contains alignment info, plus fragment and read info for easy access.
    • But
      • ReadGroupSet >--< ReadGroup --< fragment --< read --< alignment --< linear/graph alignment
      • A ReadGroupSet is a logical collection of ReadGroup's.
      • A ReadGroup is all the data that's processed the same way by the sequencer. There are typically 1-10 ReadGroup's in a ReadGroupSet.
      • A fragment is a single stretch of a DNA molecule. There are typically millions of fragments in a ReadGroup. A fragment has a name (QNAME in BAM spec), a length (TLEN in BAM spec), and an array of reads.
    • Could someone fill me in on why this was flattened? It would be difficult (as I'm finding) to unflatten it. However, without doing that the primary use case I know of for fragments return the reads which are in this fragment is O(n). I understand the desire to model off of BAM, but it does an equally terrible job of capturing this. This will be compounded dramatically (as Sean alluded to on today's call) if/when multireads of n>2 become common.

From an API and implementation standpoint for RNA, it seems like a terrible idea to have fragmentId be part of readAlignment, however I understand that is desirable for regenerating BAM files to give to other tools. How has this question [which side of data contains the one-many relationship] been decided in the past? E.g a readGroupSet contains an array of readGroup[Id]s, but it also contains the datasetId (rather than the dataset containing an array of readGroupSets.

So, sure, we can slap a fragmentIds array field into searchReadsRequest, and it will work, but it is heinously inefficient and seems backwards. I'd like to hear from @pgrosu and @saupchurch in particular, and maybe someone can tag the folks responsible for the flattening of fragment into readAlignment?

Regards,
Alastair

@skeenan skeenan reopened this May 20, 2015
This was referenced May 20, 2015
@pgrosu
Copy link
Contributor

pgrosu commented May 21, 2015

Hi Alastair (@afirth),

I would be happy to provide the history. I wish I knew about the DWG and RNAseq calls, though unfortunately I don't get the emails. If it would be possible to be added regarding future ones, that would be really nice - my email is pgrosu@gmail.com just in case. Do you have any meeting notes for the two calls and what @saupchurch and everyone discussed? This is just to be in sync.

So below is a little bit of the history regarding how we got to flattening things into a ReadAlignment:

So initially the discussion first got integrated to #33 from (#3, #8, #9, #18, #22, #28 and #30) where chimeric reads could be combined into GARead. Then GAFragment was introduced in #38, where GAReads where under the umbrella of GAFragment. In #38 they were sort of separate with a hierarchy of Fragment <- Read <- Alignment <- LinearAlignment, but it was later determined that it could be slow with Fragment having multiple alignments that might be distant from each other.

So then the discussion continued with the idea that maybe we can extend on SAM records, by combining related reads which would also improve indexing. This generated #47 that got finalized in #60. The #51 pull was introduced to help with consolidating related reads (i.e. mates) into arrays, with #60 becoming more favorable design which also helped with chimeric reads. This then got summarized in #63. The scope was to revisit things - as suggested in #100 - when more complexity might be required. @saupchurch, @delagoya, @lh3, @fnothaft did I forget anything?

Now regarding of why we associated fragmentId to ReadAlignment, was because it looked to be the most straightforward way in associating the two, without incorporating too much complexity. Since these would define a general concept of the wire protocol, it would be the servers that would store these which would optimize via indexing. Regarding indexing, there are several approaches one can take where among them would be the inverted indices approach, which is similar to how search engines implement things with parallelization in mind for document-term pair frequencies. I detailed this in several previous posts, and the lookup can be optimized down to O(1), where in this case it would help with the lookup for the fragment -> read mapping.

Hope it helps,
Paul

@saupchurch
Copy link
Contributor Author

Thank you @pgrosu for the sleuthing. I'm a recent add to the project and came in after the Reads API was for the most part done so the early history is not something I'm very familiar with. The organization of the one-many relationships has been tripping me up a bit as evidenced by the issues raised here. Perhaps this is the right time to take a step back and re-examine what we want a Fragment to be and what we want to accomplish with a Fragment.

From my understanding, the driver for this change is to have access to the CIGAR of the mate pair. The nextMatePosition was not thought to be enough to unambiguously search out the mate in order to extract the CIGAR from its' LinearAlignment. There is also a desire to create an API that can handle more than pairs of reads in a Fragment as a forward-looking goal.

@pgrosu
Copy link
Contributor

pgrosu commented May 22, 2015

I also came at the tail end of it, and only wanted read up on it beforehand - for about a month - just to get in sync with the project. You are right that we need to re-visit the balance of ease of use for analysis with the level of compression/encapsulation for transmission over the wire. I agree that Fragment would need to be updated.

The schema can be expanded - which I agree with - to support accessing the CIGAR for the mate pair more directly. I didn't want to perform a major change since that seem to trip a major response in the past. It think many people feel this data is not dynamic, but it can be. Would you be interested is associating/connecting sequences and processes on-the-fly? What I mean is let's say you have a set of reads in the system, and then you can treat them like a sets of objects that can have transformed-associated mapping. Below is an example:

read1
read2

Then via a command-line you type:

> ReadGroupSet5 <- associate(read1, read2, primary=read1)
> AlignedReadGroupSet15 <- align( ReadGroupSet5, ReadGroupSet1)
> update(read1.annotation[+3, methylated], AlignedReadGroupSet15)
> table.lessthan5mismatch.byReadGroup <- search.cigar(<5M, *GroupSet*, mate.location < 180-250, groupby=ReadGroup)
> table.methylated.byReadGroup <- search.annotation("*methylated*", AlignedReadGroupSet*, groupby=ReadGroup )
> collapse.to.matched.genename(table.lessthan5mismatch.byReadGroup, table.methylated.byReadGroup)

Maybe we can brainstorm the different types of analysis and associated data-searches we would prefer to have or others would wish for now and into the future - without worrying about the implemented data-structures yet. We can always can always find optimal implementations for them afterward this exploration. I think this would be great if many people also contributed to be sure we cover all the bases.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants