-
Notifications
You must be signed in to change notification settings - Fork 110
look up all reads in fragment #212
Comments
When we had spoken on the call, I believe that the plan was to look into the needed changes within the Reads API to add in the notion of a So we need a Pull Request to add in the |
I wasn't on the call, though this seems fairly straightforward, since a fragment is just a collection of reads under the same Thanks, |
Paul, thanks for taking a stab at this. Delagoya has the essence of it - the plan is to implement the Fragment portion of the API hierarchy. The driver behind this as discussed on the call is the need to have the mate pair CIGAR in addition to position for fragment reconstruction when on the RNA side of the things due to splicing and other potential modifications. It is also desirable to have the structure in place to handle multiple reads in a fragment for future proofing against things like middle reads. |
Hi Sean, Thank you for the reply and the helpful information. I will work through several implementations until one seems most clear/intuitive, and then post something back. I might have a small question here and there, but this now gives me a good foundation from which to work from. If there are any additional meeting notes/diagrams/docs/presentations from the call regarding this, that would be nice to sync with. Thanks, |
Paul, how is this going? Do you have any further questions? |
Sorry for the delay. I got a little vacation this week so I should have time to complete this now. |
Hi Sean and Angel, Sorry for the delay and I wanted to post this here before performing a pull-request, in order to open it up first for discussion. So without further ado, below is the most succinct implementation I am thinking of:
The array of Please let me know what you think. Thank you, |
I like it - it captures the idea without being too weighty. From looking at how ReadAlignments fit into the ReadGroup it seems we are not putting arrays of alignments in the group. I'm not sure if that was a design choice or if there were more 'real' schema considerations. To follow that model, I'd remove the array of ReadAlignment from Fragment and reference via fragmentId. I'd go ahead and open a pull request - further discussion can be carried out there. |
Thank you for helping me with this. Yes, clarity is my preferred style :) I'll take out the array of
In fact we can take out Regarding the reasoning behind choices made, that came out of several fun discussions from last year - which I listed below, in chronological order - but we're always free to update at any time: I'll get started on the PR. Thank you, |
Hi Sean, Something very funny is happening when I fork schemas using the master branch. I do not get the current schema but a very old version of the schema, which uses the GA prefix and only a limited number of Avro files. I tried forking from several places but the same thing keeps happening. If you know how I should approach forking it in a different way, I would greatly appreciate it. Below are two screenshots of what I get in my repository and the link to it is the following: https://github.com/pgrosu/schemas Thank you in advance, |
I suspect that you must have forked this at some point in the past. From what we have been able to figure out here, you can't sync a fork from the web interface. See: https://help.github.com/articles/syncing-a-fork/ |
Thank you for the link, and I will try out these steps until I see no difference between ga4gh and my repository. Thank you for helping me with this, |
Hi Sean, It took some work, but I finally submitted it (#259). Rebasing still is a process I am working at smoothing out :) Many thanks for helping me, |
Is this complete now that #259 has been merged? |
@saupchurch, would you like me to add the methods too? |
@saupchurch Sounds good. |
I'd like to reopen this issue as the title is not solved by the current solution. There is, as @saupchurch discussed on the DWG and RNAseq calls, a need to have methods here. I'm working to do that, but running into a few issues.
From an API and implementation standpoint for RNA, it seems like a terrible idea to have So, sure, we can slap a Regards, |
Hi Alastair (@afirth), I would be happy to provide the history. I wish I knew about the DWG and RNAseq calls, though unfortunately I don't get the emails. If it would be possible to be added regarding future ones, that would be really nice - my email is pgrosu@gmail.com just in case. Do you have any meeting notes for the two calls and what @saupchurch and everyone discussed? This is just to be in sync. So below is a little bit of the history regarding how we got to flattening things into a So initially the discussion first got integrated to #33 from (#3, #8, #9, #18, #22, #28 and #30) where chimeric reads could be combined into So then the discussion continued with the idea that maybe we can extend on SAM records, by combining related reads which would also improve indexing. This generated #47 that got finalized in #60. The #51 pull was introduced to help with consolidating related reads (i.e. mates) into arrays, with #60 becoming more favorable design which also helped with chimeric reads. This then got summarized in #63. The scope was to revisit things - as suggested in #100 - when more complexity might be required. @saupchurch, @delagoya, @lh3, @fnothaft did I forget anything? Now regarding of why we associated Hope it helps, |
Thank you @pgrosu for the sleuthing. I'm a recent add to the project and came in after the Reads API was for the most part done so the early history is not something I'm very familiar with. The organization of the one-many relationships has been tripping me up a bit as evidenced by the issues raised here. Perhaps this is the right time to take a step back and re-examine what we want a Fragment to be and what we want to accomplish with a Fragment. From my understanding, the driver for this change is to have access to the CIGAR of the mate pair. The nextMatePosition was not thought to be enough to unambiguously search out the mate in order to extract the CIGAR from its' LinearAlignment. There is also a desire to create an API that can handle more than pairs of reads in a Fragment as a forward-looking goal. |
I also came at the tail end of it, and only wanted read up on it beforehand - for about a month - just to get in sync with the project. You are right that we need to re-visit the balance of ease of use for analysis with the level of compression/encapsulation for transmission over the wire. I agree that Fragment would need to be updated. The schema can be expanded - which I agree with - to support accessing the CIGAR for the mate pair more directly. I didn't want to perform a major change since that seem to trip a major response in the past. It think many people feel this data is not dynamic, but it can be. Would you be interested is associating/connecting sequences and processes on-the-fly? What I mean is let's say you have a set of reads in the system, and then you can treat them like a sets of objects that can have transformed-associated mapping. Below is an example:
Then via a command-line you type:
Maybe we can brainstorm the different types of analysis and associated data-searches we would prefer to have or others would wish for now and into the future - without worrying about the implemented data-structures yet. We can always can always find optimal implementations for them afterward this exploration. I think this would be great if many people also contributed to be sure we cover all the bases. |
For fragment reconstruction when dealing with RNA data we need to have the full CIGAR as well as position of the mate pair. A method to query all the reads for a fragment would allow this as well as support cases (i.e. middle reads) containing more than a single paired read for a fragment.
The text was updated successfully, but these errors were encountered: