-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use mmcollapse for diploid transcriptome? #36
Comments
You can't tell mmcollapse to avoid collapsing particular pairs of transcripts. I wouldn't bother running mmcollapse, but make sure you do not have any repeated sequences in your transcriptome FASTA
… On 22 Oct 2019, at 03:50, weishwu ***@***.***> wrote:
I've run mmseq on a diploid transcriptome that includes for example ENST00000052754.10_A (paternal transcript) and ENST00000052754.10_B (maternal transcript) in the results (.mmseq file). I want to use mmcollapse to collapse transcripts but I don't want to merge a _A transcript with a _B transcript. How should I do this?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Ernest, First, the support and documentation you provide here is really helpful. We've been applying mmseq on datasets where we are interested in allelic effects. Would it be reasonable to split the |
Hi Mike, In some work I've done on hybrid F1 mice, I have split up the mmseq output by strain, after having aligned to a hybrid transcriptome (where each reference transcript has two strain-specific sequences, with different suffixes). Some of the transcripts may be identical (or very similar) between the strains, but you may not wish to do any collapsing of posterior traces across the strains, so I can see why you might want to do this. In principle, I don't see any problem with splitting all the mmseq output files by haplotype (so you double the number of files), then removing the suffixes of the transcript IDs and treating the estimates for each individual and haplotype as if they had been obtained from two separate individuals. You may need to use Hope this helps. |
Thanks for the quick reply! I think conceptually I got it, now I'm just trying to find a way to split the |
I seem to have scripts working for splitting up the following by haplotype:
It also wants a And then, are there other files I need to split? |
Hi Mike,
The .M file encodes a sparse 2D binary matrix with the columns corresponding to transcripts, the IDs of which are given in the header. So each row corresponds to a transcript set. The two numbers on each row give the (row,col) indices for the 1s in the matrix, using 0-based indexing.
The .k file gives the read counts for each transcript set in a single column. So each row corresponds to one of the rows in the M matrix.
… On 6 Jan 2022, at 10:45, Mike Love ***@***.***> wrote:
I seem to have scripts working for splitting up the following by haplotype:
sample.identical.mmseq
sample.identical.trace_gibbs.gz
sample.mmseq
sample.trace_gibbs.gz
It also wants a .M file, but I'm not sure exactly how to split that one. This file has transcripts in the header and then a number of lines with two numbers (not labeled). Do the rows of this file correspond to the transcripts listed in the header?
And then, are there other files I need to split? .k?
—
Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABBTMECCB52HWFDMRXWKHODUUW2IJANCNFSM4JDJPXBQ>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.
|
I have an idea how to modify this file but want to check if it is reasonable. I know this is "off-piste" as you stated in the top of the issue, so I will proceed with caution. I am splitting transcripts by haplotype for the quantification files, e.g. txp_A -> txp (file A) and txp_B -> txp (file B). So far so good. It's less trivial to deal with this bipartite graph with transcript sets -> haplotype transcripts. One idea is to just fold over the matrix representing the bipartite graph like so:
This is merging the nodes on the right side of the bipartite graph and if either had an edge in the original, it will have an edge in the merged graph. I would then duplicate this modified |
Hi Mike,
I'm not sure why you'd want to do the sum. Wouldn't you want to simply extract the columns? If a read maps to one haplotype but not the other you don't want to throw away that information.
m1= m[,c(1,3)]; m2=m[,c(2,4)]; colnames(m1)=colnames(m2)=sub("_[AB]","",colnames(m1))
… On 6 Jan 2022, at 16:49, Mike Love ***@***.***> wrote:
I have an idea how to modify this file but want to check if it is reasonable. I know this is "off-piste" as you stated in the top of the issue, so I will proceed with caution.
I am splitting transcripts by haplotype for the quantification files, e.g. txp_A -> txp (file A) and txp_B -> txp (file B). So far so good.
It's less trivial to deal with this bipartite graph with transcript sets -> haplotype transcripts. One idea is to just fold over the matrix representing the bipartite graph like so:
> m <- matrix(c(0,0,1,0,0,1,0,0),ncol=4,byrow=TRUE)
> colnames(m) <- c("txp1_A","txp1_B","txp2_A","txp2_B")
> m
txp1_A txp1_B txp2_A txp2_B
[1,] 0 0 1 0
[2,] 0 1 0 0
> m2 <- m[,c(1,3)] + m[,c(2,4)]
> colnames(m2) <- sub("_A","",colnames(m2))
> m2
txp1 txp2
[1,] 0 1
[2,] 1 0
This is merging the nodes on the right side of the bipartite graph and if either had an edge in the original, it will have an edge in the merged graph.
I would then duplicate this modified .M and the original .k and rename these with _A and _B.
—
Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABBTMEGBJTTGMIEQDQZSDVLUUYE77ANCNFSM4JDJPXBQ>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.
|
My thinking was that, in my mmdiff analysis, I won't have haplotype transcripts, only transcripts. I'm planning to run mmcollapse on data that only list Alternatively, I can extract transcripts (columns) for each haplotype, but I didn't know about: 1) how to divvy the transcript set counts in |
I suppose I don't fully understand your experimental design. What is the unit of inference - haplotype-specific transcripts or transcripts? If the latter, why align to haplotype-specific transcripts in the first place?
If the former, then, having given this some more thought, I can see that there are complications to running mmcollapse on separate haplotype specific subsets of transcripts. As you say, if a read maps to both haplotypes, you can't categorically assign it to only one of the two haplotypes (that's precisely what mmseq is designed to model).
Ideally, you wouldn't want to split up the files before running mmcollapse, but after running it, and before running mmdiff. However, you'd want to impose a restriction whereby only transcripts from the same haplotype set (i.e. with a particular suffix) can be collapsed. I'm afraid this would require modifying the mmcollapse C++ code!
… On 6 Jan 2022, at 17:30, Mike Love ***@***.***> wrote:
My thinking was that, in my mmdiff analysis, I won't have haplotype transcripts, only transcripts. I'm planning to run mmcollapse on data that only list txp1, txp2, ... and then followed by mmdiff. But I may be missing something.
Alternatively, I can extract transcripts (columns) for each haplotype, but I didn't know about: 1) how to divvy the transcript set counts in .k to haplotype-specific files, and also 2) if it was ok to leave transcript sets (rows) that have no edge. E.g. do I need to prune transcripts sets for sample_A.M if they only have edges to _B transcripts?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you commented.
|
Sorry I should have mentioned this earlier, and thanks for bearing with me on this. My unit of inference is transcripts, and I'm trying to find allelic imbalance by performing DE across haplotype, as in #33. So we align to haplotype-specific transcripts so that we can later make comparison B vs A for each transcript. In my mind, while there is some loss of information in my summing procedure two comments above, I think it may generate collapsed groups with more power. |
I got somewhere but not all the way to These 2x set of mmseq files have the same transcript names. Below, I'm using my annotation which is M vs P instead of A vs B for the two alleles/haplotypes.
Example of the
And I took the "fold over" approach to making the sparse matrix in the |
I've run mmseq on a diploid transcriptome that includes for example ENST00000052754.10_A (paternal transcript) and ENST00000052754.10_B (maternal transcript) in the results (.mmseq file). I want to use mmcollapse to collapse transcripts but I don't want to merge a _A transcript with a _B transcript. How should I do this?
The text was updated successfully, but these errors were encountered: