How to use mmcollapse for diploid transcriptome? #36

weishwu · 2019-10-22T01:50:24Z

I've run mmseq on a diploid transcriptome that includes for example ENST00000052754.10_A (paternal transcript) and ENST00000052754.10_B (maternal transcript) in the results (.mmseq file). I want to use mmcollapse to collapse transcripts but I don't want to merge a _A transcript with a _B transcript. How should I do this?

eturro · 2019-10-22T09:01:14Z

You can't tell mmcollapse to avoid collapsing particular pairs of transcripts. I wouldn't bother running mmcollapse, but make sure you do not have any repeated sequences in your transcriptome FASTA

…

On 22 Oct 2019, at 03:50, weishwu ***@***.***> wrote: I've run mmseq on a diploid transcriptome that includes for example ENST00000052754.10_A (paternal transcript) and ENST00000052754.10_B (maternal transcript) in the results (.mmseq file). I want to use mmcollapse to collapse transcripts but I don't want to merge a _A transcript with a _B transcript. How should I do this? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

mikelove · 2022-01-05T20:01:31Z

Ernest,

First, the support and documentation you provide here is really helpful. We've been applying mmseq on datasets where we are interested in allelic effects.

Would it be reasonable to split the .trace_gibbs.gz files into separate files based on _A or _B endings of column names, then to run mmcollapse on the 2x files?

eturro · 2022-01-05T20:42:51Z

Hi Mike,

In some work I've done on hybrid F1 mice, I have split up the mmseq output by strain, after having aligned to a hybrid transcriptome (where each reference transcript has two strain-specific sequences, with different suffixes). Some of the transcripts may be identical (or very similar) between the strains, but you may not wish to do any collapsing of posterior traces across the strains, so I can see why you might want to do this.

In principle, I don't see any problem with splitting all the mmseq output files by haplotype (so you double the number of files), then removing the suffixes of the transcript IDs and treating the estimates for each individual and haplotype as if they had been obtained from two separate individuals. You may need to use head to extract headers and grep to extract rows to get this done properly. At that stage you should be able to run all the usual downstream methods, including mmcollapse and mmdiff.

Hope this helps.

mikelove · 2022-01-05T20:57:01Z

Thanks for the quick reply!

I think conceptually I got it, now I'm just trying to find a way to split the .trace_gibbs.gz file by column efficiently. It's either going to be reading into R (I'm trying read_delim now) and writing out new data with new headers, or with cut, but I can't call cut from R for example (the command is too long). I'll post code if I get something that works.

mikelove · 2022-01-06T15:44:57Z

I seem to have scripts working for splitting up the following by haplotype:

sample.identical.mmseq
sample.identical.trace_gibbs.gz
sample.mmseq
sample.trace_gibbs.gz

It also wants a .M file, but I'm not sure exactly how to split that one. This file has transcripts in the header and then a number of lines with two numbers (not labeled). Do the rows of this file correspond to the transcripts listed in the header?

And then, are there other files I need to split? .k?

eturro · 2022-01-06T21:16:22Z

Hi Mike, The .M file encodes a sparse 2D binary matrix with the columns corresponding to transcripts, the IDs of which are given in the header. So each row corresponds to a transcript set. The two numbers on each row give the (row,col) indices for the 1s in the matrix, using 0-based indexing. The .k file gives the read counts for each transcript set in a single column. So each row corresponds to one of the rows in the M matrix.

…

On 6 Jan 2022, at 10:45, Mike Love ***@***.***> wrote: I seem to have scripts working for splitting up the following by haplotype: sample.identical.mmseq sample.identical.trace_gibbs.gz sample.mmseq sample.trace_gibbs.gz It also wants a .M file, but I'm not sure exactly how to split that one. This file has transcripts in the header and then a number of lines with two numbers (not labeled). Do the rows of this file correspond to the transcripts listed in the header? And then, are there other files I need to split? .k? — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABBTMECCB52HWFDMRXWKHODUUW2IJANCNFSM4JDJPXBQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.

mikelove · 2022-01-06T21:49:41Z

I have an idea how to modify this file but want to check if it is reasonable. I know this is "off-piste" as you stated in the top of the issue, so I will proceed with caution.

I am splitting transcripts by haplotype for the quantification files, e.g. txp_A -> txp (file A) and txp_B -> txp (file B). So far so good.

It's less trivial to deal with this bipartite graph with transcript sets -> haplotype transcripts. One idea is to just fold over the matrix representing the bipartite graph like so:

> m <- matrix(c(0,0,1,0,0,1,0,0),ncol=4,byrow=TRUE)
> colnames(m) <- c("txp1_A","txp1_B","txp2_A","txp2_B")
> m
     txp1_A txp1_B txp2_A txp2_B
[1,]      0      0      1      0
[2,]      0      1      0      0
> m2 <- m[,c(1,3)] + m[,c(2,4)]
> colnames(m2) <- sub("_A","",colnames(m2))
> m2
     txp1 txp2
[1,]    0    1
[2,]    1    0

This is merging the nodes on the right side of the bipartite graph and if either had an edge in the original, it will have an edge in the merged graph.

I would then duplicate this modified .M and the original .k and rename these with _A and _B.

eturro · 2022-01-06T22:22:08Z

Hi Mike, I'm not sure why you'd want to do the sum. Wouldn't you want to simply extract the columns? If a read maps to one haplotype but not the other you don't want to throw away that information. m1= m[,c(1,3)]; m2=m[,c(2,4)]; colnames(m1)=colnames(m2)=sub("_[AB]","",colnames(m1))

…

On 6 Jan 2022, at 16:49, Mike Love ***@***.***> wrote: I have an idea how to modify this file but want to check if it is reasonable. I know this is "off-piste" as you stated in the top of the issue, so I will proceed with caution. I am splitting transcripts by haplotype for the quantification files, e.g. txp_A -> txp (file A) and txp_B -> txp (file B). So far so good. It's less trivial to deal with this bipartite graph with transcript sets -> haplotype transcripts. One idea is to just fold over the matrix representing the bipartite graph like so: > m <- matrix(c(0,0,1,0,0,1,0,0),ncol=4,byrow=TRUE) > colnames(m) <- c("txp1_A","txp1_B","txp2_A","txp2_B") > m txp1_A txp1_B txp2_A txp2_B [1,] 0 0 1 0 [2,] 0 1 0 0 > m2 <- m[,c(1,3)] + m[,c(2,4)] > colnames(m2) <- sub("_A","",colnames(m2)) > m2 txp1 txp2 [1,] 0 1 [2,] 1 0 This is merging the nodes on the right side of the bipartite graph and if either had an edge in the original, it will have an edge in the merged graph. I would then duplicate this modified .M and the original .k and rename these with _A and _B. — Reply to this email directly, view it on GitHub <#36 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABBTMEGBJTTGMIEQDQZSDVLUUYE77ANCNFSM4JDJPXBQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.

mikelove · 2022-01-06T22:29:56Z

My thinking was that, in my mmdiff analysis, I won't have haplotype transcripts, only transcripts. I'm planning to run mmcollapse on data that only list txp1, txp2, ... and then followed by mmdiff. But I may be missing something.

Alternatively, I can extract transcripts (columns) for each haplotype, but I didn't know about: 1) how to divvy the transcript set counts in .k to haplotype-specific files, and also 2) if it was ok to leave transcript sets (rows) that have no edge. E.g. do I need to prune transcripts sets for sample_A.M if they only have edges to _B transcripts?

eturro · 2022-01-06T23:28:20Z

I suppose I don't fully understand your experimental design. What is the unit of inference - haplotype-specific transcripts or transcripts? If the latter, why align to haplotype-specific transcripts in the first place? If the former, then, having given this some more thought, I can see that there are complications to running mmcollapse on separate haplotype specific subsets of transcripts. As you say, if a read maps to both haplotypes, you can't categorically assign it to only one of the two haplotypes (that's precisely what mmseq is designed to model). Ideally, you wouldn't want to split up the files before running mmcollapse, but after running it, and before running mmdiff. However, you'd want to impose a restriction whereby only transcripts from the same haplotype set (i.e. with a particular suffix) can be collapsed. I'm afraid this would require modifying the mmcollapse C++ code!

…

On 6 Jan 2022, at 17:30, Mike Love ***@***.***> wrote: My thinking was that, in my mmdiff analysis, I won't have haplotype transcripts, only transcripts. I'm planning to run mmcollapse on data that only list txp1, txp2, ... and then followed by mmdiff. But I may be missing something. Alternatively, I can extract transcripts (columns) for each haplotype, but I didn't know about: 1) how to divvy the transcript set counts in .k to haplotype-specific files, and also 2) if it was ok to leave transcript sets (rows) that have no edge. E.g. do I need to prune transcripts sets for sample_A.M if they only have edges to _B transcripts? — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.

mikelove · 2022-01-07T00:29:20Z

Sorry I should have mentioned this earlier, and thanks for bearing with me on this. My unit of inference is transcripts, and I'm trying to find allelic imbalance by performing DE across haplotype, as in #33.

So we align to haplotype-specific transcripts so that we can later make comparison B vs A for each transcript.

In my mind, while there is some loss of information in my summing procedure two comments above, I think it may generate collapsed groups with more power.

mikelove · 2022-01-09T23:12:28Z

I got somewhere but not all the way to .collapsed.mmseq files. I removed identical transcripts at the beginning of the pipeline, and then got up until the point when mmcollapse would write out the new tables. But then it stops with Error: incompatible arguments to function uh().

These 2x set of mmseq files have the same transcript names. Below, I'm using my annotation which is M vs P instead of A vs B for the two alleles/haplotypes.

Stopping threshold: -97.5th percentile of the distribution of correlations
mmseq_nodup/sample_A_1_M SD thres=0.108423
...
9808 transcripts or sets of identical transcripts are unidentifiable in all samples.
Reading posterior traces and computing variance-covariance matrices using 12 thread(s)
note: this step requires approximately 1.5 GB plus 0.72 GB per thread; re-run with a lower value of OMP_NUM_THREADS to use less memory.
        mmseq_nodup/sample_A_1_M.trace_gibbs.gz (thread 0)
        mmseq_nodup/sample_A_1_M.identical.trace_gibbs.gz (thread 0)
...
Calculating mean and s.d. of correlations...done.
Threshold for mean anti-correlation: 1
Collapsing unidentifiable transcripts based on mean anti-correlations...done (0 iterations).
Reading posterior traces from "mmseq_nodup/sample_A_1_M.trace_gibbs.gz"...done.
Reading posterior traces from "mmseq_nodup/sample_A_1_M.identical.trace_gibbs.gz"...done.
Simulating traces for transcripts/sets of identical transcripts with no hits...
done.
        Saving new table "mmseq_nodup/sample_A_1_M.collapsed.mmseq".
Error: incompatible arguments to function uh().

Example of the .mmseq files after splitting:

head mmseq_nodup/sample_A_1_M.mmseq
# Mapped fragments: 45752140
feature_id	log_mu	sd	mcse	iact	effective_length	true_length	unique_hits	mean_proportion	mean_probit_proportion	sd_probit_proportion	log_mu_em	observed	ntranscripts	percentiles5,25,50,75,95	percentiles_proportion5,25,50,75,95
FBtr0081624|6822498_2_FBgn0000003_M	7.59895	0.0695741	0.00226516	1.08543	0.1	299	208	1	Inf	NA	10.7271	1	1	1781.86,1902.54,1994.32,2090.47,2241.98	1,1,1,1,1
FBtr0330134|16647882_2_FBgn0000017_M	2.54677	0.231971	0.0726621	100.473	5313	5712	0	0.16708	-0.976813	0.151777	2.24498	1	10	8.42228,11.2221,13.0304,14.8807,18.1494	0.107316,0.142624,0.165738,0.190008,0.234087

head mmseq_nodup/sample_A_1_P.mmseq
# Mapped fragments: 45752140
feature_id	log_mu	sd	mcse	iact	effective_length	true_length	unique_hits	mean_proportion	mean_probit_proportion	sd_probit_proportion	log_mu_em	observed	ntranscripts	percentiles5,25,50,75,95	percentiles_proportion5,25,50,75,95
FBtr0081624|6822498_2_FBgn0000003_P	7.55612	0.0699165	0.00216587	0.982666	0.1	299	200	1	Inf	NA10.6879	1	1	1709.47,1820.27,1919.68,1999.1,2139.82	1,1,1,1,1
FBtr0330134|16647882_2_FBgn0000017_P	-4.46829	9.70711	3.00903	98.3953	5313	5712	0	0.0602261	-2.88761	1.8159	1.90921	1	10	3.88674e-11,0.000550771,0.321136,10.6512,15.9886	4.77515e-13,6.62723e-06,0.00383743,0.12906,0.194303

And I took the "fold over" approach to making the sparse matrix in the .M file, saving a 1 whenever either M or P had a 1 for that transcript set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use mmcollapse for diploid transcriptome? #36

How to use mmcollapse for diploid transcriptome? #36

weishwu commented Oct 22, 2019

eturro commented Oct 22, 2019 via email

mikelove commented Jan 5, 2022

eturro commented Jan 5, 2022 •

edited

Loading

mikelove commented Jan 5, 2022

mikelove commented Jan 6, 2022

eturro commented Jan 6, 2022 via email

mikelove commented Jan 6, 2022

eturro commented Jan 6, 2022 via email

mikelove commented Jan 6, 2022

eturro commented Jan 6, 2022 via email

mikelove commented Jan 7, 2022

mikelove commented Jan 9, 2022

How to use mmcollapse for diploid transcriptome? #36

How to use mmcollapse for diploid transcriptome? #36

Comments

weishwu commented Oct 22, 2019

eturro commented Oct 22, 2019 via email

mikelove commented Jan 5, 2022

eturro commented Jan 5, 2022 • edited Loading

mikelove commented Jan 5, 2022

mikelove commented Jan 6, 2022

eturro commented Jan 6, 2022 via email

mikelove commented Jan 6, 2022

eturro commented Jan 6, 2022 via email

mikelove commented Jan 6, 2022

eturro commented Jan 6, 2022 via email

mikelove commented Jan 7, 2022

mikelove commented Jan 9, 2022

eturro commented Jan 5, 2022 •

edited

Loading