Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use mmcollapse for diploid transcriptome? #36

Open
weishwu opened this issue Oct 22, 2019 · 12 comments
Open

How to use mmcollapse for diploid transcriptome? #36

weishwu opened this issue Oct 22, 2019 · 12 comments

Comments

@weishwu
Copy link

weishwu commented Oct 22, 2019

I've run mmseq on a diploid transcriptome that includes for example ENST00000052754.10_A (paternal transcript) and ENST00000052754.10_B (maternal transcript) in the results (.mmseq file). I want to use mmcollapse to collapse transcripts but I don't want to merge a _A transcript with a _B transcript. How should I do this?

@eturro
Copy link
Owner

eturro commented Oct 22, 2019 via email

@mikelove
Copy link

mikelove commented Jan 5, 2022

Ernest,

First, the support and documentation you provide here is really helpful. We've been applying mmseq on datasets where we are interested in allelic effects.

Would it be reasonable to split the .trace_gibbs.gz files into separate files based on _A or _B endings of column names, then to run mmcollapse on the 2x files?

@eturro
Copy link
Owner

eturro commented Jan 5, 2022

Hi Mike,

In some work I've done on hybrid F1 mice, I have split up the mmseq output by strain, after having aligned to a hybrid transcriptome (where each reference transcript has two strain-specific sequences, with different suffixes). Some of the transcripts may be identical (or very similar) between the strains, but you may not wish to do any collapsing of posterior traces across the strains, so I can see why you might want to do this.

In principle, I don't see any problem with splitting all the mmseq output files by haplotype (so you double the number of files), then removing the suffixes of the transcript IDs and treating the estimates for each individual and haplotype as if they had been obtained from two separate individuals. You may need to use head to extract headers and grep to extract rows to get this done properly. At that stage you should be able to run all the usual downstream methods, including mmcollapse and mmdiff.

Hope this helps.

@mikelove
Copy link

mikelove commented Jan 5, 2022

Thanks for the quick reply!

I think conceptually I got it, now I'm just trying to find a way to split the .trace_gibbs.gz file by column efficiently. It's either going to be reading into R (I'm trying read_delim now) and writing out new data with new headers, or with cut, but I can't call cut from R for example (the command is too long). I'll post code if I get something that works.

@mikelove
Copy link

mikelove commented Jan 6, 2022

I seem to have scripts working for splitting up the following by haplotype:

sample.identical.mmseq
sample.identical.trace_gibbs.gz
sample.mmseq
sample.trace_gibbs.gz

It also wants a .M file, but I'm not sure exactly how to split that one. This file has transcripts in the header and then a number of lines with two numbers (not labeled). Do the rows of this file correspond to the transcripts listed in the header?

And then, are there other files I need to split? .k?

@eturro
Copy link
Owner

eturro commented Jan 6, 2022 via email

@mikelove
Copy link

mikelove commented Jan 6, 2022

I have an idea how to modify this file but want to check if it is reasonable. I know this is "off-piste" as you stated in the top of the issue, so I will proceed with caution.

I am splitting transcripts by haplotype for the quantification files, e.g. txp_A -> txp (file A) and txp_B -> txp (file B). So far so good.

It's less trivial to deal with this bipartite graph with transcript sets -> haplotype transcripts. One idea is to just fold over the matrix representing the bipartite graph like so:

> m <- matrix(c(0,0,1,0,0,1,0,0),ncol=4,byrow=TRUE)
> colnames(m) <- c("txp1_A","txp1_B","txp2_A","txp2_B")
> m
     txp1_A txp1_B txp2_A txp2_B
[1,]      0      0      1      0
[2,]      0      1      0      0
> m2 <- m[,c(1,3)] + m[,c(2,4)]
> colnames(m2) <- sub("_A","",colnames(m2))
> m2
     txp1 txp2
[1,]    0    1
[2,]    1    0

This is merging the nodes on the right side of the bipartite graph and if either had an edge in the original, it will have an edge in the merged graph.

I would then duplicate this modified .M and the original .k and rename these with _A and _B.

@eturro
Copy link
Owner

eturro commented Jan 6, 2022 via email

@mikelove
Copy link

mikelove commented Jan 6, 2022

My thinking was that, in my mmdiff analysis, I won't have haplotype transcripts, only transcripts. I'm planning to run mmcollapse on data that only list txp1, txp2, ... and then followed by mmdiff. But I may be missing something.

Alternatively, I can extract transcripts (columns) for each haplotype, but I didn't know about: 1) how to divvy the transcript set counts in .k to haplotype-specific files, and also 2) if it was ok to leave transcript sets (rows) that have no edge. E.g. do I need to prune transcripts sets for sample_A.M if they only have edges to _B transcripts?

@eturro
Copy link
Owner

eturro commented Jan 6, 2022 via email

@mikelove
Copy link

mikelove commented Jan 7, 2022

Sorry I should have mentioned this earlier, and thanks for bearing with me on this. My unit of inference is transcripts, and I'm trying to find allelic imbalance by performing DE across haplotype, as in #33.

So we align to haplotype-specific transcripts so that we can later make comparison B vs A for each transcript.

In my mind, while there is some loss of information in my summing procedure two comments above, I think it may generate collapsed groups with more power.

@mikelove
Copy link

mikelove commented Jan 9, 2022

I got somewhere but not all the way to .collapsed.mmseq files. I removed identical transcripts at the beginning of the pipeline, and then got up until the point when mmcollapse would write out the new tables. But then it stops with Error: incompatible arguments to function uh().

These 2x set of mmseq files have the same transcript names. Below, I'm using my annotation which is M vs P instead of A vs B for the two alleles/haplotypes.

Stopping threshold: -97.5th percentile of the distribution of correlations
mmseq_nodup/sample_A_1_M SD thres=0.108423
...
9808 transcripts or sets of identical transcripts are unidentifiable in all samples.
Reading posterior traces and computing variance-covariance matrices using 12 thread(s)
note: this step requires approximately 1.5 GB plus 0.72 GB per thread; re-run with a lower value of OMP_NUM_THREADS to use less memory.
        mmseq_nodup/sample_A_1_M.trace_gibbs.gz (thread 0)
        mmseq_nodup/sample_A_1_M.identical.trace_gibbs.gz (thread 0)
...
Calculating mean and s.d. of correlations...done.
Threshold for mean anti-correlation: 1
Collapsing unidentifiable transcripts based on mean anti-correlations...done (0 iterations).
Reading posterior traces from "mmseq_nodup/sample_A_1_M.trace_gibbs.gz"...done.
Reading posterior traces from "mmseq_nodup/sample_A_1_M.identical.trace_gibbs.gz"...done.
Simulating traces for transcripts/sets of identical transcripts with no hits...
done.
        Saving new table "mmseq_nodup/sample_A_1_M.collapsed.mmseq".
Error: incompatible arguments to function uh().

Example of the .mmseq files after splitting:

head mmseq_nodup/sample_A_1_M.mmseq
# Mapped fragments: 45752140
feature_id	log_mu	sd	mcse	iact	effective_length	true_length	unique_hits	mean_proportion	mean_probit_proportion	sd_probit_proportion	log_mu_em	observed	ntranscripts	percentiles5,25,50,75,95	percentiles_proportion5,25,50,75,95
FBtr0081624|6822498_2_FBgn0000003_M	7.59895	0.0695741	0.00226516	1.08543	0.1	299	208	1	Inf	NA	10.7271	1	1	1781.86,1902.54,1994.32,2090.47,2241.98	1,1,1,1,1
FBtr0330134|16647882_2_FBgn0000017_M	2.54677	0.231971	0.0726621	100.473	5313	5712	0	0.16708	-0.976813	0.151777	2.24498	1	10	8.42228,11.2221,13.0304,14.8807,18.1494	0.107316,0.142624,0.165738,0.190008,0.234087
head mmseq_nodup/sample_A_1_P.mmseq
# Mapped fragments: 45752140
feature_id	log_mu	sd	mcse	iact	effective_length	true_length	unique_hits	mean_proportion	mean_probit_proportion	sd_probit_proportion	log_mu_em	observed	ntranscripts	percentiles5,25,50,75,95	percentiles_proportion5,25,50,75,95
FBtr0081624|6822498_2_FBgn0000003_P	7.55612	0.0699165	0.00216587	0.982666	0.1	299	200	1	Inf	NA10.6879	1	1	1709.47,1820.27,1919.68,1999.1,2139.82	1,1,1,1,1
FBtr0330134|16647882_2_FBgn0000017_P	-4.46829	9.70711	3.00903	98.3953	5313	5712	0	0.0602261	-2.88761	1.8159	1.90921	1	10	3.88674e-11,0.000550771,0.321136,10.6512,15.9886	4.77515e-13,6.62723e-06,0.00383743,0.12906,0.194303

And I took the "fold over" approach to making the sparse matrix in the .M file, saving a 1 whenever either M or P had a 1 for that transcript set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants