Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methylation per read #56

Closed
nchernia opened this issue Feb 28, 2018 · 24 comments
Closed

Methylation per read #56

nchernia opened this issue Feb 28, 2018 · 24 comments

Comments

@nchernia
Copy link

Thank you for this tool - it's been really nice to work with (runs very fast and the code is very readable).

We are working on a project for which we need to know the methylation per read. In Bismark, this is given via the XM tag, where there is essentially a string that represents (for every cytosine covered), the context and whether or not it is methylated.

I'm currently getting this info by print out statements in lines 424-426 in extract.c and then processing the resulting text files but obviously this is less than ideal (multiple lines printed per read name that then need to be combined, files are big, etc). Is there any way to make this an option in MethylDackel? Alternatively, do you have suggestions for a better way to gather this info on a per read basis?

Thanks!

@dpryan79
Copy link
Owner

I suppose something equivalent to how MethylDackel mbias works internally would work here as well. There, instead of the status for each C being written to the output, it's stored in a vector according to whether it's C or T. What sort of output format are you looking for with this?

Out of curiosity, what sort of project would benefit from something like this?

@nchernia
Copy link
Author

nchernia commented Feb 28, 2018 via email

@nchernia
Copy link
Author

Would it be better for me to modify the mbias code instead to extract this information?

@dpryan79
Copy link
Owner

@nchernia Actually, I just double checked and I use a pileup there as well. I can hack something together easily enough to do this, but do you really need it to be multithreaded? It's easy enough to write a single-threaded application to do this (one could even use python with pysam), but it'll take a lot more time to write it to handle threads. Could you privately send me the basic reason for wanting to do this (I've modified versions of this for in-the-pipeline commercial products before, so I'm used to not blabbing about other's projects)? There almost has to be a more efficient way than going over a gigantic text file with read names and either a bunch of ....h..H... or percentages.

@dpryan79
Copy link
Owner

dpryan79 commented Mar 9, 2018

FYI, the perRead stuff is now going into the perRead branch. One can now MethylDackel perRead ... with that and it seems not to crash. But I haven't really tested it yet, so hold off on trying it unless you're a glutton for punishment. It's written to be multithreaded, though I don't think the actual options for that are there at the moment.

@dpryan79
Copy link
Owner

The perRead branch seems to produce correct results now. Multithreading works properly and I added some other rudimentary options. It should be easy enough to add other options in. Let me know if you need anything else from that or need the output in a different format.

@nchernia
Copy link
Author

Thanks! I will give it a try very soon.

@nchernia
Copy link
Author

I've been working with this function, thanks. One question - I should expect the reads to appear more than once in the output if they align in multiple places, right?

@dpryan79
Copy link
Owner

Good question. You should see each instance, since it doesn't use filter_func(). I'll try to make a short test of that, though.

@dpryan79
Copy link
Owner

I can confirm that you should see each instance. If you run MethylDackel perRead tests/cg100.fa tests/cg_aln.bam in the source tree you'll see each instance of read1 in the output. I should add a method to filter reads, though. If you need --ignoreFlags or --requireFlags now then let me know.

@nchernia
Copy link
Author

Yes, I should have clarified that I do see multiple instances, just double checking that I understood how it worked. Yes, I need --ignoreFlags and --requireFlags though this isn't urgent.

@dpryan79
Copy link
Owner

Both those options should now be supported (they both work properly when I test them locally).

@nchernia
Copy link
Author

nchernia commented May 3, 2018

Hi - perRead seems to include MAPQ 0 reads. These are filtered out by default in "extract", right?

@dpryan79
Copy link
Owner

dpryan79 commented May 3, 2018

extract filters base score of 0 to avoid double calling overlapping ends of mates. The default -q will also skip MAPQ 0, but that can be changed. If I never added a MAPQ thresholding option to perRead I can do that, just let me know if you need it.

@nchernia
Copy link
Author

nchernia commented May 3, 2018

I'm a little confused. In the usage for extract it says:
Options:
-q INT Minimum MAPQ threshold to include an alignment (default 10)

So I thought reads with MAPQ < 10 were not included in extract. Is that correct?

I had assumed perRead would have these same defaults. For me it's not strictly necessary but may be useful for others.

@dpryan79
Copy link
Owner

dpryan79 commented May 3, 2018

You're correct regarding extract. I wrote perRead to be pretty bare-bones, so it includes most things by default. I can add the -q and -p options, though, with the defaults from extract.

@nchernia
Copy link
Author

nchernia commented May 3, 2018

Thanks; whatever you think is best.

Just one more clarification question, perRead looks just at CpG methylation, right? Not CHH or CHG (unless these are passed in)?

@dpryan79
Copy link
Owner

dpryan79 commented May 4, 2018

Correct, it's just CpG at this point. I'll try to get MAPQ and base quality filtering added today.

@dpryan79
Copy link
Owner

dpryan79 commented May 4, 2018

The -q and -p options should now be present with the same default values as extract. I've not thoroughly tested these, so let me know if they're buggy.

@nchernia
Copy link
Author

nchernia commented Nov 8, 2018

Question - sometimes the output is

readname chr position 0 0

Why does it output "no evidence"? Is it just seeing that the read is in the appropriate region but doesn't cover a CpG?

@dpryan79
Copy link
Owner

dpryan79 commented Nov 9, 2018

My guess is that this happens when all of the bases are ignored due to quality filtering, but I'm not sure off-hand.

@SamERoss
Copy link

SamERoss commented Nov 21, 2018

Hello,

Would it also be possible to add CHG and CHH options for perRead?

@amatthopkins
Copy link

Hi there!

Thanks for making this! Just wanted to ask about column 5 of the output - number of informative bases. Is the number of informative bases the sum of the number of methylated Cs and unmethylated Cs in CpGs in a given read?

@dpryan79
Copy link
Owner

dpryan79 commented Apr 3, 2019

@amatthopkins Yes, exactly.

@dpryan79 dpryan79 closed this as completed Apr 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants