Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNNNN reads #8

Closed
JWDebler opened this issue Apr 15, 2021 · 8 comments
Closed

NNNNN reads #8

JWDebler opened this issue Apr 15, 2021 · 8 comments

Comments

@JWDebler
Copy link

Hi Adam,
love the tool, but came across something strange.

I ran the tool like this:

minimap2 -ax map-ont P9424_final.fasta ../P9424.correctedReads.fasta.gz | teloclip --ref P9424_final.fasta.fai  | samtools sort > P9424_teloclip.bam

When I look at the bam files I get a lot of these NNNNNNN reads as in the below image mapping to contig ends.
But when I go and look that read name up in the actual fasta reads file, they are proper reads without any Ns.
Any idea where that might come from?

Cheers

image

@Adamtaranto
Copy link
Owner

Hi Johannes,

That is curious. Could you please post example SAM records for one read that shows this behaviour and one that does not?

What happens if you filter secondary alignments?

@JWDebler
Copy link
Author

Hi Adam, I added the secondary alignment filtering step samtools view -h -F 0x2308 and that seems to have taken care of the problem :-) Cheers.

@Adamtaranto
Copy link
Owner

Cool, that makes sense. When teloclip writes out the overhang sequence it is drawing on the read sequence stored in the SAM alignment record, this is only present for primary alignments.

The choice to discard secondary alignments was based on the assumption that sub-telomeric sequences are generally repetitive and therefore long-reads (containing telomeric repeats in the soft-clipped overhang) may also have secondary alignments to other chromosome/contig ends. It may be worthwhile checking where the primary alignments are for some of those overhanging secondary alignments - do those reads show up at the end of another contig?

I've corrected the hex code for filtering non-primary alignments, it should be 0x100. Previous code was also to catch PE reads where the mate was not mapped - doesn't apply to long reads.

@JWDebler
Copy link
Author

Hi Adam,

I just ran it with the 0x100 hex code, and the 'NNNNN' reads are back. 0x2308 removes them.

@Adamtaranto
Copy link
Owner

Can you please post a few example SAM records so I can figure out what's going on and update the docs accordingly?

@JWDebler
Copy link
Author

Sure, if you can tell me how to do that :-) I just run the command above and get the bam file.

@Adamtaranto
Copy link
Owner

For a contig where you know that there are overhanging reads with telomeric repeats on at least one end you can extract those alignments like this:

Align reads to reference + sort:
minimap2 -ax map-ont P9424_final.fasta ../P9424.correctedReads.fasta.gz | samtools sort > P9424_sorted.bam

Index on position:
samtools index P9424_sorted.bam

Filter for reads only on "Chr1" from position 1 to some number slightly longer than your longest read:
samtools view P9424_sorted.bam "Chr1:1-500000" > Chr1_leftend.bam

If you can post that final file + a fasta file with just the target contig (i.e. Chr1 in the example) I'll take a look. Can also email me if you don't want to post data.

@JWDebler
Copy link
Author

Hi Adam, looks like I grabbed the wrong bam file :-) I ran it again with 0x100 and 0x2308 and the results are almost identical. No 'NNNN' reads in either of them. It's only when I omit that step that the 'NNNN' reads creep in. False alarm :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants