Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to get paired-end alignemntRecord like RDD[AlignmentRecord, AlignmentRecordRDD]? #1419

Closed
xubo245 opened this issue Mar 4, 2017 · 5 comments

Comments

@xubo245
Copy link

xubo245 commented Mar 4, 2017

Question: How to get paired-end alignemntRecord like RDD[AlignmentRecord, AlignmentRecord]?

I want to use it to read mapping.

I try loadPairedFastq and loadFastq:

    val pairdRDD = sc.loadPairedFastq(str1, str2, None, ValidationStringency.STRICT)
    val pairdRDD = sc.loadFastq(filePath1 = str1, filePath2Opt = Option(str2))

But it only get AlignmentRecordRDD or RDD[AlignmentRecord], I cann't get paired reads.

Ask for help, thanks, please.

@fnothaft
Copy link
Member

fnothaft commented Mar 4, 2017

Hi @xubo245! ADAM's paired end data structure is the Fragment/FragmentRDD. I have a usage example for alignment at ytchen0323/cloud-scale-bwamem#9.

@xubo245
Copy link
Author

xubo245 commented Mar 4, 2017

Thank you! @fnothaft I will read it later.

New I try groupBy readName to get RDD[readName,iterator[AlignmentRecord]], Have you try this? I don't know whether there is a problem...

I run cs-bwamem with upload function before, it spend a lot time for paired-end fastq from local fs to HDFS.

@fnothaft
Copy link
Member

fnothaft commented Mar 4, 2017

The groupBy on readname usually works, but sometimes people put a /1,/2 or _1/_2 suffix on the reads for first/second of pair, so you'll want to look out for that.

Actually, the slowness of the FASTQ upload was one of my motivations for doing the cs-bwamem refactor with Fragment!

@xubo245
Copy link
Author

xubo245 commented Mar 5, 2017

Adam and cs-bwamem both has deal with /1,/2 , I groupBy data after loadFastq in adam.

I has not seen _1/_2 suffix format ,and not sure it has been deal with in Adam?

Thanks, I have use Spark+Adam+BWA to read mapping in distributed enviromental and faster than cs-bwamem, but I have not to verification all.

Thank you for your help and contribution.

@xubo245
Copy link
Author

xubo245 commented Mar 5, 2017

Have you try to improvement the cs-bwamem performance in align? such as SWExtend in worker1 of cs-bwamem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants