FastQC vs SolexaQA #25

tanglingfung · 2011-05-01T02:28:43Z

Brad,
It's not really an issue. But I want to know, from your experience, how much time you would save from switching to FastQC from SolexaQA?

Thanks,
Paul

tanglingfung · 2011-05-01T02:40:23Z

and our system appears to be overloaded after recalibration using GATK, and then the system will slow down a lot afterwards. Do you have any suggestion on this?

Thanks a lot!
Paul

chapmanb · 2011-05-01T10:54:10Z

Paul;
FastQC is much faster than SolexaQA. For large 100bp paired end HiSeqs, SolexaQA was taking upwards of 8 hours. FastQC runs in a couple hours, and has more useful details like overrepresented kmers and sequences. Sorry for the change, but hopefully this makes it easier to install and faster for the future.

For your GATK issues, how much memory does your machine have? Perhaps our memory settings now (6Gb per process) are too high and it's causing swapping. I could make that configurable if it would be helpful. Thanks,
Brad

tanglingfung · 2011-05-01T14:14:43Z

Brad,

I actually like the change. I am just concerned with consistence for some of our projects. But I agree with you that FastQC has more useful details.

Our machine currently has 48G RAM with 2x12 cores. I guess it would be more reasonable to add more RAM in our case. May I know the configuration of your system? and how much time does it take to go through the whole pipeline? (e.g. in a SNP calling analysis)

Thanks,
Paul

By the way, we are working on a script that creates symbolic link to fastq files in different flow cell directories and put them in a virtual one to be processed by automated_initial_analysis.py. I think that would be useful for multiple samples run in multiple flowcells

chapmanb · 2011-05-01T16:52:49Z

Paul;
Glad you like FastQC approach. It's really helpful for debugging and is actively developed for these large flowcells, which is really useful as well. You should grab the latest change I just checked in which avoids some LaTeX issues with some of the FastQC output -- some percents and other characters need to be escaped for LaTeX.

We have 48G of RAM but only 8 cores; we need more processors to deal with the new HiSeq. If you can run up to 24 processes currently, you would expect some memory swapping on barcoded HiSeq lanes. I'll work on making that a configurable parameter; it might slow down GATK but would at least save tons of swapping slowness.

Full barcoded SNP calling analyses can take a couple of days to process on that machine; it can be even longer if you have lots of barcodes as you need to wait for cores.

Let me know when you have your script finished. I'd be very happy to link to it from the documentation or include it as a utility for others. Thanks again,
Brad

tanglingfung · 2011-05-01T20:36:51Z

Thanks.

I may make a mistake on the number of cores. Maybe ours is also 8
core. But we are mostly handling 2x100bp runs. I cannot imagine the
computation challenge when Illumina triple the throughput of HiSeq
this summer. I was thinking if it's better to move the pipeline to a
cluster or dedicate different tasks to different (but physically
attached) servers. I noticed that the demand of CPU and RAM varies a
lot throughput the pipeline. We are thinking of how to best utilize
the computer resources.

By the way, what's the best practice to use git to stay up to date
with the pipeline? I'm interested to contribute utility scripts when
they are settled down.

Best,
Paul

On Sun, May 1, 2011 at 9:52 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
Glad you like FastQC approach. It's really helpful for debugging and is actively developed for these large flowcells, which is really useful as well. You should grab the latest change I just checked in which avoids some LaTeX issues with some of the FastQC output -- some percents and other characters need to be escaped for LaTeX.

We have 48G of RAM but only 8 cores; we need more processors to deal with the new HiSeq. If you can run up to 24 processes currently, you would expect some memory swapping on barcoded HiSeq lanes. I'll work on making that a configurable parameter; it might slow down GATK but would at least save tons of swapping slowness.

Full barcoded SNP calling analyses can take a couple of days to process on that machine; it can be even longer if you have lots of barcodes as you need to wait for cores.

Let me know when you have your script finished. I'd be very happy to link to it from the documentation or include it as a utility for others. Thanks again,
Brad

Reply to this email directly or view it on GitHub:
#25 (comment)

brainstorm · 2011-05-02T22:15:03Z

Hello Paul,

I'm also running Brad's pipeline in production. I use a rather naïve approach to launch the automatic initial analysis, but so far has worked acceptably well (~2 days of processing on average per Run). It consists of putting a wrapper in place in post_process.yaml:

(...)
analysis:
process_program: illumina_run_batch.sh

instead of the default "automated_initial_analysis.py"

Illumina_run_batch.sh will queue the job on a cluster and launch the analysis on a single machine, using all 8 cores.

All this assumes that you have a "beowulf"-type cluster in place, together with a batch queueing system (perhaps you can ask your IT staff?):

http://en.wikipedia.org/wiki/Beowulf_(computing)
http://en.wikipedia.org/wiki/Batch-queuing_system

As I said, this is just a hack, better ways to parallelize/optimize this need to be worked on further.

Regarding the best practice to use git, I would recommend you to "fork" Brad's repository by following this guide:

http://help.github.com/fork-a-repo/

Once you're happy with your changes, you may issue "Pull requests" towards Brad:

http://help.github.com/pull-requests/

Hope it all helps ! ;)

chapmanb · 2011-05-03T13:04:32Z

Paul;
Yes, the computational demands are a challenge. Luckily it was written to be parallelizable, but to handle the new HiSeq it'll need more cores as you suggest. Clusters are one possibility; I'd be happy to hear what you come up with.

Roman is spot on with his GitHub suggestions. Once you make a fork you can keep a repository of your own scripts in utils or wherever, and we can merge ones back into the main trunk. While you are developing you can keep pulling in changes from the main repository and git will help with merging differences.

Thanks guys.

tanglingfung · 2011-05-03T18:47:57Z

Thanks Roman and Brad.

Yes, I think the script is doing very well for a single flowcell with 8-cores, 48G RAM. It can be done in 2-3 days. No problem with that. And I also looking into the "beowulf"-type cluster. It makes configuration easier by syncing the OS of the servers.

Thanks again for all the advices and help here!

Best,
Paul

tanglingfung · 2011-05-05T18:21:02Z

Brad,

I have just tried to current version of the pipeline. However, the text from FastQC is still weird and the subtitle is missing.

Paul

chapmanb · 2011-05-05T22:00:35Z

Paul;
I'm not sure what you mean, can you be more specific on the problems you're seeing? I didn't add in captions on the figures for FastQC, if that's what you mean, as they have more useful titles than the previous plots. What text problems are you encountering?

tanglingfung · 2011-05-05T22:06:56Z

Brad,

Sorry for being unclear. I found that it's a problem from FastQC. The
text problem I was having is also found on the png exported from
FastQC. Sorry about that.

Thanks,
Paul

On Thu, May 5, 2011 at 3:00 PM, chapmanb
reply@reply.github.com
wrote:

Paul;
I'm not sure what you mean, can you be more specific on the problems you're seeing? I didn't add in captions on the figures for FastQC, if that's what you mean, as they have more useful titles than the previous plots. What text problems are you encountering?

Reply to this email directly or view it on GitHub:
#25 (comment)

tanglingfung · 2011-06-10T23:04:06Z

we're getting stable with the pipeline now and have plans to move the analysis part to the cluster. Thanks again for all the helping. I have also started to fork the repository, and hopefully I can start to contribute back to the pipeline development.

Thanks again for all the helps in the past few months.

chapmanb · 2011-06-11T15:19:55Z

Paul;
Thanks for the message. That's great to hear -- really happy things are working out. Let me know when you have changes to merge back in. Thanks again,
Brad

fastq_screen fixes & basecalling parameters

tanglingfung closed this as completed May 1, 2011

b97pla referenced this issue in b97pla/bcbb Dec 26, 2011

Merge pull request SciLifeLab#25 from b97pla/master

94499e3

fastq_screen fixes & basecalling parameters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastQC vs SolexaQA #25

FastQC vs SolexaQA #25

tanglingfung commented May 1, 2011

tanglingfung commented May 1, 2011

chapmanb commented May 1, 2011

tanglingfung commented May 1, 2011

chapmanb commented May 1, 2011

tanglingfung commented May 1, 2011

brainstorm commented May 2, 2011

chapmanb commented May 3, 2011

tanglingfung commented May 3, 2011

tanglingfung commented May 5, 2011

chapmanb commented May 5, 2011

tanglingfung commented May 5, 2011

tanglingfung commented Jun 10, 2011

chapmanb commented Jun 11, 2011

FastQC vs SolexaQA #25

FastQC vs SolexaQA #25

Comments

tanglingfung commented May 1, 2011

tanglingfung commented May 1, 2011

chapmanb commented May 1, 2011

tanglingfung commented May 1, 2011

chapmanb commented May 1, 2011

tanglingfung commented May 1, 2011

brainstorm commented May 2, 2011

chapmanb commented May 3, 2011

tanglingfung commented May 3, 2011

tanglingfung commented May 5, 2011

chapmanb commented May 5, 2011

tanglingfung commented May 5, 2011

tanglingfung commented Jun 10, 2011

chapmanb commented Jun 11, 2011