Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastQC vs SolexaQA #25

Closed
tanglingfung opened this issue May 1, 2011 · 13 comments
Closed

FastQC vs SolexaQA #25

tanglingfung opened this issue May 1, 2011 · 13 comments

Comments

@tanglingfung
Copy link

Brad,
It's not really an issue. But I want to know, from your experience, how much time you would save from switching to FastQC from SolexaQA?

Thanks,
Paul

@tanglingfung
Copy link
Author

and our system appears to be overloaded after recalibration using GATK, and then the system will slow down a lot afterwards. Do you have any suggestion on this?

Thanks a lot!
Paul

@chapmanb
Copy link
Owner

chapmanb commented May 1, 2011

Paul;
FastQC is much faster than SolexaQA. For large 100bp paired end HiSeqs, SolexaQA was taking upwards of 8 hours. FastQC runs in a couple hours, and has more useful details like overrepresented kmers and sequences. Sorry for the change, but hopefully this makes it easier to install and faster for the future.

For your GATK issues, how much memory does your machine have? Perhaps our memory settings now (6Gb per process) are too high and it's causing swapping. I could make that configurable if it would be helpful. Thanks,
Brad

@tanglingfung
Copy link
Author

Brad,

I actually like the change. I am just concerned with consistence for some of our projects. But I agree with you that FastQC has more useful details.

Our machine currently has 48G RAM with 2x12 cores. I guess it would be more reasonable to add more RAM in our case. May I know the configuration of your system? and how much time does it take to go through the whole pipeline? (e.g. in a SNP calling analysis)

Thanks,
Paul

By the way, we are working on a script that creates symbolic link to fastq files in different flow cell directories and put them in a virtual one to be processed by automated_initial_analysis.py. I think that would be useful for multiple samples run in multiple flowcells

@chapmanb
Copy link
Owner

chapmanb commented May 1, 2011

Paul;
Glad you like FastQC approach. It's really helpful for debugging and is actively developed for these large flowcells, which is really useful as well. You should grab the latest change I just checked in which avoids some LaTeX issues with some of the FastQC output -- some percents and other characters need to be escaped for LaTeX.

We have 48G of RAM but only 8 cores; we need more processors to deal with the new HiSeq. If you can run up to 24 processes currently, you would expect some memory swapping on barcoded HiSeq lanes. I'll work on making that a configurable parameter; it might slow down GATK but would at least save tons of swapping slowness.

Full barcoded SNP calling analyses can take a couple of days to process on that machine; it can be even longer if you have lots of barcodes as you need to wait for cores.

Let me know when you have your script finished. I'd be very happy to link to it from the documentation or include it as a utility for others. Thanks again,
Brad

@tanglingfung
Copy link
Author

Thanks.

I may make a mistake on the number of cores. Maybe ours is also 8
core. But we are mostly handling 2x100bp runs. I cannot imagine the
computation challenge when Illumina triple the throughput of HiSeq
this summer. I was thinking if it's better to move the pipeline to a
cluster or dedicate different tasks to different (but physically
attached) servers. I noticed that the demand of CPU and RAM varies a
lot throughput the pipeline. We are thinking of how to best utilize
the computer resources.

By the way, what's the best practice to use git to stay up to date
with the pipeline? I'm interested to contribute utility scripts when
they are settled down.

Best,
Paul

On Sun, May 1, 2011 at 9:52 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
Glad you like FastQC approach. It's really helpful for debugging and is actively developed for these large flowcells, which is really useful as well. You should grab the latest change I just checked in which avoids some LaTeX issues with some of the FastQC output -- some percents and other characters need to be escaped for LaTeX.

We have 48G of RAM but only 8 cores; we need more processors to deal with the new HiSeq. If you can run up to 24 processes currently, you would expect some memory swapping on barcoded HiSeq lanes. I'll work on making that a configurable parameter; it might slow down GATK but would at least save tons of swapping slowness.

Full barcoded SNP calling analyses can take a couple of days to process on that machine; it can be even longer if you have lots of barcodes as you need to wait for cores.

Let me know when you have your script finished. I'd be very happy to link to it from the documentation or include it as a utility for others. Thanks again,
Brad

Reply to this email directly or view it on GitHub:
#25 (comment)

@brainstorm
Copy link

Hello Paul,

I'm also running Brad's pipeline in production. I use a rather naïve approach to launch the automatic initial analysis, but so far has worked acceptably well (~2 days of processing on average per Run). It consists of putting a wrapper in place in post_process.yaml:

(...)
analysis:
process_program: illumina_run_batch.sh

instead of the default "automated_initial_analysis.py"

Illumina_run_batch.sh will queue the job on a cluster and launch the analysis on a single machine, using all 8 cores.

All this assumes that you have a "beowulf"-type cluster in place, together with a batch queueing system (perhaps you can ask your IT staff?):

http://en.wikipedia.org/wiki/Beowulf_(computing)
http://en.wikipedia.org/wiki/Batch-queuing_system

As I said, this is just a hack, better ways to parallelize/optimize this need to be worked on further.

Regarding the best practice to use git, I would recommend you to "fork" Brad's repository by following this guide:

http://help.github.com/fork-a-repo/

Once you're happy with your changes, you may issue "Pull requests" towards Brad:

http://help.github.com/pull-requests/

Hope it all helps ! ;)

@chapmanb
Copy link
Owner

chapmanb commented May 3, 2011

Paul;
Yes, the computational demands are a challenge. Luckily it was written to be parallelizable, but to handle the new HiSeq it'll need more cores as you suggest. Clusters are one possibility; I'd be happy to hear what you come up with.

Roman is spot on with his GitHub suggestions. Once you make a fork you can keep a repository of your own scripts in utils or wherever, and we can merge ones back into the main trunk. While you are developing you can keep pulling in changes from the main repository and git will help with merging differences.

Thanks guys.

@tanglingfung
Copy link
Author

Thanks Roman and Brad.

Yes, I think the script is doing very well for a single flowcell with 8-cores, 48G RAM. It can be done in 2-3 days. No problem with that. And I also looking into the "beowulf"-type cluster. It makes configuration easier by syncing the OS of the servers.

Thanks again for all the advices and help here!

Best,
Paul

@tanglingfung
Copy link
Author

Brad,

I have just tried to current version of the pipeline. However, the text from FastQC is still weird and the subtitle is missing.

Paul

@chapmanb
Copy link
Owner

chapmanb commented May 5, 2011

Paul;
I'm not sure what you mean, can you be more specific on the problems you're seeing? I didn't add in captions on the figures for FastQC, if that's what you mean, as they have more useful titles than the previous plots. What text problems are you encountering?

@tanglingfung
Copy link
Author

Brad,

Sorry for being unclear. I found that it's a problem from FastQC. The
text problem I was having is also found on the png exported from
FastQC. Sorry about that.

Thanks,
Paul

On Thu, May 5, 2011 at 3:00 PM, chapmanb
reply@reply.github.com
wrote:

Paul;
I'm not sure what you mean, can you be more specific on the problems you're seeing? I didn't add in captions on the figures for FastQC, if that's what you mean, as they have more useful titles than the previous plots. What text problems are you encountering?

Reply to this email directly or view it on GitHub:
#25 (comment)

@tanglingfung
Copy link
Author

we're getting stable with the pipeline now and have plans to move the analysis part to the cluster. Thanks again for all the helping. I have also started to fork the repository, and hopefully I can start to contribute back to the pipeline development.

Thanks again for all the helps in the past few months.

@chapmanb
Copy link
Owner

Paul;
Thanks for the message. That's great to hear -- really happy things are working out. Let me know when you have changes to merge back in. Thanks again,
Brad

b97pla referenced this issue in b97pla/bcbb Dec 26, 2011
fastq_screen fixes & basecalling parameters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants