Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

picard_sam_to_bam.py #21

Closed
tanglingfung opened this issue Apr 12, 2011 · 12 comments
Closed

picard_sam_to_bam.py #21

tanglingfung opened this issue Apr 12, 2011 · 12 comments

Comments

@tanglingfung
Copy link

Hi Brad,

it seems that it will keep finding CreateSequenceDictionary in /usr/share/java/picard even though I have specify another path in my config file? I have tried doing the setup again after I modified the config files, but still it didn't look up the path I specified.

and I didn't seem to have specified the path of hg19.fa for GATK?

Thanks,
Paul

@chapmanb
Copy link
Owner

Paul;
For your CreateSequenceDictionary problem, what did you modify in your post_process.yaml? You'll need to change program/picard in your post_process.yaml and point at the new file. The path isn't hardcoded anywhere so I'm not sure how else to advise. What commandline are you using?

For your hg19.fa problem, can you be more specific about the issue you are seeing?
Thanks

@tanglingfung
Copy link
Author

Brad:
I changed picard in my post_process.yaml in bcbb/nextgen/config and I
was running nosetests -v -s in nextgen, I had that guess because when
I track back the error messages, it says: Could not find jar
CreateSequenceDictionary in /usr/share/java/picard, So, I wonder if I
didn't do the set up probably. Is post_process.yaml ready to go after
modification?

As for hg19.fa, it's not a problem. I just don't think I have provided
any path for the indexes. Or it's fine if bowtie and bwa knows where
it is.
Best,
Paul

On Tue, Apr 12, 2011 at 4:41 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
For your CreateSequenceDictionary problem, what did you modify in your post_process.yaml? You'll need to change program/picard in your post_process.yaml and point at the new file. The path isn't hardcoded anywhere so I'm not sure how else to advise. What commandline are you using?

For your hg19.fa problem, can you be more specific about the issue you are seeing?
Thanks

Reply to this email directly or view it on GitHub:
#21 (comment)

@chapmanb
Copy link
Owner

Paul;
Ah, I see now. For the picard problem, there is a separate configuration for tests, under the assumption that you'll have a different test and production environment:

nextgen/tests/data/automated

Sorry for the confusion.

For hg19, the scripts in data_fabfile.py will update your Galaxy location files (tool-data/bowtie_indices.loc) with the locations for the indices. This is used by the automated pipeline to find sequence files as well.

Let me know if you run into any other problems. Thanks,
Brad

@tanglingfung
Copy link
Author

Thanks Brad.

Yes, we should have a different environment for test and production
but we kind of skipped that now.

For hg19, I didn't find tool-data/bowtie_indices.loc being updated. I
actually install the data and program in parallel, so I guess it
didn't know my galaxy environment? should I update it manually?

Thanks,
Paul

On Tue, Apr 12, 2011 at 7:56 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
Ah, I see now. For the picard problem, there is a separate configuration for tests, under the assumption that you'll have a different test and production environment:

nextgen/tests/data/automated

Sorry for the confusion.

For hg19, the scripts in data_fabfile.py will update your Galaxy location files (tool-data/bowtie_indices.loc) with the locations for the indices. This is used by the automated pipeline to find sequence files as well.

Let me know if you run into any other problems. Thanks,
Brad

Reply to this email directly or view it on GitHub:
#21 (comment)

@tanglingfung
Copy link
Author

Brad:

Actually, I know where to update, just I am not sure if it's
appropriate to do that manually. After manual update, it is working
now.

Thanks,
Paul

On Tue, Apr 12, 2011 at 8:08 AM, Paul Tang tanglingfung@gmail.com wrote:

Thanks Brad.

Yes, we should have a different environment for test and production
but we kind of skipped that now.

For hg19, I didn't find tool-data/bowtie_indices.loc being updated. I
actually install the data and program in parallel, so I guess it
didn't know my galaxy environment? should I update it manually?

Thanks,
Paul

On Tue, Apr 12, 2011 at 7:56 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
Ah, I see now. For the picard problem, there is a separate configuration for tests, under the assumption that you'll have a different test and production environment:

nextgen/tests/data/automated

Sorry for the confusion.

For hg19, the scripts in data_fabfile.py will update your Galaxy location files (tool-data/bowtie_indices.loc) with the locations for the indices. This is used by the automated pipeline to find sequence files as well.

Let me know if you run into any other problems. Thanks,
Brad

Reply to this email directly or view it on GitHub:
#21 (comment)

@chapmanb
Copy link
Owner

Paul;
Great -- really glad to hear it is working. It's no problem to update manually -- I've used Galaxy defaults as much as possible. The automated scripts just help you avoid the manual step if you want, but aren't required.

@tanglingfung
Copy link
Author

Brad,

Thanks. One more thing (I guess it's the last one): I think the
automated script would also download snp130.vcf, but it didn't. And I
can't find it in the config files or data_fabfile.py.

Thanks a lot for your help.

On Tue, Apr 12, 2011 at 10:58 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
Great -- really glad to hear it is working. It's no problem to update manually -- I've used Galaxy defaults as much as possible. The automated scripts just help you avoid the manual step if you want, but aren't required.

Reply to this email directly or view it on GitHub:
#21 (comment)

@chapmanb
Copy link
Owner

Paul;
You are into the tricky parts now that haven't yet been automated. Pulling down dbSNP in the right format for GATK takes a bit of work but here's the procedure I used. The only difference here is that I used Broad's GRCh37 genome instead of hg19; GATK can be a bit picky about the reference genomes, so if you are planning to use this for production work on humans I'd suggest trying to get the GRCh37 genome build. I will work on adding this to the S3 genomes to make it easier. Once you have that, you can get and prepare dbSNP with:

$ wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz
$ gunzip 00-All.vcf.gz
$ mv 00-All.vcf dbSNP132-orig.vcf
$ java -jar /usr/share/java/gatk/GenomeAnalysisTK.jar -T VariantsToVCF -R ../seq/Homo_sapiens_assembly19.fasta -B:variant,vcf dbSNP132-orig.vcf -o dbSNP132.vcf

Hope this helps

@tanglingfung
Copy link
Author

Thanks Brad. It's ok for this part. But then, where can I modify the
location of the snp datafile? We actually have the snp130 but in .rod
format.

By the way, would you normally take away the random contigs in a
genome index? and is there any special reason putting the index read
at the 3' of read 1?

Paul

On Tue, Apr 12, 2011 at 6:09 PM, chapmanb
reply@reply.github.com
wrote:

Paul;
You are into the tricky parts now that haven't yet been automated. Pulling down dbSNP in the right format for GATK takes a bit of work but here's the procedure I used. The only difference here is that I used Broad's GRCh37 genome instead of hg19; GATK can be a bit picky about the reference genomes, so if you are planning to use this for production work on humans I'd suggest trying to get the GRCh37 genome build. I will work on adding this to the S3 genomes to make it easier. Once you have that, you can get and prepare dbSNP with:

   $ wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz
   $ gunzip 00-All.vcf.gz
   $ mv 00-All.vcf dbSNP132-orig.vcf
   $ java -jar /usr/share/java/gatk/GenomeAnalysisTK.jar -T VariantsToVCF -R ../seq/Homo_sapiens_assembly19.fasta -B:variant,vcf dbSNP132-orig.vcf -o dbSNP132.vcf

Hope this helps

Reply to this email directly or view it on GitHub:
#21 (comment)

@chapmanb
Copy link
Owner

Paul;
You can specify the location of the SNP datafile in the post_process.yaml (under dbsnp). I haven't used dbSNP 130 in a while but here were my notes on how I converted it. I'm not sure if this will still work with current GATK, you may want to think about moving to 132:

ftp://ftp.broad.mit.edu/pub/gsa/gatk_resources.tgz
tar -xzvpf resources.tgz
mv resources/dbsnp_130_b37.rod .
java -jar /source/Picard/GATK/dist/GenomeAnalysisTK.jar -T VariantsToVCF -R ../seq/Homo_sapiens_assembly19.fasta -D dbsnp_130_b37.rod -B variant,DbSNP,dbsnp_130_b37.rod -o snp130.vcf

By default I use the random contigs; they are part of the genome. But others avoid them so this is more of a choice about how it will fit with your downstream analyses.

Indexes are normally 3' to avoid repetitive bases at the 5' end of the sequence which can throw off basecalling. This problem has been alleviated somewhat with newer basecallers and better tag design, but by default Illumina barcodes are 3'.

vals referenced this issue in vals/bcbb Oct 5, 2011
@tanglingfung
Copy link
Author

by the way, Brad, would you include different haplotype
(e.g.chr6_apd_hap1) in your reference genome for mapper like BWA?

Thanks,
Paul

On Wed, Apr 13, 2011 at 4:58 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
You can specify the location of the SNP datafile in the post_process.yaml (under dbsnp). I haven't used dbSNP 130 in a while but here were my notes on how I converted it. I'm not sure if this will still work with current GATK, you may want to think about moving to 132:

   ftp://ftp.broad.mit.edu/pub/gsa/gatk_resources.tgz
   tar -xzvpf resources.tgz
   mv resources/dbsnp_130_b37.rod .
   java -jar /source/Picard/GATK/dist/GenomeAnalysisTK.jar -T VariantsToVCF -R ../seq/Homo_sapiens_assembly19.fasta -D dbsnp_130_b37.rod -B variant,DbSNP,dbsnp_130_b37.rod -o snp130.vcf

By default I use the random contigs; they are part of the genome. But others avoid them so this is more of a choice about how it will fit with your downstream analyses.

Indexes are normally 3' to avoid repetitive bases at the 5' end of the sequence which can throw off basecalling. This problem has been alleviated somewhat with newer basecallers and better tag design, but by default Illumina barcodes are 3'.

Reply to this email directly or view it on GitHub:
#21 (comment)

@chapmanb
Copy link
Owner

Paul;
I would include those, along with the random chromosomes, during the mapping step. As long as you use a consistent genome reference throughout the process GATK will process BAM files containing them without any problems. Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants