picard_sam_to_bam.py #21

tanglingfung · 2011-04-12T07:49:04Z

Hi Brad,

it seems that it will keep finding CreateSequenceDictionary in /usr/share/java/picard even though I have specify another path in my config file? I have tried doing the setup again after I modified the config files, but still it didn't look up the path I specified.

and I didn't seem to have specified the path of hg19.fa for GATK?

Thanks,
Paul

chapmanb · 2011-04-12T11:41:54Z

Paul;
For your CreateSequenceDictionary problem, what did you modify in your post_process.yaml? You'll need to change program/picard in your post_process.yaml and point at the new file. The path isn't hardcoded anywhere so I'm not sure how else to advise. What commandline are you using?

For your hg19.fa problem, can you be more specific about the issue you are seeing?
Thanks

tanglingfung · 2011-04-12T14:40:39Z

Brad:
I changed picard in my post_process.yaml in bcbb/nextgen/config and I
was running nosetests -v -s in nextgen, I had that guess because when
I track back the error messages, it says: Could not find jar
CreateSequenceDictionary in /usr/share/java/picard, So, I wonder if I
didn't do the set up probably. Is post_process.yaml ready to go after
modification?

As for hg19.fa, it's not a problem. I just don't think I have provided
any path for the indexes. Or it's fine if bowtie and bwa knows where
it is.
Best,
Paul

On Tue, Apr 12, 2011 at 4:41 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
For your CreateSequenceDictionary problem, what did you modify in your post_process.yaml? You'll need to change program/picard in your post_process.yaml and point at the new file. The path isn't hardcoded anywhere so I'm not sure how else to advise. What commandline are you using?

For your hg19.fa problem, can you be more specific about the issue you are seeing?
Thanks

Reply to this email directly or view it on GitHub:
#21 (comment)

chapmanb · 2011-04-12T14:56:10Z

Paul;
Ah, I see now. For the picard problem, there is a separate configuration for tests, under the assumption that you'll have a different test and production environment:

nextgen/tests/data/automated

Sorry for the confusion.

For hg19, the scripts in data_fabfile.py will update your Galaxy location files (tool-data/bowtie_indices.loc) with the locations for the indices. This is used by the automated pipeline to find sequence files as well.

Let me know if you run into any other problems. Thanks,
Brad

tanglingfung · 2011-04-12T15:08:47Z

Thanks Brad.

Yes, we should have a different environment for test and production
but we kind of skipped that now.

For hg19, I didn't find tool-data/bowtie_indices.loc being updated. I
actually install the data and program in parallel, so I guess it
didn't know my galaxy environment? should I update it manually?

Thanks,
Paul

On Tue, Apr 12, 2011 at 7:56 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
Ah, I see now. For the picard problem, there is a separate configuration for tests, under the assumption that you'll have a different test and production environment:

nextgen/tests/data/automated

Sorry for the confusion.

For hg19, the scripts in data_fabfile.py will update your Galaxy location files (tool-data/bowtie_indices.loc) with the locations for the indices. This is used by the automated pipeline to find sequence files as well.

Let me know if you run into any other problems. Thanks,
Brad

Reply to this email directly or view it on GitHub:
#21 (comment)

tanglingfung · 2011-04-12T15:55:08Z

Brad:

Actually, I know where to update, just I am not sure if it's
appropriate to do that manually. After manual update, it is working
now.

Thanks,
Paul

On Tue, Apr 12, 2011 at 8:08 AM, Paul Tang tanglingfung@gmail.com wrote:

Thanks Brad.

Yes, we should have a different environment for test and production
but we kind of skipped that now.

For hg19, I didn't find tool-data/bowtie_indices.loc being updated. I
actually install the data and program in parallel, so I guess it
didn't know my galaxy environment? should I update it manually?

Thanks,
Paul

On Tue, Apr 12, 2011 at 7:56 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
Ah, I see now. For the picard problem, there is a separate configuration for tests, under the assumption that you'll have a different test and production environment:

nextgen/tests/data/automated

Sorry for the confusion.

For hg19, the scripts in data_fabfile.py will update your Galaxy location files (tool-data/bowtie_indices.loc) with the locations for the indices. This is used by the automated pipeline to find sequence files as well.

Let me know if you run into any other problems. Thanks,
Brad

Reply to this email directly or view it on GitHub:
#21 (comment)

chapmanb · 2011-04-12T17:58:44Z

Paul;
Great -- really glad to hear it is working. It's no problem to update manually -- I've used Galaxy defaults as much as possible. The automated scripts just help you avoid the manual step if you want, but aren't required.

tanglingfung · 2011-04-12T19:45:51Z

Brad,

Thanks. One more thing (I guess it's the last one): I think the
automated script would also download snp130.vcf, but it didn't. And I
can't find it in the config files or data_fabfile.py.

Thanks a lot for your help.

On Tue, Apr 12, 2011 at 10:58 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
Great -- really glad to hear it is working. It's no problem to update manually -- I've used Galaxy defaults as much as possible. The automated scripts just help you avoid the manual step if you want, but aren't required.

Reply to this email directly or view it on GitHub:
#21 (comment)

chapmanb · 2011-04-13T01:09:30Z

Paul;
You are into the tricky parts now that haven't yet been automated. Pulling down dbSNP in the right format for GATK takes a bit of work but here's the procedure I used. The only difference here is that I used Broad's GRCh37 genome instead of hg19; GATK can be a bit picky about the reference genomes, so if you are planning to use this for production work on humans I'd suggest trying to get the GRCh37 genome build. I will work on adding this to the S3 genomes to make it easier. Once you have that, you can get and prepare dbSNP with:

$ wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz
$ gunzip 00-All.vcf.gz
$ mv 00-All.vcf dbSNP132-orig.vcf
$ java -jar /usr/share/java/gatk/GenomeAnalysisTK.jar -T VariantsToVCF -R ../seq/Homo_sapiens_assembly19.fasta -B:variant,vcf dbSNP132-orig.vcf -o dbSNP132.vcf

Hope this helps

tanglingfung · 2011-04-13T06:28:10Z

Thanks Brad. It's ok for this part. But then, where can I modify the
location of the snp datafile? We actually have the snp130 but in .rod
format.

By the way, would you normally take away the random contigs in a
genome index? and is there any special reason putting the index read
at the 3' of read 1?

Paul

On Tue, Apr 12, 2011 at 6:09 PM, chapmanb
reply@reply.github.com
wrote:

Paul;
You are into the tricky parts now that haven't yet been automated. Pulling down dbSNP in the right format for GATK takes a bit of work but here's the procedure I used. The only difference here is that I used Broad's GRCh37 genome instead of hg19; GATK can be a bit picky about the reference genomes, so if you are planning to use this for production work on humans I'd suggest trying to get the GRCh37 genome build. I will work on adding this to the S3 genomes to make it easier. Once you have that, you can get and prepare dbSNP with:

$ wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz
$ gunzip 00-All.vcf.gz
$ mv 00-All.vcf dbSNP132-orig.vcf
$ java -jar /usr/share/java/gatk/GenomeAnalysisTK.jar -T VariantsToVCF -R ../seq/Homo_sapiens_assembly19.fasta -B:variant,vcf dbSNP132-orig.vcf -o dbSNP132.vcf

Hope this helps

Reply to this email directly or view it on GitHub:
#21 (comment)

chapmanb · 2011-04-13T11:58:30Z

Paul;
You can specify the location of the SNP datafile in the post_process.yaml (under dbsnp). I haven't used dbSNP 130 in a while but here were my notes on how I converted it. I'm not sure if this will still work with current GATK, you may want to think about moving to 132:

ftp://ftp.broad.mit.edu/pub/gsa/gatk_resources.tgz
tar -xzvpf resources.tgz
mv resources/dbsnp_130_b37.rod .
java -jar /source/Picard/GATK/dist/GenomeAnalysisTK.jar -T VariantsToVCF -R ../seq/Homo_sapiens_assembly19.fasta -D dbsnp_130_b37.rod -B variant,DbSNP,dbsnp_130_b37.rod -o snp130.vcf

By default I use the random contigs; they are part of the genome. But others avoid them so this is more of a choice about how it will fit with your downstream analyses.

Indexes are normally 3' to avoid repetitive bases at the 5' end of the sequence which can throw off basecalling. This problem has been alleviated somewhat with newer basecallers and better tag design, but by default Illumina barcodes are 3'.

tanglingfung · 2012-04-13T04:47:31Z

by the way, Brad, would you include different haplotype
(e.g.chr6_apd_hap1) in your reference genome for mapper like BWA?

Thanks,
Paul

On Wed, Apr 13, 2011 at 4:58 AM, chapmanb
reply@reply.github.com
wrote:

Paul;
You can specify the location of the SNP datafile in the post_process.yaml (under dbsnp). I haven't used dbSNP 130 in a while but here were my notes on how I converted it. I'm not sure if this will still work with current GATK, you may want to think about moving to 132:

ftp://ftp.broad.mit.edu/pub/gsa/gatk_resources.tgz
tar -xzvpf resources.tgz
mv resources/dbsnp_130_b37.rod .
java -jar /source/Picard/GATK/dist/GenomeAnalysisTK.jar -T VariantsToVCF -R ../seq/Homo_sapiens_assembly19.fasta -D dbsnp_130_b37.rod -B variant,DbSNP,dbsnp_130_b37.rod -o snp130.vcf

By default I use the random contigs; they are part of the genome. But others avoid them so this is more of a choice about how it will fit with your downstream analyses.

Indexes are normally 3' to avoid repetitive bases at the 5' end of the sequence which can throw off basecalling. This problem has been alleviated somewhat with newer basecallers and better tag design, but by default Illumina barcodes are 3'.

Reply to this email directly or view it on GitHub:
#21 (comment)

chapmanb · 2012-04-13T08:19:11Z

Paul;
I would include those, along with the random chromosomes, during the mapping step. As long as you use a consistent genome reference throughout the process GATK will process BAM files containing them without any problems. Hope this helps.

chapmanb closed this as completed Apr 12, 2011

vals referenced this issue in vals/bcbb Oct 5, 2011

issue #21

17d6cf2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

picard_sam_to_bam.py #21

picard_sam_to_bam.py #21

tanglingfung commented Apr 12, 2011

chapmanb commented Apr 12, 2011

tanglingfung commented Apr 12, 2011

chapmanb commented Apr 12, 2011

tanglingfung commented Apr 12, 2011

tanglingfung commented Apr 12, 2011

chapmanb commented Apr 12, 2011

tanglingfung commented Apr 12, 2011

chapmanb commented Apr 13, 2011

tanglingfung commented Apr 13, 2011

chapmanb commented Apr 13, 2011

tanglingfung commented Apr 13, 2012

chapmanb commented Apr 13, 2012

picard_sam_to_bam.py #21

picard_sam_to_bam.py #21

Comments

tanglingfung commented Apr 12, 2011

chapmanb commented Apr 12, 2011

tanglingfung commented Apr 12, 2011

chapmanb commented Apr 12, 2011

tanglingfung commented Apr 12, 2011

tanglingfung commented Apr 12, 2011

chapmanb commented Apr 12, 2011

tanglingfung commented Apr 12, 2011

chapmanb commented Apr 13, 2011

tanglingfung commented Apr 13, 2011

chapmanb commented Apr 13, 2011

tanglingfung commented Apr 13, 2012

chapmanb commented Apr 13, 2012