Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence dictionaries differ #126

Closed
slowkow opened this issue Dec 10, 2014 · 11 comments
Closed

Sequence dictionaries differ #126

slowkow opened this issue Dec 10, 2014 · 11 comments

Comments

@slowkow
Copy link
Contributor

slowkow commented Dec 10, 2014

I'm running:

java -Xmx4g -jar picard.jar CollectRnaSeqMetrics \
REFERENCE_SEQUENCE=hg19.genome.fa \
REF_FLAT=gencode.v19.annotation.refFlat \
RIBOSOMAL_INTERVALS=gencode.v19.rRNA.interval_list \
STRAND_SPECIFICITY=NONE \
INPUT=file.bam \
ASSUME_SORTED=true \
OUTPUT=file.rnaseq_metrics \
CHART_OUTPUT=file.rnaseq.pdf

I get this error:

Exception in thread "main" picard.PicardException: Sequence dictionaries differ in file.bam and gencode.v19.rRNA.interval_list

However, I copied the sequence dictionary from the BAM file. So, the sequence dictionaries are identical.

Here's the sequence dictionary along with the first 5 lines of the ribosomal intervals file:

@HD     VN:1.4  SO:coordinate
@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566
chr1    9497728 9497837 -   ENST00000517147.1
chr1    13949679    13949779    -   ENST00000411020.1
chr1    34578550    34578664    +   ENST00000364278.1
chr1    37730278    37730387    -   ENST00000516559.1
chr1    39619836    39619968    -   ENST00000410446.1
@slowkow
Copy link
Contributor Author

slowkow commented Dec 10, 2014

By the way, you can see the full ribosomal intervals file and my script for creating it here: https://gist.github.com/slowkow/b11c28796508f03cdf4b

It would be nice to maintain a public repository of commonly used files such as this one.

@slowkow
Copy link
Contributor Author

slowkow commented Dec 10, 2014

The current implementation of Picard throws a new exception, discarding the information provided by htsjdk describing the difference between the two sequence dictionaries.

The new exception hides information from the htsjdk exception:

throw new PicardException("Sequence dictionaries differ in " + samFile.getAbsolutePath() + " and " + ribosomalIntervalsFile.getAbsolutePath(),

The htsjdk exception includes information: https://github.com/samtools/htsjdk/blob/a30bc12df2aeb3b4312a9c236c6a639025d5b596/src/java/htsjdk/samtools/util/SequenceUtil.java#L100

@yfarjoun
Copy link
Contributor

Can you please post the output of the following:

$ cat hg19.genome.dict
$ samtools view -H file.bam

Thanks.

Yossi.

On Wed, Dec 10, 2014 at 4:57 PM, Kamil Slowikowski <notifications@github.com

wrote:

I'm running:

java -Xmx4g -jar picard.jar CollectRnaSeqMetrics
REFERENCE_SEQUENCE=hg19.genome.fa
REF_FLAT=gencode.v19.annotation.refFlat
RIBOSOMAL_INTERVALS=gencode.v19.rRNA.interval_list
STRAND_SPECIFICITY=NONE
INPUT=file.bam
ASSUME_SORTED=true
OUTPUT=file.rnaseq_metrics
CHART_OUTPUT=file.rnaseq.pdf

I get this error:

Exception in thread "main" picard.PicardException: Sequence dictionaries differ in file.bam and gencode.v19.rRNA.interval_list

However, I copied the sequence dictionary from the BAM file. So, the
sequence dictionaries are identical.

Here's the sequence dictionary along with the first 5 lines of the
ribosomal intervals file:

@hd VN:1.4 SO:coordinate
@sq SN:chrM LN:16571
@sq SN:chr1 LN:249250621
@sq SN:chr2 LN:243199373
@sq SN:chr3 LN:198022430
@sq SN:chr4 LN:191154276
@sq SN:chr5 LN:180915260
@sq SN:chr6 LN:171115067
@sq SN:chr7 LN:159138663
@sq SN:chr8 LN:146364022
@sq SN:chr9 LN:141213431
@sq SN:chr10 LN:135534747
@sq SN:chr11 LN:135006516
@sq SN:chr12 LN:133851895
@sq SN:chr13 LN:115169878
@sq SN:chr14 LN:107349540
@sq SN:chr15 LN:102531392
@sq SN:chr16 LN:90354753
@sq SN:chr17 LN:81195210
@sq SN:chr18 LN:78077248
@sq SN:chr19 LN:59128983
@sq SN:chr20 LN:63025520
@sq SN:chr21 LN:48129895
@sq SN:chr22 LN:51304566
@sq SN:chrX LN:155270560
@sq SN:chrY LN:59373566
chr1 9497728 9497837 - ENST00000517147.1
chr1 13949679 13949779 - ENST00000411020.1
chr1 34578550 34578664 + ENST00000364278.1
chr1 37730278 37730387 - ENST00000516559.1
chr1 39619836 39619968 - ENST00000410446.1


Reply to this email directly or view it on GitHub
#126.

@slowkow
Copy link
Contributor Author

slowkow commented Dec 11, 2014

Hi Yossi, I believe I already posted it. See above.

You can see my ribosomal intervals file, in full, at the gist link. The header of my bam file is already posted in the first comment of this thread, except for the single line that specifies the program which created the bam file. Here it is a second time, with that line included:

$ samtools view -H file.bam
@HD     VN:1.4  SO:coordinate
@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566
@PG     ID:subread      PN:subread      VN:1.4.5-p1

When you say hg19.genome.dict, are you referring to something else that is not the ribosomal intervals file? My understanding is that the CollectRnaSeqMetrics tool requires a RIBOSOMAL_INTERVALS file formatted as I have shown in my previous comment.

@nh13
Copy link
Collaborator

nh13 commented Dec 11, 2014

@slowkow I only see the sequence dictionary from the ribosmal interval list, not the sequence dictionary associated with the FASTA file.

@slowkow
Copy link
Contributor Author

slowkow commented Dec 11, 2014

@nh13 There is no mention of a sequence dictionary associated with the FASTA file in the error, and also no mention in the documentation for CollectRnaSeqMetrics.

Here are the chromosome names in my FASTA file:

grep "^>" hg19.genome.fa
>chrM
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr20
>chr21
>chr22
>chrX
>chrY

@nh13
Copy link
Collaborator

nh13 commented Dec 11, 2014

Do you have a file hg19.genome.dict or hg19.genome.fa.dict? If not, could you create one and post the sequence dictionary created (use CreateSequenceDictionary).

Regarding documentation and error messages, we would be grateful if you could submit a pull request so we can review (@vdauwera).

@slowkow
Copy link
Contributor Author

slowkow commented Dec 11, 2014

I created the file as you suggested.

I also tried replacing the dictionary in my ribosomal intervals file with the generated dictionary, and I receive the same error as in my first post..

cat hg19.genome.fa.dict
@HD     VN:1.4  SO:unsorted
@SQ     SN:chrM LN:16571        UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:d2ed829b8a1628d16cbeee88e88e39eb
@SQ     SN:chr1 LN:249250621    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ     SN:chr2 LN:243199373    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:a0d9851da00400dec1098a9255ac712e
@SQ     SN:chr3 LN:198022430    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:641e4338fa8d52a5b781bd2a2c08d3c3
@SQ     SN:chr4 LN:191154276    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:23dccd106897542ad87d2765d28a19a1
@SQ     SN:chr5 LN:180915260    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:0740173db9ffd264d728f32784845cd7
@SQ     SN:chr6 LN:171115067    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:1d3a93a248d92a729ee764823acbbc6b
@SQ     SN:chr7 LN:159138663    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:618366e953d6aaad97dbe4777c29375e
@SQ     SN:chr8 LN:146364022    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:96f514a9929e410c6651697bded59aec
@SQ     SN:chr9 LN:141213431    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:3e273117f15e0a400f01055d9f393768
@SQ     SN:chr10        LN:135534747    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:988c28e000e84c26d552359af1ea2e1d
@SQ     SN:chr11        LN:135006516    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:98c59049a2df285c76ffb1c6db8f8b96
@SQ     SN:chr12        LN:133851895    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:51851ac0e1a115847ad36449b0015864
@SQ     SN:chr13        LN:115169878    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:283f8d7892baa81b510a015719ca7b0b
@SQ     SN:chr14        LN:107349540    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:98f3cae32b2a2e9524bc19813927542e
@SQ     SN:chr15        LN:102531392    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:e5645a794a8238215b2cd77acb95a078
@SQ     SN:chr16        LN:90354753     UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:fc9b1a7b42b97a864f56b348b06095e6
@SQ     SN:chr17        LN:81195210     UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:351f64d4f4f9ddd45b35336ad97aa6de
@SQ     SN:chr18        LN:78077248     UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c
@SQ     SN:chr19        LN:59128983     UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:1aacd71f30db8e561810913e0b72636d
@SQ     SN:chr20        LN:63025520     UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:0dec9660ec1efaaf33281c0d5ea2560f
@SQ     SN:chr21        LN:48129895     UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:2979a6085bfe28e3ad6f552f361ed74d
@SQ     SN:chr22        LN:51304566     UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:a718acaa6135fdca8357d5bfe94211dd
@SQ     SN:chrX LN:155270560    UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:7e0e2e580297b7764e31dbc80c2540dd
@SQ     SN:chrY LN:59373566     UR:file:/medpop/rnaseq/genome/hg19.genome.fa    M5:1e86411d73e6f00a10590f976be01623

@nh13
Copy link
Collaborator

nh13 commented Dec 11, 2014

Ah, it says the interval list and file.bam dictionaries are different. My apologies, could you post the sequence dictionary for the header in file.bam (ex. samtools view -H file.bam).

@slowkow
Copy link
Contributor Author

slowkow commented Dec 11, 2014

#126 (comment)

@slowkow
Copy link
Contributor Author

slowkow commented Dec 12, 2014

The problem: my sequence dictionary lines are space-delimited, and they must be tab-delimited, i.e. \t.

I apologize for the confusion, and for wasting your time. I recommend that you document the format of the interval list file, or refer the user to an example of a properly-formatted file. It would be useful to point them to a ready-to-use file like this one.

Strangely, the current documentation for CollectRnaSeqMetrics refers the user to the IntervalList javadoc. The comment at the top of the page does not mention that the SAM-style header must be tab-delimited. In my opinion, developers -- not users -- should be referred to javadoc documentation.

Wrong (each item is separated by a space ' '):

@SQ SN:chrM LN:16571

Right (each item is separated by a tab '\t'):

@SQ SN:chrM LN:16571

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants