Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case in LiftoverVcf causes cryptic string exception #1951

Closed
1 task done
rickymagner opened this issue Mar 23, 2024 · 8 comments
Closed
1 task done

Edge case in LiftoverVcf causes cryptic string exception #1951

rickymagner opened this issue Mar 23, 2024 · 8 comments

Comments

@rickymagner
Copy link
Contributor

Bug Report

Affected tool(s)

LiftoverVcf

Affected version(s)

  • Latest public release version [3.1.1] (GATK 4.5.0.0)

Description

This was discovered in trying to debug #1933. There is an edge case in lifting over deletions when they contract a tandem repeat close to the edge of the resulting liftover contig. Consider the entry in a real chain file (details in the steps to reproduce below on where files live):

chain   4900    chr17   83257441        +       22527230        22533041        unplaced_85     5811    -       0       5811    2535

Next consider the variant:

chr17   22533036        .       GTA     G       .       .       .

The 7 bases on chr17 from 22533036 are: GTATATG, and the first 6 bases on contig unplaced_85 are: ATATACAT. The chain file says that the GTATAT on chr17 corresponds to ATATAC on unplaced_85. The deletion as written in the input VCF corresponds to ATA--C in unplaced_85. When left aligned (as happens in the algorithm), this looks like A--TAC, so should be recorded as:

unplaced_85   1    ATA   A   .    .    .

But instead, LiftoverVcf throws an error (see stacktrace below) about a string index out of bounds.

I suspect there is a bug in our left aligning step, but can't see it yet from the code. There is a start - 2 in LiftoverUtils.leftAlignVariant which looks suspicious, but I haven't pinned that down as the direct cause yet.

Here are some experiments playing with the DEL length:

REF ALT RESULT
GT G Accept: unplaced_85 at POS 4
GTA G Error
GTAT G Accept: unplaced_85 at POS 2
GTATA G Error
GTATAT G Error
GTATATG G Reject: NoTarget

The fact that the smaller "odd" deletions work but the full contractions don't highly suggests something strange is happening during the left alignment after liftover. But the code is a bit complex so it'll take some digging.

Steps to reproduce

Grab an hg38 VCF header and record this variant as the sites-only VCF special_variant.vcf.gz:

chr17   22533036        .       GTA     G       .       .       .

Then take the JG2.1.0 reference here with chain file here. Save these to ref.fa.gz and lift.chain. Get the right ref indices ready. Then run

gatk LiftoverVcf -I special_variant.vcf.gz -O accept.vcf --REJECT reject.vcf -R ref.fa.gz --CHAIN lift.chain

You will see the stacktrace:

java.lang.StringIndexOutOfBoundsException: offset -1, count 3, length 5811
	at java.base/java.lang.String.checkBoundsOffCount(String.java:4591)
	at java.base/java.lang.String.<init>(String.java:397)
	at htsjdk.samtools.util.StringUtil.bytesToString(StringUtil.java:301)
	at picard.vcf.LiftoverVcf.tryToAddVariant(LiftoverVcf.java:558)
	at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:453)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:280)
	at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
	at org.broadinstitute.hellbender.Main.main(Main.java:306)

Expected behavior

If a variant gets left aligned to the start of a contig, we should handle that either with a uniform algorithm that works everywhere or as a special case.

Actual behavior

The user sees an uninformative stack trace.

@yfarjoun
Copy link
Contributor

yfarjoun commented Mar 23, 2024

nice sleuthing work @rickymagner! IIANM, I'm largely responsible for the lift-over of indels, and so if I can help, let me know.

@rickymagner
Copy link
Contributor Author

Hi @yfarjoun, if you do have some time to look at this, that would be great. I suspect the part of the code causing this error is located in LiftoverUtils.java at L389 here:

            // 5. if there exists an empty allele then
            if (alleleBasesMap.values().stream()
                    .map(a -> a.length)
                    .anyMatch(l -> l == 0)) {
                // 6. extend alleles 1 nucleotide to the left
                for (final Allele allele : alleleBasesMap.keySet()) {
                    // the first -1 for zero-base (getBases) versus 1-based (variant position)
                    // another   -1 to get the base prior to the location of the start of the allele
                    final byte extraBase = (theStart > 1) ?
                            referenceSequence.getBases()[theStart - 2] :
                            referenceSequence.getBases()[theEnd];

                    alleleBasesMap.put(allele, extendOneBase(alleleBasesMap.get(allele), extraBase));
                }
                changesInAlleles = true;
                theStart--;

                // 7. end if
            }

The theStart - 2 looks like it could suspiciously lead to a string index becoming -1 when theStart == 1, but this could also be a red herring since the conditional which triggers this block might not be satisfied by the given variant. I haven't been able to look too deeply at the code around here to understand what's actually happening.

This is not the moment the stacktrace is thrown of course, but I'm wondering if it's the place where that -1 appears and then propagates further to the moment later in tryToAddVariant where the string constructor is given index -1 to start which triggers the error. Otherwise I'm not sure how that number would ever become -1 yet.

@yfarjoun
Copy link
Contributor

I'll see what I can do. do you have a test-case that fails? I know you got a vcf from a user, but have you converted it to an additional unit test? I think that would be a good first step, so if you did it, please point me to it, if not, I can try my hand.

@rickymagner
Copy link
Contributor Author

Thanks. Yes, I would take the files from the issue above (the JP ref + chain file linked from the public URL; should be small enough to download), and then take the bad deletion variant I wrote on chr17 and add an hg38 header to it. Any header should be fine, but didn't want to post a complete example since there's 3k contigs there. I haven't turned it into a formal Java test yet, but might be able to try that this week (using the example above, or rewriting one using whatever resource files we already have in the test suite).

@rickymagner
Copy link
Contributor Author

This is resolved by #1933

@serge2016
Copy link

I suppose it is resolved by #1956

@yfarjoun
Copy link
Contributor

yfarjoun commented Apr 12, 2024 via email

@rickymagner
Copy link
Contributor Author

@serge2016 I checked it works for the specific edge case variant I found before. Let us know if there are any other variants that give an issue if you try to rerun on it. I'm not sure when the next Picard release will be, but you can test building from master if you're interested in seeing if the new version of the tool can get through your entire file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants