Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-GVS bits required for GVS [VS-971] #8362

Merged
merged 5 commits into from Jun 27, 2023

Conversation

mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Jun 13, 2023

A collection of changes in non-GVS packages required to build a working version of GVS against master:

  1. Support for an optional monitoring script for VQSR Lite JointVcfFiltering.wdl.
  2. VQS_SENS_FAILURE_PREFIX VCF header value updated for correctness.
  3. Moved all BigQuery classes under a gvs package to make clear these are currently considered to be GVS specific.
  4. Added method to BigQueryUtils.
  5. ExcessHet calculation fixes for the case of no PLs. Removed, no longer required with Annotation changes in ExtractTool.

@mcovarr mcovarr changed the title [DRAFT] Non-GVS bits required for GVS [VS-971] Non-GVS bits required for GVS [VS-971] Jun 14, 2023
@mcovarr mcovarr marked this pull request as ready for review June 14, 2023 12:29
@mcovarr mcovarr requested review from lbergelson, ldgauthier and droazen and removed request for lbergelson June 20, 2023 17:16
@@ -451,6 +451,19 @@ public static StorageAPIAvroReaderAndBigQueryStatistics executeQueryWithStorageA
}

public static boolean doRowsExistFor(String projectID, String datasetName, String tableName, String columnName, String value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't unique to this PR, but should we be worried about injection attacks here? It seems like we should be sanitizing the inputs and probably using whatever bigquery query builder exists instead of directly munging strings from the user into our queries.

Copy link
Collaborator Author

@mcovarr mcovarr Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably not clear from the way GitHub displays these diffs, but this PR is adding a second version of doRowsExistFor that makes its last parameter a Long instead of the String. So the changes here should be even safer than the baseline. 🙂

The code in this file is often formatting not just column values into String, but also project / dataset / table. All of these are variable in GVS (even table names, thanks to "superpartitioning" to work around BQ partitioning constraints).

At least in the GVS context I'm not particularly concerned about injection here for a few reasons:

  • The project and dataset must have been validated to even create the tables in the first place.
  • Table and column names are not specified from user input.
  • Users would be running GVS via their pet service accounts and wouldn't have the ability to read or write anything that isn't theirs.
  • These doRowsExistFor are only called to validate sample ids (Longs) and digests. So even if the VCF header had some "little Bobby Tables" in it we would be querying for the digest of that.

Happy to add some docs / warnings to clarify for potential future users for whom these assumptions may not apply.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcovarr I have a few comments. I didn't look at the wdl at all. It should really have a test to show the new functionality also.

if (!g.hasLikelihoods()) {
if (!g.isHomRef() && !roundContributionFromEachGenotype) {
throw new IllegalStateException("Genotype likelihoods cannot be determined from GQ for genotype: " + g +
".\nGenotypes with no PLs should have integer counts using roundContributionFromEachGenotype = true.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message seems confusingly worded to me. Maybe something like "Non-hom-ref genotypes with no PLs are only allowed if roundContributionFromEachGenotype = true.

@@ -62,9 +62,19 @@ public static GenotypeCounts computeDiploidGenotypeCounts(final VariantContext v
continue;
}

if (!g.hasLikelihoods() && g.isHomRef()) {
if (!g.hasLikelihoods()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the control flow kind of confusing here now. I might rearrange it into a tree with round vs not round branches instead of pulling out that error case at the top.

Something like this:

            if (!g.hasLikelihoods()) {
                if (roundContributionFromEachGenotype) {
                    if (g.isHomRef()) {
                        genotypeWithTwoRefsCount += 1;
                    } else if (g.isHet()) {
                        genotypesWithOneRefCount += 1;
                    } else {
                        genotypesWithNoRefsCount += 1;
                    }
                } else {
                    if (!g.isHomRef()) {
                        throw new IllegalStateException("Genotype likelihoods cannot be determined from GQ for genotype: " + g +
                                ".\nGenotypes with no PLs should have integer counts using roundContributionFromEachGenotype = true.");
                    }
                    if (gq == 0) {
                        genotypeWithTwoRefsCount += 1.0/3;
                        genotypesWithOneRefCount += 1.0/3;
                        genotypesWithNoRefsCount += 1.0/3;
                    } else {
                        genotypeWithTwoRefsCount += QualityUtils.qualToProb(gq);
                        genotypesWithOneRefCount += 1 - QualityUtils.qualToProb(gq);
                        //assume last likelihood is negligible
                    }
                }
                continue;
            }

@@ -208,7 +208,7 @@ their names (or descriptions) depend on some threshold. Those filters are not i
public static final String VQSR_FAILURE_PREFIX = "low_VQSLOD_";
public static final String VQSR_FAILURE_SNP = VQSR_FAILURE_PREFIX + SNP;
public static final String VQSR_FAILURE_INDEL = VQSR_FAILURE_PREFIX + INDEL;
public static final String VQS_SENS_FAILURE_PREFIX = "low_VQS_SENS_";
public static final String VQS_SENS_FAILURE_PREFIX = "high_VQS_SENS_";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this was confusing enough that it was gotten wrong initially maybe add a comment explaining what this means?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's analogous to the low_VQSLOD_ prefix - "Prefix for a site (SNP/INDEL) that failed calibration sensitivity cutoff". In this case, the site would be a failure if the sensitivity is greater than the threshold.

@mcovarr mcovarr force-pushed the vs_971_non_gvs_bits_required_for_gvs branch from df66177 to a68c196 Compare June 20, 2023 20:56
@mcovarr
Copy link
Collaborator Author

mcovarr commented Jun 20, 2023

The "WDL test" CI failures are not related to the changes in this PR, please see this sanity check PR which is also currently aflame.

@mcovarr mcovarr force-pushed the vs_971_non_gvs_bits_required_for_gvs branch 2 times, most recently from e14616c to 585d0c8 Compare June 21, 2023 22:19
@mcovarr mcovarr requested a review from lbergelson June 22, 2023 12:35
@mcovarr
Copy link
Collaborator Author

mcovarr commented Jun 22, 2023

Some updates per the description:

  • Backed out ExcessHet changes from this PR
  • Removed BigQuery classes temporarily, to return inside a gvs package in a later PR. Brought these back under a gvs package

@mcovarr mcovarr force-pushed the vs_971_non_gvs_bits_required_for_gvs branch from 18202d7 to 23b750d Compare June 23, 2023 00:38
*
* Created by jonn on 4/17/19.
*/
public final class BigQueryUtils {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of deleting and re-adding these, it might be better to simply move them to their destined package in this PR. That way, we could preserve the git history of changes to these utils.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done so I'm going to dismiss this review.

@mcovarr mcovarr force-pushed the vs_971_non_gvs_bits_required_for_gvs branch 3 times, most recently from 91de082 to 977a2ef Compare June 23, 2023 22:54
@mcovarr mcovarr force-pushed the vs_971_non_gvs_bits_required_for_gvs branch from 977a2ef to dc44285 Compare June 26, 2023 13:51
@mcovarr mcovarr requested a review from droazen June 26, 2023 14:41
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@lbergelson lbergelson dismissed droazen’s stale review June 27, 2023 17:55

The requested changes were made.

@mcovarr mcovarr merged commit daeb3e2 into master Jun 27, 2023
20 checks passed
@mcovarr mcovarr deleted the vs_971_non_gvs_bits_required_for_gvs branch June 27, 2023 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants