Code to run logistic regression on v4 exomes and genomes with ancesty pcs #616

KoalaQin · 2024-05-22T02:02:46Z

No description provided.

…y PCs

jkgoodrich · 2024-05-25T01:30:24Z

gnomad_qc/v4/assessment/logistic_regression.py

+        ht = get_test_intervals(ht)
+        ht = ht.checkpoint(hl.utils.new_temp_file("test_intervals", "ht"))
+        exomes_vds = hl.vds.filter_intervals(
+            exomes_vds, ht, split_reference_blocks=True


Is this needed? Did the Hail team say this is faster than just filtering the variant matrix table and doing a densify like we do in this script? https://github.com/broadinstitute/gnomad_qc/blob/main/gnomad_qc/v4/sample_qc/generate_qc_mt.py

I didn't talk to Hail team, I tried filter_variants, I found it took very long to finish that step, then I remembered that once the vds has this store_max_ref_length, filter_intervals is much faster.

OK, sounds good, if it's faster then go for it

jkgoodrich · 2024-05-25T01:32:37Z

gnomad_qc/v4/assessment/logistic_regression.py

+    logger.info("Densifying exomes...")
+    exomes_mt = hl.vds.to_dense_mt(exomes_vds)
+    exomes_mt = exomes_mt.annotate_cols(is_genome=False)
+    exomes_mt = exomes_mt.select_entries("GT").select_rows().select_cols("is_genome")


If this is the only entry you need then you should filter to it before the densify. I think for this you probably also want to filter to only adj genotypes though, so you probably need more

I agree, adj seems to make more sense. I will get that.

I found the entries are not the same as what we used in getting freq for exomes or genomes, could you check the new steps for me?

jkgoodrich · 2024-05-25T01:35:29Z

gnomad_qc/v4/assessment/logistic_regression.py

+    return ht
+
+
+def densify_union_exomes_genomes(


I would split this up. I would run the exomes and genomes filter and densify in parallel, checkpoint each, and then union after those are done and checkpoint

jkgoodrich · 2024-05-25T01:38:26Z

gnomad_qc/v4/assessment/logistic_regression.py

+    :param joint_ht: Joint HT of v4 exomes and genomes.
+    :return: Test Table
+    """
+    # Filter to chr22


Before running the chr22 test, make an actual test that is only the first few partitions of chr22

Only a few partitions were slower when I tested, when I get the set of intervals on chr22, they are more partitioned and they were densified faster.

jkgoodrich · 2024-05-25T01:40:12Z

gnomad_qc/v4/assessment/logistic_regression.py

+        "firth",
+        y=mt.is_genome,
+        x=mt.GT.n_alt_alleles(),
+        covariates=[1] + [mt.pc[i] for i in range(10)],


I don't remember how many PCs we used for ancestry assignment off the top of my head, but I would use that number

Mike told me it was 10, but I do remember you're exploring until 18,19, I will double check the code.

Sorry if I wasnt clear, I wasnt certain it was 10, just thought it may be. Its 20: https://app.zenhub.com/workspaces/gnomad-5f4d127ea61afc001d6be50b/issues/gh/broadinstitute/gnomad_production/496

No worries, I put that 10 temporarily because it was just 10 in Julia's code. I changed it in my test.

jkgoodrich · 2024-05-25T01:40:45Z

gnomad_qc/v4/assessment/logistic_regression.py

+    return mt
+
+
+def main(args):


add a try catch for copying the logger in case it fails with an error

I added this.

KoalaQin · 2024-06-04T14:05:12Z

gnomad_qc/v4/assessment/logistic_regression.py

+        hl.utils.new_temp_file(f"temp_{data_type}_vds_filtered", "vds")
+    )
+    mt = hl.vds.to_dense_mt(vds)
+    mt = annotate_adj(mt)


After looking at your code, I think I could use filter_to_adj directly.

Code to run logistic regression on v4 exomes and genomes with ancestr…

071d4de

…y PCs

jkgoodrich self-requested a review May 23, 2024 13:46

jkgoodrich assigned jkgoodrich and KoalaQin May 23, 2024

jkgoodrich reviewed May 25, 2024

View reviewed changes

KoalaQin added 2 commits May 26, 2024 17:30

Merge branch 'main' into qh/logistic_regression

4b41e38

Address review suggestions

ad96374

KoalaQin commented Jun 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code to run logistic regression on v4 exomes and genomes with ancesty pcs #616

Code to run logistic regression on v4 exomes and genomes with ancesty pcs #616

KoalaQin commented May 22, 2024

jkgoodrich May 25, 2024

KoalaQin May 25, 2024 •

edited

Loading

jkgoodrich May 25, 2024

jkgoodrich May 25, 2024

KoalaQin May 25, 2024

KoalaQin May 28, 2024 •

edited

Loading

jkgoodrich May 25, 2024

jkgoodrich May 25, 2024

KoalaQin May 28, 2024

jkgoodrich May 25, 2024

KoalaQin May 25, 2024

mike-w-wilson May 28, 2024

KoalaQin May 28, 2024

jkgoodrich May 25, 2024

KoalaQin May 28, 2024

KoalaQin Jun 4, 2024

Code to run logistic regression on v4 exomes and genomes with ancesty pcs #616

Are you sure you want to change the base?

Code to run logistic regression on v4 exomes and genomes with ancesty pcs #616

Conversation

KoalaQin commented May 22, 2024

Choose a reason for hiding this comment

KoalaQin May 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KoalaQin May 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KoalaQin May 25, 2024 •

edited

Loading

KoalaQin May 28, 2024 •

edited

Loading