# 11: Count qualifying variants per gene
This script counts the burden of qualifying variants (you select this definition) per gene for each sample and writes out a table of this burden across all genes in a chromosome. This script was run using a mem2_ss2_v2_x8 instance with 20 nodes, and took ~7 hours. It cost a total of £35.

## Set up environment
Make sure you run this block only once. You'll get errors if you try to initialise Hail multiple times. If you do do this, you'll need to restart the kernel, and then initialise Hail only once. 

In [1]:
# Initialise hail and spark logs? Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.
import pyspark.sql

config = pyspark.SparkConf().setAll([('spark.kryoserializer.buffer.max', '128')])
sc = pyspark.SparkContext(conf=config) 

from pyspark.sql import SparkSession

import hail as hl
builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=sc)

import dxpy

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/backend/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.2.3
SparkUI available at http://ip-10-60-98-61.eu-west-2.compute.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.116-cd64e0876c94
LOGGING: writing to /opt/notebooks/hail-20250113-0947-0.2.116-cd64e0876c94.log


## AC<=5 counts

This script loops over each chromosome and reads in the QC'd and annotated matrix tables. It then counts the number of alternate alleles fitting each class carried by each individual, and writes this out on a per gene basis. The stages this script runs are as follows: 
- PREP MT: Reads in matrix table from stage 10, filters to only keep variants with a MAC of 5 or less, and then writes the outputted mt out to speed up downstream processing. 
- COUNTS:
    - All rare variants: Counts rare variants of any annotation carried per gene per person and writes out counts as a .tsv file 
    - PTVs: Counts rare PTVs carried per person per gene and writes out as a .tsv file
    - Deleterious missense variants: Counts rare deleterious missense variants carried per person per gene and writes out as a .tsv file
    - Synonymous variants: Counts rare synonymous variants carried per person per gene and writes out as a .tsv file


In [None]:

#Define the chromosomes you are working with
chromosomes = list(range(1, 23)) #chr 1 is in 2 halves so needs to be proccessed separately (see code chunks below)!
AC_filter=5

for chr in chromosomes:
    if chr == 1 :
        # First half 
        print(f"Processing chromosome {chr} first half...")
    

        ####### PREP MT ######### 
        # Only do this the first time you're running counts - not needed at later stages as the count ready matrix table is written out, so can be read in for later counts. 
         # Read the matrix table for the current chromosome
        mt=hl.read_matrix_table("dnax://database-Gq45XQjJ637Q9X6XJJJ3Pf7k/chr_1_first_half_ready_for_counts.mt")
        print(f'AC filter set as <={AC_filter}')
        filtered_mt=mt.filter_rows(mt.variant_qc.AC[1]<= AC_filter)
        filtered_mt.checkpoint(f'AC{AC_filter}orless_chr_{chr}_first_half.mt', overwrite=True)


        ####### COUNTS ########
        # All rare variants
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}.mt')
        PTV_var = (filtered_mt
           .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
           .aggregate(
               n= hl.agg.filter(
                   (filtered_mt.GT.is_non_ref()),
                   hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'AC<{AC_filter} counts for chr {chr} done...')
        PTV_var.n.export(f"chr_{chr}_AC{AC_filter}orless_gene_counts.tsv")
        print(f'Any variant class AC<={AC_filter} counts for chr {chr} written out as chr_{chr}_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')

        # PTVs
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}_first_half.mt')
        PTV_var = (filtered_mt
           .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
           .aggregate(
               n= hl.agg.filter(
                   (filtered_mt.LoF_worstCsq==True)& 
                   (filtered_mt.GT.is_non_ref()),
                   hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'AC<{AC_filter} counts for chr {chr} first half done...')
        PTV_var.n.export(f"chr_{chr}_first_half_PTV_AC{AC_filter}orless_gene_counts.tsv")
        print(f'PTV AC<={AC_filter} counts for chr {chr} first half written out as chr_{chr}_first_half_PTV_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')
    
        # Deleterious missense variants 
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}_first_half.mt') # Read back in here as it speeds up process. Checkpoint command is suposed to read back in but doesn't seem to be?
        Revel75_miss_var=(filtered_mt
             .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
             .aggregate(
                 n = hl.agg.filter(
                     (filtered_mt.Miss_worstCsq == True)& 
                     (filtered_mt.REVEL_score> 0.75),
                     hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'REVEL >0.75 missense AC<={AC_filter} counts for chr {chr} first half done...')
        Revel75_miss_var.n.export(f"chr_{chr}_first_half_REVEL75_Miss_AC{AC_filter}orless_gene_counts.tsv")
        print(f'REVEL >0.75 missense AC<={AC_filter} counts for chr {chr} first half written out as chr_{chr}_first_half_REVEL75_Miss_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')

    
        # Synonymous variants 
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}_first_half.mt')
        syn_var=(filtered_mt
             .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
             .aggregate(
                 n = hl.agg.filter(
                     (filtered_mt.Syn_worstCsq == True),
                     hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'Synonymous AC<={AC_filter} counts for chr {chr} first half done...')
        syn_var.n.export(f"chr_{chr}_first_half_synonymous_AC{AC_filter}orless_gene_counts.tsv")
        print(f'Synonymous AC<={AC_filter} counts for {chr} first half written out as chr_{chr}_first_half_synonymous_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')
    
        print(f"Finished processing chromosome {chr} first half!")
    
    
        # Second half
        print(f"Processing chromosome {chr} second half...")
    
        ####### PREP MT ######### 
        # Read the matrix table for the current chromosome
        mt=hl.read_matrix_table("dnax://database-Gq45XQjJ637Q9X6XJJJ3Pf7k/chr_1_second_half_ready_for_counts.mt")
        print(f'AC filter set as <={AC_filter}')
        filtered_mt=mt.filter_rows(mt.variant_qc.AC[1]<= AC_filter)
        filtered_mt.checkpoint(f'AC{AC_filter}orless_chr_{chr}_second_half.mt', overwrite=True)


        ###### COUNTS #######
        # All rare variants
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}.mt')
        PTV_var = (filtered_mt
           .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
           .aggregate(
               n= hl.agg.filter(
                   (filtered_mt.GT.is_non_ref()),
                   hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'AC<{AC_filter} counts for chr {chr} done...')
        PTV_var.n.export(f"chr_{chr}_AC{AC_filter}orless_gene_counts.tsv")
        print(f'Any variant class AC<={AC_filter} counts for chr {chr} written out as chr_{chr}_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')

        # PTVs 
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}_second_half.mt')
        PTV_var = (filtered_mt
           .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
           .aggregate(
               n= hl.agg.filter(
                   (filtered_mt.LoF_worstCsq==True)& 
                   (filtered_mt.GT.is_non_ref()),
                   hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'AC<{AC_filter} counts for chr {chr} second half done...')
        PTV_var.n.export(f"chr_{chr}_second_half_PTV_AC{AC_filter}orless_gene_counts.tsv")
        print(f'PTV AC<={AC_filter} counts for chr {chr} second half written out as chr_{chr}_second_half_PTV_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')
    
        # Deleterious missense variants
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}_second_half.mt') # Read back in here as it speeds up process. Checkpoint command is suposed to read back in but doesn't seem to be?
        Revel75_miss_var=(filtered_mt
             .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
             .aggregate(
                 n = hl.agg.filter(
                     (filtered_mt.Miss_worstCsq == True)& 
                     (filtered_mt.REVEL_score> 0.75),
                     hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'REVEL >0.75 missense AC<={AC_filter} counts for chr {chr} second half done...')
        Revel75_miss_var.n.export(f"chr_{chr}_second_half_REVEL75_Miss_AC{AC_filter}orless_gene_counts.tsv")
        print(f'REVEL >0.75 missense AC<={AC_filter} counts for chr {chr} second half written out as chr_{chr}_second_half_REVEL75_Miss_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')
    
        # Count synonymous variants and write out
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}_second_half.mt')
        syn_var=(filtered_mt
             .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
             .aggregate(
                 n = hl.agg.filter(
                     (filtered_mt.Syn_worstCsq == True),
                     hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'Synonymous AC<={AC_filter} counts for chr {chr} second half done...')
        syn_var.n.export(f"chr_{chr}_second_half_synonymous_AC{AC_filter}orless_gene_counts.tsv")
        print(f'Synonymous AC<={AC_filter} counts for {chr} second half written out as chr_{chr}_second_half_synonymous_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')
    
        print(f"Finished processing chromosome {chr} second half!")
        print(f"Finished processing chromosome {chr}!")

    else:
        print(f"Processing chromosome {chr}...")
    
        ######## PREP MT #########
        # Read the matrix table for the current chromosome
        mt=hl.read_matrix_table(f'dnax://database-Gq45XQjJ637Q9X6XJJJ3Pf7k/chr_{chr}_ready_for_counts.mt')
        print(f'AC filter set as <={AC_filter}')
        filtered_mt=mt.filter_rows(mt.variant_qc.AC[1]<= AC_filter)
        filtered_mt.checkpoint(f'AC{AC_filter}orless_chr_{chr}.mt', overwrite=True)


        ####### COUNTS ####### 
        # Any rare variant 
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}.mt')
        PTV_var = (filtered_mt
           .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
           .aggregate(
               n= hl.agg.filter(
                   (filtered_mt.GT.is_non_ref()),
                   hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'AC<{AC_filter} counts for chr {chr} done...')
        PTV_var.n.export(f"chr_{chr}_AC{AC_filter}orless_gene_counts.tsv")
        print(f'Any variant class AC<={AC_filter} counts for chr {chr} written out as chr_{chr}_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')
     
        # PTVs
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}.mt')
        PTV_var = (filtered_mt
           .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
           .aggregate(
               n= hl.agg.filter(
                   (filtered_mt.LoF_worstCsq==True)& 
                   (filtered_mt.GT.is_non_ref()),
                   hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'AC<{AC_filter} counts for chr {chr} done...')
        PTV_var.n.export(f"chr_{chr}_PTV_AC{AC_filter}orless_gene_counts.tsv")
        print(f'PTV AC<={AC_filter} counts for chr {chr} written out as chr_{chr}_PTV_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')
    
        # Deleterious missense variants and write out
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}.mt') # Read back in here as it speeds up process. Checkpoint command is suposed to read back in but doesn't seem to be?
        Revel75_miss_var=(filtered_mt
             .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
             .aggregate(
                 n = hl.agg.filter(
                     (filtered_mt.Miss_worstCsq == True)& 
                     (filtered_mt.REVEL_score> 0.75),
                     hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'REVEL >0.75 missense AC<={AC_filter} counts for chr {chr} done...')
        Revel75_miss_var.n.export(f"chr_{chr}_REVEL75_Miss_AC{AC_filter}orless_gene_counts.tsv")
        print(f'REVEL >0.75 missense AC<={AC_filter} counts for chr {chr} written out as chr_{chr}_REVEL75_Miss_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')
    
        # Count synonymous variants and write out
        filtered_mt=hl.read_matrix_table(f'AC{AC_filter}orless_chr_{chr}.mt')
        syn_var=(filtered_mt
             .group_rows_by(gene_id=filtered_mt.gene_id_worstCsq)
             .aggregate(
                 n = hl.agg.filter(
                     (filtered_mt.Syn_worstCsq == True),
                     hl.agg.sum(filtered_mt.GT.n_alt_alleles()))))
        print(f'Synonymous AC<={AC_filter} counts for chr {chr} done...')
        syn_var.n.export(f"chr_{chr}_synonymous_AC{AC_filter}orless_gene_counts.tsv")
        print(f'Synonymous AC<={AC_filter} counts for {chr} written out as chr_{chr}_synonymous_AC{AC_filter}orless_gene_counts.tsv')
        print('Dont forget to copy these tables up to your project before closing the session!!')
    
        print(f"Finished processing chromosome {chr}!")
    

Processing chromosome 21...
AC filter set as <=5


2025-01-09 13:59:08.634 Hail: INFO: wrote matrix table with 158235 rows and 399877 columns in 388 partitions to AC5orless_chr_21.mt


AC<5 counts for chr 21 done...


2025-01-09 13:59:53.739 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-01-09 14:01:07.818 Hail: INFO: merging 1 files totalling 186.9M...
2025-01-09 14:01:08.677 Hail: INFO: while writing:
    chr_21_PTV_AC5orless_gene_counts.tsv
  merge time: 858.288ms


PTV AC<=5 counts for chr 21 written out as chr_21_PTV_AC5orless_gene_counts.tsv
Dont forget to copy these tables up to your project before closing the session!!
REVEL >0.75 missense AC<=5 counts for chr 21 done...


2025-01-09 14:01:51.895 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-01-09 14:03:02.679 Hail: INFO: merging 1 files totalling 186.9M...
2025-01-09 14:03:03.302 Hail: INFO: while writing:
    chr_21_REVEL75_Miss_AC5orless_gene_counts.tsv
  merge time: 622.617ms


REVEL >0.75 missense AC<=5 counts for chr 21 written out as chr_21_REVEL75_Miss_AC5orless_gene_counts.tsv
Dont forget to copy these tables up to your project before closing the session!!
REVEL 0.75-0.5 missense AC<=5 counts for chr 21 done...


2025-01-09 14:03:50.680 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-01-09 14:05:06.344 Hail: INFO: merging 1 files totalling 186.9M...
2025-01-09 14:05:06.898 Hail: INFO: while writing:
    chr_21_REVEL75to50_Miss_AC5orless_gene_counts.tsv
  merge time: 553.660ms


REVEL 0.75-0.5 missense AC<=5 counts for chr 21 written out as chr_21_REVEL75to50_Miss_AC5orless_gene_counts.tsv
Dont forget to copy these tables up to your project before closing the session!!
REVEL <=0.5 missense AC<=5 counts for chr 21 done...


2025-01-09 14:05:55.930 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-01-09 14:07:08.963 Hail: INFO: merging 1 files totalling 186.9M...
2025-01-09 14:07:09.634 Hail: INFO: while writing:
    chr_21_REVEL50orless_Miss_AC5orless_gene_counts.tsv
  merge time: 671.283ms


REVEL <=0.5 missense AC<=5 counts for chr 21 written out as chr_21_REVEL50orless_Miss_AC5orless_gene_counts.tsv
Dont forget to copy these tables up to your project before closing the session!!
Synonymous AC<=5 counts for chr 21 done...


2025-01-09 14:07:51.626 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-01-09 14:09:01.414 Hail: INFO: merging 1 files totalling 186.9M...
2025-01-09 14:09:01.964 Hail: INFO: while writing:
    chr_21_synonymous_AC5orless_gene_counts.tsv
  merge time: 549.882ms


Synonymous AC<=5 counts for 21 written out as chr_21_synonymous_AC5orless_gene_counts.tsv
Dont forget to copy these tables up to your project before closing the session!!
Finished processing chromosome 21!
Processing chromosome 2...
AC filter set as <=5


2025-01-09 14:12:21.157 Hail: INFO: wrote matrix table with 1100907 rows and 399877 columns in 2209 partitions to AC5orless_chr_2.mt


AC<5 counts for chr 2 done...


2025-01-09 14:15:26.882 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-01-09 14:20:50.048 Hail: INFO: merging 1 files totalling 1.1G...
2025-01-09 14:20:53.351 Hail: INFO: while writing:
    chr_2_PTV_AC5orless_gene_counts.tsv
  merge time: 3.303s


PTV AC<=5 counts for chr 2 written out as chr_2_PTV_AC5orless_gene_counts.tsv
Dont forget to copy these tables up to your project before closing the session!!
REVEL >0.75 missense AC<=5 counts for chr 2 done...


2025-01-09 14:24:10.568 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-01-09 14:33:37.256 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-01-09 14:39:22.479 Hail: INFO: merging 1 files totalling 1.1G...
2025-01-09 14:39:25.847 Hail: INFO: while writing:
    chr_2_REVEL75to50_Miss_AC5orless_gene_counts.tsv
  merge time: 3.368s


REVEL 0.75-0.5 missense AC<=5 counts for chr 2 written out as chr_2_REVEL75to50_Miss_AC5orless_gene_counts.tsv
Dont forget to copy these tables up to your project before closing the session!!
REVEL <=0.5 missense AC<=5 counts for chr 2 done...


2025-01-09 14:43:01.320 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-01-09 14:48:51.073 Hail: INFO: merging 1 files totalling 1.1G...
2025-01-09 14:48:54.624 Hail: INFO: while writing:
    chr_2_REVEL50orless_Miss_AC5orless_gene_counts.tsv
  merge time: 3.551s


REVEL <=0.5 missense AC<=5 counts for chr 2 written out as chr_2_REVEL50orless_Miss_AC5orless_gene_counts.tsv
Dont forget to copy these tables up to your project before closing the session!!
Synonymous AC<=5 counts for chr 2 done...


2025-01-09 14:54:25.671 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-01-09 14:59:52.461 Hail: INFO: merging 1 files totalling 1.1G...


Synonymous AC<=5 counts for 2 written out as chr_2_synonymous_AC5orless_gene_counts.tsv
Dont forget to copy these tables up to your project before closing the session!!
Finished processing chromosome 2!


2025-01-09 14:59:55.702 Hail: INFO: while writing:
    chr_2_synonymous_AC5orless_gene_counts.tsv
  merge time: 3.241s


## Copy all tables up to your project! 

In the terminal make sure you run the following commands so this is saved up to your project.

hdfs dfs -get *.tsv 

dx upload *.tsv 