# 5. Applying QC cut-offs, annotating low quality variants and removing variants in LCRs

For this stage, I used a mem2_ssd1_v2_x8 instance, using either 10 nodes (for smaller chromosomes) or 25 nodes (for larger chromosomes). I ran 1/2 the choromosomes all within one instance, and the other half within the other instance. The total cost for running this stage was £33. 

QC'd matrix tables are saved using DNAX. These tables total 318GB. 

## Set up environment
Make sure you run this block only once. You'll get errors if you try to initialise Hail multiple times. If you do do this, you'll need to restart the kernel, and then initialise Hail only once. 

In [None]:
# Initialise hail and spark logs? Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.
import pyspark.sql

config = pyspark.SparkConf().setAll([('spark.kryoserializer.buffer.max', '128')])
sc = pyspark.SparkContext(conf=config) 

from pyspark.sql import SparkSession

import hail as hl
builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=sc)

import dxpy

The code below includes multiple stages, with heading matching the heading here: 
## Read in matrix table
Here, you read in the matrix table which was saved out in step 1 from a DNAX database. First, set the chromosomes for which you are running this script. 

## Applying variant QC cut-offs and annotations
### Filtering out very low quality variants, and annotating fairly low quality variants
In script 2 you should have plotted variant QC metrics and from these plots, defined filters you wish to apply. Here we are removing variants: 
- With a GQ <= 30
- With a call rate <= 0.7
- Which lie within low complexity regions.

And we are annotating variants as low quality if they have:
- Mean GQ across samples <= 40
- Call rate <= 0.9

### Removing variants in LCRs
Next, remove any variants lying within LCRs (low complexity regions) of the genome. 

## Filtering on sample QC metrics 
Remove samples with a low mean call rate 

## Saving out a QC'd matrix table 
Save out the post-variant and sample QC mtrix table using DNAX. 

In [None]:
# Define the chromosomes you're working with (smaller half if working with fewer nodes, or larger half if working with more nodes)
#chromosomes = list([4]+range(8,11)+ range(13,16)+[18]+range(20,24)) # For smaller chromosomes
chromosomes = list(range(1,4)+ range(5,8)+range(11,13)+[17]+[19]) #For larger chromosomes 

for chr in chromosomes:
    print(f'Processing chr {chr}...')



    ######### Read in matrix table  ##########
    print(f'Reading in MT for chr {chr}...')
    mt=hl.read_matrix_table(f"dnax://database-GgPbpq8J637bkp84VQyQ83X9/chromosome_{chr}_post_genoqc_final.mt")
    # Check this table is as you'd expect
    print(f'Read in poost-genotype QC table from stage 1 for chr {chr}, sized {mt.count()}')
    print(f'This table is formed of {mt.n_partitions()} partitions')
    print(f'The structure of this MT is:')
    print(mt.describe())
    print(f'Finished reading in MT for chr {chr}!')



    ####### Applying variant QC cut-offs and annotations ########
    #### Filtering out very low quality variants, and annotating fairly low quality variants 
    print(f'Applying variant-level QC cut-offs for chr {chr}...')
    # Filter out the lowest quality variants
    print(f'n original variants in chromosome {chr}: {mt.count_rows()} following script 1')
    mt=mt.filter_rows(mt.variant_qc.gq_stats.mean>30)
    print(f'n variants in chromosome {chr} following removal of those with GQ<=30: {mt.count_rows()}')
    mt=mt.filter_rows(mt.variant_qc.call_rate>0.7)
    print(f'n variants in chromosome {chr} following removal of those with a call rate <=0.7: {mt.count_rows()}')
    # Checkpoint here to speed up later stages of this script
    mt.checkpoint('post_lq_vars.mt', overwrite=True)
    # Annotate variants which are fairly low quality so these can be removed at a later point. 
    mt = mt.annotate_rows(variant_gq_over_40 = mt.variant_qc.gq_stats.mean >40)
    variant_gq_over_40_count = mt.aggregate_rows(hl.agg.count_where(mt.variant_gq_over_40))
    print(f'n variants in chromosome {chr} with variant GQ > 40: {variant_gq_over_40_count}')
    mt = mt.annotate_rows(variant_call_rate_over_90_percent = mt.variant_qc.call_rate > 0.9)
    variant_call_rate_over_90_percent_count = mt.aggregate_rows(hl.agg.count_where(mt.variant_call_rate_over_90_percent))
    print(f'n variants in chromosome {chr} with call rate > 0.9: {variant_call_rate_over_90_percent_count}')
    HQ_var_count=mt.aggregate_rows(hl.agg.count_where(mt.variant_call_rate_over_90_percent & mt.variant_gq_over_40))
    print(f'n high quality variants in chromosome {chr}: {HQ_var_count}')

    #### Removing variants in LCRs
    lcr_url = f"file:///mnt/project/LCR_intervals.tsv"
    intervals=hl.import_locus_intervals(lcr_url, reference_genome='GRCh38', skip_invalid_intervals=True)
    mt=mt.annotate_rows(LCR=hl.is_defined(intervals[mt.locus]))
    mt=mt.filter_rows(mt.LCR==False)    
    print(f'n high quality variants in chromosome {chr} following removal of those in LCRs: {mt.count()}')
    # Checkpoint again to speed up later stages of the script
    mt.checkpoint('post_lcr_rm.mt', overwrite=True)
    print(f'Finished variant-level QC filtering for chr {chr}!')


    ###### Filtering on sample QC metrics #######
    print(f'Applying sample-level QC cut-offs for chr {chr}...')
    print(f'Samples before any QC applied: {mt.count_cols()}')
    ids_passing_QC = f"file:///mnt/project/WES_QC/ids_with_call_rate_over_80_percent.tsv"
    samples_to_keep=hl.import_table(ids_passing_QC)
    samples_to_keep=samples_to_keep.rename({'"x"':"s"})
    samples_to_keep=samples_to_keep.key_by(samples_to_keep.s)
    mt = mt.filter_cols((hl.is_defined(samples_to_keep[mt.s])))
    print(f'Samples remaining following removal of those with call rate < 0.8: {mt.count_cols()}')
    print(f'Finished sample-level QC filtering for chr {chr}!')

    ###### Saving out a QC'd matrix table 
    print(f'Saving out QCd MT for chr {chr}...')
    # Define database and MT names
    # Note: It is recommended to only use lowercase letters for the database name.
    # # If uppercase lettering is used, the database name will be lowercased when creating the database.
    db_name = f"ukbb_test_wes_matrix_tables_post_qc_filtering"
    mt_name = f"chromosome_{chr}_post_geno_sample_and_var_qc.mt"
    # Create database in DNAX
    stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
    print(stmt)
    spark.sql(stmt).show()
    # Store MT in DNAX
    # # Find database ID of newly created database using dxpy method
    db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
    url = f"dnax://{db_uri}/{mt_name}" # Note: the dnax url must follow this format to properly save MT to DNAX
    
    # Before this step, the Hail MatrixTable is just an object in memory. To persist it and be able to access it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
    mt.write(url) # Note: output should describe size of MT (i.e. number of rows, columns, partitions) 
    print(f'QCd MT for chr {chr} saved! Now checking file can be read in ... ')
    
    
    ###### Check your mt can be read in for later scripts
    b=hl.read_matrix_table(f"dnax://database-Ggy1X3QJ637qBKGypjy9y9f4/chromosome_{chr}_post_geno_sample_and_var_qc.mt")
    # Check count is as you'd expect
    print(f'Count for read in matrix table is {b.count()} - check this matches what you expect!')
    # Check the additional fields you added are present
    print(f'The structure of the post-QC MT is:')
    print(b.describe())
    
    
    print(f'This stage of processing for chr {chr} complete!')