# 7. Sex imputation
For this stage, I used a mem1_ssd1_v2_x8 instance with 2 nodes. This took a total of 30 minutes, and cost £0.20. 

## Set up environment
Make sure you run this block only once. You'll get errors if you try to initialise Hail multiple times. If you do do this, you'll need to restart the kernel, and then initialise Hail only once. 

In [1]:
# Initialise hail and spark logs? Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.
import pyspark.sql

config = pyspark.SparkConf().setAll([('spark.kryoserializer.buffer.max', '128')])
sc = pyspark.SparkContext(conf=config) 

from pyspark.sql import SparkSession

import hail as hl
builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=sc)

import dxpy

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/backend/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.2.3
SparkUI available at http://ip-10-60-68-12.eu-west-2.compute.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.116-cd64e0876c94
LOGGING: writing to /opt/notebooks/hail-20240419-0851-0.2.116-cd64e0876c94.log


## Reading in X chromosome
Here, you read in the matrix table for the X chromosome and check it is the size you'd expect. 

In [2]:
chr=23
#Annotate variants to the matrix table 
mt=hl.read_matrix_table(f"dnax://database-Ggy1X3QJ637qBKGypjy9y9f4/chromosome_{chr}_post_geno_sample_and_var_qc.mt")
# Check this table is as you'd expect
print(mt.count())
print(mt.n_partitions())
mt.describe()

(425743, 469151)
2000
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'sample_qc': struct {
        dp_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_hom_ref: int64, 
        n_het: int64, 
        n_hom_var: int64, 
        n_non_ref: int64, 
        n_singleton: int64, 
        n_snp: int64, 
        n_insertion: int64, 
        n_deletion: int64, 
        n_transition: int64, 
        n_transversion: int64, 
        n_star: int64, 
        r_ti_tv: float64, 
        r_het_hom_var: float64, 
        r_insertion_deletion: float64
    }
------

## Filter to high quality, non-rare variants.
Filter to variants with a high call rate and remove rare variants. 

In [3]:
mt=mt.filter_rows(mt.variant_qc.call_rate>=0.97)
print(mt.count()) # 297719 variants remain following call rate filtering
mt=mt.filter_rows((mt.variant_qc.AF[1]>.01) & (mt.variant_qc.AF[1]<.99))
print(mt.count()) # 677 variants remain following AF filtering.

(297719, 469151)
(677, 469151)


## Impute sex 
Now use the hail impute_sex function to impute sex based on these non-rare, high call-rate variants. Then write out the outputs of this imputation to a .tsv file and save this out to the project folder. 

In [4]:
imputed=hl.impute_sex(mt.GT)
imputed.export('imputed_sex.tsv')

2024-04-19 09:10:05.831 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2024-04-19 09:18:53.580 Hail: INFO: Coerced sorted dataset
2024-04-19 09:18:54.696 Hail: INFO: merging 17 files totalling 19.6M...
2024-04-19 09:18:54.779 Hail: INFO: while writing:
    imputed_sex.tsv
  merge time: 82.777ms


In [None]:
%%bash
hdfs dfs -get imputed_sex.tsv
dx upload imputed_sex.tsv --destination ./WES_QC/

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/cluster/hadoop/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/cluster/dnax/jars/dnanexus-api-0.1.0-SNAPSHOT-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
2024-04-19 09:21:24,633 WARN metrics.MetricsReporter: Unable to initialize metrics scraping configurations from hive-site.xml. Message:InputStream cannot be null
2024-04-19 09:21:24,730 WARN service.DNAxApiSvc: Using default configurations. Unable to find dnanexus.conf.location=null
2024-04-19 09:21:24,730 INFO service.DNAxApiSvc: apiserver connection-pool config. MaxPoolSize=10, MaxPoolPerRoute=10,MaxWaitTimeout=60000
2024-04-19 09:21:24,730 INFO service.DNAxApiSvc: initializing http connection man

ID                          file-GjV3P5QJ637vYkXPKyF6qp3Q
Class                       file
Project                     project-Gfj1VXjJ637kVQFzkY7xyQz2
Folder                      /
Name                        imputed_sex.tsv
State                       closing
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Fri Apr 19 09:21:26 2024
Created by                  efenner
 via the job                job-GjV2gyjJ637Qk9BGXJjv8PPv
Last modified               Fri Apr 19 09:21:27 2024
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


# Y chromosome QC 

Filter to high quality variants on the Y-chromsome (non-PAR variants with mean depth >3.5) and then write out sample QC metrics based on these for use in inferring genetic sex. 

In [None]:
# Read in original y chromosome pVCF (we want pre genotype QC values for this)
chr=24
file_url = f"file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/*_cY_*.vcf.gz"
a=hl.import_vcf(file_url, 
                 force_bgz=True,
                 reference_genome='GRCh38',
                 array_elements_required=False).write(f"./chr_{chr}_initial_mt.mt", overwrite=True)
mt=hl.read_matrix_table(f"./chr_{chr}_initial_mt.mt")
print(f"Num partitions: {mt.n_partitions()}")
# Check this table is as you'd expect
print(mt.count())
mt.describe()

# Annotate rows with whether they are in the Y chromosome PAR regions
mt = mt.annotate_rows(in_par=mt.locus.in_autosome_or_par()) 
# Look at this field 
mt.in_par.show(10)
mt.in_par.summarize()
# Count vars in par/non-par regions 
par_counts = mt.aggregate_rows(hl.struct(
    in_par=hl.agg.count_where(mt.in_par),
    not_in_par=hl.agg.count_where(~mt.in_par)
))
print(f"Variants in PAR region: {par_counts.in_par}")
print(f"Variants outside PAR region: {par_counts.not_in_par}") # There are no par regions here - think the variant caller only called non-PAR regions for Y chromosome. 

# Filter on depth
mt = hl.variant_qc(mt)
print(mt.count())
mt=mt.filter_rows(mt.variant_qc.dp_stats.mean>=3.5)
print(mt.count()) # 8,737 variants remain

mt = hl.sample_qc(mt)
sample_qc=mt.cols()
sample_qc.describe()
sample_qc.export(f"chr_{chr}sample_qc_for_sex_inference.csv", delimiter=",")
print('Sample QC table written')

In [None]:
%%bash
hdfs dfs -get ./*sample_qc_for_sex_inference.csv ./

In [None]:
%%bash
# Upload these to that dir within your project
dx upload ./*sample_qc_for_sex_inference.csv --destination ./WES_QC/ 