### This notebook focus on generating a manta-SVimmer-Graphtyper matrix table release from Manta-SVimmer-GT2 VCFrelease 1.4 Validataion Manta dataset
- contains "AGGREGATE", "BERAKPOINT", ... SVMODEL entries. We only carry fwd "INFO/SVMODEL=AGGREGATE" entries
- contains "PASS", "{fail}" FILTER entries We only carry fwd "FILTER=PASS" entries
- contains INS, DEL and DUP SVTYPE entries. We only carry fwd "SVTYOE={INS, DEL}" entries
- contains samples that arew not in the "discovery" set
- contains monomorphic entries. We only carry fwd entries wit at least one hom-ref and drop monomorphic entries
- contains SV with lenght < 50bp & > 10,000,000bp (10Mb). We only carry fwd entries with INFO/SVSIZE > 50bp or INFO/SVSIZE < 10,000,000bp
- contains SV outside of our predefied whieloist region (ie not low-cpmplexity, telemore, centromere, ...) 

In [1]:
%%configure -f
{"driverMemory": "6000M"}

In [2]:
import hail as hl
hl.init(sc)

Starting Spark application


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.1.2-amzn-0
SparkUI available at
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.80-4ccfae1ff293
LOGGING: writing to

In [3]:
## list of validatoin 15x samples  ... run once

## load release 1.3 to extract the samples
release_13_mt_uri = "SG10K-SV-Release-1.3-Validation15x_final.mt"
release_13_sample_txt_uri = "SG10K-SV-Release-1.3_15xValidation.samples.txt"

mt = hl.read_matrix_table(release_13_mt_uri)
#mt.cols().s.describe()
#mt.cols().s.show()
mt.cols().s.export(release_13_sample_txt_uri)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2024-05-06 05:42:40 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2024-05-06 05:42:44 Hail: INFO: Coerced sorted dataset
2024-05-06 05:42:46 Hail: INFO: merging 16 files totalling 11.4K...
2024-05-06 05:42:47 Hail: INFO: while writing:
    SG10K-SV-Release-1.3_15xValidation.samples.txt
  merge time: 372.385ms

In [4]:
## list all resources  used in this notebook 

release14_manta_svimmer_gt2_vcf_uri = "SG10K_SV_MantaSVimmerGraphtyper.n1523.15xvalidation.mergevcf.vcf.gz"
release14_sample_txt_uri = "SG10K-SV-Release-1.3_15xValidation.samples.txt"
release14_sample_metadata_uri = "2021_06_18_supplier_metadata.n10714_replacespace.txt"
whiltelist_region_bed_uri = "resources_broad_hg38_v0_wgs_calling_regions.hg38.merged.autosome_only-minus_excl_regions.bed"



FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
## load the entire release14_manta_svimmer_gt2_vcf

mt = hl.import_vcf(release14_manta_svimmer_gt2_vcf_uri,
                   reference_genome="GRCh38",
                   force_bgz=True)

mt.describe()
print("Samples: %d; Variants: %d; Entries: %d" % (mt.count_cols(), mt.count_rows(), mt.entries().count()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        ABHet: float64, 
        ABHom: float64, 
        ABHetMulti: array<float64>, 
        ABHomMulti: array<float64>, 
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        CR: int32, 
        END: int32, 
        HOMSEQ: array<str>, 
        INV3: bool, 
        INV5: bool, 
        LEFT_SVINSSEQ: array<str>, 
        LOGF: float64, 
        MaxAAS: array<int32>, 
        MaxAASR: array<float64>, 
        MaxAltPP: int32, 
        MQ: int32, 
        MQsquared: int32, 
        NCLUSTERS: int32, 
        NGT: array<int32>, 
        NHet: int32, 
        NHomRef: int32, 
        NHomAlt: int32, 
        NUM_MERGED_SVS: int32, 
        OL

In [6]:
##
## filter out relevant samples
##

sample_ht = hl.import_table(release14_sample_txt_uri).key_by('s')

print(sample_ht.count()) ## 
print(mt.count()) ## 

mt = mt.filter_cols(hl.is_defined(sample_ht[mt.col_key]))

print(mt.count()) ## 

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1523
(423803, 1523)
(423803, 1523)
2024-05-06 06:28:57 Hail: INFO: Reading table without type imputation
  Loading field 's' as type str (not specified)
2024-05-06 06:29:29 Hail: INFO: Coerced sorted dataset
2024-05-06 06:29:55 Hail: INFO: Coerced sorted dataset

In [7]:
##
## filter relevant variant
##

## 1- contains "AGGREGATE", "BERAKPOINT", ... SVMODEL entries. We only carry fwd "INFO/SVMODEL=AGGREGATE" entries 
## 2- contains "PASS", "{fail}" FILTER entries We only carry fwd "FILTER=PASS" entries 
## 3- contains INS, DEL and DUP SVTYPE entries. We only carry fwd "SVTYOE={INS, DEL}" entries 
## 5- contains monomorphic entries. We only carry fwd entries wit at least one hom-ref and drop monomorphic entries
## 6- contains SV with lenght < 50bp & > 10,000,000bp (10Mb).  We only carry fwd entries with INFO/SVSIZE > 50bp or INFO/SVSIZE < 10,000,000bp 
## 7- contains SV outside of our predefind whitlist region (ie not low-cpmplexity, telemore, centromere, ...) 
# print(mt.count()) 

##  because we want to only carry fwd entries wit at least one hom-ref and drop monomorphic entries
##   and there is a need to put all genotypes that fail FORMAT/FT to fORMAT/GT = `./.`
##   we first update GT from which vairant where least one hom-ref and not monomorphic entries
mt = mt.annotate_entries(
    GT = hl.case()
            .when((mt.FT == "PASS"), mt.GT)
            .default( hl.null(hl.tcall) ))

## load the predefine whitlist region (ie not low-cpmplexity, telemore, centromere, ...) 
whitelist_region= hl.import_bed(whiltelist_region_bed_uri, reference_genome='GRCh38')

## filter relevant variant
mt = mt.filter_rows(
    True
    & (mt.info.SVMODEL == "AGGREGATED")                         ## We only carry fwd "INFO/SVMODEL=AGGREGATE" entries 
    & (mt.filters.length() == 0)                                ## We only carry fwd "FILTER=PASS" entries 
    & ((mt.info.SVTYPE == "INS") | (mt.info.SVTYPE == "DEL"))   ## We only carry fwd "SVTYOE={INS, DEL}" entries 
    & (hl.agg.any(mt.GT.is_hom_ref()))                          ## We only carry fwd entries with at least one hom-ref
    & (hl.if_else(hl.agg.any(hl.is_missing(mt.GT)),             ## We only carry fwd polumorphic entries 
                  hl.agg.counter(mt.GT).size() > 2,             ##       that is GT contain NA + at least 2 of 0/0, 0/1, 1/1  
                  hl.agg.counter(mt.GT).size() > 1 ))           ##       that is GT contain at least 2 of 0/0, 0/1, 1/1  
    & ((~hl.is_defined(mt.info.SVSIZE))                         ## We only carry fwd entries with INFO/SVSIZE undefined 
       | (mt.info.SVSIZE >= 50)                                 ##                            or  INFO/SVSIZE > 50bp
       | (mt.info.SVSIZE <= 10000000))                          ##                            or INFO/SVSIZE < 10,000,000bp 
    & (hl.is_defined(whitelist_region[mt.locus]))               ## We only carry fwd  whitelist region contained SV 
    , keep=True)

print(mt.count()) 



FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(22446, 1523)
2024-05-06 06:31:37 Hail: INFO: Reading table without type imputation
  Loading field 'f0' as type str (user-supplied)
  Loading field 'f1' as type int32 (user-supplied)
  Loading field 'f2' as type int32 (user-supplied)
2024-05-06 06:32:04 Hail: INFO: Coerced sorted dataset
2024-05-06 06:32:05 Hail: INFO: Coerced sorted dataset

In [8]:
## Enventual additional GT2 recommended filtering (optional)
mt = mt.annotate_rows( info = mt.info.annotate(
    PASS_GT2_filter = hl.case()
            .when((  (mt.info.SVTYPE == "DEL")
                   & ( (mt.info.ABHet > 0.30) | (mt.info.ABHet < 0) ) 
                   & ( (mt.info.AC[0] / mt.info.NUM_MERGED_SVS) < 25 ) 
                   & (mt.info.PASS_AC[0] > 0)
                   & (mt.info.PASS_ratio > 0.1)
                   & (mt.info.QD > 12) 
                  ), "PASS")
            .when((  (mt.info.SVTYPE == "INS")
                   & ( (mt.info.ABHet > 0.25) | (mt.info.ABHet < 0) ) 
                   & ( (mt.info.AC[0] / mt.info.NUM_MERGED_SVS) < 25 ) 
                   & (mt.info.PASS_AC[0] > 0)
                   & (mt.info.PASS_ratio > 0.1) 
                   & (mt.info.MaxAAS[0] > 4) 
                  ), "PASS")
            .default("FAIL")            
    )
)


svpass_stats = mt.aggregate_rows(hl.struct( sv_stat = hl.agg.counter(mt.info.PASS_GT2_filter)))
print(svpass_stats.sv_stat)


svtypepass_stats = mt.group_rows_by(mt.info.SVTYPE).aggregate(sv_stat = hl.agg.counter(mt.info.PASS_GT2_filter))
svtypepass_stats.show()


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

frozendict({'FAIL': 11155, 'PASS': 11291})
+--------+---------------------------+---------------------------+
| SVTYPE | 'WHH2410'.sv_stat         | 'WHH2381'.sv_stat         |
+--------+---------------------------+---------------------------+
| str    | dict<str, int64>          | dict<str, int64>          |
+--------+---------------------------+---------------------------+
| "DEL"  | {"FAIL":7221,"PASS":6775} | {"FAIL":7221,"PASS":6775} |
| "INS"  | {"FAIL":3934,"PASS":4516} | {"FAIL":3934,"PASS":4516} |
+--------+---------------------------+---------------------------+
showing the first 2 of 1523 columns
2024-05-06 06:36:33 Hail: INFO: Coerced sorted dataset
2024-05-06 06:36:34 Hail: INFO: Coerced sorted dataset
2024-05-06 06:37:41 Hail: INFO: Coerced sorted dataset
2024-05-06 06:37:54 Hail: INFO: Coerced sorted dataset
2024-05-06 06:38:07 Hail: INFO: Coerced sorted dataset
2024-05-06 06:38:07 Hail: INFO: Coerced sorted dataset
2024-05-06 06:39:00 Hail: INFO: Coerced sorted dataset

In [10]:
mt.describe()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        ABHet: float64, 
        ABHom: float64, 
        ABHetMulti: array<float64>, 
        ABHomMulti: array<float64>, 
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        CR: int32, 
        END: int32, 
        HOMSEQ: array<str>, 
        INV3: bool, 
        INV5: bool, 
        LEFT_SVINSSEQ: array<str>, 
        LOGF: float64, 
        MaxAAS: array<int32>, 
        MaxAASR: array<float64>, 
        MaxAltPP: int32, 
        MQ: int32, 
        MQsquared: int32, 
        NCLUSTERS: int32, 
        NGT: array<int32>, 
        NHet: int32, 
        NHomRef: int32, 
        NHomAlt: int32, 
        NUM_MERGED_SVS: int32, 
        OL

In [11]:
# Extract Variants that PASS GT2 filters
mt2 = mt.filter_rows(mt.info.PASS_GT2_filter=="PASS", keep=True)
print("Samples: %d; Variants: %d; Entries: %d" % (mt2.count_cols(), mt2.count_rows(), mt2.entries().count()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Samples: 1523; Variants: 11291; Entries: 17196193
2024-05-06 06:42:02 Hail: INFO: Coerced sorted dataset
2024-05-06 06:42:16 Hail: INFO: Coerced sorted dataset
2024-05-06 06:42:16 Hail: INFO: Coerced sorted dataset
2024-05-06 06:43:20 Hail: INFO: Coerced sorted dataset
2024-05-06 06:43:21 Hail: INFO: Coerced sorted dataset

In [12]:
release14_manta_svimmer_gt2_mt_uri =   "SG10K_SV_MantaSVimmerGraphtyper_15x.n1523.m11291.discovery.DEL-INS-only.mt"
mt2.write(release14_manta_svimmer_gt2_mt_uri, overwrite=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2024-05-06 06:51:25 Hail: INFO: Coerced sorted dataset
2024-05-06 06:51:26 Hail: INFO: Coerced sorted dataset
2024-05-06 06:53:10 Hail: INFO: wrote matrix table with 11291 rows and 1523 columns in 15 partitions to SG10K_SV_MantaSVimmerGraphtyper_15x.n1523.m11291.discovery.DEL-INS-only.mt
    Total size: 80.34 MiB
    * Rows/entries: 80.33 MiB
    * Columns: 5.53 KiB
    * Globals: 11.00 B
    * Smallest partition: 589 rows (4.19 MiB)
    * Largest partition:  913 rows (6.41 MiB)