<b>Author:</b> ...

<b>Contributors:</b> ...


<div class="alert alert-block alert-danger">
Before you start running this notebook, make sure you are using the Hail Genomics Analysis Environment. To do so,
<br/>
    
<ul>
    <li>Click on the <b>cloud analysis environment</b> icon on the righthand side of the screen.</li>
    <li>Inside <b>Recommended environments</b>, select <b>Hail Genomics Analysis</b> which creates a cloud environment for your analyses.</li>
    <li>This analysis can be run with <b>high compute</b> (e.g. 96 CPUs, 624 GB of RAM, 300 workers and 300 preemptibles with 4 CPUs, 15 GB of RAM).</li>
    <li>Click on <b>Next</b>.</li>
</ul>
    
</div>

<h1>Notebook Objectives</h1>

This notebook subsets the short-read v7 VDS to the ~1,027 long-read samples and the ~989 samples for which GATK-SV calls are available.

<b>How to Use this Notebook...</b>

<b>As a tutorial:</b>

...

<b>As a resource:</b>

...

<h2>Relevant Information:</h2>

...

In [1]:
import pandas as pd
import numpy as np
import os
import re

In [2]:
import pysam
from pysam import VariantFile

In [3]:
from google.cloud import storage

In [4]:
import hail as hl
from hail.plot import show
from pprint import pprint

## Define helper functions

In [5]:
def mt_exists(gcs_path):
    (gcs_bucket_name, gcs_obj) = re.split("\/", re.sub("gs://", "", gcs_path), maxsplit=1)
    
    storage_client = storage.Client()
    gcs_bucket = storage_client.bucket(gcs_bucket_name)
    stats = storage.Blob(bucket=gcs_bucket, name=f'{gcs_obj}/README.txt').exists(storage_client)
    
    return stats

In [6]:
def vds_exists(gcs_path):
    (gcs_bucket_name, gcs_obj) = re.split("\/", re.sub("gs://", "", gcs_path), maxsplit=1)
    
    storage_client = storage.Client()
    gcs_bucket = storage_client.bucket(gcs_bucket_name)
    stats = storage.Blob(bucket=gcs_bucket, name=f'{gcs_obj}/reference_data/README.txt').exists(storage_client)
    
    return stats

In [7]:
bucket = os.environ['WORKSPACE_BUCKET']
workspace = os.environ['WORKSPACE_NAME']
namespace = os.environ['WORKSPACE_NAMESPACE']

In [5]:
if not os.path.exists("AoU_srWGS_SV_PhaseI.vcf.gz"):
    !gsutil -m cp gs://prod-drc-broad/aou-wgs-sv/phase1/joint-vcf/AoU_srWGS_SV_PhaseI.vcf.gz .
    !gsutil -m cp gs://prod-drc-broad/aou-wgs-sv/phase1/joint-vcf/AoU_srWGS_SV_PhaseI.vcf.gz.tbi .

In [8]:
if not os.path.exists("cohort_AoUSVPhaseII.v7.LRsamples.vcf.gz"):
    !gsutil cp gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/yulia/cohort_AoUSVPhaseII.v7.LRsamples.vcf.gz .

In [9]:
#sr_sv_samples = !zgrep -m1 '^#CHROM' AoU_srWGS_SV_PhaseI.vcf.gz | cut -f10- | sed 's/\t/\n/g'

In [10]:
sr_sv_samples = !zgrep -m1 '^#CHROM' cohort_AoUSVPhaseII.v7.LRsamples.vcf.gz | cut -f10- | sed 's/\t/\n/g'

In [11]:
len(sr_sv_samples)

990

In [12]:
if not os.path.exists("concat_annotated.sens_09.vcf.gz"):
    !gsutil cp gs://fc-secure-fd873afb-038d-44ed-b113-623c141cb95f/releases/sv_integration/GRCh38/v1/concat_annotated.sens_09.vcf.gz .
        
if not os.path.exists("concat_annotated.sens_07.vcf.gz"):        
    !gsutil cp gs://fc-secure-fd873afb-038d-44ed-b113-623c141cb95f/releases/sv_integration/GRCh38/v1/concat_annotated.sens_07.vcf.gz .

In [13]:
sv_sens_09_vcf = 'concat_annotated.sens_09.vcf.gz'
sv_sens_07_vcf = 'concat_annotated.sens_07.vcf.gz'

In [14]:
!cat {sv_sens_09_vcf} | zcat | head -n 2000 | grep -v '^#' | head -n 3 | cut -f1-9

chr1	10147	0	C	CCCTAACCCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAACCCTAACCCTAACAACCCTAACAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCAACCCAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAA	.	.	TRUVARI_ID=chr1-10148-INS-330;SVTYPE=INS;SVLEN=330;GTCNT=1073,0,0,1;F_MISSING=0.999069;NS=1;AN=2;AF=1;MAF=0;AC=2;AC_Het=0;AC_Hom=2;AC_Hemi=0;HWE=1;ExcHet=1	GT:GQ:DR:DV:SCORE:CALIBRATION_SENSITIVITY:SUPP_PBSV:SUPP_SNIFFLES:SUPP_PAV
chr1	10231	1	C	CCCTAACCCTAACCCCTACCCCAACCCCAACCCCAACCCCAACCCCAACCCTTAACCCTAA	.	.	TRUVARI_ID=chr1-10232-INS-60;SVTYPE=INS;SVLEN=60;GTCNT=1073,0,1,0;F_MISSING=0.999069;NS=1;AN=2;AF=0.5;MAF=0.5;AC=1;AC_Het=1;AC_Hom=0;AC_Hemi=0;HWE=1;ExcHet=1	GT:GQ:DR:DV:SCORE:CALIBRATION_SENSITIVITY:SUPP_PBSV:SUPP_SNIFFLES:SUPP_PAV
chr1	10280	2	AACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAAC	A	.	.	TRUVARI_ID=chr1-10281-DEL-50;SV

In [15]:
sv_sens_09_in = VariantFile(sv_sens_09_vcf)  # auto-detect input format

for i, rec in enumerate(sv_sens_09_in):
    print(f'{i} {rec.chrom} {rec.pos} {rec.info.values()}')
    
    if i > 10:
        break

0 chr1 10147 ['chr1-10148-INS-330', 'INS', 330, (1073, 0, 0, 1), (0.9990689754486084,), 1, 2, (1.0,), 0.0, (2,), (0,), (2,), (0,), (1.0,), (1.0,)]
1 chr1 10231 ['chr1-10232-INS-60', 'INS', 60, (1073, 0, 1, 0), (0.9990689754486084,), 1, 2, (0.5,), 0.5, (1,), (1,), (0,), (0,), (1.0,), (1.0,)]
2 chr1 10280 ['chr1-10281-DEL-50', 'DEL', 50, (1074, 0, 0, 0), (1.0,), 0, 0, (None,), None, (0,), (0,), (0,), (0,), (1.0,), (1.0,)]
3 chr1 10300 ['chr1-10301-DEL-103', 'DEL', 103, (1073, 0, 0, 1), (0.9990689754486084,), 1, 2, (1.0,), 0.0, (2,), (0,), (2,), (0,), (1.0,), (1.0,)]
4 chr1 10306 ['chr1-10307-INS-102', 'INS', 102, (1073, 0, 0, 1), (0.9990689754486084,), 1, 2, (1.0,), 0.0, (2,), (0,), (2,), (0,), (1.0,), (1.0,)]
5 chr1 10309 ['chr1-10310-INS-106', 'INS', 106, (1073, 0, 1, 0), (0.9990689754486084,), 1, 2, (0.5,), 0.5, (1,), (1,), (0,), (0,), (1.0,), (1.0,)]
6 chr1 10310 ['chr1-10311-INS-91', 'INS', 91, (1073, 0, 0, 1), (0.9990689754486084,), 1, 2, (1.0,), 0.0, (2,), (0,), (2,), (0,), (1.0,)

[E::idx_find_and_load] Could not retrieve index file for 'concat_annotated.sens_09.vcf.gz'


## List long read samples

In [16]:
lr_sv_samples = !zgrep -m1 '^#CHROM' concat_annotated.sens_09.vcf.gz | cut -f10- | sed 's/\t/\n/g'

In [17]:
len(lr_sv_samples)

1074

In [18]:
!zgrep -m1 '^#CHROM' concat_annotated.sens_09.vcf.gz | cut -f10- | sed 's/\t/\n/g' > samples_1074.txt



gzip: stdout: Broken pipe


In [19]:
with open('samples_1074.txt', 'r') as file:
    sample_names = file.readlines()

sample_names = [name.strip() for name in sample_names]

In [20]:
len(sample_names)

1074

## List long read samples without HPRC samples

In [21]:
common_samples_1027 = [element for element in lr_sv_samples if not (element.startswith('HG') or element.startswith('NA'))]
len(common_samples_1027)

1027

## List long read samples with GATK-SV calls available

In [23]:
#common_samples_989 = list(set(sr_sv_samples) & set(lr_sv_samples))
#len(common_samples_989)

In [24]:
common_samples_990 = list(set(sr_sv_samples) & set(lr_sv_samples))
len(common_samples_990)

990

## Initialize Hail

In [25]:
spark_conf_more_ram = dict()
spark_conf_more_ram["spark.executor.memory"] = "8g"
spark_conf_more_ram["spark.driver.memory"] = "196g"

# hl.init(default_reference='GRCh38', idempotent=True, spark_conf=spark_conf_more_ram)

hl.init(idempotent=True, spark_conf=spark_conf_more_ram)


Reading spark-defaults.conf to determine GCS requester pays configuration. This is deprecated. Please use `hailctl config set gcs_requester_pays/project` and `hailctl config set gcs_requester_pays/buckets`.

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.0
SparkUI available at http://saturn-f75e1fa5-6fbc-4dc6-ae19-602e6c4dd082-m.us-central1-c.c.terra-7a376e4e.internal:40299
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.130.post1-c69cd67afb8b
LOGGING: writing to /home/jupyter/AoU_DRC_WGS_LongReads_PacBio/edit/hail-20250328-0502-0.2.130.post1-c69cd67afb8b.log


In [26]:
hl.default_reference('GRCh38')

## Subset v7 VDS to samples that have long reads

In [18]:
vds = hl.vds.read_vds('gs://prod-drc-broad/v7/wgs/with_aian_no_prod/vds/aou_srwgs_short_variants_v7_with_aian_no_prod.vds')

2024-09-06 21:40:23.528 Hail: WARN: You are reading a VDS written with an older version of Hail.
  Hail now supports much faster interval filters on VDS, but you'll need to run either
  `hl.vds.truncate_reference_blocks(vds, ...)` and write a copy (see docs) or patch the
  existing VDS in place with `hl.vds.store_ref_block_max_length(vds_path)`.


In [78]:
if not vds_exists(f'{bucket}/scratch/kvg/srs-subset.1027.chr22.vds'):
    #callset_sample_filtered_1027 = hl.vds.filter_samples(vds, common_samples_1027, keep=True, remove_dead_alleles=True)
    #callset_sample_filtered_1027.write(f'{bucket}/scratch/kvg/srs-subset.1027.chr22.vds', overwrite=True)
    #callset_sample_filtered_1027.write(f'{bucket}/scratch/kvg/srs-subset.1027.vds', overwrite=True)
    
    pass

In [28]:
callset_sample_filtered_1027 = hl.vds.read_vds(f'{bucket}/scratch/kvg/srs-subset.1027.chr22.vds')

2025-03-28 05:04:03.617 Hail: WARN: You are reading a VDS written with an older version of Hail.
  Hail now supports much faster interval filters on VDS, but you'll need to run either
  `hl.vds.truncate_reference_blocks(vds, ...)` and write a copy (see docs) or patch the
  existing VDS in place with `hl.vds.store_ref_block_max_length(vds_path)`.


## Subset v7 VDS to samples that have long reads and GATK-SV calls

In [79]:
if not vds_exists(f'{bucket}/scratch/kvg/srs-subset.989.vds'):
    #callset_sample_filtered_989 = hl.vds.filter_samples(callset_sample_filtered_1027, common_samples_989, keep=True, remove_dead_alleles=True)
    #callset_sample_filtered_989.write(f'{bucket}/scratch/kvg/srs-subset.989.vds', overwrite=True)
    
    pass

In [23]:
callset_sample_filtered_989 = hl.vds.read_vds(f'{bucket}/scratch/kvg/srs-subset.989.vds')

2024-09-06 21:40:59.821 Hail: WARN: You are reading a VDS written with an older version of Hail.
  Hail now supports much faster interval filters on VDS, but you'll need to run either
  `hl.vds.truncate_reference_blocks(vds, ...)` and write a copy (see docs) or patch the
  existing VDS in place with `hl.vds.store_ref_block_max_length(vds_path)`.


In [29]:
if not vds_exists(f'{bucket}/scratch/kvg/srs-subset.990.vds'):
    callset_sample_filtered_990 = hl.vds.filter_samples(callset_sample_filtered_1027, common_samples_990, keep=True, remove_dead_alleles=True)
    callset_sample_filtered_990.write(f'{bucket}/scratch/kvg/srs-subset.990.vds', overwrite=True)

2025-03-28 09:06:01.630 Hail: INFO: wrote matrix table with 2437620004 rows and 990 columns in 84648 partitions to gs://fc-secure-f7d80b48-be60-426f-aa6b-f037a1bf7f34/scratch/kvg/srs-subset.990.vds/reference_data
2025-03-28 09:08:46.425 Hail: INFO: wrote matrix table with 72130558 rows and 990 columns in 84648 partitions to gs://fc-secure-f7d80b48-be60-426f-aa6b-f037a1bf7f34/scratch/kvg/srs-subset.990.vds/variant_data


In [30]:
callset_sample_filtered_990 = hl.vds.read_vds(f'{bucket}/scratch/kvg/srs-subset.990.vds')

2025-03-28 09:08:59.493 Hail: WARN: You are reading a VDS written with an older version of Hail.
  Hail now supports much faster interval filters on VDS, but you'll need to run either
  `hl.vds.truncate_reference_blocks(vds, ...)` and write a copy (see docs) or patch the
  existing VDS in place with `hl.vds.store_ref_block_max_length(vds_path)`.


## Check that we got the number of samples correct

In [31]:
callset_sample_filtered_1027.n_samples()

1027

In [26]:
callset_sample_filtered_989.n_samples()

989

In [32]:
callset_sample_filtered_990.n_samples()

990

## Densify subsetted VDS objects

In [85]:
if not mt_exists(f'{bucket}/scratch/kvg/srs-subset.1027.mt') or True:
    mt_1027 = callset_sample_filtered_1027.variant_data.annotate_entries(
        AD = hl.vds.local_to_global(callset_sample_filtered_1027.variant_data.LAD, 
                                    callset_sample_filtered_1027.variant_data.LA, 
                                    n_alleles = hl.len(callset_sample_filtered_1027.variant_data.alleles), 
                                    fill_value = 0, 
                                    number = 'R')
    )
    
    mt_1027 = mt_1027.annotate_entries(GT = hl.vds.lgt_to_gt(mt_1027.LGT, mt_1027.LA))
    mt_1027 = hl.vds.to_dense_mt(hl.vds.VariantDataset(callset_sample_filtered_1027.reference_data, mt_1027))
    mt_1027 = mt_1027.annotate_rows(info = hl.agg.call_stats(mt_1027.GT, mt_1027.alleles))
    mt_1027.write(f'{bucket}/scratch/kvg/srs-subset.1027.mt', overwrite=True)

2024-09-07 00:28:45.412 Hail: INFO: wrote matrix table with 73161980 rows and 1027 columns in 84648 partitions to gs://fc-secure-f7d80b48-be60-426f-aa6b-f037a1bf7f34/scratch/kvg/srs-subset.1027.mt


In [31]:
mt_1027 = hl.read_matrix_table(f'{bucket}/scratch/kvg/srs-subset.1027.mt')

In [86]:
mt_qc_1027 = hl.sample_qc(mt_1027)

In [88]:
mt_qc_1027.describe()

----------------------------------------
Global fields:
    'tranche_data': array<struct {
        model: str, 
        truth_sensitivity: float64, 
        min_vqslod: float64, 
        filter_name: str
    }>
    'truth_sensitivity_snp_threshold': float64
    'truth_sensitivity_indel_threshold': float64
    'snp_vqslod_threshold': float64
    'indel_vqslod_threshold': float64
----------------------------------------
Column fields:
    's': str
    'sample_qc': struct {
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_hom_ref: int64, 
        n_het: int64, 
        n_hom_var: int64, 
        n_non_ref: int64, 
        n_singleton: int64, 
        n_snp: int64, 
        n_insertion: int64, 
        n_deletion: int64, 
        n_transition: int64, 
        n_transversi

In [87]:
mt_qc_1027.cols().show()

2024-09-07 01:13:24.447 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'

Unnamed: 0_level_0,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc,sample_qc
Unnamed: 0_level_1,gq_stats,gq_stats,gq_stats,gq_stats,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
s,mean,stdev,min,max,call_rate,n_called,n_not_called,n_filtered,n_hom_ref,n_het,n_hom_var,n_non_ref,n_singleton,n_snp,n_insertion,n_deletion,n_transition,n_transversion,n_star,r_ti_tv,r_het_hom_var,r_insertion_deletion
str,float64,float64,float64,float64,float64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,float64,float64,float64
"""1000151""",38.5,13.8,0.0,99.0,0.994,72716066,0,445914,66762013,4097921,1856132,5954053,30091,6355492,792269,793474,4241709,2113783,0,2.01,2.21,0.998
"""1000513""",37.9,13.2,0.0,99.0,0.994,72731515,0,430465,66875438,4076768,1779309,5856077,29923,6211279,777158,774227,4147162,2064117,0,2.01,2.29,1.0
"""1000920""",38.5,13.8,0.0,99.0,0.994,72718025,0,443955,66785718,4111733,1820574,5932307,28180,6304855,791619,789904,4211349,2093506,0,2.01,2.26,1.0
"""1001399""",38.8,13.6,0.0,99.0,0.994,72714732,0,447248,66797836,3996124,1920772,5916896,27806,6368180,800807,799820,4251516,2116664,0,2.01,2.08,1.0
"""1001980""",38.3,13.4,0.0,99.0,0.994,72746269,0,415711,67217496,3725614,1803159,5528773,36753,5946433,751562,747736,3970485,1975948,0,2.01,2.07,1.01
"""1002322""",38.0,13.4,0.0,99.0,0.994,72719932,0,442048,66828626,4026064,1865242,5891306,28609,6305735,790262,790175,4213357,2092378,0,2.01,2.16,1.0
"""1002826""",38.7,13.8,0.0,99.0,0.994,72712771,0,449209,66752399,4121247,1839125,5960372,32002,6342616,796135,796047,4233637,2108979,0,2.01,2.24,1.0
"""1004266""",38.4,13.7,0.0,99.0,0.994,72716202,0,445778,66724227,4108162,1883813,5991975,32626,6407634,800916,802674,4277358,2130276,0,2.01,2.18,0.998
"""1005038""",38.7,13.6,0.0,99.0,0.994,72716615,0,445365,66807892,4123780,1784943,5908723,28700,6257487,786112,783431,4174980,2082507,0,2.0,2.31,1.0
"""1005444""",38.4,13.5,0.0,99.0,0.994,72729426,0,432554,66994011,3874298,1861117,5735415,49216,6164321,777891,775620,4117269,2047052,0,2.01,2.08,1.0


In [84]:
if not mt_exists(f'{bucket}/scratch/kvg/srs-subset.989.mt') or True:
    mt_989 = callset_sample_filtered_989.variant_data.annotate_entries(
        AD = hl.vds.local_to_global(callset_sample_filtered_989.variant_data.LAD, 
                                    callset_sample_filtered_989.variant_data.LA, 
                                    n_alleles = hl.len(callset_sample_filtered_989.variant_data.alleles), 
                                    fill_value = 0, 
                                    number = 'R')
    )
    
    mt_989 = mt_989.annotate_entries(GT = hl.vds.lgt_to_gt(mt_989.LGT, mt_989.LA))
    mt_989 = hl.vds.to_dense_mt(hl.vds.VariantDataset(callset_sample_filtered_989.reference_data, mt_989))
    mt_989 = mt_989.annotate_rows(info = hl.agg.call_stats(mt_989.GT, mt_989.alleles))
    mt_989.write(f'{bucket}/scratch/kvg/srs-subset.989.mt', overwrite=True)

2024-09-06 23:31:19.638 Hail: INFO: wrote matrix table with 72103826 rows and 989 columns in 84648 partitions to gs://fc-secure-f7d80b48-be60-426f-aa6b-f037a1bf7f34/scratch/kvg/srs-subset.989.mt


In [81]:
mt_989 = hl.read_matrix_table(f'{bucket}/scratch/kvg/srs-subset.989.mt')

In [89]:
mt_qc_989 = hl.sample_qc(mt_989)

In [90]:
mt_qc_989.describe()

----------------------------------------
Global fields:
    'tranche_data': array<struct {
        model: str, 
        truth_sensitivity: float64, 
        min_vqslod: float64, 
        filter_name: str
    }>
    'truth_sensitivity_snp_threshold': float64
    'truth_sensitivity_indel_threshold': float64
    'snp_vqslod_threshold': float64
    'indel_vqslod_threshold': float64
----------------------------------------
Column fields:
    's': str
    'sample_qc': struct {
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_hom_ref: int64, 
        n_het: int64, 
        n_hom_var: int64, 
        n_non_ref: int64, 
        n_singleton: int64, 
        n_snp: int64, 
        n_insertion: int64, 
        n_deletion: int64, 
        n_transition: int64, 
        n_transversi

In [93]:
mt_qc_989.cols().show()

[Stage 46:>                                                   (0 + 350) / 84648]

FatalError: RemoteException: The directory item limit of /tmp/aggregate_intermediates is exceeded: limit=1048576 items=1048576
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:1277)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:1361)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1184)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.addFile(FSDirWriteFileOp.java:579)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:398)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2703)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2596)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:799)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)


Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 46.0 failed 4 times, most recent failure: Lost task 2.3 in stage 46.0 (TID 959727) (saturn-919ba7fe-2d8c-4e1d-945c-229767cf9700-w-276.us-central1-c.c.terra-7a376e4e.internal executor 3617): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): The directory item limit of /tmp/aggregate_intermediates is exceeded: limit=1048576 items=1048576
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:1277)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:1361)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1184)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.addFile(FSDirWriteFileOp.java:579)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:398)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2703)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2596)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:799)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)
	at org.apache.hadoop.ipc.Client.call(Client.java:1558)
	at org.apache.hadoop.ipc.Client.call(Client.java:1455)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
	at com.sun.proxy.$Proxy36.create(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:382)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy37.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1271)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1250)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1232)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1170)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:556)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:553)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:567)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:494)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1196)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1176)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1065)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1053)
	at is.hail.io.fs.HadoopFS.createNoCompression(HadoopFS.scala:101)
	at is.hail.io.fs.HadoopFS.createNoCompression(HadoopFS.scala:85)
	at is.hail.io.fs.FS.create(FS.scala:578)
	at is.hail.io.fs.FS.create$(FS.scala:577)
	at is.hail.io.fs.HadoopFS.create(HadoopFS.scala:85)
	at is.hail.io.fs.FS.create(FS.scala:575)
	at is.hail.io.fs.FS.create$(FS.scala:575)
	at is.hail.io.fs.HadoopFS.create(HadoopFS.scala:85)
	at __C18399collect_distributed_array_table_scan_write_prefix_sums.apply(Unknown Source)
	at __C18399collect_distributed_array_table_scan_write_prefix_sums.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$6(BackendUtils.scala:87)
	at is.hail.utils.package$.using(package.scala:664)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:166)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$5(BackendUtils.scala:86)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:910)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2673)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2609)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2608)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2861)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2803)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2792)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2257)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2276)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2301)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
	at is.hail.backend.spark.SparkBackend.parallelizeAndComputeWithIndex(SparkBackend.scala:429)
	at is.hail.backend.BackendUtils.collectDArray(BackendUtils.scala:82)
	at __C18282Compiled.__m18286begin_group_0(Emit.scala)
	at __C18282Compiled.__m18284split_Block(Emit.scala)
	at __C18282Compiled.apply(Emit.scala)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$7(CompileAndEvaluate.scala:82)
	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:82)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:28)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:59)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:64)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:83)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:32)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:32)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:30)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:29)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:78)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:21)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:19)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:19)
	at is.hail.expr.ir.lowering.EvalRelationalLets$.execute$1(EvalRelationalLets.scala:13)
	at is.hail.expr.ir.lowering.EvalRelationalLets$.lower$1(EvalRelationalLets.scala:21)
	at is.hail.expr.ir.lowering.EvalRelationalLets$.apply(EvalRelationalLets.scala:35)
	at is.hail.expr.ir.lowering.EvalRelationalLetsPass.transform(LoweringPass.scala:168)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:32)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:32)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:30)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:29)
	at is.hail.expr.ir.lowering.EvalRelationalLetsPass.apply(LoweringPass.scala:162)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:21)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:19)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:19)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:45)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:600)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$4(SparkBackend.scala:636)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$3(SparkBackend.scala:631)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$3$adapted(SparkBackend.scala:630)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:78)
	at is.hail.utils.package$.using(package.scala:664)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:78)
	at is.hail.utils.package$.using(package.scala:664)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:13)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:65)
	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$2(SparkBackend.scala:407)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:55)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:62)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:393)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:630)
	at is.hail.backend.BackendHttpHandler.handle(BackendServer.scala:88)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
	at jdk.httpserver/sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:82)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:80)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:848)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:817)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:201)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:560)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:526)
	at java.base/java.lang.Thread.run(Thread.java:829)

org.apache.hadoop.ipc.RemoteException: The directory item limit of /tmp/aggregate_intermediates is exceeded: limit=1048576 items=1048576
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:1277)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:1361)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1184)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.addFile(FSDirWriteFileOp.java:579)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:398)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2703)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2596)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:799)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)
	at org.apache.hadoop.ipc.Client.call(Client.java:1558)
	at org.apache.hadoop.ipc.Client.call(Client.java:1455)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
	at com.sun.proxy.$Proxy36.create(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:382)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy37.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1271)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1250)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1232)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1170)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:556)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:553)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:567)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:494)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1196)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1176)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1065)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1053)
	at is.hail.io.fs.HadoopFS.createNoCompression(HadoopFS.scala:101)
	at is.hail.io.fs.HadoopFS.createNoCompression(HadoopFS.scala:85)
	at is.hail.io.fs.FS.create(FS.scala:578)
	at is.hail.io.fs.FS.create$(FS.scala:577)
	at is.hail.io.fs.HadoopFS.create(HadoopFS.scala:85)
	at is.hail.io.fs.FS.create(FS.scala:575)
	at is.hail.io.fs.FS.create$(FS.scala:575)
	at is.hail.io.fs.HadoopFS.create(HadoopFS.scala:85)
	at __C18399collect_distributed_array_table_scan_write_prefix_sums.apply(Unknown Source)
	at __C18399collect_distributed_array_table_scan_write_prefix_sums.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$6(BackendUtils.scala:87)
	at is.hail.utils.package$.using(package.scala:664)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:166)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$5(BackendUtils.scala:86)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:910)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)




Hail version: 0.2.130.post1-c69cd67afb8b
Error summary: RemoteException: The directory item limit of /tmp/aggregate_intermediates is exceeded: limit=1048576 items=1048576
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:1277)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:1361)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1184)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.addFile(FSDirWriteFileOp.java:579)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:398)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2703)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2596)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:799)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)


[Stage 47:>                                                  (0 + 1063) / 84648]

FatalError: RemoteException: The directory item limit of /tmp/aggregate_intermediates is exceeded: limit=1048576 items=1048576
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:1277)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:1361)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1184)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.addFile(FSDirWriteFileOp.java:579)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:398)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2703)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2596)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:799)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)


Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 47.0 failed 4 times, most recent failure: Lost task 22.3 in stage 47.0 (TID 961784) (saturn-919ba7fe-2d8c-4e1d-945c-229767cf9700-w-276.us-central1-c.c.terra-7a376e4e.internal executor 3617): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): The directory item limit of /tmp/aggregate_intermediates is exceeded: limit=1048576 items=1048576
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:1277)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:1361)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1184)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.addFile(FSDirWriteFileOp.java:579)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:398)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2703)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2596)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:799)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)
	at org.apache.hadoop.ipc.Client.call(Client.java:1558)
	at org.apache.hadoop.ipc.Client.call(Client.java:1455)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
	at com.sun.proxy.$Proxy36.create(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:382)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy37.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1271)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1250)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1232)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1170)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:556)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:553)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:567)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:494)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1196)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1176)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1065)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1053)
	at is.hail.io.fs.HadoopFS.createNoCompression(HadoopFS.scala:101)
	at is.hail.io.fs.HadoopFS.createNoCompression(HadoopFS.scala:85)
	at is.hail.io.fs.FS.create(FS.scala:578)
	at is.hail.io.fs.FS.create$(FS.scala:577)
	at is.hail.io.fs.HadoopFS.create(HadoopFS.scala:85)
	at is.hail.io.fs.FS.create(FS.scala:575)
	at is.hail.io.fs.FS.create$(FS.scala:575)
	at is.hail.io.fs.HadoopFS.create(HadoopFS.scala:85)
	at __C20895collect_distributed_array_table_scan_write_prefix_sums.apply(Unknown Source)
	at __C20895collect_distributed_array_table_scan_write_prefix_sums.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$6(BackendUtils.scala:87)
	at is.hail.utils.package$.using(package.scala:664)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:166)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$5(BackendUtils.scala:86)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:910)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2673)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2609)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2608)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2861)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2803)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2792)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2257)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2276)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2301)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
	at is.hail.backend.spark.SparkBackend.parallelizeAndComputeWithIndex(SparkBackend.scala:429)
	at is.hail.backend.BackendUtils.collectDArray(BackendUtils.scala:82)
	at __C20778Compiled.__m20782begin_group_0(Emit.scala)
	at __C20778Compiled.__m20780split_Block(Emit.scala)
	at __C20778Compiled.apply(Emit.scala)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$7(CompileAndEvaluate.scala:82)
	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:82)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:28)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:59)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:64)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:83)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:32)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:32)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:30)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:29)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:78)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:21)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:19)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:19)
	at is.hail.expr.ir.lowering.EvalRelationalLets$.execute$1(EvalRelationalLets.scala:13)
	at is.hail.expr.ir.lowering.EvalRelationalLets$.lower$1(EvalRelationalLets.scala:21)
	at is.hail.expr.ir.lowering.EvalRelationalLets$.apply(EvalRelationalLets.scala:35)
	at is.hail.expr.ir.lowering.EvalRelationalLetsPass.transform(LoweringPass.scala:168)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:32)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:32)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:30)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:29)
	at is.hail.expr.ir.lowering.EvalRelationalLetsPass.apply(LoweringPass.scala:162)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:21)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:19)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:19)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:45)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:600)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$4(SparkBackend.scala:636)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:84)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$3(SparkBackend.scala:631)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$3$adapted(SparkBackend.scala:630)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:78)
	at is.hail.utils.package$.using(package.scala:664)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:78)
	at is.hail.utils.package$.using(package.scala:664)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:13)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:65)
	at is.hail.backend.spark.SparkBackend.$anonfun$withExecuteContext$2(SparkBackend.scala:407)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:55)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:62)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:393)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:630)
	at is.hail.backend.BackendHttpHandler.handle(BackendServer.scala:88)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
	at jdk.httpserver/sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:82)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:80)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:848)
	at jdk.httpserver/com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:817)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:201)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:560)
	at jdk.httpserver/sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:526)
	at java.base/java.lang.Thread.run(Thread.java:829)

org.apache.hadoop.ipc.RemoteException: The directory item limit of /tmp/aggregate_intermediates is exceeded: limit=1048576 items=1048576
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:1277)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:1361)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1184)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.addFile(FSDirWriteFileOp.java:579)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:398)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2703)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2596)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:799)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)
	at org.apache.hadoop.ipc.Client.call(Client.java:1558)
	at org.apache.hadoop.ipc.Client.call(Client.java:1455)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
	at com.sun.proxy.$Proxy36.create(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:382)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy37.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1271)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1250)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1232)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1170)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:556)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:553)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:567)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:494)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1196)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1176)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1065)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1053)
	at is.hail.io.fs.HadoopFS.createNoCompression(HadoopFS.scala:101)
	at is.hail.io.fs.HadoopFS.createNoCompression(HadoopFS.scala:85)
	at is.hail.io.fs.FS.create(FS.scala:578)
	at is.hail.io.fs.FS.create$(FS.scala:577)
	at is.hail.io.fs.HadoopFS.create(HadoopFS.scala:85)
	at is.hail.io.fs.FS.create(FS.scala:575)
	at is.hail.io.fs.FS.create$(FS.scala:575)
	at is.hail.io.fs.HadoopFS.create(HadoopFS.scala:85)
	at __C20895collect_distributed_array_table_scan_write_prefix_sums.apply(Unknown Source)
	at __C20895collect_distributed_array_table_scan_write_prefix_sums.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$6(BackendUtils.scala:87)
	at is.hail.utils.package$.using(package.scala:664)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:166)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$5(BackendUtils.scala:86)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:910)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)




Hail version: 0.2.130.post1-c69cd67afb8b
Error summary: RemoteException: The directory item limit of /tmp/aggregate_intermediates is exceeded: limit=1048576 items=1048576
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:1277)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:1361)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1184)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.addFile(FSDirWriteFileOp.java:579)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:398)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2703)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2596)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:799)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:604)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:572)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:556)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1043)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:971)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2976)


[Stage 47:>                                                     (0 + 2) / 84648]

In [33]:
if not mt_exists(f'{bucket}/scratch/kvg/srs-subset.990.mt') or True:
    mt_990 = callset_sample_filtered_990.variant_data.annotate_entries(
        AD = hl.vds.local_to_global(callset_sample_filtered_990.variant_data.LAD, 
                                    callset_sample_filtered_990.variant_data.LA, 
                                    n_alleles = hl.len(callset_sample_filtered_990.variant_data.alleles), 
                                    fill_value = 0, 
                                    number = 'R')
    )
    
    mt_990 = mt_990.annotate_entries(GT = hl.vds.lgt_to_gt(mt_990.LGT, mt_990.LA))
    mt_990 = hl.vds.to_dense_mt(hl.vds.VariantDataset(callset_sample_filtered_990.reference_data, mt_990))
    mt_990 = mt_990.annotate_rows(info = hl.agg.call_stats(mt_990.GT, mt_990.alleles))
    mt_990.write(f'{bucket}/scratch/kvg/srs-subset.990.mt', overwrite=True)

2025-03-28 09:16:05.404 Hail: INFO: wrote matrix table with 72130558 rows and 990 columns in 84648 partitions to gs://fc-secure-f7d80b48-be60-426f-aa6b-f037a1bf7f34/scratch/kvg/srs-subset.990.mt


In [34]:
mt_990 = hl.read_matrix_table(f'{bucket}/scratch/kvg/srs-subset.990.mt')

In [35]:
mt_qc_990 = hl.sample_qc(mt_990)

In [36]:
mt_qc_990.describe()

----------------------------------------
Global fields:
    'tranche_data': array<struct {
        model: str, 
        truth_sensitivity: float64, 
        min_vqslod: float64, 
        filter_name: str
    }>
    'truth_sensitivity_snp_threshold': float64
    'truth_sensitivity_indel_threshold': float64
    'snp_vqslod_threshold': float64
    'indel_vqslod_threshold': float64
----------------------------------------
Column fields:
    's': str
    'sample_qc': struct {
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_hom_ref: int64, 
        n_het: int64, 
        n_hom_var: int64, 
        n_non_ref: int64, 
        n_singleton: int64, 
        n_snp: int64, 
        n_insertion: int64, 
        n_deletion: int64, 
        n_transition: int64, 
        n_transversi

In [37]:
mt_qc_990.aggregate_cols(hl.agg.stats(mt_qc_990.sample_qc.r_ti_tv))

2025-03-28 09:16:17.293 Hail: WARN: aggregate_cols(): Aggregates over cols ordered by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'

Struct(mean=2.0096745497622273, stdev=0.002684913051356153, min=2.0012562412889787, max=2.0173091551027644, n=990, sum=1989.577804264605)

In [38]:
mt_qc_990.aggregate_cols(hl.agg.stats(mt_qc_990.sample_qc.r_het_hom_var))



Struct(mean=2.183973992295268, stdev=0.10190033852555179, min=1.4475272934819965, max=2.588171391746939, n=990, sum=2162.1342523723156)