# Exploring TileDB-VCF: 
# Notebook 2. A deeper look at ingestion of data
2024-12-20 Daniel P. Brink

This notebook investigates various aspects of data ingestion with the TileDB-VCF python library. It compares loading of tutorial data from a S3 bucket versus local file, finds (through some trial-and-error) preferrable ways of handling file ingestion, and investigates how multi-sample files can be handled to allow their data to be ingested.

The approach of this notebook is to jump head-first into the commands with little prior knowledge other than that which was learned in notedbook 1 of this GitHub repo. It is very likely that some of the issue encountered below could have been circumented had I read the manual more closely, beforehand. However, by working in this manner, I stumbled upon aspects about how TileDB-VCF handles duplicate ingestion of the same sample multiple times. This lead to some interesting observations, and suggestions for how to work with data ingestion with this library.

The timings in the saved output in this notebook should be taken with a grain of salt, as they were run on a laptop and not a dedicated computing environment. Absolute numbers aside, the relative trends between different methods should be of interest. Another thing to keep in mind is that this notebook used the TileDB-VCF tutorial data, which consists of VCF files from chr1 of the 1000 genomes project for five individuals. This is to be considered a small dataset, as far as variant calling data goes. This means that the findings of this notebook might be completely different for a larger dataset.

**Key findings:**
- The TileDB-VCF python libary does by default not really communicate warnings or error messages. Wrapping the functions in `try-except`blocks helps a lot for debugging
- Duplicate ingestion of the same sample is possible, and can easily be done by mistake (especially when working in a Jupyter notebook and rerunning individual cells). Wrapping the ingestion method in a conditional that checks which samples are already ingested is likely a good practice.
- As already stated in the TileDB-VCF tutorials, ingestion of the example data from local files was substantially faster than ingestion from S3 bucket.
- Ingestion of multiple samples is faster when the `ds.ingest_sample()` is given a list of all samples compared to when it is given one sample at a time. This is perhaps not very surprising, but good to keep in mind when trying to optimize the work-flow.
- Since TileDB-VCF does not support ingestion of multi-sample VCFs, such files must be split prior to ingestion (e.g. with `bcftools`). This creates an processing overhead that, while unremarkable for the tutorial files, might be massive for large datasets. More testing will be required to learn more about how this scales.

# 1. Initiation

In [395]:
import os
import glob

import tiledb
import tiledbvcf
import numpy as np
import pandas as pd
import shutil
import time
import statistics
import contextlib
import io
from io import StringIO
import subprocess


#Check that Conda and the libraries are installed as expected:
print(f"Current Conda environment: {os.environ['CONDA_DEFAULT_ENV']}")

print(
    f"tiledb v{tiledb.version.version}\n"
    f"numpy v{np.__version__}\n"
    f"tiledb-vcf v{tiledbvcf.version}\n"
)

!bcftools --version

Current Conda environment: tileDB_vcf
tiledb v0.31.1
numpy v1.26.4
tiledb-vcf v0.34.2

bcftools 1.20
Using htslib 1.20
Copyright (C) 2024 Genome Research Ltd.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


# 2. Ingest some data from S3 bucket and export it to local disk

One of the main [tutorials for TileDB-VCF](https://tiledb-inc.github.io/TileDB-VCF/examples/tutorial_tiledbvcf_basics.html) mention that ingestion of data from a cloud bucket is going to be slower than ingesting it from disk. It would be interesting to compare time it takes to do the two. In order to have a fair comparison, let's ingest the .bcf data from the cloud, and then export to local .bcf files on disk.

Initiate storage for the TileDB array:

In [2]:
vfs = tiledb.VFS(config=tiledb.Config())
array_uri = "./temp_array_storage/ingestion_testing"
if (vfs.is_dir(array_uri)):
    print(f"Deleting existing array '{array_uri}'")
    vfs.remove_dir(array_uri)
    print("Done.")
ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
ds
ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)

# Verify that the array exists
os.listdir(array_uri)

Deleting existing array './temp_array_storage/ingestion_testing'
Done.


['__meta',
 'variant_stats',
 'allele_count',
 'sample_stats',
 'data',
 '__tiledb_group.tdb',
 'metadata',
 '__group']

Like in the previous notebook, we will use the sample data from the 1000 genomes project supplied by TileDB from their S3 bucket:

In [24]:
%%time
vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kgp3-chr1"
batch1_samples = ["HG00096.bcf", "HG00097.bcf", "HG00099.bcf", "HG00100.bcf", "HG00101.bcf"]
batch1_uris = [f"{vcf_bucket}/{s}" for s in batch1_samples]

ds.ingest_samples(sample_uris = batch1_uris)

CPU times: user 16.5 s, sys: 4.25 s, total: 20.7 s
Wall time: 2min 28s


Let's look at the samples:

In [25]:
ds.samples()

Exception: Sample names can only be retrieved for reader

Interestingly, the dataset is locked for operations unless it is set to read or write mode. We need to change it back to read mode first. (There is going to be a lot of back-and-forth with this in this notebook...)

In [26]:
ds = tiledbvcf.Dataset(array_uri, mode = "r")
ds.samples()

['HG00096', 'HG00097', 'HG00099', 'HG00100', 'HG00101']

Let's export the dataset to disk as single-samples files so that we can later compare ingestion speeds between the TileDB S3 bucket and the local files. The files were in BCF format in the bucket, but we can just as well export to VCF as well to compare the size difference between BCF and VCF versions of the same file.

In [38]:
os.makedirs("temp_VCF_exports/untruncated_files", exist_ok=False)

In [39]:
%%time
for sample in batch1_samples:
    sample=sample.rstrip(".bcf")
    ds.export(
        samples = [sample],
        output_format = 'b',
        output_dir = 'temp_VCF_exports/untruncated_files'
    )
    ds.export(
        samples = [sample],
        output_format = 'v',
        output_dir = 'temp_VCF_exports/untruncated_files'
    )

CPU times: user 10.6 s, sys: 953 ms, total: 11.5 s
Wall time: 8.57 s


(Sanity-check: the expected behaviour is that the export method should not include variants that have empty genotype calls for a given sample. The VCF notation for that is `.|.`. Thus, the following command should return empty:

In [392]:
!bcftools view temp_VCF_exports/untruncated_files/HG00099.bcf | grep -F '.|.'

Which it did. Good.)

As expected, the BCF version is a fraction of the size of the VCF version of the same data file.

In [42]:
file_path = "temp_VCF_exports/untruncated_files"
vcf_files = glob.glob(os.path.join(file_path, "*.*cf"))

for file in vcf_files:
    file_size_bytes = os.path.getsize(file)
    file_size_mb = file_size_bytes / (1024 * 1024)
    print(f"{file}\t{file_size_mb:.2f} Mb")

temp_VCF_exports/untruncated_files/HG00101.bcf	6.19 Mb
temp_VCF_exports/untruncated_files/HG00100.bcf	6.24 Mb
temp_VCF_exports/untruncated_files/HG00099.bcf	6.14 Mb
temp_VCF_exports/untruncated_files/HG00096.vcf	27.92 Mb
temp_VCF_exports/untruncated_files/HG00097.vcf	28.53 Mb
temp_VCF_exports/untruncated_files/HG00099.vcf	27.90 Mb
temp_VCF_exports/untruncated_files/HG00100.vcf	28.34 Mb
temp_VCF_exports/untruncated_files/HG00101.vcf	28.11 Mb
temp_VCF_exports/untruncated_files/HG00097.bcf	6.28 Mb
temp_VCF_exports/untruncated_files/HG00096.bcf	6.14 Mb


All seemed to work fine. But while working on the code in section 1, I noticed that the `ds.export()`command seemed to run slower if I had iterated over the ingestion command without first having re-initated the dataset. Because of this observation, we will now digress from the intended benchmarking for little bit to investigate how duplicate samples are handled.

# 3. Single sample ingestion: how are duplicate samples handled?

While working on section 2, it seemed that rerunning the `.ingest_samples()` command with the same samples seemed to rewrite the dataset without any warning message. This made me suspicious about how TileDB-VCF handles duplicate rows. We will now investigate this in more detail.

Granted, I am currently working on this in a Jupyter notebook, rerunning code-blocks and not the whole script. This means that the TileDB-VCF dataset is not automatically re-initated everytime the code is run. For an script implementation of TileDB-VCF, it is less likely that such duplication issues occurs, as the TileDB-VCF array will probably be re-initated in each iteration. But it is still worth learning more about the behaviour.

## 3.1. Understanding the behaviour

In [32]:
if (vfs.is_dir(array_uri)):
    print(f"Deleting existing array '{array_uri}'")
    vfs.remove_dir(array_uri)
    print("Done.")
ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
ds
ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)

# Verify that the array exists
os.listdir(array_uri)

Deleting existing array './temp_array_storage/ingestion_testing'
Done.


['__meta',
 'variant_stats',
 'allele_count',
 'sample_stats',
 'data',
 '__tiledb_group.tdb',
 'metadata',
 '__group']

Start by ingesting a single sample from S3:

In [33]:
vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kgp3-chr1"
single_sample_uri = [f"{vcf_bucket}/HG00096.bcf"]

In [34]:
%%time
ds.ingest_samples(sample_uris = single_sample_uri)

CPU times: user 3.55 s, sys: 1.22 s, total: 4.77 s
Wall time: 31.4 s


What happens if we ingest the same sample twice? Will it overwrite the previous data (which I what I assume, given that there are scaffold coordinates for each row), or will it append the duplicate lines? We can for instance investigate this by creating pandas dataframes from the TileDB dataset before and after duplicate ingestion. Just for fun, let's also export the files to VCF.

Interestingly, the `.export()` method can only export files to a specific file name when the `merge=True` flag is set. This flag is intended for combined VCFs (multi-sample VCFs), and uses `output_path` instead of `output_dir`. But here we can make a small hack to export a single file this way by asking for a merged file for only one sample.

In [37]:
%%time
ds = tiledbvcf.Dataset(array_uri, mode = "r")
test_df1 = ds.read(
    samples=["HG00096"],
    attrs=[
        "sample_name",
        "contig",
        "pos_start",
        "pos_end",
        "alleles",
        "fmt_GT",
    ],
)
for sample in batch1_samples:
    sample=sample.rstrip(".bcf")
    ds.export(
        samples = [sample],
        merge = True,
        output_format = 'v',
        output_path = f"temp_VCF_exports/untruncated_files/test_{sample}.vcf",
    )

CPU times: user 16.3 s, sys: 238 ms, total: 16.5 s
Wall time: 16 s


In [38]:
test_df1

Unnamed: 0,sample_name,contig,pos_start,pos_end,alleles,fmt_GT
0,HG00096,1,10177,10177,"[A, AC]","[1, 0]"
1,HG00096,1,10352,10352,"[T, TA]","[1, 0]"
2,HG00096,1,10616,10637,"[CCGCCGTTGCAAAGGCGCGCCG, C]","[1, 1]"
3,HG00096,1,14464,14464,"[A, T]","[1, 1]"
4,HG00096,1,14930,14930,"[A, G]","[1, 0]"
...,...,...,...,...,...,...
320469,HG00096,1,249240051,249240051,"[T, TA]","[0, 1]"
320470,HG00096,1,249240099,249240099,"[T, TA]","[1, 0]"
320471,HG00096,1,249240219,249240219,"[A, T]","[1, 0]"
320472,HG00096,1,249240537,249240539,"[GGT, G]","[1, 0]"


Now, let's ingest the same sample once again and learn how TileDB handles duplicate data:

In [39]:
%%time
ds = tiledbvcf.Dataset(array_uri, mode = "w")
ds.ingest_samples(sample_uris = batch1_uris)

CPU times: user 3.77 s, sys: 1.14 s, total: 4.91 s
Wall time: 30 s


Interestingly, the time to ingest a new sample is about the same as the first sample (this operation is what TileDB is supposedly good at), but when we export it to df and file, it is clear that the repeated ingestion has, infact, led to longer processing times:

In [40]:
%%time
ds = tiledbvcf.Dataset(array_uri, mode = "r")
test_df2 = ds.read(
    samples=["HG00096"],
    attrs=[
        "sample_name",
        "contig",
        "pos_start",
        "pos_end",
        "alleles",
        "fmt_GT",
    ],
)

for sample in batch1_samples:
    sample=sample.rstrip(".bcf")
    ds.export(
        samples = [sample],
        merge = True,
        output_format = 'v',
        output_path = f"temp_VCF_exports/untruncated_files/test_{sample}2.vcf",
    )

CPU times: user 2min 22s, sys: 750 ms, total: 2min 23s
Wall time: 2min 20s


Comparing the number of rows and columns in the two dataframes clearly show that the data from the second ingestion (the duplicate) was appended to the file. This is not the the behaviour I expected from parser for a co-ordinate based dataset.

In [396]:
print(f"The single-ingestion (test_df1) contains:\t {(rows := test_df1.shape[0])} rows, {(columns := test_df1.shape[1])} columns.")
print(f"The duplicate-ingestion (test_df2) contains:\t {(rows := test_df2.shape[0])} rows, {(columns := test_df2.shape[1])} columns.")

The single-ingestion (test_df1) contains:	 320474 rows, 6 columns.
The duplicate-ingestion (test_df2) contains:	 640948 rows, 6 columns.


Given the assumption that each chromosomal position only has a single entry in the original BCF, we can check to see if the first row of `test_df1` is duplicated in `test_df2`.

In [54]:
test_df1[test_df1["pos_start"] == 10177]

Unnamed: 0,sample_name,contig,pos_start,pos_end,alleles,fmt_GT
0,HG00096,1,10177,10177,"[A, AC]","[1, 0]"


In [55]:
test_df2[test_df2["pos_start"] == 10177]

Unnamed: 0,sample_name,contig,pos_start,pos_end,alleles,fmt_GT
0,HG00096,1,10177,10177,"[A, AC]","[1, 0]"
320474,HG00096,1,10177,10177,"[A, AC]","[1, 0]"


Indeed it is. And the duplication is on row 320474, which is one row below the last row of `test_df1` (320474). As the pandas index implies, it seem that TileDB-VCF does not take the values of the duplicated rows into consideration. Instead it just appends the duplicate, storing it as another row object in the dataset. Perhaps this is just how the pandas dataframe instance of the TileDB array (ds) is organized? If two samples have their own unique variant in the same position, would the dataframe handle that as two different rows? The examples from the previous notebook seem to imply that.

(However, if we were to add a different additional sample, we would expect that the `ds.export()` will rewrite the data on a row basis (adding another sample column to each row).)

It is always good practice to verify the findings with an additional method. For what it is worth, we can also see that the duplication is also reflected in the file size of the exported files:

In [13]:
file_path = "temp_VCF_exports/untruncated_files"
vcf_files = glob.glob(os.path.join(file_path, "test*.*cf"))

for file in vcf_files:
    file_size_bytes = os.path.getsize(file)
    file_size_mb = file_size_bytes / (1024 * 1024)
    print(f"{file}\t{file_size_mb:.2f} Mb")

temp_VCF_exports/untruncated_files/test_HG00096.vcf	27.92 Mb
temp_VCF_exports/untruncated_files/test_HG000962.vcf	55.83 Mb


## 3.2 A suggestion as for how to avoid ingesting duplicate samples 

One way to avoid it duplicate ingestions to declare new objects `ds`, `ds2`, etc. for each ingestion. Another way, that is perhaps even better (?), would be to start up a new array_uri every time. This behaviour is probably also a reason why the read-write permissions on the dataset is strict.

But to prevent this from happening, we could implement a conditional check that ensures that samples cannot be added twice to the TileDB-VCF array

In [130]:
if (vfs.is_dir(array_uri)):
    print(f"Deleting existing array '{array_uri}'")
    vfs.remove_dir(array_uri)
    print("Done.")
ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
ds
ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)

# Verify that the array exists
os.listdir(array_uri)

Deleting existing array './temp_array_storage/ingestion_testing'
Done.


['__meta',
 'variant_stats',
 'allele_count',
 'sample_stats',
 'data',
 '__tiledb_group.tdb',
 'metadata',
 '__group']

Quick check to ensure that the re-initated dataset object contains no samples:

In [131]:
ds = tiledbvcf.Dataset(uri=array_uri, mode="r")
ds.samples()

[]

We can ensure that doublet ingestion does not occur by wrapping it in a logic sequence that uses `ds.samples()` to check which samples are alredy ingested in `ds`. It turns out try-except is needed to get error messages out of `ds.ingest_samples()`, so we can add that as well.

In [187]:
def ingest_samples_without_risk_of_duplication(new_samples: list):
    ds = tiledbvcf.Dataset(uri=array_uri, mode="r")
    samples_to_ingest = [sample for sample in new_samples if os.path.basename(sample).rstrip(".bcf") not in (existing_samples := list(ds.samples()))]
    if samples_to_ingest:
        ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
        try:
            ds.ingest_samples(samples_to_ingest)
            print(f"Successfully ingested sample: {samples_to_ingest}")
        except RuntimeError as e:
            print(f"Failed to ingest sample '{samples_to_ingest}': {e}")
        print(f"Ingested samples: {samples_to_ingest}")
    else:
        print("No new samples to ingest: all samples are already present in the dataset.")

In [126]:
%%time

vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kgp3-chr1"
new_samples = [f"{vcf_bucket}/HG00096.bcf"]

ingest_samples_without_risk_of_duplication(new_samples)

Ingested samples: ['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kgp3-chr1/HG00096.bcf']
CPU times: user 3.95 s, sys: 1.28 s, total: 5.22 s
Wall time: 37.7 s


To show that this function works for adding new samples that are not duplicates, we can test the following:

In [128]:
%%time

vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kgp3-chr1"
new_samples = [f"{vcf_bucket}/HG00096.bcf",f"{vcf_bucket}/HG00097.bcf"]

ingest_samples_without_risk_of_duplication(new_samples)

Ingested samples: ['s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kgp3-chr1/HG00097.bcf']
CPU times: user 3.95 s, sys: 1.17 s, total: 5.12 s
Wall time: 33.6 s


This last step should ingest the new sample and ignore the first sample. Which it did. Good!

An additional sanity-check:

In [129]:
ds = tiledbvcf.Dataset(uri=array_uri, mode="r")
ds.samples()

['HG00096', 'HG00097']

So, to iterate: the occurance of the duplication issue is likely to be most prominent in a Jupyter notebook work-flow, and might be alleviated by a script work-flow where the TileDB array is re-initiated upon each iteration. Still, the suggested solution will probably be useful also in that case to catch this issue if it was to occur.

# 4. Comparing the time required to load the S3 tutorial files vs their local versions

OK, now that we know how to ensure that ingestion is not duplicated upon iteration of the code blocks, let's go back to the original question: how much slower is it to load the files from S3 than from their local exported versions?

TileDB-VCF requires BCF files to be indexed prior to ingestion. This can be done with `bcftools`, which is the gold standard for massaging BCF files. The index files are saved as `[FILENAME].bcf.csi`, and, as one would expect, need to be in the same directory as their associated BCF file in order to be ingested by TileDB-VCF.


In [398]:
path_to_local_dir = "temp_VCF_exports/untruncated_files"
sample_list = ["HG00096.bcf", "HG00097.bcf", "HG00099.bcf", "HG00100.bcf", "HG00101.bcf"]
new_samples = [f"{path_to_local_dir}/{s}" for s in sample_list]
for file in new_samples:
    !bcftools index {file}

The `%%time` magic command does not support saving its output to variable, so let's implement another way of timing the commands using the `time` python library. To get some basic statistics on how this performs on the current machine, we can wrap the code in a loop and make, say, ten iterations and calculate the average and standard deviation of the processing time. 

Note! The line `with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):` is added to suppress the messages from `ingest_samples_without_risk_of_duplication()`. This is perhaps a little risky behaviour, since the messages are very useful. But for the iterative benchmarking, let's allow us the pleasure to silence the messages. 

First, we assess the time it takes to ingest the local samples:

In [195]:
timing_results = []
number_of_iterations = 10

for i in range(number_of_iterations):
    start_time = time.time()
    if (vfs.is_dir(array_uri)):
        vfs.remove_dir(array_uri)
    print(f"Iteration {i}: creating new array '{array_uri}'")
    ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
    ds
    ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)
    
    path_to_local_dir = "temp_VCF_exports/untruncated_files"
    sample_list = ["HG00096.bcf", "HG00097.bcf", "HG00099.bcf", "HG00100.bcf", "HG00101.bcf"]
    new_samples = [f"{path_to_local_dir}/{s}" for s in sample_list]
    
    with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
        ingest_samples_without_risk_of_duplication(new_samples)
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    timing_results.append(elapsed_time)

average_time = statistics.mean(timing_results)
std_dev_time = statistics.stdev(timing_results)

print(f"It took {average_time:.2f} +/- {std_dev_time:.2f} s ({number_of_iterations} iterations)")

Iteration 0: creating new array './temp_array_storage/ingestion_testing'
Iteration 1: creating new array './temp_array_storage/ingestion_testing'
Iteration 2: creating new array './temp_array_storage/ingestion_testing'
Iteration 3: creating new array './temp_array_storage/ingestion_testing'
Iteration 4: creating new array './temp_array_storage/ingestion_testing'
Iteration 5: creating new array './temp_array_storage/ingestion_testing'
Iteration 6: creating new array './temp_array_storage/ingestion_testing'
Iteration 7: creating new array './temp_array_storage/ingestion_testing'
Iteration 8: creating new array './temp_array_storage/ingestion_testing'
Iteration 9: creating new array './temp_array_storage/ingestion_testing'
It took 8.64 +/- 0.26 s (10 iterations)


Then, we do the same for the same samples in the S3 bucket:

In [178]:
timing_results = []
number_of_iterations = 10

for i in range(number_of_iterations):
    start_time = time.time()
    if (vfs.is_dir(array_uri)):
        vfs.remove_dir(array_uri)
    print(f"Iteration {i}: creating new array '{array_uri}'")
    ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
    ds
    ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)
    
    vcf_bucket = "s3://tiledb-inc-demo-data/examples/notebooks/vcfs/1kgp3-chr1"
    sample_list = ["HG00096.bcf", "HG00097.bcf", "HG00099.bcf", "HG00100.bcf", "HG00101.bcf"]
    new_samples = [f"{vcf_bucket}/{s}" for s in sample_list]

    with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
        ingest_samples_without_risk_of_duplication(new_samples)
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    timing_results.append(elapsed_time)

average_time = statistics.mean(timing_results)
std_dev_time = statistics.stdev(timing_results)

print(f"It took {average_time:.2f} +/- {std_dev_time:.2f} s ({number_of_iterations} iterations)")

Iteration 0: creating new array './temp_array_storage/ingestion_testing'
Iteration 1: creating new array './temp_array_storage/ingestion_testing'
Iteration 2: creating new array './temp_array_storage/ingestion_testing'
Iteration 3: creating new array './temp_array_storage/ingestion_testing'
Iteration 4: creating new array './temp_array_storage/ingestion_testing'
Iteration 5: creating new array './temp_array_storage/ingestion_testing'
Iteration 6: creating new array './temp_array_storage/ingestion_testing'
Iteration 7: creating new array './temp_array_storage/ingestion_testing'
Iteration 8: creating new array './temp_array_storage/ingestion_testing'
Iteration 9: creating new array './temp_array_storage/ingestion_testing'
It took 154.62 +/- 5.77 s (10 iterations)


These results are not unexpected, as they confirm what the TileDB-VCF tutorials already showed: it does take substantially longer time to ingest these files from S3. The difference was longer on this machine than the 15 extra seconds or so that the [tutorial estimated](https://tiledb-inc.github.io/TileDB-VCF/examples/tutorial_tiledbvcf_basics.html). I'd still be careful to extrapolate this to any general statement about local files versus S3 hosted files, but there seem to be a trend. Also to keep in mind is that this is the outcome of these tutorial files and this particular S3 bucket. Other cloud storage solutions might have other performance, based on user permissions and bandwidth limitations. 

Update: this [forum thread](https://forum.tiledb.com/t/s3-first-access-very-slow-with-3d-tiled-dense-array/416/) explains that it has to do with the limitations of how many objects S3 will return per request, meaning that the API typically will need to make multiple requests. So this seem to be a known bottleneck


# 5. Does the ingestion time increase linearly with increasing sample count?

Previous results in the notebook seemed to imply that the ingestion time was linearly proportional to the number of ingested samples. To further investigate if this is the case, we can modify the last code block to time the ingestion of each sample. To do that, we will need to loop over each sample in the `new_samples` list and feed them one-by-one to `ingest_samples_without_risk_of_duplication()`. We will use the local files to test this, since it is proved to be much faster than the S3 bucket ingestion.

In [285]:
timing_results = []
number_of_iterations = 10
dict_elapsed_time_single_sample = {i: [] for i in range(len(new_samples))}

for i in range(number_of_iterations):
    start_time = time.time()
    if (vfs.is_dir(array_uri)):
        vfs.remove_dir(array_uri)
    print(f"Iteration {i}: creating new array '{array_uri}'")
    ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
    ds
    ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)
    
    path_to_local_dir = "temp_VCF_exports/untruncated_files"
    sample_list = ["HG00096.bcf", "HG00097.bcf", "HG00099.bcf", "HG00100.bcf", "HG00101.bcf"]
    new_samples = [f"{path_to_local_dir}/{s}" for s in sample_list]

    # Ingest the samples in the new_samples list one by one. Note that the function needs a list as input.
    for j, sample in enumerate(new_samples):
        start_single_sample = time.time()
        with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
            ingest_samples_without_risk_of_duplication([sample])
        end_single_sample = time.time()
        elapsed_time_single_sample= end_single_sample - start_single_sample
        dict_elapsed_time_single_sample[j].append(elapsed_time_single_sample)
        
    end_time = time.time()
    elapsed_time = end_time - start_time
    timing_results.append(elapsed_time)

# Calculate average and standard deviation for each sample
for j in range(len(sample_list)):
    average_time = statistics.mean(dict_elapsed_time_single_sample[j])
    std_dev_time = statistics.stdev(dict_elapsed_time_single_sample[j])
    print(f"Sample {sample_list[j]}: {average_time:.2f} +/- {std_dev_time:.2f} s ({number_of_iterations} iterations)")

# Calculate overall average and standard deviation
total_average_time = statistics.mean(timing_results)
total_std_dev_time = statistics.stdev(timing_results)
print(f"Total runtime: {total_average_time:.2f} +/- {total_std_dev_time:.2f} s ({number_of_iterations} iterations)")

Iteration 0: creating new array './temp_array_storage/ingestion_testing'
Iteration 1: creating new array './temp_array_storage/ingestion_testing'
Iteration 2: creating new array './temp_array_storage/ingestion_testing'
Iteration 3: creating new array './temp_array_storage/ingestion_testing'
Iteration 4: creating new array './temp_array_storage/ingestion_testing'
Iteration 5: creating new array './temp_array_storage/ingestion_testing'
Iteration 6: creating new array './temp_array_storage/ingestion_testing'
Iteration 7: creating new array './temp_array_storage/ingestion_testing'
Iteration 8: creating new array './temp_array_storage/ingestion_testing'
Iteration 9: creating new array './temp_array_storage/ingestion_testing'
Sample HG00096.bcf: 3.29 +/- 0.16 s (10 iterations)
Sample HG00097.bcf: 3.38 +/- 0.24 s (10 iterations)
Sample HG00099.bcf: 3.29 +/- 0.23 s (10 iterations)
Sample HG00100.bcf: 3.41 +/- 0.27 s (10 iterations)
Sample HG00101.bcf: 3.31 +/- 0.10 s (10 iterations)
Total runt

Interestingly, this took about twice as long compared to the code where the all samples were passed in a list once to the `ds.ingest_sample()`: 8 s versus 16 s in this case. Also, for this data, we can see that each sample took equally long to ingest (~3 s).

There is clearly an extra overhead of looping over all samples and reinitiating the process. If this is mostly due to the python code, or how `ds.ingest_sample()` works is difficult to say from this small experiment, but it shows that if multiple samples are to be ingested, they should be sent to the method as one big list of samples instead of sample-by-sample.

# 6. The case of multi-samples VCFs: some lessions and unexpected turns 

TileDB-VCF tools does not support ingestion of multi-sample VCF files. This is perhaps one of its biggest drawbacks in terms of useability and performance. This essentially means that the user will need to first split any multi-samples VCFs with another tool, which is a process that is known to take a long time (depending on the data and the machine).

Let's simulate a multi-sample BCF by ingesting the single-samples files and then exporting a multi-sample file.

In [348]:
%%time
if (vfs.is_dir(array_uri)):
    vfs.remove_dir(array_uri)
ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)
    
path_to_local_dir = "temp_VCF_exports/untruncated_files"
sample_list = ["HG00096.bcf", "HG00097.bcf", "HG00099.bcf", "HG00100.bcf", "HG00101.bcf"]
new_samples = [f"{path_to_local_dir}/{s}" for s in sample_list]
    
ingest_samples_without_risk_of_duplication(new_samples)

ds = tiledbvcf.Dataset(uri=array_uri, mode="r")
ds.export(
    samples = ds.samples()[0:5],
    merge = True,
    output_format = 'b',
    output_path = 'temp_VCF_exports/untruncated_files/combined_HG00096-101.bcf',
)

Successfully ingested sample: ['temp_VCF_exports/untruncated_files/HG00096.bcf', 'temp_VCF_exports/untruncated_files/HG00097.bcf', 'temp_VCF_exports/untruncated_files/HG00099.bcf', 'temp_VCF_exports/untruncated_files/HG00100.bcf', 'temp_VCF_exports/untruncated_files/HG00101.bcf']
Ingested samples: ['temp_VCF_exports/untruncated_files/HG00096.bcf', 'temp_VCF_exports/untruncated_files/HG00097.bcf', 'temp_VCF_exports/untruncated_files/HG00099.bcf', 'temp_VCF_exports/untruncated_files/HG00100.bcf', 'temp_VCF_exports/untruncated_files/HG00101.bcf']
CPU times: user 2min 33s, sys: 3.5 s, total: 2min 36s
Wall time: 2min 29s


We can view the file to ensure that it has been correctly exported. (There are ~250 lines of `##` metadata headers in these VCFs, so we can omit them for the sake of readability)

In [352]:
!bcftools view temp_VCF_exports/untruncated_files/combined_HG00096-101.bcf | grep -v "^##" | head -n 10

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096	HG00097	HG00099	HG00100	HG00101
1	10177	rs367896724	A	AC	100	PASS	VT=INDEL;AA=|||unknown(NO_COVERAGE);DP=412608;END=10177;AN=8;AC=4	GT	1|0	0|1	0|1	1|0	./.
1	10352	rs555500075	T	TA	100	PASS	VT=INDEL;AA=|||unknown(NO_COVERAGE);DP=444575;AC=5;AN=10;END=10352	GT	1|0	1|0	0|1	0|1	1|0
1	10616	rs376342519	CCGCCGTTGCAAAGGCGCGCCG	C	100	PASS	VT=INDEL;DP=11825;AC=10;AN=10;END=10637	GT	1|1	1|1	1|1	1|1	1|1
1	13110	rs540538026	G	A	100	PASS	VT=SNP;AA=g|||;DP=23422;AC=1;AN=2;END=13110	GT	./.	1|0	./.	./.	./.
1	13116	rs62635286	T	G	100	PASS	VT=SNP;AA=t|||;DP=44680;AC=2;AN=4;END=13116	GT	./.	1|0	./.	./.	1|0
1	13118	rs200579949	A	G	100	PASS	VT=SNP;AA=a|||;DP=42790;AC=2;AN=4;END=13118	GT	./.	1|0	./.	./.	1|0
1	14464	rs546169444	A	T	100	PASS	VT=SNP;AA=a|||;DP=53522;AC=3;AN=4;END=14464	GT	1|1	./.	1|0	./.	./.
1	14599	rs531646671	T	A	100	PASS	VT=SNP;AA=t|||;DP=64162;AC=2;AN=4;END=14599	GT	./.	0|1	1|0	./.	./.
1	14604	rs541940975	A	G	100	PASS	VT=SNP;AA=a|||;DP=

It looks good, but to make it more readable, let's import it to pandas and use the pretty notebook rendering of dataframes:

In [355]:
command = "bcftools view temp_VCF_exports/untruncated_files/combined_HG00096-101.bcf | grep -v '^##' | head -n 10"
result = subprocess.run(command, shell=True, capture_output=True, text=True)
bcfdf = pd.read_csv(StringIO(result.stdout), sep='\t')
bcfdf

Unnamed: 0,#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,HG00096,HG00097,HG00099,HG00100,HG00101
0,1,10177,rs367896724,A,AC,100,PASS,VT=INDEL;AA=|||unknown(NO_COVERAGE);DP=412608;...,GT,1|0,0|1,0|1,1|0,./.
1,1,10352,rs555500075,T,TA,100,PASS,VT=INDEL;AA=|||unknown(NO_COVERAGE);DP=444575;...,GT,1|0,1|0,0|1,0|1,1|0
2,1,10616,rs376342519,CCGCCGTTGCAAAGGCGCGCCG,C,100,PASS,VT=INDEL;DP=11825;AC=10;AN=10;END=10637,GT,1|1,1|1,1|1,1|1,1|1
3,1,13110,rs540538026,G,A,100,PASS,VT=SNP;AA=g|||;DP=23422;AC=1;AN=2;END=13110,GT,./.,1|0,./.,./.,./.
4,1,13116,rs62635286,T,G,100,PASS,VT=SNP;AA=t|||;DP=44680;AC=2;AN=4;END=13116,GT,./.,1|0,./.,./.,1|0
5,1,13118,rs200579949,A,G,100,PASS,VT=SNP;AA=a|||;DP=42790;AC=2;AN=4;END=13118,GT,./.,1|0,./.,./.,1|0
6,1,14464,rs546169444,A,T,100,PASS,VT=SNP;AA=a|||;DP=53522;AC=3;AN=4;END=14464,GT,1|1,./.,1|0,./.,./.
7,1,14599,rs531646671,T,A,100,PASS,VT=SNP;AA=t|||;DP=64162;AC=2;AN=4;END=14599,GT,./.,0|1,1|0,./.,./.
8,1,14604,rs541940975,A,G,100,PASS,VT=SNP;AA=a|||;DP=58462;AC=2;AN=4;END=14604,GT,./.,0|1,1|0,./.,./.


If we compare this to `HG00096.bcf` (below), we can see that in the combined VCF 
- a) there are sample headers for all samples,
- b) there are, as expected, rows for variants that not present in HG00096.bcf
- c) there are row like number 3 that only contain variants for one of the samples,
- d) there are, in e.g. line 0, positions where the different sample have different genotypes e.g. 0|1 (phased diploid, first allele has REF variant, second allele has the first variant in ALT) and 1|0.

From a quick inspection of these particular example files, I did not find any examples of an ALT column that has been updated with more variants after combining the samples. Nevertheless, when that happens, the ALT column values and the genotype call needs to be updated for all samples. Without knowing too much about how VCF merging software operate, I can only make a guess that the operation to read and potentially update each row is likely to be a computational burden.

In [356]:
command = "bcftools view temp_VCF_exports/untruncated_files/HG00096.bcf | grep -v '^##' | head -n 10"
result = subprocess.run(command, shell=True, capture_output=True, text=True)
bcfdf_HG00096 = pd.read_csv(StringIO(result.stdout), sep='\t')
bcfdf_HG00096

Unnamed: 0,#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,HG00096
0,1,10177,rs367896724,A,AC,100,PASS,END=10177;AC=1;AN=2;DP=103152;AA=|||unknown(NO...,GT,1|0
1,1,10352,rs555500075,T,TA,100,PASS,END=10352;AC=1;AN=2;DP=88915;AA=|||unknown(NO_...,GT,1|0
2,1,10616,rs376342519,CCGCCGTTGCAAAGGCGCGCCG,C,100,PASS,END=10637;AC=2;AN=2;DP=2365;VT=INDEL,GT,1|1
3,1,14464,rs546169444,A,T,100,PASS,END=14464;AC=2;AN=2;DP=26761;AA=a|||;VT=SNP,GT,1|1
4,1,14930,rs75454623,A,G,100,PASS,END=14930;AC=1;AN=2;DP=42231;AA=a|||;VT=SNP,GT,1|0
5,1,15211,rs78601809,T,G,100,PASS,END=15211;AC=1;AN=2;DP=32245;AA=t|||;VT=SNP,GT,0|1
6,1,15274,rs62636497,A,"G,T",100,PASS,"END=15274;AC=1,1;AN=2;DP=23255;AA=g|||;VT=SNP;...",GT,1|2
7,1,15820,rs2691315,G,T,100,PASS,END=15820;AC=1;AN=2;DP=14933;AA=t|||;VT=SNP;EX...,GT,1|0
8,1,15903,rs557514207,G,GC,100,PASS,END=15903;AC=1;AN=2;DP=7012;AA=ccc|CC|CCC|dele...,GT,0|1


Attempting to ingest the new multi-sample VCF will result in a TileDB-VCF error (as was known):

In [357]:
%%time
if (vfs.is_dir(array_uri)):
    vfs.remove_dir(array_uri)
ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)
    
path_to_local_dir = "temp_VCF_exports/untruncated_files"
sample_list = ["combined_HG00096-101.bcf"]
new_samples = [f"{path_to_local_dir}/{s}" for s in sample_list]
    
ingest_samples_without_risk_of_duplication(new_samples)

Failed to ingest sample '['temp_VCF_exports/untruncated_files/combined_HG00096-101.bcf']': TileDB-VCF exception: Combined VCFs are current not suppported
Ingested samples: ['temp_VCF_exports/untruncated_files/combined_HG00096-101.bcf']
CPU times: user 26 ms, sys: 80.2 ms, total: 106 ms
Wall time: 151 ms


Instead, let's explore how long time it takes to split the multi-sample BCF back to single-sample BCFs, and then ingesting them. Of course, `bcftools` offers options how to do this. There is even a special plugin call `split`for it, and the `bcftools`version used in this conda environment (see Section 1) at the time of writing includes the plugin.

There are several options in `bcftools split`that can be used to preserve the different combined column values from the multi-sample VCF, but for now let's just make a default split, making one .bcf file per sample in the multi-sample BCF.

In [358]:
%%time
!bcftools +split temp_VCF_exports/untruncated_files/combined_HG00096-101.bcf -Ob -o temp_VCF_exports/untruncated_files/resplit_files

CPU times: user 20.6 ms, sys: 88.4 ms, total: 109 ms
Wall time: 3.65 s


For these files, which are small tutorial files, this was a fast operation. But this might take substantially longer for larger files with a higher sample number.

In [359]:
command = "bcftools view temp_VCF_exports/untruncated_files/resplit_files/HG00096.bcf | grep -v '^##'| head -n 10"
result = subprocess.run(command, shell=True, capture_output=True, text=True)
bcfdf_HG00096_resplit = pd.read_csv(StringIO(result.stdout), sep='\t')
bcfdf_HG00096_resplit

Unnamed: 0,#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,HG00096
0,1,10177,rs367896724,A,AC,100,PASS,VT=INDEL;AA=|||unknown(NO_COVERAGE);DP=412608;...,GT,1|0
1,1,10352,rs555500075,T,TA,100,PASS,VT=INDEL;AA=|||unknown(NO_COVERAGE);DP=444575;...,GT,1|0
2,1,10616,rs376342519,CCGCCGTTGCAAAGGCGCGCCG,C,100,PASS,VT=INDEL;DP=11825;AC=10;AN=10;END=10637,GT,1|1
3,1,13110,rs540538026,G,A,100,PASS,VT=SNP;AA=g|||;DP=23422;AC=1;AN=2;END=13110,GT,./.
4,1,13116,rs62635286,T,G,100,PASS,VT=SNP;AA=t|||;DP=44680;AC=2;AN=4;END=13116,GT,./.
5,1,13118,rs200579949,A,G,100,PASS,VT=SNP;AA=a|||;DP=42790;AC=2;AN=4;END=13118,GT,./.
6,1,14464,rs546169444,A,T,100,PASS,VT=SNP;AA=a|||;DP=53522;AC=3;AN=4;END=14464,GT,1|1
7,1,14599,rs531646671,T,A,100,PASS,VT=SNP;AA=t|||;DP=64162;AC=2;AN=4;END=14599,GT,./.
8,1,14604,rs541940975,A,G,100,PASS,VT=SNP;AA=a|||;DP=58462;AC=2;AN=4;END=14604,GT,./.


So are these two version of this small selection of the dataframe the same?

In [360]:
df_comparison = bcfdf_HG00096.equals(bcfdf_HG00096_resplit)
print(f"Are the dataframes from before and after the merge-and-split are equal: {df_comparison}")

Are the dataframes from before and after the merge-and-split are equal: False


No they are not, since the empty genotypes were not filtered in the split. Also, the INFO column is sorted in a different order.

In [361]:
bcfdf_HG00096

Unnamed: 0,#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,HG00096
0,1,10177,rs367896724,A,AC,100,PASS,END=10177;AC=1;AN=2;DP=103152;AA=|||unknown(NO...,GT,1|0
1,1,10352,rs555500075,T,TA,100,PASS,END=10352;AC=1;AN=2;DP=88915;AA=|||unknown(NO_...,GT,1|0
2,1,10616,rs376342519,CCGCCGTTGCAAAGGCGCGCCG,C,100,PASS,END=10637;AC=2;AN=2;DP=2365;VT=INDEL,GT,1|1
3,1,14464,rs546169444,A,T,100,PASS,END=14464;AC=2;AN=2;DP=26761;AA=a|||;VT=SNP,GT,1|1
4,1,14930,rs75454623,A,G,100,PASS,END=14930;AC=1;AN=2;DP=42231;AA=a|||;VT=SNP,GT,1|0
5,1,15211,rs78601809,T,G,100,PASS,END=15211;AC=1;AN=2;DP=32245;AA=t|||;VT=SNP,GT,0|1
6,1,15274,rs62636497,A,"G,T",100,PASS,"END=15274;AC=1,1;AN=2;DP=23255;AA=g|||;VT=SNP;...",GT,1|2
7,1,15820,rs2691315,G,T,100,PASS,END=15820;AC=1;AN=2;DP=14933;AA=t|||;VT=SNP;EX...,GT,1|0
8,1,15903,rs557514207,G,GC,100,PASS,END=15903;AC=1;AN=2;DP=7012;AA=ccc|CC|CCC|dele...,GT,0|1


Apparently, `split`does not have an option for filtering out lines that only contain empty genotypes, but regular `bcftools view`does. To wrap the command in a python variable, we need to escape some quotes, so let's use a `"""` block to handle this:

In [362]:
command = """
    bcftools view -e 'GT="./."' temp_VCF_exports/untruncated_files/resplit_files/HG00096.bcf | grep -v '^##' | head -n 10
    """
result = subprocess.run(command, shell=True, capture_output=True, text=True)
bcfdf_HG00096_resplit_filtered = pd.read_csv(StringIO(result.stdout), sep='\t')
bcfdf_HG00096_resplit_filtered

Unnamed: 0,#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,HG00096
0,1,10177,rs367896724,A,AC,100,PASS,VT=INDEL;AA=|||unknown(NO_COVERAGE);DP=412608;...,GT,1|0
1,1,10352,rs555500075,T,TA,100,PASS,VT=INDEL;AA=|||unknown(NO_COVERAGE);DP=444575;...,GT,1|0
2,1,10616,rs376342519,CCGCCGTTGCAAAGGCGCGCCG,C,100,PASS,VT=INDEL;DP=11825;AC=10;AN=10;END=10637,GT,1|1
3,1,14464,rs546169444,A,T,100,PASS,VT=SNP;AA=a|||;DP=53522;AC=3;AN=4;END=14464,GT,1|1
4,1,14930,rs75454623,A,G,100,PASS,VT=SNP;AA=a|||;DP=211155;AC=5;AN=10;END=14930,GT,1|0
5,1,15211,rs78601809,T,G,100,PASS,VT=SNP;AA=t|||;DP=161225;AC=5;AN=10;END=15211,GT,0|1
6,1,15274,rs62636497,A,"G,T",100,PASS,"MULTI_ALLELIC;VT=SNP;AA=g|||;DP=116275;AC=3,7;...",GT,1|2
7,1,15820,rs2691315,G,T,100,PASS,EX_TARGET;VT=SNP;AA=t|||;DP=44799;AC=3;AN=6;EN...,GT,1|0
8,1,15903,rs557514207,G,GC,100,PASS,EX_TARGET;VT=INDEL;AA=ccc|CC|CCC|deletion;DP=2...,GT,0|1


Because of the different sorting in INFO, the dataframe difference remains. But since it seems like the values in INFO are preserved, we can accept this for the sake of this experiment. (Down the line, and with a non-tutorial dataset, it would be important to ensure that the splitting recreates files identical to the files from before the merge).

In [363]:
df_comparison = bcfdf_HG00096.equals(bcfdf_HG00096_resplit_filtered)
print(f"Are the dataframes from before and after the merge-and-split are equal: {df_comparison}")

Are the dataframes from before and after the merge-and-split are equal: False


We can do a quick sanity-check to verify that the number of lines are different in the two files:

In [367]:
!bcftools view temp_VCF_exports/untruncated_files/HG00096.bcf | wc -l

320726


In [370]:
!bcftools view temp_VCF_exports/untruncated_files/resplit_files/HG00096.bcf | wc -l

560069


And here it is clear that the filtering step results in the same number of lines as the original. Very nice:

In [369]:
!bcftools view -e 'GT="./."' temp_VCF_exports/untruncated_files/resplit_files/HG00096.bcf | wc -l

320726


What about the file size? We need to save the filtered the files to disk for later operations anyway, so let's do that already and include them in the file size comparison.

In [385]:
path_to_local_dir = "temp_VCF_exports/untruncated_files/resplit_files"
sample_list = ["HG00096.bcf", "HG00097.bcf", "HG00099.bcf", "HG00100.bcf", "HG00101.bcf"]
new_samples = [f"{path_to_local_dir}/{s}" for s in sample_list]
for file in new_samples:
    basename = file.rstrip(".bcf")
    !bcftools view -e 'GT="./."' {file} -Ob -o {basename}_filtered.bcf

 The split unfiltered files are, as expected from the extra line count, larger. They are more in the size range of the merged multi-sample file, which make sense, since there cannot be more lines than in the multi-sample file.

The split and filtered files are much closer in file size to the original files. The differences can probably be attributed things like how the the `##` metadata preamble was not touched in the splitting and filtering.

In [389]:
file_path = "temp_VCF_exports/untruncated_files/resplit_files"
vcf_files = glob.glob(os.path.join(file_path, "*.bcf"))
vcf_files = [f for f in vcf_files if not f.endswith("filtered.bcf")]

print("- Resplit files, unfiltered for empty genotypes:")
for file in vcf_files:
    file_size_bytes = os.path.getsize(file)
    file_size_mb = file_size_bytes / (1024 * 1024)
    print(f"{file}\t{file_size_mb:.2f} Mb")

vcf_files = glob.glob(os.path.join(file_path, "*filtered.bcf"))

print("- Resplit files, filtered for empty genotypes:")
for file in vcf_files:
    file_size_bytes = os.path.getsize(file)
    file_size_mb = file_size_bytes / (1024 * 1024)
    print(f"{file}\t{file_size_mb:.2f} Mb")
    
print("- Original files, does not contain empty genotypes:")
file_path = "temp_VCF_exports/untruncated_files/"
vcf_files = glob.glob(os.path.join(file_path, "*.bcf"))

for file in vcf_files:
    file_size_bytes = os.path.getsize(file)
    file_size_mb = file_size_bytes / (1024 * 1024)
    print(f"{file}\t{file_size_mb:.2f} Mb")

- Resplit files, unfiltered for empty genotypes:
temp_VCF_exports/untruncated_files/resplit_files/HG00101.bcf	11.14 Mb
temp_VCF_exports/untruncated_files/resplit_files/HG00100.bcf	11.14 Mb
temp_VCF_exports/untruncated_files/resplit_files/HG00099.bcf	11.14 Mb
temp_VCF_exports/untruncated_files/resplit_files/HG00097.bcf	11.14 Mb
temp_VCF_exports/untruncated_files/resplit_files/HG00096.bcf	11.14 Mb
- Resplit files, filtered for empty genotypes:
temp_VCF_exports/untruncated_files/resplit_files/HG00099_filtered.bcf	6.32 Mb
temp_VCF_exports/untruncated_files/resplit_files/HG00101_filtered.bcf	6.37 Mb
temp_VCF_exports/untruncated_files/resplit_files/HG00100_filtered.bcf	6.42 Mb
temp_VCF_exports/untruncated_files/resplit_files/HG00097_filtered.bcf	6.47 Mb
temp_VCF_exports/untruncated_files/resplit_files/HG00096_filtered.bcf	6.32 Mb
- Original files, does not contain empty genotypes:
temp_VCF_exports/untruncated_files/HG00101.bcf	6.19 Mb
temp_VCF_exports/untruncated_files/HG00100.bcf	6.24 Mb
te

Let's ingest the split files and check the performance. First we need to index the new files:

In [364]:
path_to_local_dir = "temp_VCF_exports/untruncated_files/resplit_files"
sample_list = ["HG00096.bcf", "HG00097.bcf", "HG00099.bcf", "HG00100.bcf", "HG00101.bcf"]
new_samples = [f"{path_to_local_dir}/{s}" for s in sample_list]
for file in new_samples:
    !bcftools index {file}

In [366]:
timing_results = []
number_of_iterations = 10

for i in range(number_of_iterations):
    start_time = time.time()
    if (vfs.is_dir(array_uri)):
        vfs.remove_dir(array_uri)
    print(f"Iteration {i}: creating new array '{array_uri}'")
    ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
    ds
    ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)
    
    path_to_local_dir = "temp_VCF_exports/untruncated_files/resplit_files"
    sample_list = ["HG00096.bcf", "HG00097.bcf", "HG00099.bcf", "HG00100.bcf", "HG00101.bcf"]
    new_samples = [f"{path_to_local_dir}/{s}" for s in sample_list]
    
    with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
        ingest_samples_without_risk_of_duplication([sample])
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    timing_results.append(elapsed_time)

average_time = statistics.mean(timing_results)
std_dev_time = statistics.stdev(timing_results)

print(f"It took {average_time:.2f} +/- {std_dev_time:.2f} s ({number_of_iterations} iterations)")

Iteration 0: creating new array './temp_array_storage/ingestion_testing'
Iteration 1: creating new array './temp_array_storage/ingestion_testing'
Iteration 2: creating new array './temp_array_storage/ingestion_testing'
Iteration 3: creating new array './temp_array_storage/ingestion_testing'
Iteration 4: creating new array './temp_array_storage/ingestion_testing'
Iteration 5: creating new array './temp_array_storage/ingestion_testing'
Iteration 6: creating new array './temp_array_storage/ingestion_testing'
Iteration 7: creating new array './temp_array_storage/ingestion_testing'
Iteration 8: creating new array './temp_array_storage/ingestion_testing'
Iteration 9: creating new array './temp_array_storage/ingestion_testing'
It took 3.49 +/- 0.13 s (10 iterations)


The ingestion time actually seem to be slightly faster for the files that have gone through the merge-split than for the original files (3 s versus 8 s). Question is if this is due to the difference in sorting of e.g. the INFO column, or if the data of the files acctually are different. The above cell was run on the non-filtered files (i.e. lines with empty genotype were still present). This could be a future line of investigation

For curiosity, what about the time it takes to ingest the split and filtered files?

In [None]:
path_to_local_dir = "temp_VCF_exports/untruncated_files/resplit_files"
sample_list = ["HG00096_filtered.bcf", "HG00097_filtered.bcf", "HG00099_filtered.bcf", "HG00100_filtered.bcf", "HG00101_filtered.bcf"]
new_samples = [f"{path_to_local_dir}/{s}" for s in sample_list]
for file in new_samples:
    !bcftools index {file}

In [391]:
timing_results = []
number_of_iterations = 10

for i in range(number_of_iterations):
    start_time = time.time()
    if (vfs.is_dir(array_uri)):
        vfs.remove_dir(array_uri)
    print(f"Iteration {i}: creating new array '{array_uri}'")
    ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
    ds
    ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)

    path_to_local_dir = "temp_VCF_exports/untruncated_files/resplit_files"
    sample_list = ["HG00096_filtered.bcf", "HG00097_filtered.bcf", "HG00099_filtered.bcf", "HG00100_filtered.bcf", "HG00101_filtered.bcf"]
    new_samples = [f"{path_to_local_dir}/{s}" for s in sample_list]

    with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
        ingest_samples_without_risk_of_duplication([sample])
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    timing_results.append(elapsed_time)

average_time = statistics.mean(timing_results)
std_dev_time = statistics.stdev(timing_results)

print(f"It took {average_time:.2f} +/- {std_dev_time:.2f} s ({number_of_iterations} iterations)")

Iteration 0: creating new array './temp_array_storage/ingestion_testing'
Iteration 1: creating new array './temp_array_storage/ingestion_testing'
Iteration 2: creating new array './temp_array_storage/ingestion_testing'
Iteration 3: creating new array './temp_array_storage/ingestion_testing'
Iteration 4: creating new array './temp_array_storage/ingestion_testing'
Iteration 5: creating new array './temp_array_storage/ingestion_testing'
Iteration 6: creating new array './temp_array_storage/ingestion_testing'
Iteration 7: creating new array './temp_array_storage/ingestion_testing'
Iteration 8: creating new array './temp_array_storage/ingestion_testing'
Iteration 9: creating new array './temp_array_storage/ingestion_testing'
It took 3.32 +/- 0.07 s (10 iterations)


No, it does not seem to make any difference at all for these tutorial files. Nevertheless, these results make sense in the light of how TileDB-VCF is based on storing data in sparse arrays.

For the sake of the experiment and for reuse in future testing of multi-sample VCFs, we can wrap up the whole multi-sample process splitting and ingestion, and time it. The function:

In [344]:
path_to_local_dir = "temp_VCF_exports/untruncated_files"

def ingest_multi_sample_file(multi_sample_file, dir_split_samples = None):
    if not dir_split_samples:
        dir_split_samples = "temp_VCF_exports/untruncated_files/resplit_files"
        
    # Read the header from the BCF/VCF and save it to a python variable
    view_header_command = f"bcftools view {multi_sample_file} | grep -v '^##' | head -n 1"
    try:
        view_header_result = subprocess.run(view_header_command, shell=True, capture_output=True, text=True)
        output = view_header_result.stdout
        # Make a list of all the samples names 
        sample_list = output.split('\t')[output.split('\t').index("FORMAT") + 1:]
        sample_list[-1] = sample_list[-1].rstrip('\n')
        sample_list = [sample + ".bcf" for sample in sample_list]
    except Exception as e:
        print(f"Error: {e}")
        
    # Split the samples into seprate BCF files
    split_command = f"bcftools +split {multi_sample_file} -Ob -o {dir_split_samples}"
    try:
        subprocess.run(split_command, shell=True, check=True, capture_output=True, text=True)
        print("bcftools +split command executed successfully.")
    except Exception as e:
        print(f"Error: {e}")
    
    new_samples = [f"{dir_split_samples}/{s}" for s in sample_list]

    print("Indexing the split files.")
    for file in new_samples:
        try:
            subprocess.run(f"bcftools index {file}", shell=True, check=True, capture_output=True, text=True)
        except Exception as e:
             print(f"Error: {e}")

    ingest_samples_without_risk_of_duplication(new_samples)

And running it:

In [345]:
%%time
if (vfs.is_dir(array_uri)):
    vfs.remove_dir(array_uri)
ds = tiledbvcf.Dataset(uri=array_uri, mode="w")
ds.create_dataset(enable_allele_count=True, enable_variant_stats=True)

ingest_multi_sample_file("temp_VCF_exports/untruncated_files/combined_HG00096-101.bcf")


bcftools +split command executed successfully.
Indexing the split files.
Successfully ingested sample: ['temp_VCF_exports/untruncated_files/resplit_files/HG00096.bcf', 'temp_VCF_exports/untruncated_files/resplit_files/HG00097.bcf', 'temp_VCF_exports/untruncated_files/resplit_files/HG00099.bcf', 'temp_VCF_exports/untruncated_files/resplit_files/HG00100.bcf', 'temp_VCF_exports/untruncated_files/resplit_files/HG00101.bcf']
Ingested samples: ['temp_VCF_exports/untruncated_files/resplit_files/HG00096.bcf', 'temp_VCF_exports/untruncated_files/resplit_files/HG00097.bcf', 'temp_VCF_exports/untruncated_files/resplit_files/HG00099.bcf', 'temp_VCF_exports/untruncated_files/resplit_files/HG00100.bcf', 'temp_VCF_exports/untruncated_files/resplit_files/HG00101.bcf']
CPU times: user 12.6 s, sys: 2.99 s, total: 15.6 s
Wall time: 15 s


This process, of course, takes longer than the ingestion of the five single-samples files (15 s versus 8 s), but it would be interesting to try this with larger files with more samples to see how this scales.

In all, this notebook resulted many interesting observations about how data ingestion works for TileDB-VCF. As mentioned throughout the notes, the next logical step would be to try to scale this up with a much bigger dataset and see how it performs.