HyDrop produces a steady stream of cell/bead emulsion. The user can pick the volume of emulsion for droplet PCR and downstream processing. A large emulsion volume will lead to fewer steps downstream, and thus less work to do. However, if the number of cells is too high relative to the barcode complexity (total number of barcodes available to index cells), the odds of barcode collisions (two different cells receiving the same barcode by chance) increases according to a poisson distribution.  

For each replicate, we created an emulsion containing the equivalent of 3000 recovered cells. We split each of this emulsion into two parts, and indexed them separately to avoid these barcode collisions. We do this only with the 384x384 version of hydrop, as the barcode complexity is only ~140k compared to the 96x96x96 complexity of ~880k.

These aliquots are manifested as such:

In [2]:
!ls *k/*k_preprocessing_out/data/fragments/VIB*hydrop*.tsv.gz

10k/10k_preprocessing_out/data/fragments/VIB_hydrop_11.10k.fragments.raw.tsv.gz
10k/10k_preprocessing_out/data/fragments/VIB_hydrop_11.10k.fragments.tsv.gz
10k/10k_preprocessing_out/data/fragments/VIB_hydrop_12.10k.fragments.raw.tsv.gz
10k/10k_preprocessing_out/data/fragments/VIB_hydrop_12.10k.fragments.tsv.gz
10k/10k_preprocessing_out/data/fragments/VIB_hydrop_21.10k.fragments.raw.tsv.gz
10k/10k_preprocessing_out/data/fragments/VIB_hydrop_21.10k.fragments.tsv.gz
10k/10k_preprocessing_out/data/fragments/VIB_hydrop_22.10k.fragments.raw.tsv.gz
10k/10k_preprocessing_out/data/fragments/VIB_hydrop_22.10k.fragments.tsv.gz
15k/15k_preprocessing_out/data/fragments/VIB_hydrop_11.15k.fragments.raw.tsv.gz
15k/15k_preprocessing_out/data/fragments/VIB_hydrop_11.15k.fragments.tsv.gz
15k/15k_preprocessing_out/data/fragments/VIB_hydrop_12.15k.fragments.raw.tsv.gz
15k/15k_preprocessing_out/data/fragments/VIB_hydrop_12.15k.fragments.tsv.gz
15k/15k_preprocessing_out/data/fragments/VIB_hydrop_21.15k.fragm

These aliquot fragments files each bear a number denoting replicate-aliquot. e.g. `VIB_hydrop_21` corresponds to aliquot 1 from replicate 2. It is clear that the two fragments file originating from the same emulsion can be merged into a single fragments file. In order to do this however, we must append the aliquot number to each aliquot's barcodes. If we did not do this, then we would not be able to distinguish different cells which by chance were indexed with the same barcode, but in different aliquots. This is done below.

# Merge fragments per-sample and append samplename

Then, for each aliquot, add a unique identifier to each fragments file barcode which denotes the sample and aliquot origin.

In [4]:
import glob

%load_ext lab_black

In [16]:
fragments_paths

['10k/10k_preprocessing_out/data/fragments/VIB_hydrop_11.10k.fragments.tsv.gz',
 '10k/10k_preprocessing_out/data/fragments/VIB_hydrop_12.10k.fragments.tsv.gz',
 '10k/10k_preprocessing_out/data/fragments/VIB_hydrop_21.10k.fragments.tsv.gz',
 '10k/10k_preprocessing_out/data/fragments/VIB_hydrop_22.10k.fragments.tsv.gz',
 '15k/15k_preprocessing_out/data/fragments/VIB_hydrop_11.15k.fragments.tsv.gz',
 '15k/15k_preprocessing_out/data/fragments/VIB_hydrop_12.15k.fragments.tsv.gz',
 '15k/15k_preprocessing_out/data/fragments/VIB_hydrop_21.15k.fragments.tsv.gz',
 '15k/15k_preprocessing_out/data/fragments/VIB_hydrop_22.15k.fragments.tsv.gz',
 '20k/20k_preprocessing_out/data/fragments/VIB_hydrop_11.20k.fragments.tsv.gz',
 '20k/20k_preprocessing_out/data/fragments/VIB_hydrop_12.20k.fragments.tsv.gz',
 '20k/20k_preprocessing_out/data/fragments/VIB_hydrop_21.20k.fragments.tsv.gz',
 '20k/20k_preprocessing_out/data/fragments/VIB_hydrop_22.20k.fragments.tsv.gz',
 '25k/25k_preprocessing_out/data/fragmen

In [17]:
supersamples

['VIB_hydrop_1.10k',
 'VIB_hydrop_1.15k',
 'VIB_hydrop_1.20k',
 'VIB_hydrop_1.25k',
 'VIB_hydrop_1.30k',
 'VIB_hydrop_1.35k',
 'VIB_hydrop_1.5k',
 'VIB_hydrop_2.10k',
 'VIB_hydrop_2.15k',
 'VIB_hydrop_2.20k',
 'VIB_hydrop_2.25k',
 'VIB_hydrop_2.30k',
 'VIB_hydrop_2.35k',
 'VIB_hydrop_2.5k']

In [20]:
fragments_dir = "*k/*k_preprocessing_out/data/fragments/"
fragments_paths = sorted(glob.glob(fragments_dir + "VIB*hydrop*fragments.tsv.gz"))
supersamples = sorted(
    list(
        set(
            [
                x.split("/")[-1]
                .replace(".fragments.tsv.gz", "")
                .replace("11", "1")
                .replace("12", "1")
                .replace("21", "2")
                .replace("22", "2")
                for x in fragments_paths
            ]
        )
    )
)

parallel_filename = "add_identifier.parallel"
with open(parallel_filename, "w") as f:
    for supersample in supersamples:
        depth = supersample.split(".")[-1]
        print(supersample)
        for subsample_number in [1, 2]:
            subsample = supersample.replace("_1", f"_1{str(subsample_number)}").replace(
                "_2", f"_2{str(subsample_number)}"
            )
            fragments = (
                fragments_dir.replace("*k", depth) + subsample + ".fragments.tsv.gz"
            )
            newfragments = (
                fragments_dir.replace("*k", depth) + subsample + ".fragments.id.tsv.gz"
            )
            command = (
                f"zcat {fragments}"
                + ' | mawk \'{ print $1 "\\t" $2 "\\t" $3 "\\t" $4 "-'
                + str(subsample_number)
                + "\\t\" $5}'"
                + f" | bgzip -@ 4 > {newfragments}"
            )
            print("\t" + command)
            f.write(command + "\n")

VIB_hydrop_1.10k
	zcat 10k/10k_preprocessing_out/data/fragments/VIB_hydrop_11.10k.fragments.tsv.gz | mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-1\t" $5}' | bgzip -@ 4 > 10k/10k_preprocessing_out/data/fragments/VIB_hydrop_11.10k.fragments.id.tsv.gz
	zcat 10k/10k_preprocessing_out/data/fragments/VIB_hydrop_12.10k.fragments.tsv.gz | mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-2\t" $5}' | bgzip -@ 4 > 10k/10k_preprocessing_out/data/fragments/VIB_hydrop_12.10k.fragments.id.tsv.gz
VIB_hydrop_1.15k
	zcat 15k/15k_preprocessing_out/data/fragments/VIB_hydrop_11.15k.fragments.tsv.gz | mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-1\t" $5}' | bgzip -@ 4 > 15k/15k_preprocessing_out/data/fragments/VIB_hydrop_11.15k.fragments.id.tsv.gz
	zcat 15k/15k_preprocessing_out/data/fragments/VIB_hydrop_12.15k.fragments.tsv.gz | mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-2\t" $5}' | bgzip -@ 4 > 15k/15k_preprocessing_out/data/fragments/VIB_hydrop_12.15k.fragments.id.tsv.gz
VIB_hydrop_1.20k
	zcat 20k/20k_preprocessi

In [21]:
module load HTSlib
cat add_identifier.parallel | parallel -j 8 --progress

/bin/bash: line 1: parallel: command not found
cat: write error: Broken pipe


Then merge the fragments files from same run

In [40]:
fragments_dir = "*k/*k_preprocessing_out/data/fragments/"
fragments_paths = sorted(glob.glob(fragments_dir + "VIB*hydrop*fragments.tsv.gz"))
supersamples = sorted(
    list(
        set(
            [
                x.split("/")[-1]
                .replace(".fragments.tsv.gz", "")
                .replace("11", "1")
                .replace("12", "1")
                .replace("21", "2")
                .replace("22", "2")
                for x in fragments_paths
            ]
        )
    )
)

parallel_filename = "merge_subsamples.parallel"
with open(parallel_filename, "w") as f:
    for supersample in supersamples:
        depth = supersample.split(".")[-1]
        supersample = supersample.replace("." + depth, "")
        print(supersample)
        fragments_1 = fragments_dir.replace("*k", depth) + supersample.replace(
            "_2", f"_21.{depth}.fragments.id.tsv.gz"
        ).replace("_1", f"_11.{depth}.fragments.id.tsv.gz")

        fragments_2 = fragments_dir.replace("*k", depth) + supersample.replace(
            "_2", f"_22.{depth}.fragments.id.tsv.gz"
        ).replace("_1", f"_12.{depth}.fragments.id.tsv.gz")

        newfragments = (
            fragments_dir.replace("*k", depth)
            + supersample
            + f".{depth}.fragments.tsv.gz"
        )
        command = (
            f"zcat {fragments_1} {fragments_2} "
            + " | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n"
            + f" | bgzip -@ 8 > {newfragments}"
        )
        print("\t" + command)
        f.write(command + "\n")

VIB_hydrop_1
	zcat 10k/10k_preprocessing_out/data/fragments/VIB_hydrop_11.10k.fragments.id.tsv.gz 10k/10k_preprocessing_out/data/fragments/VIB_hydrop_12.10k.fragments.id.tsv.gz  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > 10k/10k_preprocessing_out/data/fragments/VIB_hydrop_1.10k.fragments.tsv.gz
VIB_hydrop_1
	zcat 15k/15k_preprocessing_out/data/fragments/VIB_hydrop_11.15k.fragments.id.tsv.gz 15k/15k_preprocessing_out/data/fragments/VIB_hydrop_12.15k.fragments.id.tsv.gz  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > 15k/15k_preprocessing_out/data/fragments/VIB_hydrop_1.15k.fragments.tsv.gz
VIB_hydrop_1
	zcat 20k/20k_preprocessing_out/data/fragments/VIB_hydrop_11.20k.fragments.id.tsv.gz 20k/20k_preprocessing_out/data/fragments/VIB_hydrop_12.20k.fragments.id.tsv.gz  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > 20k/20k_preprocessing_out/data/fragments/VIB_hydrop_1.20k.fragments.tsv.gz
VIB_hydrop_1
	zcat 25k/25k_p

In [35]:
!cat merge_subsamples.parallel

zcat 10k/10k_preprocessing_out/data/fragments/VIB_hydrop_11.10k.fragments.id.tsv.gz.10k 10k/10k_preprocessing_out/data/fragments/VIB_hydrop_12.10k.fragments.id.tsv.gz.10k  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > 10k/10k_preprocessing_out/data/fragments/VIB_hydrop_1.10k.fragments.tsv.gz
zcat 15k/15k_preprocessing_out/data/fragments/VIB_hydrop_11.15k.fragments.id.tsv.gz.15k 15k/15k_preprocessing_out/data/fragments/VIB_hydrop_12.15k.fragments.id.tsv.gz.15k  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > 15k/15k_preprocessing_out/data/fragments/VIB_hydrop_1.15k.fragments.tsv.gz
zcat 20k/20k_preprocessing_out/data/fragments/VIB_hydrop_11.20k.fragments.id.tsv.gz.20k 20k/20k_preprocessing_out/data/fragments/VIB_hydrop_12.20k.fragments.id.tsv.gz.20k  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > 20k/20k_preprocessing_out/data/fragments/VIB_hydrop_1.20k.fragments.tsv.gz
zcat 25k/25k_preprocessing_out/data/fragments/

In [None]:
module load HTSlib
cat  merge_subsamples.parallel | parallel -j 8 --progress

need to add an id to all the barcodes in the bams first

In [16]:
bam=bam_postbap/HYA__24010b__20210813_384_PBMC_11_S9.bwa.out.possorted.mm.bam
outbam=${bam%.bam}.ID.bam
/staging/leuven/stg_00002/lcb/ghuls/software/single_cell_toolkit_rust/target/release/append_string_to_bc_tag_try $bam $outbam DB '-11' &

[1] 15971


In [18]:
bam=bam_postbap/HYA__2beafa__20210813_384_PBMC_12_S10.bwa.out.possorted.mm.bam
outbam=${bam%.bam}.ID.bam
/staging/leuven/stg_00002/lcb/ghuls/software/single_cell_toolkit_rust/target/release/append_string_to_bc_tag_try $bam $outbam DB '-12' &

[2] 16061


In [19]:
bam=bam_postbap/HYA__3d6da9__20210813_384_PBMC_21_S11.bwa.out.possorted.mm.bam
outbam=${bam%.bam}.ID.bam
/staging/leuven/stg_00002/lcb/ghuls/software/single_cell_toolkit_rust/target/release/append_string_to_bc_tag_try $bam $outbam DB '-21' &

[3] 16081


In [20]:
bam=bam_postbap/HYA__5028cb__20210813_384_PBMC_22_S12.bwa.out.possorted.mm.bam
outbam=${bam%.bam}.ID.bam
/staging/leuven/stg_00002/lcb/ghuls/software/single_cell_toolkit_rust/target/release/append_string_to_bc_tag_try $bam $outbam DB '-22' &

[4] 16118


Then, merge the resulting bamsm

In [24]:
module load SAMtools
samtools merge -@ 12 -o bam_postbap/Hydrop_1.bwa.out.possorted.mm.bam bam_postbap/HYA__24010b__20210813_384_PBMC_11_S9.bwa.out.possorted.mm.ID.bam bam_postbap/HYA__2beafa__20210813_384_PBMC_12_S10.bwa.out.possorted.mm.ID.bam -f &
samtools merge -@ 12 -o bam_postbap/Hydrop_2.bwa.out.possorted.mm.bam bam_postbap/HYA__3d6da9__20210813_384_PBMC_21_S11.bwa.out.possorted.mm.ID.bam bam_postbap/HYA__5028cb__20210813_384_PBMC_22_S12.bwa.out.possorted.mm.ID.bam -f &


The following have been reloaded with a version change:
  1) XZ/5.2.4-GCCcore-6.4.0 => XZ/5.2.5-GCCcore-6.4.0

[1] 18219
[2] 18233


In [34]:
samtools index bam_postbap/Hydrop_1.bwa.out.possorted.mm.bam &
samtools index bam_postbap/Hydrop_2.bwa.out.possorted.mm.bam &

[3] 22295
[4] 22296


The final result is 2 instead of 4 bams and fragments files, each corresponding to the hydrop replicate runs.