HyDrop produces a steady stream of cell/bead emulsion. The user can pick the volume of emulsion for droplet PCR and downstream processing. A large emulsion volume will lead to fewer steps downstream, and thus less work to do. However, if the number of cells is too high relative to the barcode complexity (total number of barcodes available to index cells), the odds of barcode collisions (two different cells receiving the same barcode by chance) increases according to a poisson distribution.  

For each replicate, we created an emulsion containing the equivalent of 3000 recovered cells. We split each of this emulsion into two parts, and indexed them separately to avoid these barcode collisions. We do this only with the 384x384 version of hydrop, as the barcode complexity is only ~140k compared to the 96x96x96 complexity of ~880k.

These aliquots are manifested as such:

In [1]:
!ls full_preprocessing_out/data/fragments/*hydrop*.tsv.gz

full_preprocessing_out/data/fragments/CNA_hydrop_1.FULL.fragments.raw.tsv.gz
full_preprocessing_out/data/fragments/CNA_hydrop_1.FULL.fragments.tsv.gz
full_preprocessing_out/data/fragments/CNA_hydrop_2.FULL.fragments.raw.tsv.gz
full_preprocessing_out/data/fragments/CNA_hydrop_2.FULL.fragments.tsv.gz
full_preprocessing_out/data/fragments/CNA_hydrop_3.FULL.fragments.raw.tsv.gz
full_preprocessing_out/data/fragments/CNA_hydrop_3.FULL.fragments.tsv.gz
full_preprocessing_out/data/fragments/EPF_hydrop_1.FULL.fragments.raw.tsv.gz
full_preprocessing_out/data/fragments/EPF_hydrop_1.FULL.fragments.tsv.gz
full_preprocessing_out/data/fragments/EPF_hydrop_2.FULL.fragments.raw.tsv.gz
full_preprocessing_out/data/fragments/EPF_hydrop_2.FULL.fragments.tsv.gz
full_preprocessing_out/data/fragments/EPF_hydrop_3.FULL.fragments.raw.tsv.gz
full_preprocessing_out/data/fragments/EPF_hydrop_3.FULL.fragments.tsv.gz
full_preprocessing_out/data/fragments/EPF_hydrop_4.FULL.fragments.raw.tsv.gz
full_preprocessing_out/

These aliquot fragments files each bear a number denoting replicate-aliquot. e.g. `VIB_hydrop_21` corresponds to aliquot 1 from replicate 2. It is clear that the two fragments file originating from the same emulsion can be merged into a single fragments file. In order to do this however, we must append the aliquot number to each aliquot's barcodes. If we did not do this, then we would not be able to distinguish different cells which by chance were indexed with the same barcode, but in different aliquots. This is done below.

# Merge fragments per-sample and append samplename

Then, for each aliquot, add a unique identifier to each fragments file barcode which denotes the sample and aliquot origin.

In [2]:
import glob

%load_ext lab_black

In [3]:
fragments_dir = "full_preprocessing_out/data/fragments/"
fragments_paths = glob.glob(fragments_dir + "*hydrop*.tsv.gz")
supersamples = sorted(
    list(set([x.split("/")[-1].split(".")[0][:-1] for x in fragments_paths]))
)

parallel_filename = "add_identifier.parallel"
with open(parallel_filename, "w") as f:
    for supersample in supersamples:
        print(supersample)
        for subsample_number in [1, 2]:
            subsample = supersample + str(subsample_number)
            fragments = fragments_dir + subsample + ".fragments.tsv.gz"
            newfragments = fragments_dir + subsample + ".fragments.id.tsv.gz"
            command = (
                f"zcat {fragments}"
                + ' | mawk \'{ print $1 "\\t" $2 "\\t" $3 "\\t" $4 "-'
                + str(subsample_number)
                + "\\t\" $5}'"
                + f" | bgzip -@ 4 > {newfragments}"
            )
            print("\t" + command)
            f.write(command + "\n")

CNA_hydrop_
	zcat full_preprocessing_out/data/fragments/CNA_hydrop_1.fragments.tsv.gz | mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-1\t" $5}' | bgzip -@ 4 > full_preprocessing_out/data/fragments/CNA_hydrop_1.fragments.id.tsv.gz
	zcat full_preprocessing_out/data/fragments/CNA_hydrop_2.fragments.tsv.gz | mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-2\t" $5}' | bgzip -@ 4 > full_preprocessing_out/data/fragments/CNA_hydrop_2.fragments.id.tsv.gz
EPF_hydrop_
	zcat full_preprocessing_out/data/fragments/EPF_hydrop_1.fragments.tsv.gz | mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-1\t" $5}' | bgzip -@ 4 > full_preprocessing_out/data/fragments/EPF_hydrop_1.fragments.id.tsv.gz
	zcat full_preprocessing_out/data/fragments/EPF_hydrop_2.fragments.tsv.gz | mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-2\t" $5}' | bgzip -@ 4 > full_preprocessing_out/data/fragments/EPF_hydrop_2.fragments.id.tsv.gz
VIB_hydrop_1
	zcat full_preprocessing_out/data/fragments/VIB_hydrop_11.fragments.tsv.gz | mawk '{ print $1 "\t" $2 "

In [57]:
!cat add_identifier.parallel | parallel -j 8 --progress

mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-1\t" $5}' full_preprocessing_out/data/fragments/CNA_hydrop_41.fragments.tsv.gz > full_preprocessing_out/data/fragments/CNA_hydrop_41.fragments.id.tsv.gz
mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-2\t" $5}' full_preprocessing_out/data/fragments/CNA_hydrop_42.fragments.tsv.gz > full_preprocessing_out/data/fragments/CNA_hydrop_42.fragments.id.tsv.gz
mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-1\t" $5}' full_preprocessing_out/data/fragments/CNA_hydrop_51.fragments.tsv.gz > full_preprocessing_out/data/fragments/CNA_hydrop_51.fragments.id.tsv.gz
mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-2\t" $5}' full_preprocessing_out/data/fragments/CNA_hydrop_52.fragments.tsv.gz > full_preprocessing_out/data/fragments/CNA_hydrop_52.fragments.id.tsv.gz
mawk '{ print $1 "\t" $2 "\t" $3 "\t" $4 "-1\t" $5}' full_preprocessing_out/data/fragments/CNA_hydrop_61.fragments.tsv.gz > full_preprocessing_out/data/fragments/CNA_hydrop_61.fragments.id.tsv.gz
mawk '{ print $1 "\t

<IPython.core.display.Javascript object>

Then merge the fragments files from same run

In [6]:
fragments_dir = "full_preprocessing_out/data/fragments/"
fragments_paths = glob.glob(fragments_dir + "*hydrop*.tsv.gz")
supersamples = sorted(
    list(set([x.split("/")[-1].split(".")[0][:-1] for x in fragments_paths]))
)

parallel_filename = "merge_subsamples.parallel"
with open(parallel_filename, "w") as f:
    for supersample in supersamples:
        print(supersample)
        fragments_1 = fragments_dir + supersample + "1.fragments.id.tsv.gz"
        fragments_2 = fragments_dir + supersample + "2.fragments.id.tsv.gz"
        newfragments = fragments_dir + supersample + ".fragments.id.tsv.gz"
        command = (
            f"zcat {fragments_1} {fragments_2} "
            + " | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n"
            + f" | bgzip -@ 8 > {newfragments}"
        )
        print("\t" + command)
        f.write(command + "\n")

CNA_hydrop_
	zcat full_preprocessing_out/data/fragments/CNA_hydrop_1.fragments.id.tsv.gz full_preprocessing_out/data/fragments/CNA_hydrop_2.fragments.id.tsv.gz  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > full_preprocessing_out/data/fragments/CNA_hydrop_.fragments.id.tsv.gz
EPF_hydrop_
	zcat full_preprocessing_out/data/fragments/EPF_hydrop_1.fragments.id.tsv.gz full_preprocessing_out/data/fragments/EPF_hydrop_2.fragments.id.tsv.gz  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > full_preprocessing_out/data/fragments/EPF_hydrop_.fragments.id.tsv.gz
VIB_hydrop_1
	zcat full_preprocessing_out/data/fragments/VIB_hydrop_11.fragments.id.tsv.gz full_preprocessing_out/data/fragments/VIB_hydrop_12.fragments.id.tsv.gz  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > full_preprocessing_out/data/fragments/VIB_hydrop_1.fragments.id.tsv.gz
VIB_hydrop_2
	zcat full_preprocessing_out/data/fragments/VIB_hydrop_21.fragments.id.tsv.gz

In [7]:
!cat merge_subsamples.parallel

zcat full_preprocessing_out/data/fragments/CNA_hydrop_1.fragments.id.tsv.gz full_preprocessing_out/data/fragments/CNA_hydrop_2.fragments.id.tsv.gz  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > full_preprocessing_out/data/fragments/CNA_hydrop_.fragments.id.tsv.gz
zcat full_preprocessing_out/data/fragments/EPF_hydrop_1.fragments.id.tsv.gz full_preprocessing_out/data/fragments/EPF_hydrop_2.fragments.id.tsv.gz  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > full_preprocessing_out/data/fragments/EPF_hydrop_.fragments.id.tsv.gz
zcat full_preprocessing_out/data/fragments/VIB_hydrop_11.fragments.id.tsv.gz full_preprocessing_out/data/fragments/VIB_hydrop_12.fragments.id.tsv.gz  | LC_ALL=C sort --parallel=8 -S 8G -k1,1 -k 2,2n -k3,3n | bgzip -@ 8 > full_preprocessing_out/data/fragments/VIB_hydrop_1.fragments.id.tsv.gz
zcat full_preprocessing_out/data/fragments/VIB_hydrop_21.fragments.id.tsv.gz full_preprocessing_out/data/fragments/VIB_hydrop_22.f

# similarly, merge bams

need to add an id to all the barcodes in the bams first

In [16]:
bam=bam_postbap/HYA__24010b__20210813_384_PBMC_11_S9.bwa.out.possorted.mm.bam
outbam=${bam%.bam}.ID.bam
/staging/leuven/stg_00002/lcb/ghuls/software/single_cell_toolkit_rust/target/release/append_string_to_bc_tag_try $bam $outbam DB '-11' &

[1] 15971


In [18]:
bam=bam_postbap/HYA__2beafa__20210813_384_PBMC_12_S10.bwa.out.possorted.mm.bam
outbam=${bam%.bam}.ID.bam
/staging/leuven/stg_00002/lcb/ghuls/software/single_cell_toolkit_rust/target/release/append_string_to_bc_tag_try $bam $outbam DB '-12' &

[2] 16061


In [19]:
bam=bam_postbap/HYA__3d6da9__20210813_384_PBMC_21_S11.bwa.out.possorted.mm.bam
outbam=${bam%.bam}.ID.bam
/staging/leuven/stg_00002/lcb/ghuls/software/single_cell_toolkit_rust/target/release/append_string_to_bc_tag_try $bam $outbam DB '-21' &

[3] 16081


In [20]:
bam=bam_postbap/HYA__5028cb__20210813_384_PBMC_22_S12.bwa.out.possorted.mm.bam
outbam=${bam%.bam}.ID.bam
/staging/leuven/stg_00002/lcb/ghuls/software/single_cell_toolkit_rust/target/release/append_string_to_bc_tag_try $bam $outbam DB '-22' &

[4] 16118


Then, merge the resulting bamsm

In [24]:
module load SAMtools
samtools merge -@ 12 -o bam_postbap/Hydrop_1.bwa.out.possorted.mm.bam bam_postbap/HYA__24010b__20210813_384_PBMC_11_S9.bwa.out.possorted.mm.ID.bam bam_postbap/HYA__2beafa__20210813_384_PBMC_12_S10.bwa.out.possorted.mm.ID.bam -f &
samtools merge -@ 12 -o bam_postbap/Hydrop_2.bwa.out.possorted.mm.bam bam_postbap/HYA__3d6da9__20210813_384_PBMC_21_S11.bwa.out.possorted.mm.ID.bam bam_postbap/HYA__5028cb__20210813_384_PBMC_22_S12.bwa.out.possorted.mm.ID.bam -f &


The following have been reloaded with a version change:
  1) XZ/5.2.4-GCCcore-6.4.0 => XZ/5.2.5-GCCcore-6.4.0

[1] 18219
[2] 18233


In [34]:
samtools index bam_postbap/Hydrop_1.bwa.out.possorted.mm.bam &
samtools index bam_postbap/Hydrop_2.bwa.out.possorted.mm.bam &

[3] 22295
[4] 22296


The final result is 2 instead of 4 bams and fragments files, each corresponding to the hydrop replicate runs.