# Writing a metadata file

In [1]:
import os
import glob
import pandas as pd

%load_ext lab_black

The metadata file is essentially a tab-delimited list of samples. For example, if you have 4 10x scATAC-seq runs named `ATAC_1` to `ATAC_4`, and each run has produced three FASTQ files (mate 1 and 2, and barcode read), your `metadata.tsv` would contain a header and four lines, with each line containing the paths to the three FASTQ files. An additional column (`technology`) refers to a set of instructions that the pipeline will use to extract and correct cell barcodes from the cell barcode FASTQ. There are 4 default technologies that come with the pipeline:
- `standard`: the simplest case, where a barcode whitelist is provided, and each read in the `fastq_barcode` can be directly corrected using the provided whitelist. This strategy is the default strategy.
- `ddseq`: for Bio-Rad SureCell ATAC ddSEQ samples. This workflow is a bit more complicated for two reasons: the barcode sequence itself contains adapters (constant regions that are shared between all barcodes), and the barcode read is followed by a gDNA insert read (i.e. barcode and mate 1 are read in the same sequencing read).
- `hydrop_3x96` and  `hydrop_3x384`: for two variations of HyDrop. The HyDrop barcode read also contains constant regions, which are automatically removed by the pipeline.

Later (in notebook `2_running_nextflow_pipeline.ipynb`, you will define a path to the whitelist that should be used for each `technology`.

An example `metadata.tsv`, for many samples, looks like so:

In [2]:
example = pd.read_csv("metadata_example.tsv", sep="\t", index_col=0)
example

Unnamed: 0_level_0,technology,fastq_PE1_path,fastq_barcode_path,fastq_PE2_path
sample_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BIO_ddseq_1,biorad,/dodrio/scratch/projects/starting_2022_023/ben...,,/dodrio/scratch/projects/starting_2022_023/ben...
BIO_ddseq_2,biorad,/dodrio/scratch/projects/starting_2022_023/ben...,,/dodrio/scratch/projects/starting_2022_023/ben...
BIO_ddseq_3,biorad,/dodrio/scratch/projects/starting_2022_023/ben...,,/dodrio/scratch/projects/starting_2022_023/ben...
BIO_ddseq_4,biorad,/dodrio/scratch/projects/starting_2022_023/ben...,,/dodrio/scratch/projects/starting_2022_023/ben...
BRO_mtscatac_1,atac_revcomp,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...
BRO_mtscatac_2,atac_revcomp,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...
CNA_10xmultiome_1,multiome_revcomp,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...
CNA_10xmultiome_2,multiome_revcomp,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...
CNA_10xv11_1,atac,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...
CNA_10xv11_2,atac,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...


In this notebook, we use some simple Python string manipulation and list comprehensions to automatically write a `metadata.tsv` file for the pipeline to interpret.

# Case #1. A directory containing FASTQ files from various techniques

In this case, the names of the FASTQ files all share a logical syntax `center_technique_samplenumber_read`. This can be achieved by merging/renaming your FASTQ files. First, find all the FASTQ files in your `fastq_dir/`:

In [3]:
fastq_dir = "PUMATAC_example_fastq/"

In [4]:
filepaths = sorted(glob.glob(f"{fastq_dir}/*"))
filepaths

['PUMATAC_example_fastq/BIO_ddseq_4__R1.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/BIO_ddseq_4__R2.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/EPF_hydrop_1__R1.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/EPF_hydrop_1__R2.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/EPF_hydrop_1__R3.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/OHS_s3atac_1__R1.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/OHS_s3atac_1__R2.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/OHS_s3atac_1__R3.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/VIB_10xv2_1__R1.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/VIB_10xv2_1__R2.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/VIB_10xv2_1__R3.LIBDS.fastq.gz']

We assume that each FASTQ file has a structured name indicating the sample and read. In this syntax, R1 is mate 1, R2 is the barcode read and R3 is mate 2. For ddSEq samples, R2 is mate 2. The file names are completely arbitrary, but can help to systematically generate the metadata file.

In [5]:
filenames = [x.split("/")[-1] for x in filepaths]
samplenames = sorted(list(set([x.split("__")[0] for x in filenames])))
samplenames

['BIO_ddseq_4', 'EPF_hydrop_1', 'OHS_s3atac_1', 'VIB_10xv2_1']

Initiate a pandas dataframe:

In [6]:
metadata = pd.DataFrame(samplenames, columns=["sample_name"])
metadata.index = metadata["sample_name"]

Then, for every sample, you need to define the technology used. For that, we use a python dictionary, which we initiate with the standard value `atac` for each sample's technology.

In [7]:
tech_dict_template = {x: "standard" for x in metadata["sample_name"]}
tech_dict_template

{'BIO_ddseq_4': 'standard',
 'EPF_hydrop_1': 'standard',
 'OHS_s3atac_1': 'standard',
 'VIB_10xv2_1': 'standard'}

Now, copy the dictionary and change `standard` to each sample's true method, for example:

In [8]:
tech_dict = {
    "BIO_ddseq_4": "biorad",
    "EPF_hydrop_1": "hydrop_2x384",
    "OHS_s3atac_1": "s3atac_1",
    "VIB_10xv2_1": "atac_revcomp",
}
tech_dict

{'BIO_ddseq_4': 'biorad',
 'EPF_hydrop_1': 'hydrop_2x384',
 'OHS_s3atac_1': 's3atac_1',
 'VIB_10xv2_1': 'atac_revcomp'}

`biorad` is a standard method. `atac_revcomp` is not a standard method. As a result, `standard` will be used as a demultiplexing strategy for `atac_revcomp`, but using a whitelist that is specified for `atac_revcomp` in the `.config` file written in notebook 2.

In [9]:
fastq_df = pd.DataFrame.from_dict(tech_dict, orient="index", columns=["technology"])
fastq_df

Unnamed: 0,technology
BIO_ddseq_4,biorad
EPF_hydrop_1,hydrop_2x384
OHS_s3atac_1,s3atac_1
VIB_10xv2_1,atac_revcomp


And find out which file is which read:

In [10]:
read_alias_dict = {
    "R1": "fastq_PE1_path",
    "R2": "fastq_barcode_path",
    "R3": "fastq_PE2_path",
}

read_alias_dict_ddseq = {
    "R1": "fastq_PE1_path",
    "R2": "fastq_PE2_path",
}

In [11]:
read_dict = {}
for sample in samplenames:  # iterate over all samples
    tech = sample.split("_")[1]  # get the technology from the filename
    read_dict[sample] = {}  # initiate a new dict
    sample_filepaths = sorted(
        glob.glob(f"{fastq_dir}/*{sample}*")
    )  # get all the filepaths for the current sample
    for (
        sample_file
    ) in sample_filepaths:  # iterate over all file paths for current sample
        read = (
            sample_file.split("/")[-1].split("__")[1].split(".")[0]
        )  # identify the read from the file name
        if tech != "ddseq":
            read_dict[sample][read_alias_dict[read]] = os.path.abspath(sample_file)
        else:
            read_dict[sample][read_alias_dict_ddseq[read]] = os.path.abspath(
                sample_file
            )

# concatenate this to the fastq_df
fastq_df = pd.concat(
    [fastq_df, pd.DataFrame.from_dict(read_dict, orient="index")], axis=1
)
fastq_df = fastq_df[
    ["technology", "fastq_PE1_path", "fastq_barcode_path", "fastq_PE2_path"]
]
fastq_df.index.name = "sample_name"

In [12]:
fastq_df

Unnamed: 0_level_0,technology,fastq_PE1_path,fastq_barcode_path,fastq_PE2_path
sample_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BIO_ddseq_4,biorad,/lustre1/project/stg_00002/lcb/fderop/data/202...,,/lustre1/project/stg_00002/lcb/fderop/data/202...
EPF_hydrop_1,hydrop_2x384,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...
OHS_s3atac_1,s3atac_1,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...
VIB_10xv2_1,atac_revcomp,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...


Then, write this to a csv file.

In [13]:
fastq_df.to_csv("metadata.tsv", sep="\t", index=True)

# Case #2. 10x Genomics FASTQ files
Here, we work with 10x files. All of the runs were sequenced on NextSeq2000, which uses the reverse complement workflow (https://kb.10xgenomics.com/hc/en-us/articles/360056364852-Should-I-select-Workflow-A-or-Workflow-B-for-the-i5-index-sequence-). The barcode read is therefore read in reverse complement, which should be met with a reverse complemented whitelist (10x CellRanger detects this, and does this automatically.

In [14]:
fastq_dir = "10x_fastq/"

In [15]:
filepaths = sorted(glob.glob(f"{fastq_dir}/*/*R1*"))
filepaths

['10x_fastq/NextSeq2000_20221004/ASA__0201f1__20220902_MO-016-b-ATAC_S5_L001_R1_001.fastq.gz',
 '10x_fastq/NextSeq2000_20221004/ASA__0201f1__20220902_MO-016-b-ATAC_S5_L002_R1_001.fastq.gz',
 '10x_fastq/NextSeq2000_20221004/ASA__0201f1__20220902_MO-016-b-ATAC_S6_L001_R1_001.fastq.gz',
 '10x_fastq/NextSeq2000_20221004/ASA__0201f1__20220902_MO-016-b-ATAC_S6_L002_R1_001.fastq.gz',
 '10x_fastq/NextSeq2000_20221004/ASA__0201f1__20220902_MO-016-b-ATAC_S7_L001_R1_001.fastq.gz',
 '10x_fastq/NextSeq2000_20221004/ASA__0201f1__20220902_MO-016-b-ATAC_S7_L002_R1_001.fastq.gz',
 '10x_fastq/NextSeq2000_20221004/ASA__0201f1__20220902_MO-016-b-ATAC_S8_L001_R1_001.fastq.gz',
 '10x_fastq/NextSeq2000_20221004/ASA__0201f1__20220902_MO-016-b-ATAC_S8_L002_R1_001.fastq.gz',
 '10x_fastq/NextSeq2000_20221004/ASA__089fab__20220902_MO-016-c-ATAC_S10_L001_R1_001.fastq.gz',
 '10x_fastq/NextSeq2000_20221004/ASA__089fab__20220902_MO-016-c-ATAC_S10_L002_R1_001.fastq.gz',
 '10x_fastq/NextSeq2000_20221004/ASA__089fab__20

In [16]:
len(filepaths)

200

We assume that each FASTQ file has a structured name indicating the sample and read. In this syntax, R1 is mate 1, R2 is the barcode read and R3 is mate 2. For ddSEq samples, R2 is mate 2. The file names are completely arbitrary, but can help to systematically generate the metadata file.  
  
In this case, the sample name can be extracted from the filename by taking everything that comes before `_S` (e.g.`S20_L002_R1_001.fastq.gz` is removed).

In [17]:
filenames = [x.split("/")[-1] for x in filepaths]
samplenames = sorted(list(set([x.split("_S")[0] for x in filenames])))
samplenames

['ASA__0201f1__20220902_MO-016-b-ATAC',
 'ASA__089fab__20220902_MO-016-c-ATAC',
 'ASA__09f884__20230315_MO-018-b-ATAC',
 'ASA__12bf4c__20230315_MO-018-h-ATAC',
 'ASA__1bde8b__20230315_MO-018-a-ATAC',
 'ASA__1e598f__20230315_MO-018-d-ATAC',
 'ASA__2b6050__20220927_MO-017-b-ATAC',
 'ASA__2cb45a__20230315_MO-018-g-ATAC',
 'ASA__4a5f45__20220927_MO-017-a-ATAC',
 'ASA__57620c__20230315_MO-018-c-ATAC',
 'ASA__5d65d4__20220927_MO-017-f-ATAC',
 'ASA__620c11__20220927_MO-017-d-ATAC',
 'ASA__848fa1__20220927_MO-017-c-ATAC',
 'ASA__9e5bca__20230315_MO-018-f-ATAC',
 'ASA__ab17e7__20220902_MO-016-a-ATAC',
 'ASA__b6fa6d__20220927_MO-017-e-ATAC',
 'ASA__e27b63__20220902_MO-016-d-ATAC',
 'ASA__ffd613__20230315_MO-018-e-ATAC']

We can determine the sequencer used from the FASTQ:

In [18]:
import gzip
import re

# dictionary of instrument id regex: [platform(s)]
InstrumentIDs = {
    "HWI-M[0-9]{4}$": ["MiSeq"],
    "HWUSI": ["Genome Analyzer IIx"],
    "M[0-9]{5}$": ["MiSeq"],
    "HWI-C[0-9]{5}$": ["HiSeq 1500"],
    "C[0-9]{5}$": ["HiSeq 1500"],
    "HWI-D[0-9]{5}$": ["HiSeq 2500"],
    "D[0-9]{5}$": ["HiSeq 2500"],
    "J[0-9]{5}$": ["HiSeq 3000"],
    "K[0-9]{5}$": ["HiSeq 3000", "HiSeq 4000"],
    "E[0-9]{5}$": ["HiSeq X"],
    "NB[0-9]{6}$": ["NextSeq 500/550"],
    "NS[0-9]{6}$": ["NextSeq 500/550"],
    "MN[0-9]{5}$": ["MiniSeq"],
    "N[0-9]{5}$": ["NextSeq 500/550"],  # added since original was outdated
    "A[0-9]{5}$": ["NovaSeq 6000"],  # added since original was outdated
    "V[0-9]{5}$": ["NextSeq 2000"],  # added since original was outdated
    "VH[0-9]{5}$": ["NextSeq 2000"],  # added since original was outdated
}

# dictionary of flow cell id regex: ([platform(s)], flow cell version and yeild)
FCIDs = {
    "C[A-Z,0-9]{4}ANXX$": (
        ["HiSeq 1500", "HiSeq 2000", "HiSeq 2500"],
        "High Output (8-lane) v4 flow cell",
    ),
    "C[A-Z,0-9]{4}ACXX$": (
        ["HiSeq 1000", "HiSeq 1500", "HiSeq 2000", "HiSeq 2500"],
        "High Output (8-lane) v3 flow cell",
    ),
    "H[A-Z,0-9]{4}ADXX$": (
        ["HiSeq 1500", "HiSeq 2500"],
        "Rapid Run (2-lane) v1 flow cell",
    ),
    "H[A-Z,0-9]{4}BCXX$": (
        ["HiSeq 1500", "HiSeq 2500"],
        "Rapid Run (2-lane) v2 flow cell",
    ),
    "H[A-Z,0-9]{4}BCXY$": (
        ["HiSeq 1500", "HiSeq 2500"],
        "Rapid Run (2-lane) v2 flow cell",
    ),
    "H[A-Z,0-9]{4}BBXX$": (["HiSeq 4000"], "(8-lane) v1 flow cell"),
    "H[A-Z,0-9]{4}BBXY$": (["HiSeq 4000"], "(8-lane) v1 flow cell"),
    "H[A-Z,0-9]{4}CCXX$": (["HiSeq X"], "(8-lane) flow cell"),
    "H[A-Z,0-9]{4}CCXY$": (["HiSeq X"], "(8-lane) flow cell"),
    "H[A-Z,0-9]{4}ALXX$": (["HiSeq X"], "(8-lane) flow cell"),
    "H[A-Z,0-9]{4}BGXX$": (["NextSeq"], "High output flow cell"),
    "H[A-Z,0-9]{4}BGXY$": (["NextSeq"], "High output flow cell"),
    "H[A-Z,0-9]{4}BGX2$": (["NextSeq"], "High output flow cell"),
    "H[A-Z,0-9]{4}AFXX$": (["NextSeq"], "Mid output flow cell"),
    "A[A-Z,0-9]{4}$": (["MiSeq"], "MiSeq flow cell"),
    "B[A-Z,0-9]{4}$": (["MiSeq"], "MiSeq flow cell"),
    "D[A-Z,0-9]{4}$": (["MiSeq"], "MiSeq nano flow cell"),
    "G[A-Z,0-9]{4}$": (["MiSeq"], "MiSeq micro flow cell"),
    "H[A-Z,0-9]{4}DMXX$": (["NovaSeq"], "S2 flow cell"),
}


SUPERNOVA_PLATFORM_BLACKLIST = ["HiSeq 3000", "HiSeq 4000", "HiSeq 3000/4000"]

_upgrade_set1 = set(["HiSeq 2000", "HiSeq 2500"])
_upgrade_set2 = set(["HiSeq 1500", "HiSeq 2500"])
_upgrade_set3 = set(["HiSeq 3000", "HiSeq 4000"])
_upgrade_set4 = set(["HiSeq 1000", "HiSeq 1500"])
_upgrade_set5 = set(["HiSeq 1000", "HiSeq 2000"])

fail_msg = "Cannot determine sequencing platform"
success_msg_template = "(likelihood: {})"
null_template = "{}"

# do intersection of lists
def intersect(a, b):
    return list(set(a) & set(b))


def union(a, b):
    return list(set(a) | set(b))


# extract ids from reads
def parse_readhead(head):
    fields = head.strip("\n").split(":")

    # if ill-formatted/modified non-standard header, return cry-face
    if len(fields) < 3:
        return -1, -1
    iid = fields[0][1:]
    fcid = fields[2]
    return iid, fcid


# infer sequencer from ids from single fastq
def infer_sequencer(iid, fcid):
    seq_by_iid = []
    for key in InstrumentIDs:
        if re.search(key, iid):
            seq_by_iid += InstrumentIDs[key]

    seq_by_fcid = []
    for key in FCIDs:
        if re.search(key, fcid):
            seq_by_fcid += FCIDs[key][0]

    sequencers = []

    # if both empty
    if not seq_by_iid and not seq_by_fcid:
        return sequencers, "fail"

    # if one non-empty
    if not seq_by_iid:
        return seq_by_fcid, "likely"
    if not seq_by_fcid:
        return seq_by_iid, "likely"

    # if neither empty
    sequencers = intersect(seq_by_iid, seq_by_fcid)
    if sequencers:
        return sequencers, "high"
    # this should not happen, but if both ids indicate different sequencers..
    else:
        sequencers = union(seq_by_iid, seq_by_fcid)
        return sequencers, "uncertain"


# process the flag and detected sequencer(s) for single fastq
def infer_sequencer_with_message(iid, fcid):
    sequencers, flag = infer_sequencer(iid, fcid)
    if not sequencers:
        return [""], fail_msg

    if flag == "high":
        msg_template = null_template
    else:
        msg_template = success_msg_template

    if set(sequencers) <= _upgrade_set1:
        return ["HiSeq2000/2500"], msg_template.format(flag)
    if set(sequencers) <= _upgrade_set2:
        return ["HiSeq1500/2500"], msg_template.format(flag)
    if set(sequencers) <= _upgrade_set3:
        return ["HiSeq3000/4000"], msg_template.format(flag)
    return sequencers, msg_template.format(flag)


def test_sequencer_detection():
    Samples = [
        "@ST-E00314:132:HLCJTCCXX:6:2206:31213:47966 1:N:0",
        "@D00209:258:CACDKANXX:6:2216:1260:1978 1:N:0:CGCAGTT",
        "@D00209:258:CACDKANXX:6:2216:1586:1970 1:N:0:GAGCAAG",
        "@A00311:74:HMLK5DMXX:1:1101:2013:1000 3:N:0:ACTCAGAC",
    ]

    seqrs = set()
    for head in Samples:
        iid, fcid = parse_readhead(head)
        seqr, msg = infer_sequencer_with_message(iid, fcid)
        for sr in seqr:
            signal = (sr, msg)
        seqrs.add(signal)

    print(seqrs)


def sequencer_detection_message(fastq_files):
    seqrs = set()
    # accumulate (sequencer, status) set
    for fastq in fastq_files:
        with gzip.open(fastq) as f:
            head = str(f.readline())
            # line = str(f.readline()
            # if len(line) > 0:
            #     if line[0] == "@":
            #         head = line
            #     else:
            #         print("Incorrectly formatted first read in FASTQ file: %s" % fastq)
            #         print(line)

        iid, fcid = parse_readhead(head)
        seqr, msg = infer_sequencer_with_message(iid, fcid)
        for sr in seqr:
            signal = (sr, msg)
        seqrs.add(signal)

    # get a list of sequencing platforms
    platforms = set()
    for platform, _ in seqrs:
        platforms.add(platform)
    sequencers = list(platforms)

    # if no sequencer detected at all
    message = ""
    fails = 0
    for platform, status in seqrs:
        if status == fail_msg:
            fails += 1
    if fails == len(seqrs):
        message = "could not detect the sequencing platform(s) used to generate the input FASTQ files"
        return message, sequencers

    # if partial or no detection failures
    if fails > 0:
        message = "could not detect the sequencing platform used to generate some of the input FASTQ files, "
    message += "detected the following sequencing platforms- "
    for platform, status in seqrs:
        if status != fail_msg:
            message += platform + " " + status + ", "
    message = message.strip(", ")
    return message, sequencers

In [19]:
sequencers_dict = {}
for file in filepaths:
    filename = file.split("/")[-1]
    message, sequencers = sequencer_detection_message([file])

    # print(f"{filename}: {sequencers[0]}")

    sequencers_dict[filename] = sequencers[0]
sequencers_dict

{'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__089fab__20220902_MO-016-c-ATA

Determine whether this sequencer uses forward or reverse complement:

In [20]:
worflow_dict = {
    "MiSeq": "forward",
    "HiSeq 2500": "forward",
    "HiSeq 3000": "revcomp",
    "HiSeq X": "revcomp",
    "NextSeq 500/550": "revcomp",
    "NovaSeq 6000": "revcomp",
    "NextSeq 2000": "revcomp",
}

In [21]:
sample_workflow_dict = {x: worflow_dict[y] for x, y in sequencers_dict.items()}
sample_workflow_dict

{'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S12_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220

Then, define the type of debarcoding strategy you want to use. We have 10x libraries, so `standard` should do. I use the term `10x_atac`, which will default to `standard`. 

In [22]:
tech_dict = {x: f"atac_{sample_workflow_dict[x]}" for x in filenames}
tech_dict

{'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__089fab__20220902_MO-016-c-ATA

We have multiple FASTQ sets belonging to the same sample though. We must therefore align to the reference genome and merge the resulting bam files before writing fragments from the merged bam file. We can achieve this by having one line for each set of FASTQs for each sample. For example, `sample_A` could have 8 lines, with each line denoting a set of FASTQ files that should be aligned, and of which the resulting `.bam` should be merged.

In [23]:
df_metadata = pd.DataFrame()
for sample in samplenames:
    sample_R1_files = [
        os.path.realpath(x) for x in sorted(glob.glob(f"{fastq_dir}/*/*{sample}*R1*"))
    ]
    # print(len(sample_R1_files))
    df_sub = pd.DataFrame([sample] * len(sample_R1_files), columns=["sample_name"])
    df_sub["technology"] = [tech_dict[os.path.basename(x)] for x in sample_R1_files]
    df_sub["fastq_PE1_path"] = sample_R1_files
    df_sub["fastq_barcode_path"] = [x.replace("_R1_", "_R2_") for x in sample_R1_files]
    df_sub["fastq_PE2_path"] = [x.replace("_R3_", "_R3_") for x in sample_R1_files]

    df_metadata = pd.concat([df_metadata, df_sub])

In [24]:
df_metadata

Unnamed: 0,sample_name,technology,fastq_PE1_path,fastq_barcode_path,fastq_PE2_path
0,ASA__0201f1__20220902_MO-016-b-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
1,ASA__0201f1__20220902_MO-016-b-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
2,ASA__0201f1__20220902_MO-016-b-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
3,ASA__0201f1__20220902_MO-016-b-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
4,ASA__0201f1__20220902_MO-016-b-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
...,...,...,...,...,...
3,ASA__ffd613__20230315_MO-018-e-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
4,ASA__ffd613__20230315_MO-018-e-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
5,ASA__ffd613__20230315_MO-018-e-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
6,ASA__ffd613__20230315_MO-018-e-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...


Then, write this to a csv file.

In [25]:
df_metadata.to_csv("metadata_10x.tsv", sep="\t", index=False, header=True)

Be sure to include the 10x whitelist as an option for `atac_forward` or `atac_revcomp` in the `.config` file defined in notebook 2.