# 1. Writing a metadata file

In [4]:
import os
import glob
import pandas as pd
import pypumatac as pum

%load_ext jupyter_black

The jupyter_black extension is already loaded. To reload it, use:
  %reload_ext jupyter_black


The metadata file is essentially a tab-delimited list of samples. For example, if you have 4 10x scATAC-seq runs named `ATAC_1` to `ATAC_4`, and each run has produced three FASTQ files (mate 1 and 2, and barcode read), your `metadata.tsv` would contain a header and four lines, with each line containing the paths to the three FASTQ files. An additional column (`technology`) refers to a set of instructions that the pipeline will use to extract and correct cell barcodes from the cell barcode FASTQ. There are 4 default technologies that come with the pipeline:
- `standard`: the simplest case, where a barcode whitelist is provided, and each read in the `fastq_barcode` can be directly corrected using the provided whitelist. This strategy is the default strategy.
- `ddseq`: for Bio-Rad SureCell ATAC ddSEQ samples. This workflow is a bit more complicated for two reasons: the barcode sequence itself contains adapters (constant regions that are shared between all barcodes), and the barcode read is followed by a gDNA insert read (i.e. barcode and mate 1 are read in the same sequencing read).
- `hydrop_3x96` and  `hydrop_3x384`: for two variations of HyDrop. The HyDrop barcode read also contains constant regions, which are automatically removed by the pipeline.

Later (in notebook `2_running_nextflow_pipeline.ipynb`, you will define a path to the whitelist that should be used for each `technology`.

An example `metadata.tsv`, for many samples, looks like so:

In [5]:
example = pd.read_csv("metadata_example.tsv", sep="\t", index_col=0)
example

Unnamed: 0_level_0,technology,fastq_PE1_path,fastq_barcode_path,fastq_PE2_path
sample_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BIO_ddseq_1,biorad,/dodrio/scratch/projects/starting_2022_023/ben...,,/dodrio/scratch/projects/starting_2022_023/ben...
BIO_ddseq_2,biorad,/dodrio/scratch/projects/starting_2022_023/ben...,,/dodrio/scratch/projects/starting_2022_023/ben...
BIO_ddseq_3,biorad,/dodrio/scratch/projects/starting_2022_023/ben...,,/dodrio/scratch/projects/starting_2022_023/ben...
BIO_ddseq_4,biorad,/dodrio/scratch/projects/starting_2022_023/ben...,,/dodrio/scratch/projects/starting_2022_023/ben...
BRO_mtscatac_1,atac_revcomp,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...
BRO_mtscatac_2,atac_revcomp,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...
CNA_10xmultiome_1,multiome_revcomp,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...
CNA_10xmultiome_2,multiome_revcomp,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...
CNA_10xv11_1,atac,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...
CNA_10xv11_2,atac,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...,/dodrio/scratch/projects/starting_2022_023/ben...


In this notebook, we use some simple Python string manipulation and list comprehensions to automatically write a `metadata.tsv` file for the pipeline to interpret.

# Case #1. Standard bcl2fastq format filenames
Here, we work with standard bcl2fastq format filenames. 10x demultiplexed FASTQ files will also work seamlessly.
I have put all sequencing files in the dir arbitrarily named `10x_fastq`. The code below will list and sort these files as we need.
All of the runs were sequenced on NextSeq2000, which uses the reverse complement workflow (https://kb.10xgenomics.com/hc/en-us/articles/360056364852-Should-I-select-Workflow-A-or-Workflow-B-for-the-i5-index-sequence-). The barcode read is therefore read in reverse complement, which should be met with a reverse complemented whitelist (10x CellRanger detects and does this automatically).

In [14]:
fastq_dir = "10x_fastq/"

In [15]:
filepaths = sorted(glob.glob(f"{fastq_dir}/*R1*"))
filepaths

['10x_fastq/ASA__0201f1__20220902_MO-016-b-ATAC_S5_L001_R1_001.fastq.gz',
 '10x_fastq/ASA__0201f1__20220902_MO-016-b-ATAC_S5_L002_R1_001.fastq.gz',
 '10x_fastq/ASA__0201f1__20220902_MO-016-b-ATAC_S6_L001_R1_001.fastq.gz',
 '10x_fastq/ASA__0201f1__20220902_MO-016-b-ATAC_S6_L002_R1_001.fastq.gz',
 '10x_fastq/ASA__0201f1__20220902_MO-016-b-ATAC_S7_L001_R1_001.fastq.gz',
 '10x_fastq/ASA__0201f1__20220902_MO-016-b-ATAC_S7_L002_R1_001.fastq.gz',
 '10x_fastq/ASA__0201f1__20220902_MO-016-b-ATAC_S8_L001_R1_001.fastq.gz',
 '10x_fastq/ASA__0201f1__20220902_MO-016-b-ATAC_S8_L002_R1_001.fastq.gz',
 '10x_fastq/ASA__089fab__20220902_MO-016-c-ATAC_S10_L001_R1_001.fastq.gz',
 '10x_fastq/ASA__089fab__20220902_MO-016-c-ATAC_S10_L002_R1_001.fastq.gz',
 '10x_fastq/ASA__089fab__20220902_MO-016-c-ATAC_S11_L001_R1_001.fastq.gz',
 '10x_fastq/ASA__089fab__20220902_MO-016-c-ATAC_S11_L002_R1_001.fastq.gz',
 '10x_fastq/ASA__089fab__20220902_MO-016-c-ATAC_S12_L001_R1_001.fastq.gz',
 '10x_fastq/ASA__089fab__20220902

In [16]:
len(filepaths)

200

We assume that each FASTQ file has a structured name indicating the sample and read. In this syntax, R1 is mate 1, R2 is the barcode read and R3 is mate 2. For ddSEq samples, R2 is mate 2. The file names are completely arbitrary, but can help to systematically generate the metadata file.  
  
In this case, the sample name can be extracted from the filename by taking everything that comes before `_S` (e.g.`S20_L002_R1_001.fastq.gz` is removed).

In [17]:
filenames = [x.split("/")[-1] for x in filepaths]
samplenames = sorted(list(set([x.split("_S")[0] for x in filenames])))
samplenames

['ASA__0201f1__20220902_MO-016-b-ATAC',
 'ASA__089fab__20220902_MO-016-c-ATAC',
 'ASA__09f884__20230315_MO-018-b-ATAC',
 'ASA__12bf4c__20230315_MO-018-h-ATAC',
 'ASA__1bde8b__20230315_MO-018-a-ATAC',
 'ASA__1e598f__20230315_MO-018-d-ATAC',
 'ASA__2b6050__20220927_MO-017-b-ATAC',
 'ASA__2cb45a__20230315_MO-018-g-ATAC',
 'ASA__4a5f45__20220927_MO-017-a-ATAC',
 'ASA__57620c__20230315_MO-018-c-ATAC',
 'ASA__5d65d4__20220927_MO-017-f-ATAC',
 'ASA__620c11__20220927_MO-017-d-ATAC',
 'ASA__848fa1__20220927_MO-017-c-ATAC',
 'ASA__9e5bca__20230315_MO-018-f-ATAC',
 'ASA__ab17e7__20220902_MO-016-a-ATAC',
 'ASA__b6fa6d__20220927_MO-017-e-ATAC',
 'ASA__e27b63__20220902_MO-016-d-ATAC',
 'ASA__ffd613__20230315_MO-018-e-ATAC']

We can determine the sequencer used from the FASTQ:

In [18]:
sequencers_dict = {}
for file in filepaths:
    filename = file.split("/")[-1]
    message, sequencers = pum.sequencer_detection_message([file])

    # print(f"{filename}: {sequencers[0]}")

    sequencers_dict[filename] = sequencers[0]
sequencers_dict

{'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L001_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L002_R1_001.fastq.gz': 'NextSeq 2000',
 'ASA__089fab__20220902_MO-016-c-ATA

Determine whether this sequencer uses forward or reverse complement:

In [19]:
worflow_dict = {
    "MiSeq": "forward",
    "HiSeq 2500": "forward",
    "HiSeq 3000": "revcomp",
    "HiSeq X": "revcomp",
    "NextSeq 500/550": "revcomp",
    "NovaSeq 6000": "revcomp",
    "NextSeq 2000": "revcomp",
}

In [20]:
sample_workflow_dict = {x: worflow_dict[y] for x, y in sequencers_dict.items()}
sample_workflow_dict

{'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L002_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S12_L001_R1_001.fastq.gz': 'revcomp',
 'ASA__089fab__20220

Then, define the type of debarcoding strategy you want to use. We have 10x libraries, so `standard` should do. I use the term `10x_atac`, which will default to `standard`. 

In [21]:
tech_dict = {x: f"atac_{sample_workflow_dict[x]}" for x in filenames}
tech_dict

{'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S5_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S6_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S7_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__0201f1__20220902_MO-016-b-ATAC_S8_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S10_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L001_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__089fab__20220902_MO-016-c-ATAC_S11_L002_R1_001.fastq.gz': 'atac_revcomp',
 'ASA__089fab__20220902_MO-016-c-ATA

We have multiple FASTQ sets belonging to the same sample though. We must therefore align to the reference genome and merge the resulting bam files before writing fragments from the merged bam file. We can achieve this by having one line for each set of FASTQs for each sample. For example, `sample_A` could have 8 lines, with each line denoting a set of FASTQ files that should be aligned, and of which the resulting `.bam` should be merged.

In [22]:
df_metadata = pd.DataFrame()
for sample in samplenames:
    sample_R1_files = [
        os.path.realpath(x) for x in sorted(glob.glob(f"{fastq_dir}/*{sample}*R1*"))
    ]
    # print(len(sample_R1_files))
    df_sub = pd.DataFrame([sample] * len(sample_R1_files), columns=["sample_name"])
    df_sub["technology"] = [tech_dict[os.path.basename(x)] for x in sample_R1_files]
    df_sub["fastq_PE1_path"] = sample_R1_files
    df_sub["fastq_barcode_path"] = [x.replace("_R1_", "_R2_") for x in sample_R1_files]
    df_sub["fastq_PE2_path"] = [x.replace("_R1_", "_R3_") for x in sample_R1_files]

    df_metadata = pd.concat([df_metadata, df_sub])

In [23]:
df_metadata

Unnamed: 0,sample_name,technology,fastq_PE1_path,fastq_barcode_path,fastq_PE2_path
0,ASA__0201f1__20220902_MO-016-b-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
1,ASA__0201f1__20220902_MO-016-b-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
2,ASA__0201f1__20220902_MO-016-b-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
3,ASA__0201f1__20220902_MO-016-b-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
4,ASA__0201f1__20220902_MO-016-b-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
...,...,...,...,...,...
3,ASA__ffd613__20230315_MO-018-e-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
4,ASA__ffd613__20230315_MO-018-e-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
5,ASA__ffd613__20230315_MO-018-e-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...
6,ASA__ffd613__20230315_MO-018-e-ATAC,atac_revcomp,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...,/lustre1/project/stg_00002/lcb/ngs_runs/NextSe...


Then, write this to a csv file.

In [24]:
df_metadata.to_csv("metadata_10x.tsv", sep="\t", index=False, header=True)

Be sure to include the 10x whitelist as an option for `atac_forward` or `atac_revcomp` in the `.config` file defined in notebook 2.

# Case #2. A directory containing FASTQ files with arbitrary names

First, find all the FASTQ files in your `fastq_dir/`:

In [3]:
fastq_dir = "PUMATAC_example_fastq/"

In [4]:
filepaths = sorted(glob.glob(f"{fastq_dir}/*R1*"))
filepaths

['PUMATAC_example_fastq/BIO_ddseq_4__R1.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/EPF_hydrop_1__R1.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/OHS_s3atac_1__R1.LIBDS.fastq.gz',
 'PUMATAC_example_fastq/VIB_10xv2_1__R1.LIBDS.fastq.gz']

We assume that each FASTQ file has a structured name indicating the sample and read. In this syntax, R1 is mate 1, R2 is the barcode read and R3 is mate 2. For ddSEq samples, R2 is mate 2. The file names are completely arbitrary, but can help to systematically generate the metadata file. Of course, the scripts below assume the name structure described previously. If your FASTQ filenames have different structures, you should adapt these scripts or compile manual `metadata.tsv`.

In [5]:
filenames = [x.split("/")[-1] for x in filepaths]
samplenames = sorted(list(set([x.split("__")[0] for x in filenames])))
samplenames

['BIO_ddseq_4', 'EPF_hydrop_1', 'OHS_s3atac_1', 'VIB_10xv2_1']

We can determine the sequencer used from the FASTQ:

In [6]:
sequencers_dict = {}
for file in filepaths:
    filename = file.split("/")[-1]
    try:
        message, sequencers = pum.sequencer_detection_message([file])
        sequencer = sequencers[
            0
        ]  # multiple sequencers should only be detected when FASTQ files are the concatenation of FASTQ files sequenced on different insturments
    except:
        sequencer = "UNKNOWN!"

    print(f"{filename}: {sequencer}")
    sequencers_dict[filename] = sequencer
sequencers_dict

BIO_ddseq_4__R1.LIBDS.fastq.gz: NextSeq 500/550
EPF_hydrop_1__R1.LIBDS.fastq.gz: NextSeq 2000
OHS_s3atac_1__R1.LIBDS.fastq.gz: UNKNOWN!
VIB_10xv2_1__R1.LIBDS.fastq.gz: NextSeq 2000


{'BIO_ddseq_4__R1.LIBDS.fastq.gz': 'NextSeq 500/550',
 'EPF_hydrop_1__R1.LIBDS.fastq.gz': 'NextSeq 2000',
 'OHS_s3atac_1__R1.LIBDS.fastq.gz': 'UNKNOWN!',
 'VIB_10xv2_1__R1.LIBDS.fastq.gz': 'NextSeq 2000'}

Determine whether this sequencer uses forward or reverse complement:

In [7]:
worflow_dict = {
    "MiSeq": "forward",
    "HiSeq 2500": "forward",
    "HiSeq 3000": "revcomp",
    "HiSeq X": "revcomp",
    "NextSeq 500/550": "revcomp",
    "NovaSeq 6000": "revcomp",
    "NextSeq 2000": "revcomp",
    "UNKNOWN!": "UNKNOWN!",
}

In [8]:
sample_workflow_dict = {x: worflow_dict[y] for x, y in sequencers_dict.items()}
sample_workflow_dict

{'BIO_ddseq_4__R1.LIBDS.fastq.gz': 'revcomp',
 'EPF_hydrop_1__R1.LIBDS.fastq.gz': 'revcomp',
 'OHS_s3atac_1__R1.LIBDS.fastq.gz': 'UNKNOWN!',
 'VIB_10xv2_1__R1.LIBDS.fastq.gz': 'revcomp'}

Then, define the type of debarcoding strategy you want to use. We have 10x libraries, so `standard` should do. I use the term `10x_atac`, which will default to `standard`. 

In [9]:
tech_dict = {x: f"atac_{sample_workflow_dict[x]}" for x in filenames}
tech_dict

{'BIO_ddseq_4__R1.LIBDS.fastq.gz': 'atac_revcomp',
 'EPF_hydrop_1__R1.LIBDS.fastq.gz': 'atac_revcomp',
 'OHS_s3atac_1__R1.LIBDS.fastq.gz': 'atac_UNKNOWN!',
 'VIB_10xv2_1__R1.LIBDS.fastq.gz': 'atac_revcomp'}

Now, copy the dictionary and change `standard` to each sample's true method, for example:

In [10]:
tech_dict = {
    "BIO_ddseq_4__R1.LIBDS.fastq.gz": "biorad",  # this is a bio-rad sample, so we can use the `biorad` method.
    "EPF_hydrop_1__R1.LIBDS.fastq.gz": "hydrop_2x384",  # this is a hydrop sample, so we can use `hydrop_2x384` method
    "OHS_s3atac_1__R1.LIBDS.fastq.gz": "s3atac_1",  # this is an s3-atac sample, so we must supply a custom whitelist there in the .config file
    "VIB_10xv2_1__R1.LIBDS.fastq.gz": "atac_revcomp",  # this is a niormal 10x v2 sample, so we can use `atac` workflow
}
tech_dict

{'BIO_ddseq_4__R1.LIBDS.fastq.gz': 'biorad',
 'EPF_hydrop_1__R1.LIBDS.fastq.gz': 'hydrop_2x384',
 'OHS_s3atac_1__R1.LIBDS.fastq.gz': 's3atac_1',
 'VIB_10xv2_1__R1.LIBDS.fastq.gz': 'atac_revcomp'}

`biorad` is a standard method. `atac_revcomp` is not a standard method. As a result, `standard` will be used as a demultiplexing strategy for `atac_revcomp`, but using a whitelist that is specified for `atac_revcomp` in the `.config` file written in notebook 2.

In [11]:
df_metadata = pd.DataFrame()
for sample in samplenames:
    sample_R1_files = [
        os.path.realpath(x) for x in sorted(glob.glob(f"{fastq_dir}/*{sample}*R1*"))
    ]
    # print(len(sample_R1_files))
    df_sub = pd.DataFrame([sample] * len(sample_R1_files), columns=["sample_name"])
    df_sub["technology"] = [tech_dict[os.path.basename(x)] for x in sample_R1_files]
    df_sub["fastq_PE1_path"] = sample_R1_files
    df_sub["fastq_barcode_path"] = [x.replace("_R1_", "_R2_") for x in sample_R1_files]
    df_sub["fastq_PE2_path"] = [x.replace("_R1_", "_R3_") for x in sample_R1_files]

    df_metadata = pd.concat([df_metadata, df_sub])

In [12]:
df_metadata

Unnamed: 0,sample_name,technology,fastq_PE1_path,fastq_barcode_path,fastq_PE2_path
0,BIO_ddseq_4,biorad,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...
0,EPF_hydrop_1,hydrop_2x384,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...
0,OHS_s3atac_1,s3atac_1,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...
0,VIB_10xv2_1,atac_revcomp,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...,/lustre1/project/stg_00002/lcb/fderop/data/202...


Then, write this to a csv file.

In [13]:
df_metadata.to_csv("metadata.tsv", sep="\t", index=False, header=True)