# Upload sequencing data to SRA
This Python Jupyter notebook uploads the sequencing data to the NIH [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), or SRA.

## Create BioProject and BioSamples
The first step was done manually to create the BioProject and BioSamples.
Note that for new future uploads you may be able to use the existing BioProject.

To create these, I went to the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) and signed in using the box at the upper right of the webpage, and then went to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/sra/).
I then manually completed the first five steps, which define the project and samples.

## Create submission sheet
The sixth step is to create the submission sheet in `*.tsv` format, which is done by the following code.

First, import Python modules:

In [11]:
import os

import pandas as pd

import yaml

Read the configuration for the analysis:

In [12]:
with open('../config.yaml') as f:
    config = yaml.safe_load(f)

Read the PacBio runs:

In [13]:
pacbio_runs_file = os.path.join('../', config['pacbio_runs'])

print(f"Reading PacBio runs from {pacbio_runs_file}")

pacbio_runs = (
    pd.read_csv(pacbio_runs_file)
    .assign(ccs_file=lambda x: f"../{config['ccs_dir']}/" + x['library'] + '_' + x['run'] + '_ccs.fastq.gz')
    )

pacbio_runs.head()

Reading PacBio runs from ../data/PacBio_runs.csv


Unnamed: 0,library,run,subreads,ccs_file
0,lib1,200415_A,/fh/fast/bloom_j/SR/ngs/pacbio/200415_TylerSta...,../results/ccs/lib1_200415_A_ccs.fastq.gz
1,lib1,200415_B,/fh/fast/bloom_j/SR/ngs/pacbio/200415_TylerSta...,../results/ccs/lib1_200415_B_ccs.fastq.gz
2,lib2,200415_A,/fh/fast/bloom_j/SR/ngs/pacbio/200415_TylerSta...,../results/ccs/lib2_200415_A_ccs.fastq.gz
3,lib2,200415_B,/fh/fast/bloom_j/SR/ngs/pacbio/200415_TylerSta...,../results/ccs/lib2_200415_B_ccs.fastq.gz


Next make submission entries for the PacBio CCSs:

In [14]:
pacbio_submissions = (
    pacbio_runs
    .assign(
        sample_name='PacBio_CCSs',  # BioSample created in SRA wizard
        library_id=lambda x: x['library'] + '_PacBio_CCSs',  # unique library ID
        title='PacBio CCSs linking variants to barcodes for SARS-CoV-2 RBD deep mutational scanning',
        library_strategy='Synthetic-Long-Read',
        library_source='SYNTHETIC',
        library_selection='Restriction Digest',
        library_layout='single',
        platform='PACBIO_SMRT',
        instrument_model='PacBio Sequel',
        design_description='Restriction digest of plasmids carrying barcoded RBD variants',
        filetype='fastq',
        filename_fullpath=lambda x: x['ccs_file'],      
        )
    .drop(columns=pacbio_runs.columns)
    )

pacbio_submissions.head()

Unnamed: 0,sample_name,library_id,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename_fullpath
0,PacBio_CCSs,lib1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib1_200415_A_ccs.fastq.gz
1,PacBio_CCSs,lib1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib1_200415_B_ccs.fastq.gz
2,PacBio_CCSs,lib2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib2_200415_A_ccs.fastq.gz
3,PacBio_CCSs,lib2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib2_200415_B_ccs.fastq.gz


Read the Illumina runs:

In [15]:
illumina_runs_file = os.path.join('../', config['barcode_runs'])

print(f"Reading Illumina runs from {illumina_runs_file}")

illumina_runs = pd.read_csv(illumina_runs_file)

illumina_runs.head()

Reading Illumina runs from ../data/barcode_runs.csv


Unnamed: 0,library,sample,sample_type,sort_bin,concentration,date,number_cells,R1
0,lib1,SortSeq_bin1,SortSeq,1,,200416,6600000,/shared/ngs/illumina/tstarr/200427_D00300_0952...
1,lib1,SortSeq_bin2,SortSeq,2,,200416,3060000,/shared/ngs/illumina/tstarr/200427_D00300_0952...
2,lib1,SortSeq_bin3,SortSeq,3,,200416,2511000,/shared/ngs/illumina/tstarr/200427_D00300_0952...
3,lib1,SortSeq_bin4,SortSeq,4,,200416,2992000,/shared/ngs/illumina/tstarr/200427_D00300_0952...
4,lib2,SortSeq_bin1,SortSeq,1,,200416,6420000,/shared/ngs/illumina/tstarr/200427_D00300_0953...


Next make submission entries for Illumina data:

In [38]:
illumina_submissions = (
    illumina_runs
    .assign(
        sample_name=lambda x: x['sample_type'].map({'SortSeq': 'expression_barcodes',
                                                    'TiteSeq': 'hACE2_binding_barcodes'}),
        library_id=lambda x: x['library'] + '_' + x['sample'],
        title=lambda x: 'SARS-CoV-2 RBD deep mutational scanning Illumina barcode sequencing for ' + x['sample'],
        library_strategy='AMPLICON',
        library_source='SYNTHETIC',
        library_selection='PCR',
        library_layout='single',
        platform='ILLUMINA',
        instrument_model='Illumina HiSeq 2500',
        design_description='PCR of barcodes from RBD variants',
        filetype='fastq',
        filename_fullpath=lambda x: x['R1'].str.split(';'),       
        )
    .explode('filename_fullpath')
    .assign(filename_fullpath=lambda x: x['filename_fullpath'].str.strip())
    .drop(columns=illumina_runs.columns)
    .reset_index(drop=True)
    )

illumina_submissions.head()

Unnamed: 0,sample_name,library_id,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename_fullpath
0,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/shared/ngs/illumina/tstarr/200427_D00300_0952...
1,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/200427...
2,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/200427...
3,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/200427...
4,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/200427...


Now concatenate the PacBio and Illumina submissions into tidy format (one line per file), make sure all the files exist, and also make short name versions of them that lack the path:

In [40]:
submissions_tidy = (
    pd.concat([pacbio_submissions, illumina_submissions], ignore_index=True)
    .assign(file_exists=lambda x: x['filename_fullpath'].map(os.path.isfile),
            filename=lambda x: x['filename_fullpath'].map(os.path.basename),
            )
    )

assert submissions_tidy['file_exists'].all(), submissions_tidy.query('file_exists == False')

submissions_tidy.head()

Unnamed: 0,sample_name,library_id,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename_fullpath,file_exists,filename
0,PacBio_CCSs,lib1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib1_200415_A_ccs.fastq.gz,True,lib1_200415_A_ccs.fastq.gz
1,PacBio_CCSs,lib1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib1_200415_B_ccs.fastq.gz,True,lib1_200415_B_ccs.fastq.gz
2,PacBio_CCSs,lib2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib2_200415_A_ccs.fastq.gz,True,lib2_200415_A_ccs.fastq.gz
3,PacBio_CCSs,lib2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib2_200415_B_ccs.fastq.gz,True,lib2_200415_B_ccs.fastq.gz
4,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/shared/ngs/illumina/tstarr/200427_D00300_0952...,True,200416_lib1_FITCbin1_TGGAACAA_L001_R1_001.fast...


In [26]:
! ls /fh/fast/bloom_j/SR/ngs/illumina/tstarr/200427_D00300_0952_AHFCLCBCX3/Unaligned/Project_tstarr/Sample_200416_lib1_FITCbin1/200416_lib1_FITCbin1_TGGAACAA_L001_R1_002.fastq.gz

/fh/fast/bloom_j/SR/ngs/illumina/tstarr/200427_D00300_0952_AHFCLCBCX3/Unaligned/Project_tstarr/Sample_200416_lib1_FITCbin1/200416_lib1_FITCbin1_TGGAACAA_L001_R1_002.fastq.gz


In [28]:
(submissions_tidy
 .query('file_exists == False')
 ['f']
 )

Unnamed: 0,sample_name,library_id,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename_fullpath,file_exists,filename
5,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/20042...,False,200416_lib1_FITCbin1_TGGAACAA_L001_R1_002.fast...
6,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/20042...,False,200416_lib1_FITCbin1_TGGAACAA_L001_R1_003.fast...
7,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/20042...,False,200416_lib1_FITCbin1_TGGAACAA_L001_R1_004.fast...
8,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/20042...,False,200416_lib1_FITCbin1_TGGAACAA_L001_R1_005.fast...
9,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/20042...,False,200416_lib1_FITCbin1_TGGAACAA_L002_R1_001.fast...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
368,hACE2_binding_barcodes,lib2_TiteSeq_16_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/shared/ngs/illumina/tstarr/200427_D00300_095...,False,200422_s16-b1_TATCAGCA_L002_R1_002.fastq.gz
369,hACE2_binding_barcodes,lib2_TiteSeq_16_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/shared/ngs/illumina/tstarr/200427_D00300_095...,False,200422_s16-b1_TATCAGCA_L002_R1_003.fastq.gz
371,hACE2_binding_barcodes,lib2_TiteSeq_16_bin2,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/shared/ngs/illumina/tstarr/200427_D00300_095...,False,200422_s16-b2_TCCGTCTA_L002_R1_001.fastq.gz
373,hACE2_binding_barcodes,lib2_TiteSeq_16_bin3,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/shared/ngs/illumina/tstarr/200427_D00300_095...,False,200422_s16-b3_TCTTCACA_L002_R1_001.fastq.gz
