In [1]:
from utils import sra_metadata, ftp_upload
import yaml
import os
import pandas as pd

# Uploading your sequencing runs to the SRA

Before publishing your study, you must make the raw sequencing runs available. This notebook contains the instructions and code for uploading these `fastq` files to the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra).

The recommended way to organize your experiments on the SRA is as follows:

```
BioProject: Variant library (i.e Omicron BA.1)
|
|___BioSamples: PacBio barcode linking, Illumina barcodes from individual studies
	|
	|___SRA Experiments: Individual runs and replicates of antibodies, sera, etc..
```

This structure is in keeping with the scheme used in Tyler's and Allie's Yeast Display DMS projects like this one [here](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA639956). You can see what this looks like for an individual run from it's [`BioProject`](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA639956) >> to it's [`BioSample`](https://www.ncbi.nlm.nih.gov/biosample/19925005) >> and finally, the [`SRA Run`](https://www.ncbi.nlm.nih.gov/sra/SRX11291810[accn]) itself.


**Note:** *The instructions below are applicable whether you have a new variant library for which a* `BioProject` *doesn't exist, or whether you are uploading a new* `BioSample` *to an existing* `BioProject`.

---

<details>
  
  <summary>
      <h1>&#9758; Click to Create the BioProject</h1>
  </summary>
    
**Here is where the intructions will go for making a new BioProject. I'll fill this in as I do it.** 
  
</details>

<p><strong>If you haven't yet created a </strong><code>BioProject</code><strong> for your variant library, click the pointing finger ( &#9758; ) to see how to set this up.</strong></p>

In [3]:
# After you've created the BioProject,
# or if you already have a BioProject that you'd like to add to,
# put the correct Accession here.

BioProject_ID = ""

---
# Create the BioSample for your Barcodes

**Otherwise, if your** `BioProject` **already exists and you are uploading barcode runs, start with these instructions.**

First, go to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/) and login.

Then under the **`Start a new submission`** banner, click `BioSample` to create a new BioSample. You are now on a page that has a blue **`New Submission`** button, which you should select. This will bring you to a page with some Submitter information; check that it is correct and then hit **`Continue`**. 

You will then be on the **`General Information`** page. Select when to release the submission to the public (*generally, immediately is OK*). Then click that you are uploading a `Single BioSample` and hit **`Continue`**.

You will now be at the **`Sample Type`** page, and you have to select the package that best describes the submission. Click `Pathogen(?)`. Then click **`Continue`**. 

Now, you will enter the sample attributes. For the sample name, provide a short name that describes the sample, such as `omicron_BA1_lenti_dms`. Also provide the rest of the information by editing the first section of the [`config.yml`](config.yml) in this directory. **Run the cell below to check what you've set and will need to add to the SRA.** 

In [16]:
with open('config.yml') as f:
    config = yaml.safe_load(f)
    
print(
f"""
Organism: {config['organism']}\n
Strain: {config['strain']}\n
Isolation Source: {config['isolation_source']}\n
Collection Data: {config['collection_date']}\n
Geographic Location: {config['geographic_location']}\n
Sample Type: {config['isolation_source']}
"""
)


Organism: severe acute respiratory syndrome coronavirus 2

Strain: Omicron BA.1

Isolation Source: plasmid

Collection Data: 2022

Geographic Location: USA

Sample Type: plasmid



Then hit **`Continue`**.

You will now be on the page to specify the `BioProject`. You should either be adding to an existing BioProject, or have created a new on according to the instructions above. Enter the correct ID in the format of `PRJNAXXXXXX` as the **`Existing BioProject`** and hit **`Continue`**.

Finally, add a sample title, such as `"Illumina barcode sequencing from mutational antigenic profiling of Omicron BA.1 Spike with clinically relevant antibodies."` Then hit **`Continue`**, make sure everything looks correct, then hit **`Submit`**.

After a brief bit of processing, the `BioSample` submission should show up, along with a sample accession that will be in the format of `SAMNXXXXXXX`. Add this sample accession to [`config.yml`](config.yml) as the value for the `accession` key under the `barcode_runs` key.

In [4]:
with open('config.yml') as f:
    config = yaml.safe_load(f)
print(f"The BioSample accession for these Illumina Barcode runs is: {config['barcode_runs']['accession']}")
print(f"The BioProject accession for these Illumina Barcode runs is: {BioProject_ID}")

The BioSample accession for these Illumina Barcode runs is: SAMN12345678
The BioProject accession for these Illumina Barcode runs is: 


---
## Upload the sequencing data

Go back to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/) and login again. This time, under the **`Start a new submission banner`**, click **`Sequence Read Archive`** to upload the actual sequencing data.

You are now on a page with a **`New submission`** button, which you should click. Check that the submitter information is correct, then click **`Continue`**.

You will now be at the **`General Information`** page. We are adding to an existing `BioProject`, so enter the correct `BioProject` accession as the **`Existing BioProject`**. Then for the question of whether you already registered a `BioSample`, also select yes. Then select a Release Date depending on whether you want to release the results immediately (*usually fine*) or at some future date. Then click **`Continue`**.

You will now be at a page that asks how you want to provide the `SRA Metadata`. Click to Upload a file. The next section describes how to create this table.

---
## Create SRA metadata submission table

To create the SRA metadata submission table for the Illumina Barcode runs, you'll edit the values under the `barcode_runs` key in the [`config.yml`](config.yml) file in this directory. 

For the key `barcode_runs` >> `file_path`, add the path to a `*.csv` file with **only** the barcode runs that you're going to submit to the SRA for *this BioSample*. 

Most of the values should be self explanatory, however, for the key `sample_id_columns`, choose the columns that should be concatenated in order to make a unique `library_ID` for the submission. **The order of the columns matters**. 

After you've edited the [`config.yml`](config.yml) file, run the cell below to double-check that the values are correct. 

In [22]:
with open('config.yml') as f:
    config = yaml.safe_load(f)

print(*[f"{k}: {v}\n" for k,v in config['barcode_runs'].items() if k != "ftp_subfolder"])

file_path: ../data/barcode_runs.csv
 accession: SAMN12345678
 sample_id_columns: ['virus_batch', 'sample_type', 'antibody', 'antibody_concentration', 'replicate']
 title_prefix: Full Spike glycoprotein DMS illumina barcode sequencing for
 description: PCR of barcodes from glycoprotein variants
 strategy: AMPLICON
 source: SYNTHETIC
 selection: PCR
 layout: single
 platform: ILLUMINA
 model: Illumina HiSeq 2500



If these values are correct, run the cell below to make the `barcodes_SRA_metadata.tsv` file for the Illumina Barcode runs that you're including in this `BioSample` submission and save a file containing the paths to the files for upload.

**Check that the table looks like you expect it to!**

In [36]:
sra_metadata.format_metadata_tsv(config, "", pacbio=False)

if not all(map(os.path.isfile, ["barcodes_SRA_metadata.tsv", "barcodes_fasta_files.csv"])):
    raise Exception("You're missing either one or both of the expected output files.")

barcodes_SRA_metadata = pd.read_csv("barcodes_SRA_metadata.tsv", sep="\t")
barcodes_SRA_metadata.head()

Unnamed: 0,biosample_accession,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename,filename2
0,SAMN12345678,Lib-1_thaw-1_VSVG_control_none_none_1,Full Spike glycoprotein DMS illumina barcode s...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from glycoprotein variants,fastq,VSVG_A50_rep1_S3_R1_001.fastq.gz,
1,SAMN12345678,Lib-1_thaw-1_VSVG_control_none_none_2,Full Spike glycoprotein DMS illumina barcode s...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from glycoprotein variants,fastq,VSVG_A50_rep2_S4_R1_001.fastq.gz,
2,SAMN12345678,Lib-1_thaw-1_antibody_CC95.102_37.0_1,Full Spike glycoprotein DMS illumina barcode s...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from glycoprotein variants,fastq,A50_CC25102_11_S15_R1_001.fastq.gz,
3,SAMN12345678,Lib-1_thaw-1_antibody_CC95.102_111.0_1,Full Spike glycoprotein DMS illumina barcode s...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from glycoprotein variants,fastq,A50_CC25102_21_S17_R1_001.fastq.gz,
4,SAMN12345678,Lib-1_thaw-1_antibody_CC95.102_222.0_1,Full Spike glycoprotein DMS illumina barcode s...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from glycoprotein variants,fastq,A50_CC25102_31_S19_R1_001.fastq.gz,


---
## Upload the submission table

Now return to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/sra/) webpage, and if needed navigate back to your submission. You should still be at the **`BioSample Attributes`** step, and see a **`Choose file`** box to upload your `BioSampe` attributes. Use this click box to upload the `SRA_submission_spreedsheet.tsv` file that you just created in the step above. Then click **`Continue`**.

After a little while, you should now get a page that asks you how you want to upload the files for this submission. Click the option for **`FTP or Aspera Command Line file preload`**.

If you click on the **`+`** FTP upload instructions, you will see details. Add the `Username` and a`ccount folder` provided in these instructions to the [config.yml](config.yml) file as the values for the keys `ftp_username` and `ftp_account_folder` at the bottom of the file. Also add a value for the key `ftp_subfolder` that is meaningful for this particular submission under the `barcode_runs` key. Finally, put the FTP password as plain text in a file called `ftp_password.txt` which is not tracked in this repo.

In [27]:
with open('config.yml') as f:
    config = yaml.safe_load(f)
    
print(f"""
ftp_username: {config['ftp_username']}
ftp_account_folder: {config['ftp_account_folder']}
ftp_subfolder: {config['barcode_runs']['ftp_subfolder']}
""")

if os.path.isfile('ftp_password.txt'):
    print("ftp_password.txt Exists!")
else:
    raise Exception("Make sure that ftp_password.txt exists in this directory.")



ftp_username: subftp
ftp_account_folder: uploads/something_or_another
ftp_subfolder: omicron_BA1_barcodes

ftp_password.txt Exists!


---
## Upload the sequencing data

Now we need to upload the actual sequencing data. This is done by running the cells below. It first creates a very large `*.tar` file called `SRA_submission.tar` that contains all the `fastqs` specified in the Illumina Barcode metadata. 

**Run the cell below to make this file. This can take a bit.**

In [2]:
barcodes_fastq_files = pd.read_csv("barcodes_fasta_files.csv")
ftp_upload.make_tar_file(barcodes_fastq_files, "barcode_SRA_submission.tar")

Adding file 1 of 83 to barcode_SRA_submission.tar
Adding file 10 of 83 to barcode_SRA_submission.tar
Adding file 20 of 83 to barcode_SRA_submission.tar
Adding file 30 of 83 to barcode_SRA_submission.tar
Adding file 40 of 83 to barcode_SRA_submission.tar
Adding file 50 of 83 to barcode_SRA_submission.tar
Adding file 60 of 83 to barcode_SRA_submission.tar
Adding file 70 of 83 to barcode_SRA_submission.tar
Adding file 80 of 83 to barcode_SRA_submission.tar
Adding file 83 of 83 to barcode_SRA_submission.tar
Added all files to barcode_SRA_submission.tar

The size of barcode_SRA_submission.tar is 39.6 GB

barcode_SRA_submission.tar contains all 81 expected files.

Finished preparing the tar file for upload.


If the size of the `*.tar` file is what you expect, **run the chunk below to use FTP to upload the file to the SRA.**

In [None]:
upload_via_ftp(tar_path="barcode_SRA_submission.tar",
               ftp_username=config['ftp_username'],
               ftp_account_folder=config['ftp_account_folder'],
               ftp_subfolder=config['barcode_runs']['ftp_subfolder'],
               ftp_address='ftp-private.ncbi.nlm.nih.gov',
               ftp_password='ftp_password.txt'
              )

Now that the transfer has finished, **manually log into the FTP site to see the file and use** `ls` **to see the size of what has been transferred to make sure that it worked correctly.**

Finally, return to the SRA submission webpage for the reads, and check the blue Select preload folder box. Note that you need to wait about 10 minutes for the pre-load folder to become visible. The click to select the folder you created (this is the ftp_subfolder defined in upload_config.yaml) and click Use selected folder. Finally, check Autofinish submission box and hit Continue. You will get a warning that files are missing since you uploaded a `*.tar` archive; do not worry about this and just click Continue again. The webpage will then indicate it is extracting files from the `*.tar`, so wait for this to finish. It should then show that your submission is complete and just waiting for processing.

You then probably want to delete the `barcodes_SRA_submission.tar` file as it is very large.

<h2 style=color:red;>&#128721; Stop! If you don't also need to add PacBio sequencing, you're done! &#128721;</h2>

---
**However, if you've made a new** `BioProject` **for this submission and haven't yet added you're PacBio sequencing linking variants to barcodes, go ahead and do this here.** 

The steps below are more-or-less identical to the steps above. **Make sure that you have a** `pacbio_runs` **key in the** [`config.yml`](config.yml) **file** with the correct information filled out. Click on the heading to expand the instructions as you need them.

---
<details>

<summary><h1> &#9758; Create the BioSample for your PacBio</h1></summary>

Just like you did above, go to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/) and login.

Then under the **`Start a new submission`** banner, click `BioSample` to create a new BioSample. You are now on a page that has a blue **`New Submission`** button, which you should select. This will bring you to a page with some Submitter information; check that it is correct and then hit **`Continue`**. 

You will then be on the **`General Information`** page. Select when to release the submission to the public (*generally, immediately is OK*). Then click that you are uploading a `Single BioSample` and hit **`Continue`**.

You will now be at the **`Sample Type`** page, and you have to select the package that best describes the submission. Click `Pathogen(?)`. Then click **`Continue`**. 

Now, you will enter the sample attributes. For the sample name, provide a short name that describes the sample, such as `omicron_BA1_lenti_dms_pacbio`. Also provide the rest of the information by editing the first section of the [`config.yml`](config.yml) in this directory. **Run the cell below to check what you've set and will need to add to the SRA.** 
    
</details>

In [8]:
# Load in the config again and check that there is pacbio sequencing.
# End the notebook here if there isn't.
with open('config.yml') as f:
    config = yaml.safe_load(f)

if "pacbio_runs" not in config.keys(): 
    raise Exception("It looks like you don't have PacBio sequencing info in your config. You might be done.")

In [9]:
print(
f"""
Organism: {config['organism']}\n
Strain: {config['strain']}\n
Isolation Source: {config['isolation_source']}\n
Collection Data: {config['collection_date']}\n
Geographic Location: {config['geographic_location']}\n
Sample Type: {config['isolation_source']}
"""
)


Organism: severe acute respiratory syndrome coronavirus 2

Strain: Omicron BA.1

Isolation Source: plasmid

Collection Data: 2022

Geographic Location: USA

Sample Type: plasmid



<details>
    <summary><strong>Click to see details</strong></summary>

Then hit **`Continue`**.

You will now be on the page to specify the `BioProject`. You should either be adding to an existing BioProject, or have created a new on according to the instructions above. Enter the correct ID in the format of `PRJNAXXXXXX` as the **`Existing BioProject`** and hit **`Continue`**.

Finally, add a sample title, such as `"PacBio sequencing to link barcodes to variants for mutational antigenic profiling of Omicron BA.1 Spike with clinically relevant antibodies."` Then hit **`Continue`**, make sure everything looks correct, then hit **`Submit`**.

After a brief bit of processing, the `BioSample` submission should show up, along with a sample accession that will be in the format of `SAMNXXXXXXX`. Add this sample accession to [`config.yml`](config.yml) as the value for the `accession` key **under the `pacbio_runs` key.**
</details>

In [10]:
with open('config.yml') as f:
    config = yaml.safe_load(f)
print(f"The BioSample accession for these PacBio Variant Linking runs is: {config['barcode_runs']['accession']}")
print(f"The BioProject accession for these is: {BioProject_ID}")

The BioSample accession for these PacBio Variant Linking runs is: SAMN12345678
The BioProject accession for these is: 


---
<details>
<summary><h2> &#9758; Upload the sequencing data</h2></summary>

Go back to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/) and login again. This time, under the **`Start a new submission banner`**, click **`Sequence Read Archive`** to upload the actual sequencing data.

You are now on a page with a **`New submission`** button, which you should click. Check that the submitter information is correct, then click **`Continue`**.

You will now be at the **`General Information`** page. We are adding to an existing `BioProject`, so enter the correct `BioProject` accession as the **`Existing BioProject`**. Then for the question of whether you already registered a `BioSample`, also select yes. Then select a Release Date depending on whether you want to release the results immediately (*usually fine*) or at some future date. Then click **`Continue`**.

You will now be at a page that asks how you want to provide the `SRA Metadata`. Click to Upload a file. The next section describes how to create this table.
</details>

---
<details>

<summary><h2>&#9758; Create SRA metadata submission table</h2></summary>
    
To create the SRA metadata submission table for the PacBio runs, you'll edit the values under the `pacbio_runs` key in the [`config.yml`](config.yml) file in this directory. 

For the key `pacbio_runs` >> `file_path`, add the path to a `*.csv` file with **only** the barcode runs that you're going to submit to the SRA for *this BioSample*. This will be named something like `PacBio_runs.csv`.

After you've edited the [`config.yml`](config.yml) file, run the cell below to double-check that the values are correct. 
</details>

In [12]:
with open('config.yml') as f:
    config = yaml.safe_load(f)

print(*[f"{k}: {v}\n" for k,v in config['pacbio_runs'].items() if k != "ftp_subfolder"])

file_path: ../data/PacBio_runs.csv
 accession: SAMN12341234
 title_prefix: Full Spike glycoprotein PacBio CCSs linking variants to barcodes for library
 description: PacBio CCSs linking variants to barcodes
 strategy: AMPLICON
 source: SYNTHETIC
 selection: Restriction Digest
 layout: single
 platform: PACBIO_SMRT
 model: Sequel



If these values are correct, run the cell below to make the `pacbio_SRA_metadata.tsv` file for the runs that you're including in this `BioSample` submission and save a file containing the paths to the files for upload.

**Check that the table looks like you expect it to!**

In [13]:
sra_metadata.format_metadata_tsv(config, "", pacbio=True)

if not all(map(os.path.isfile, ["pacbio_SRA_metadata.tsv", "pacbio_fasta_files.csv"])):
    raise Exception("You're missing either one or both of the expected output files.")

pacbio_SRA_metadata = pd.read_csv("pacbio_SRA_metadata.tsv", sep="\t")
pacbio_SRA_metadata.head()

Unnamed: 0,biosample_accession,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype
0,SAMN12341234,Lib-1_220531_PacBio_CCSs,Full Spike glycoprotein PacBio CCSs linking va...,AMPLICON,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,Sequel,PacBio CCSs linking variants to barcodes,fastq
1,SAMN12341234,Lib-1_220710_PacBio_CCSs,Full Spike glycoprotein PacBio CCSs linking va...,AMPLICON,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,Sequel,PacBio CCSs linking variants to barcodes,fastq
2,SAMN12341234,Lib-1_220711_PacBio_CCSs,Full Spike glycoprotein PacBio CCSs linking va...,AMPLICON,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,Sequel,PacBio CCSs linking variants to barcodes,fastq
3,SAMN12341234,Lib-2_220629_PacBio_CCSs,Full Spike glycoprotein PacBio CCSs linking va...,AMPLICON,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,Sequel,PacBio CCSs linking variants to barcodes,fastq
4,SAMN12341234,Lib-3_220629_PacBio_CCSs,Full Spike glycoprotein PacBio CCSs linking va...,AMPLICON,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,Sequel,PacBio CCSs linking variants to barcodes,fastq


---
<details>
    <summary><h2> &#9758; Upload the submission table </h2></summary>

Now return to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/sra/) webpage, and if needed navigate back to your submission. You should still be at the **`BioSample Attributes`** step, and see a **`Choose file`** box to upload your `BioSampe` attributes. Use this click box to upload the `pacbio_SRA_metadata.tsv` file that you just created in the step above. Then click **`Continue`**.

After a little while, you should now get a page that asks you how you want to upload the files for this submission. Click the option for **`FTP or Aspera Command Line file preload`**.

If you click on the **`+`** FTP upload instructions, you will see details. Add the `Username` and a`ccount folder` provided in these instructions to the [config.yml](config.yml) file as the values for the keys `ftp_username` and `ftp_account_folder` at the bottom of the file. Also add a value for the key `ftp_subfolder` that is meaningful for this particular submission under the `pacbio_runs` key. Finally, put the FTP password as plain text in a file called `ftp_password.txt` which is not tracked in this repo.
</details>

In [14]:
with open('config.yml') as f:
    config = yaml.safe_load(f)
    
print(f"""
ftp_username: {config['ftp_username']}
ftp_account_folder: {config['ftp_account_folder']}
ftp_subfolder: {config['pacbio_runs']['ftp_subfolder']}
""")

if os.path.isfile('ftp_password.txt'):
    print("ftp_password.txt Exists!")
else:
    raise Exception("Make sure that ftp_password.txt exists in this directory.")



ftp_username: subftp
ftp_account_folder: uploads/something_or_another
ftp_subfolder: omicron_BA1_pacbio

ftp_password.txt Exists!


---
<details>
    <summary><h2> &#9758; Upload the sequencing data</h2></summary>

Now we need to upload the actual sequencing data. This is done by running the cells below. It first creates a very large `*.tar` file called `SRA_submission.tar` that contains all the `fastqs` specified in the Illumina Barcode metadata. 

**Run the cell below to make this file. This can take a bit.**
</details>

In [15]:
pacbio_fastq_files = pd.read_csv("pacbio_fasta_files.csv")
ftp_upload.make_tar_file(pacbio_fastq_files, "pacbio_SRA_submission.tar")

Adding file 1 of 5 to pacbio_SRA_submission.tar
Adding file 5 of 5 to pacbio_SRA_submission.tar
Added all files to pacbio_SRA_submission.tar

The size of pacbio_SRA_submission.tar is 18.2 GB

pacbio_SRA_submission.tar contains all 5 expected files.

Finished preparing the tar file for upload.


If the size of the `*.tar` file is what you expect, **run the chunk below to use FTP to upload the file to the SRA.**

In [None]:
upload_via_ftp(tar_path="pacbio_SRA_submission.tar",
               ftp_username=config['ftp_username'],
               ftp_account_folder=config['ftp_account_folder'],
               ftp_subfolder=config['pacbio_runs']['ftp_subfolder'],
               ftp_address='ftp-private.ncbi.nlm.nih.gov',
               ftp_password='ftp_password.txt'
              ) 

Finally, return to the SRA submission webpage for the reads, and check the blue Select preload folder box. Note that you need to wait about 10 minutes for the pre-load folder to become visible. The click to select the folder you created (this is the ftp_subfolder defined in upload_config.yaml) and click Use selected folder. Finally, check Autofinish submission box and hit Continue. You will get a warning that files are missing since you uploaded a `*.tar` archive; do not worry about this and just click Continue again. The webpage will then indicate it is extracting files from the `*.tar`, so wait for this to finish. It should then show that your submission is complete and just waiting for processing.

<h2 style=color:green;>&#127881; Congrats, you're done! &#127881;</h2>