# NIH SRA submission
This notebook was used to create an NIH SRA (fastq) submission for Plasmodium falciparum MIP sequencing data in collaboration with Rosenthal Lab. It will also serve as a reference for falciparum sample submissions. 

### Prerequisites
1) Prior to SRA submission, create a bioproject on https://submit.ncbi.nlm.nih.gov/subs/bioproject/  
This is a simple procedure that produces a bioproject ID which will be associated with one or more SRA submissions.  
2) Prior to SRA submission, submit all of the samples used in a project to NIH BioSample database.



SRA submissions are made through the NIH website: https://submit.ncbi.nlm.nih.gov/subs/sra

There are 5 steps in the submission portal for SRA submissions that are listed below. **The main purpose of this notebook is to create the file needed at step 4**.

Step 1: Submitter details

Step 2: General Info
  * Did you already register a BioProject for this research, e.g. for the submission of the reads to SRA: **YES**
  * Did you already register a BioSample for this sample, e.g. for the submission of the reads to SRA: **YES**
  * Chose data release time (immediate, or future): we will chose a future date during the sumbission. Once the sumbission is finished, an email to SRA requesting removal of possible human sequences is recommended. Although our targeted sequencing should not generate any human data, and even if there are some off targets, those would not have any identifying information. Still, SRA provides this service and it makes sense to take it. Once that is done, you can publish the SRA publicly, or wait until publication date.
  
  
Step 3: SRA metadata
  * Upload a file using Excel or text format (tab-delimited): This file will be created using this notebook.
  
Step 4: Files
  * This notebook will copy the fastq files to specific directories. The files will be uploaded to the FTP server of NIH at this step.
  * The first step in the file transfer is to use command line and navigate to wherever the fastq files were saved (e.g. ssh to seekdeep server and cd to the fastq directory). Optionally, start a screen if you are comfortable using one. I'll call this location the "data location".
  
  * On the SRA submission portal choose: *FTP or Aspera Command Line file preload*
  * click on FTP upload instructions
  
  * From the *data location* connect to the NIH FTP server using the credentials provided on the submission portal
    * ftp ftp-private.ncbi.nlm.nih.gov
    * follow instructions on the portal until step 6
    * for step 6, instead do the following commands
    * prompt (this should turn off interactive, if it was already off, then it'll turn it on. We want this off, so if needed repeat the "prompt" command to turn off interactive.
    * mput *
  * Once the upload is complete go back to submission portal, click **Select preload folder**
  * Select the folder containing the uploaded files.
  * Autofinish
  * Continue
  

Step 5: Review and submit.

### Starting the notebook
This notebook was started with the below commands:
```bash
base_resources=~/git/MIPTools/base_resources
data_dir=~/processed/analysis/2020-06/ROS_200612/
seq_data_dir=~/raw_data
analysis_dir=~/processed/analysis/2020-06/ROS_200612/sra_submission
container=~/shared_bin/miptools_20200728.sif

mkdir -p $analysis_dir

singularity run --app jupyter \
    -B $base_resources:/opt/resources \
    -B $data_dir:/opt/data \
    -B $seq_data_dir:/opt/data \
    -B $analysis_dir:/opt/analysis \
    $container
```

### Sequencing data
It is best to have a data folder for the specific project where sequencing data, sample data etc will be stored. In this guide the location of that folder is "/opt/data/". In the context of singularity MIPTools container, this means that we bind the local data directory to /opt/data/ when starting the jupyter notebook (see above). Within this directory, a subdirectory "project_data" contains project specific files such as sample metadata, sequencing runs metadata etc.

New files generated will be saved to /opt/analysis folder or the /opt/data/project_data as appropriate.

/opt/work directory was bound to the root directory of raw sequencing data. So all the raw sequencing data should be available from /opt/work from within the container.

Import necessary modules

In [1]:
import sys
sys.path.append("/opt/src")
import os
import subprocess
import pandas as pd
import scandir

Classes reloading.
functions reloading


Specify directory locations for raw data and project data

In [78]:
data_dir = "/opt/data/project_data/"
raw_data_dir = "/opt/work/"

### Multiple sequencing runs
Many projects use multiple sequencing runs. We will use adata from 6 sequencing runs for this submission.  

We typically have a run ID for each sequencing run coded as the date of the run: YYMMDD. The sequence data is stored within the raw data directory in a run directory named run ID + sequencin platform such as 190312_nextseq for a sequencing run performed on March 12, 2019 on NextSeq.  

There is a sample sheet associated with the run within the run directory, typically named run ID + samples.tsv, such as 190312_samples.tsv.

The fastq files generated are stored in the fastq subdirectory within the run directory (190312_nextseq/fastq).

We are going to generate a dictionary pointing to file locations for each run, assuming the default values are correct.

In [3]:
run_ids = [190312, 190321, 190405, 190517, 200316, 200609]
seq_platform = "nextseq"
run_info = {}
for i in run_ids:
    run_name = str(i) + "_" + seq_platform
    run_info[i] = {"Run Name": run_name,
                   "Sample Sheet": os.path.join(raw_data_dir,
                                                run_name,
                                                str(i) + "_samples.tsv"),
                  "Fastq Dir": os.path.join(raw_data_dir, run_name, "fastq")}
run_info

{190312: {'Run Name': '190312_nextseq',
  'Sample Sheet': '/opt/work/190312_nextseq/190312_samples.tsv',
  'Fastq Dir': '/opt/work/190312_nextseq/fastq'},
 190321: {'Run Name': '190321_nextseq',
  'Sample Sheet': '/opt/work/190321_nextseq/190321_samples.tsv',
  'Fastq Dir': '/opt/work/190321_nextseq/fastq'},
 190405: {'Run Name': '190405_nextseq',
  'Sample Sheet': '/opt/work/190405_nextseq/190405_samples.tsv',
  'Fastq Dir': '/opt/work/190405_nextseq/fastq'},
 190517: {'Run Name': '190517_nextseq',
  'Sample Sheet': '/opt/work/190517_nextseq/190517_samples.tsv',
  'Fastq Dir': '/opt/work/190517_nextseq/fastq'},
 200316: {'Run Name': '200316_nextseq',
  'Sample Sheet': '/opt/work/200316_nextseq/200316_samples.tsv',
  'Fastq Dir': '/opt/work/200316_nextseq/fastq'},
 200609: {'Run Name': '200609_nextseq',
  'Sample Sheet': '/opt/work/200609_nextseq/200609_samples.tsv',
  'Fastq Dir': '/opt/work/200609_nextseq/fastq'}}

We have used non-standard file names for 4 of the runs for reasons beyond the scope of this guide. Below are the actual file names and the first 4 are slightly different than the default values. We'll update those file paths.

In [4]:
sample_sheets_used = ["190312_xs_samples.tsv",
                 "190321_xs_samples.tsv",
                 "190405_xs_samples.tsv",
                 "190517_samples_xs.tsv",
                 "200316_samples.tsv",
                 "200609_samples.tsv"] 

In [5]:
run_info[190312]["Sample Sheet"] = "/opt/work/190312_nextseq/190312_xs_samples.tsv"
run_info[190321]["Sample Sheet"] = "/opt/work/190321_nextseq/190321_xs_samples.tsv"
run_info[190405]["Sample Sheet"] = "/opt/work/190405_nextseq/190405_xs_samples.tsv"
run_info[190517]["Sample Sheet"] = "/opt/work/190517_nextseq/190517_samples_xs.tsv"

### What is in a sample sheet
When preparing sequencing libraries, we use certain terminology specific to our runs. Sample sheets provide all the information needed for connecting sequencing data to specific libraries. A 

  * **sample:** DNA source identifier. Multiple DNA extractions from the same source would have the same sample name, for example.
  * **library:** A sequencing library whose data can be uniquely identified (i.e. has unique sample barcodes). Multiple libraries can be generated from the same DNA sample (replicates) during the same library prep.
  * **Sample ID** library identifier.
  * **Library Prep:** A library preparation identifier, typically a date. A failed library for a sample can be re-prepared in a later library prep, for example. The new library would have a new barcode pair that identifies it and distinguishes from the first one.


There are 4 fields in the sample sheet that defines a library: **sample_name, sample_set, replicate, Library Prep**. 
  * **only alphanumeric characters and dash or underscores are allowed** in these (and most other) fields.
  * **sample_name:** The DNA sample that the library was prepared from. Each unique DNA sample in the lab must have a unique ID. 
  * **sample_set:** This is a project specific notation to identify samples belonging to the same project. Libraries sharing this value would be normally analysed together.
  * **replicate:** Replicate number. 
  
  Each library needs a unique identifier, although we sometimes use the same sample multiple times in a sequencing run. The most common use is the control DNA. To create a unique Sample ID (i.e. Library ID) we combine the sample_name, sample_set and replicate fields. 
  
  For example, we used 3d7 control DNA in 2 libraries during a library prep. sample_name: 3d7, sample_set: con, replicate: 1 and sample_name: 3d7, sample_set: con, replicate: 2. Let's say the Library Prep ID is 190201 (prepared on February 1, 2019). Corresponding Sample IDs would be 3d7-con-1 and 3d7-con-2. The sequencing data for these libraries will be permanently linked to the sample IDs (fastq file names will contain these IDs). 
  
  Let's say we have another library prep one month later that also uses 3d7 as control. sample_name: 3d7, sample_set: con, replicate: 1, and Library Prep 190301. Sample ID: 3d7-con-1
  
  If we are analysing the two data sets together, we'd have 2 libraries with the same ID 3d7-con-1. In this case, we'll need to change one of these to 3d7-con-3, but we will need to keep track of what it originally was. We do this by keeping the Library Prep identifier along with the original Sample ID (Original SID field). i.e. Sample ID 3d7-con-1 = Original SID 3d7-con-1 + Library Prep 190201 and Sample ID 3d7-con-3 = Original SID 3d7-con-1 and Library Prep 190301. All downstream analysis refer to the Sample ID.
  
  

### Get fastq file information and connect with sample information
We have the sample information for each run in the sample files provided. We also know the fastq directory for each run. Looping through the runs, we'll connect the two. 

We expect some samples not to have data, so when sample information and fastq information is merged, we would expect some decrease in the size of the sample tables. We are going to check that that is the case for each run. If there is an increase, on the other hand, that would point to some error that needs to be taken care of.

In [10]:
fastq_df_list = []
for r in run_info:
    fastq_list = []
    run_name = run_info[r]["Run Name"]
    fastq_dir = run_info[r]["Fastq Dir"]
    for entry in scandir.scandir(fastq_dir):
        try:
            split_entry = entry.name.split("_")
            fastq_list.append([split_entry[0],
                               split_entry[2][-1],
                               entry.name,
                               entry.path,
                               run_name,
                               r])
        except IndexError:
            continue
    fdf = pd.DataFrame(fastq_list, columns = ["Original SID",
                                   "Read Order",
                                   "File Name",
                                   "File Path", 
                                   "Run Name", 
                                   "Run ID"])
    print(r, fdf.shape)
    sheet_file = run_info[r]["Sample Sheet"]
    sheet_df = pd.read_table(sheet_file)[["sample_name", "sample_set", "replicate", "Library Prep"]]
    sheet_df["Original SID"] = sheet_df[["sample_name", "sample_set", "replicate"]].apply(
        lambda a: "-".join(map(str, a)), axis=1)
    fdf = fdf.merge(sheet_df)
    print(fdf.shape)
    fastq_df_list.append(fdf)
fastq_df = pd.concat(fastq_df_list, ignore_index=True)
fastq_df.head()

190312 (3776, 6)
(3302, 10)
190321 (3856, 6)
(3282, 10)
190405 (6862, 6)
(3030, 10)
190517 (6240, 6)
(664, 10)
200316 (3146, 6)
(3144, 10)
200609 (1570, 6)
(1568, 10)


Unnamed: 0,Original SID,Read Order,File Name,File Path,Run Name,Run ID,sample_name,sample_set,replicate,Library Prep
0,22801-JHU-1,1,22801-JHU-1_S1825_R1_001.fastq.gz,/opt/work/190312_nextseq/fastq/22801-JHU-1_S18...,190312_nextseq,190312,22801,JHU,1,190312
1,22801-JHU-1,2,22801-JHU-1_S1825_R2_001.fastq.gz,/opt/work/190312_nextseq/fastq/22801-JHU-1_S18...,190312_nextseq,190312,22801,JHU,1,190312
2,RXS480-ROS-1,2,RXS480-ROS-1_S953_R2_001.fastq.gz,/opt/work/190312_nextseq/fastq/RXS480-ROS-1_S9...,190312_nextseq,190312,RXS480,ROS,1,190312
3,RXS480-ROS-1,1,RXS480-ROS-1_S953_R1_001.fastq.gz,/opt/work/190312_nextseq/fastq/RXS480-ROS-1_S9...,190312_nextseq,190312,RXS480,ROS,1,190312
4,RXS514-ROS-1,1,RXS514-ROS-1_S996_R1_001.fastq.gz,/opt/work/190312_nextseq/fastq/RXS514-ROS-1_S9...,190312_nextseq,190312,RXS514,ROS,1,190312


Now we have a table that tracks sample_name, sample_set, replicate, Library Prep fields that identify the unique library and the locations of fastq files from each sequencing run.

### Select project specific libraries
Most MIP sequencing runs have libraries belonging to different projects. We will select those that belong to this project. 

First, load the sample files provided and visualize the unique sample sets.

In [13]:
sample_sheet_list = []
for r in run_info:
    s_sheet = pd.read_table(run_info[r]["Sample Sheet"])
    s_sheet["Run Name"] = run_info[r]["Run Name"]
    sample_sheet_list.append(s_sheet)
sample_sheets = pd.concat(sample_sheet_list, axis=0, ignore_index=True)
sample_sheets.groupby(["sample_set", "probe_set"]).first()

Unnamed: 0_level_0,Unnamed: 1_level_0,384 Column,FW_plate,Library Prep,REV_plate,Run Name,capture_plate,capture_plate_column,capture_plate_row,diff,fw,owner,quadrant,replicate,rev,sample_name,sample_plate
sample_set,probe_set,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
AG,HeOME_subset,Odd,FW-25,200316,REV-1,200316_nextseq,200305_HeOME_Begoro_P1,1,A,,193,jonathan,1.0,1,1,07GHR5001,2014_Begoro_Hospital_Plate3
HRP,"HRP2,HRP3,HRPF",Odd,FW-18,190312,REV-2,190312_nextseq,DR2_HRP_ROS10_JP_Pilot,1,E,-5.0,102,patrick,2.0,1,97,4624-T,2018 EPHI WHO Plate 1 (Pilot)
JHU,IBC,Odd,FW-5,180723,REV-1,190312_nextseq,IBC_Giovanna_Zambia_Plate_1,11,B,360.0,23,patrick,1.0,1,383,22755,IBC_JHU_Recapture_P1
JJJ,DR2,,,190312,,190312_nextseq,190117_DR2_MIPfest_2019_P01,1,A,-12.0,193,Patrick,,1,181,FP20443,
JJJ,IBC,,,190312,,190312_nextseq,IBC_DR2_MIPfest_2019_P05,1,A,-264.0,289,Maddi,,2,25,FP20443,
OPT,DR2,Odd,FW-12,200316,REV-3,200316_nextseq,191125_VeriFi_polymerase_Test,1,A,,12,deborah,1.0,1,193,3D7-1,CTL-1
OPT,HeOME_subset,Odd,FW-74,200316,REV-2,200316_nextseq,200123_HeOMEpanel,1,A,,349,deborah,1.0,1,97,3D7,CTL
ROS,DR2,Odd,FW-30,190312,REV-3,190312_nextseq,DR2_HRP_ROS10_JP_Pilot,1,I,-5.0,198,patrick,3.0,1,193,HC,CTL_DR2_HRP_ROS10_JP_Pilot1
uganda,DR2,Odd,FW-3,200316,REV-1,200316_nextseq,Prism_recapture_plate1,1,A,,3,deborah,1.0,1,1,TO-04-01,plate_1


Only ROS and uganda sets are relevant for this submission. So we'll limit the sample table to those.

In [14]:
sample_sheets = sample_sheets.loc[(sample_sheets["sample_set"].isin(["ROS", "uganda"]))]
print(sample_sheets.shape)
sample_sheets.head()

(3966, 18)


Unnamed: 0,384 Column,FW_plate,Library Prep,REV_plate,Run Name,capture_plate,capture_plate_column,capture_plate_row,diff,fw,owner,probe_set,quadrant,replicate,rev,sample_name,sample_plate,sample_set
366,Odd,FW-30,190312,REV-3,190312_nextseq,DR2_HRP_ROS10_JP_Pilot,1,I,-5.0,198,patrick,DR2,3.0,1,193,HC,CTL_DR2_HRP_ROS10_JP_Pilot1,ROS
368,Odd,FW-30,190312,REV-3,190312_nextseq,DR2_HRP_ROS10_JP_Pilot,3,I,-5.0,199,patrick,DR2,3.0,2,194,HC,CTL_DR2_HRP_ROS10_JP_Pilot1,ROS
370,Odd,FW-30,190312,REV-3,190312_nextseq,DR2_HRP_ROS10_JP_Pilot,5,I,-5.0,200,patrick,DR2,3.0,1,195,LC,CTL_DR2_HRP_ROS10_JP_Pilot1,ROS
372,Odd,FW-30,190312,REV-3,190312_nextseq,DR2_HRP_ROS10_JP_Pilot,7,I,-5.0,201,patrick,DR2,3.0,2,196,LC,CTL_DR2_HRP_ROS10_JP_Pilot1,ROS
374,Odd,FW-30,190312,REV-3,190312_nextseq,DR2_HRP_ROS10_JP_Pilot,9,I,-5.0,202,patrick,DR2,3.0,1,197,NTC,CTL_DR2_HRP_ROS10_JP_Pilot1,ROS


Also, drop the negative controls. These libraries should be inspected at the data analyses step for contaminations but there is no need to upload them to the SRA.

In [15]:
sample_sheets = sample_sheets.loc[~sample_sheets["sample_name"].isin(["NTP", "NTC"])]
sample_sheets.shape

(3929, 18)

In [16]:
fastq_df.shape

(14990, 10)

Make sure there are no duplicates in the sample table. If there are, it should be inspected.

In [17]:
sample_sheets.shape

(3929, 18)

In [18]:
sample_sheets.drop_duplicates().shape

(3929, 18)

Merge the fastq information with the sample table using the common column names to remove the fastq entries that do not belong to this project.

In [20]:
fastq_df = fastq_df.merge(sample_sheets[[
    "sample_name", "sample_set", "replicate", "Library Prep", "Run Name"]])

### Linking the Sample Id to the fastqs
We know what Original Sample ID each fastq file belongs to but the downstream analysis links data to Sample ID. These can be different from the original as explained above. We'll link the final Sample ID to the fastq files using the run metadata generated during downstream analysis. This file should be placed in the data directory. 

In [21]:
run_meta_file = os.path.join(data_dir, "run_meta.csv")
run_meta = pd.read_csv(, index_col=0)
run_meta.head()

Unnamed: 0,Sample ID,Library Prep,384 Column,FW_plate,Original SID,REV_plate,capture_plate,capture_plate_column,capture_plate_row,diff,fw,owner,probe_set,quadrant,rev,sample_name,sample_plate,sample_set,replicate,Sample Name
0,3D7-ROS-1,190517,Odd,FW-24,3D7-ROS-1,REV-2,DR2_ROS_RECAP_Plate_02,1,E,,108,deborah,DR2,2.0,97,3D7,CTL,ROS,1,3D7
1,3D7-ROS-2,190517,Odd,FW-44,3D7-ROS-2,REV-4,DR2_ROS_RECAP_Plate_01,1,M,,296,deborah,DR2,4.0,289,3D7,CTL,ROS,2,3D7
2,3D7-ROS-3,190517,Odd,FW-24,3D7-ROS-3,REV-2,DR2_ROS_RECAP_Plate_02,3,E,,97,deborah,DR2,2.0,98,3D7,CTL,ROS,3,3D7
3,3D7-ROS-4,190517,Odd,FW-44,3D7-ROS-4,REV-4,DR2_ROS_RECAP_Plate_01,3,M,,297,deborah,DR2,4.0,290,3D7,CTL,ROS,4,3D7
4,3D7-uganda-1,200316,Odd,FW-47,3D7-uganda-1,REV-1,Prism_recapture_plate3,1,I,,299,deborah,DR2,3.0,1,3D7,CTL,uganda,1,3D7


In [22]:
run_meta[["Original SID", "Sample ID", "Library Prep"]]

Unnamed: 0,Original SID,Sample ID,Library Prep
0,3D7-ROS-1,3D7-ROS-1,190517
1,3D7-ROS-2,3D7-ROS-2,190517
2,3D7-ROS-3,3D7-ROS-3,190517
3,3D7-ROS-4,3D7-ROS-4,190517
4,3D7-uganda-1,3D7-uganda-1,200316
...,...,...,...
1803,TO-04-46-uganda-1,TO-04-46-uganda-1,200316
1804,TO-04-47-uganda-1,TO-04-47-uganda-1,200316
1805,TO-04-48-uganda-1,TO-04-48-uganda-1,200316
1806,TO-04-49-uganda-1,TO-04-49-uganda-1,200316


"Dry" merge to see if anything funny is happening such as losing some entries from the fastq table.

In [23]:
fastq_df.shape

(7786, 10)

In [24]:
fastq_df.merge(run_meta[["Original SID", "Sample ID", "Library Prep"]]).shape

(7786, 11)

Merge to add the final Sample ID to fastq table.

In [25]:
fastq_df = fastq_df.merge(run_meta[["Original SID", "Sample ID", "Library Prep"]])

In [26]:
fastq_df.head()

Unnamed: 0,Original SID,Read Order,File Name,File Path,Run Name,Run ID,sample_name,sample_set,replicate,Library Prep,Sample ID
0,RXS480-ROS-1,2,RXS480-ROS-1_S953_R2_001.fastq.gz,/opt/work/190312_nextseq/fastq/RXS480-ROS-1_S9...,190312_nextseq,190312,RXS480,ROS,1,190312,RXS480-ROS-1
1,RXS480-ROS-1,1,RXS480-ROS-1_S953_R1_001.fastq.gz,/opt/work/190312_nextseq/fastq/RXS480-ROS-1_S9...,190312_nextseq,190312,RXS480,ROS,1,190312,RXS480-ROS-1
2,RXS480-ROS-1,2,RXS480-ROS-1_S1004_R2_001.fastq.gz,/opt/work/190321_nextseq/fastq/RXS480-ROS-1_S1...,190321_nextseq,190321,RXS480,ROS,1,190312,RXS480-ROS-1
3,RXS480-ROS-1,1,RXS480-ROS-1_S1004_R1_001.fastq.gz,/opt/work/190321_nextseq/fastq/RXS480-ROS-1_S1...,190321_nextseq,190321,RXS480,ROS,1,190312,RXS480-ROS-1
4,RXS480-ROS-1,2,RXS480-ROS-1_S2699_R2_001.fastq.gz,/opt/work/190405_nextseq/fastq/RXS480-ROS-1_S2...,190405_nextseq,190405,RXS480,ROS,1,190312,RXS480-ROS-1


### Rename files
Non-unique fastq file names may be generated when multiple sequencing runs are performed for the same libraries. However, all Sample IDs for each run are unique. So we'll rename all files to make them all unique.

In [30]:
fastq_df["New File Name"] = (fastq_df["Sample ID"] + "-" + fastq_df["Run ID"].astype(str) 
                                 + "_R" + fastq_df["Read Order"].astype(str) + ".fastq.gz")

### Copy fastq files
Create a fastq directory within the project specific data directory and copy all fastq files into it.  
Inspect the process output to make sure the command completes with returncode 0 and no errors.

In [79]:
project_fastq_dir = os.path.join(data_dir, "fastq")
subprocess.run(["mkdir", "-p", project_fastq_dir], 
               stdout=subprocess.PIPE, stderr=subprocess.PIPE)

CompletedProcess(args=['mkdir', '-p', '/opt/data/project_data/fastq'], returncode=0, stdout=b'', stderr=b'')

Add new file path which is the project fastq directory and the new file name.

In [35]:
fastq_df["New File Path"] = fastq_df["New File Name"].apply(
    lambda a: os.path.join(project_fastq_dir, a))

Copy files using rsync.

In [36]:
res = fastq_df.apply(lambda a: subprocess.run(
    ["rsync", "-a", a["File Path"], a["New File Path"]],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE), axis=1)

Check that all processes completed with 0 returncode. If the below sum is greater than 0, the called process objects in the res must be inspected.

In [39]:
res.apply(lambda a: a.returncode).sum()

0

### Link BioSample accession numbers
All samples used in the study must have been submitted to the NIH BioSample data base. Place the sample metadata file in the project's data directory and load it.

In [40]:
sample_meta_sra = pd.read_table(os.path.join(data_dir, "sra_sample_meta.tsv"))
sample_meta_sra.head()

Unnamed: 0,sample_name,organism,isolate,host,isolation_source,collection_date,geo_loc_name,sample_type,accession
0,RXS1,Plasmodium falciparum,RXS1,Homo sapiens,Dried Blood Spot,2018,Uganda: Agago,Dried Blood Spot,SAMN15749034
1,RXS2,Plasmodium falciparum,RXS2,Homo sapiens,Dried Blood Spot,2018,Uganda: Agago,Dried Blood Spot,SAMN15749035
2,RXS3,Plasmodium falciparum,RXS3,Homo sapiens,Dried Blood Spot,2018,Uganda: Agago,Dried Blood Spot,SAMN15749036
3,RXS4,Plasmodium falciparum,RXS4,Homo sapiens,Dried Blood Spot,2018,Uganda: Agago,Dried Blood Spot,SAMN15749037
4,RXS5,Plasmodium falciparum,RXS5,Homo sapiens,Dried Blood Spot,2018,Uganda: Agago,Dried Blood Spot,SAMN15749038


Include the control sample metadata

In [41]:
control_meta_sra = pd.read_table(os.path.join(data_dir, "parasite_control_meta_sra.tsv"))
control_meta_sra.head()

Unnamed: 0,Control Type,DNA Concentration (ng/ul),sample_name,organism,isolate,collected_by,collection_date,geo_loc_name,host,host_disease,isolation_source,lat_lon,accession
0,POS,1.25e-05,D0,Plasmodium Falciparum,D0,not applicable,not applicable,not applicable,Homo sapiens,Malaria,Laboratory,not applicable,SAMN15747964
1,POS,2.5e-05,D1,Plasmodium Falciparum,D1,not applicable,not applicable,not applicable,Homo sapiens,Malaria,Laboratory,not applicable,SAMN15747965
2,POS,5e-05,D2,Plasmodium Falciparum,D2,not applicable,not applicable,not applicable,Homo sapiens,Malaria,Laboratory,not applicable,SAMN15747966
3,POS,0.0001,D3,Plasmodium Falciparum,D3,not applicable,not applicable,not applicable,Homo sapiens,Malaria,Laboratory,not applicable,SAMN15747967
4,POS,0.000175,D4,Plasmodium Falciparum,D4,not applicable,not applicable,not applicable,Homo sapiens,Malaria,Laboratory,not applicable,SAMN15747968


In [42]:
sample_meta = pd.concat([sample_meta_sra, control_meta_sra], ignore_index=True)
sample_meta = sample_meta[["sample_name" ,"accession"]]
sample_meta.shape

(1443, 2)

In [44]:
sample_meta.rename(columns={"accession": "biosample_accession"},
                   inplace=True)

Dry merge

In [45]:
fastq_df.shape

(7786, 13)

In [46]:
fastq_df.merge(sample_meta).shape

(7786, 14)

Merge

In [47]:
fastq_df = fastq_df.merge(sample_meta)

### Convert the fastq table to SRA format
SRA submission requires each library on a single row and all fastq files for a specific library to be listed as filename, filename2, filename3, etc.  

We'll start with getting the maximum number of files per library.

In [48]:
max_file_num = fastq_df.groupby("Sample ID").size().max()
max_file_num

6

Define a short function that takes a groupby object and creates a single table row containing all files for the library and empty strings for libraries that don't have as many as max number of files.

In [49]:
def get_files(g):
    files = g["New File Name"].tolist()
    count = len(files)
    empty = max_file_num - count
    cols = ["filename"] + ["filename" + str(i) for i in range(2, max_file_num+1)]
    for i in range(empty):
        files.append("")
    return pd.Series(files, index=cols)

Create a files table in SRA format

In [50]:
files = fastq_df.groupby(["Sample ID", "biosample_accession", "Library Prep"]).apply(get_files).reset_index()
print(files.shape)
files.head()

(1787, 9)


Unnamed: 0,Sample ID,biosample_accession,Library Prep,filename,filename2,filename3,filename4,filename5,filename6
0,3D7-ROS-1,SAMN15747989,190517,3D7-ROS-1-190517_R1.fastq.gz,3D7-ROS-1-190517_R2.fastq.gz,,,,
1,3D7-ROS-2,SAMN15747989,190517,3D7-ROS-2-190517_R2.fastq.gz,3D7-ROS-2-190517_R1.fastq.gz,,,,
2,3D7-ROS-3,SAMN15747989,190517,3D7-ROS-3-190517_R2.fastq.gz,3D7-ROS-3-190517_R1.fastq.gz,,,,
3,3D7-ROS-4,SAMN15747989,190517,3D7-ROS-4-190517_R2.fastq.gz,3D7-ROS-4-190517_R1.fastq.gz,,,,
4,3D7-uganda-1,SAMN15747989,200316,3D7-uganda-1-200316_R2.fastq.gz,3D7-uganda-1-200316_R1.fastq.gz,3D7-uganda-1-200609_R1.fastq.gz,3D7-uganda-1-200609_R2.fastq.gz,,


### SRA metadata
SRA requires certain metadata. Most can be left as is below, but a description of the project, title, and most importantly previously generated **BioProject ID** is required.

In [54]:
desc_string = ("Plasmodium falciparum drug resistance loci in samples "
               "from Uganda collected in 2018-2019 were "
               "captured using molecular inversion probes and sequenced using "
               "Illumina NextSeq platform.")
title = "Plasmodium falciparum targeted sequencing of drug resistance loci in Uganda."
bioproject_accession = "PRJNA655702"
instrument_model = "NextSeq 550"
sra_meta = {
    "bioproject_accession": bioproject_accession,
    "title": title,
    "library_strategy": "Targeted-Capture",
    "library_source": "GENOMIC",
    "library_selection": "Padlock probes capture method",
    "library_layout": "Paired",
    "platform": "ILLUMINA",
    "instrument_model": instrument_model,
    "design_description": desc_string,
    "filetype": "fastq",
    "assembly": ""
    }
sra_meta_df = pd.DataFrame(sra_meta, index = [0])
sra_meta_df

Unnamed: 0,bioproject_accession,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,assembly
0,PRJNA655702,Plasmodium falciparum targeted sequencing of d...,Targeted-Capture,GENOMIC,Padlock probes capture method,Paired,ILLUMINA,NextSeq 550,Plasmodium falciparum drug resistance loci in ...,fastq,


Add the sra metadata to the files table

In [55]:
sra_meta_df["Temp"] = "Temp"
files["Temp"] = "Temp"
sra_meta_df.merge(files).drop("Temp", axis=1).shape

(1787, 20)

In [56]:
sra_meta_df = sra_meta_df.merge(files).drop("Temp", axis=1)
sra_meta_df.head()

Unnamed: 0,bioproject_accession,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,assembly,Sample ID,biosample_accession,Library Prep,filename,filename2,filename3,filename4,filename5,filename6
0,PRJNA655702,Plasmodium falciparum targeted sequencing of d...,Targeted-Capture,GENOMIC,Padlock probes capture method,Paired,ILLUMINA,NextSeq 550,Plasmodium falciparum drug resistance loci in ...,fastq,,3D7-ROS-1,SAMN15747989,190517,3D7-ROS-1-190517_R1.fastq.gz,3D7-ROS-1-190517_R2.fastq.gz,,,,
1,PRJNA655702,Plasmodium falciparum targeted sequencing of d...,Targeted-Capture,GENOMIC,Padlock probes capture method,Paired,ILLUMINA,NextSeq 550,Plasmodium falciparum drug resistance loci in ...,fastq,,3D7-ROS-2,SAMN15747989,190517,3D7-ROS-2-190517_R2.fastq.gz,3D7-ROS-2-190517_R1.fastq.gz,,,,
2,PRJNA655702,Plasmodium falciparum targeted sequencing of d...,Targeted-Capture,GENOMIC,Padlock probes capture method,Paired,ILLUMINA,NextSeq 550,Plasmodium falciparum drug resistance loci in ...,fastq,,3D7-ROS-3,SAMN15747989,190517,3D7-ROS-3-190517_R2.fastq.gz,3D7-ROS-3-190517_R1.fastq.gz,,,,
3,PRJNA655702,Plasmodium falciparum targeted sequencing of d...,Targeted-Capture,GENOMIC,Padlock probes capture method,Paired,ILLUMINA,NextSeq 550,Plasmodium falciparum drug resistance loci in ...,fastq,,3D7-ROS-4,SAMN15747989,190517,3D7-ROS-4-190517_R2.fastq.gz,3D7-ROS-4-190517_R1.fastq.gz,,,,
4,PRJNA655702,Plasmodium falciparum targeted sequencing of d...,Targeted-Capture,GENOMIC,Padlock probes capture method,Paired,ILLUMINA,NextSeq 550,Plasmodium falciparum drug resistance loci in ...,fastq,,3D7-uganda-1,SAMN15747989,200316,3D7-uganda-1-200316_R2.fastq.gz,3D7-uganda-1-200316_R1.fastq.gz,3D7-uganda-1-200609_R1.fastq.gz,3D7-uganda-1-200609_R2.fastq.gz,,


Rename Sample ID to library_ID as per SRA format and remove the now unnecessary Library Prep column.

In [69]:
sra_meta_df.rename(
    columns={"Sample ID": "library_ID"}, inplace=True)
sra_meta_df.drop("Library Prep", axis=1, inplace=True)

### Split the submission to multiple submissions each of which has  <1000 libraries 
Yes, the submission is limited to 999 samples. So we'll create a meta data for each "sub"submission and move files for that submission to a new location below.

In [90]:
mv_results = []
mkdir_results = []
for i in range(((sra_meta_df.shape[0] - 1) // 999) + 1):
    start_index = i * 999
    end_index = start_index + 999
    sra_meta_df.iloc[start_index:end_index].to_csv(
        os.path.join(data_dir, "sra_meta_" + str(i+1) + ".tsv"), sep="\t", index=False)
    new_fastq_dir = os.path.join(project_fastq_dir, str(i+1))
    res = subprocess.run(["mkdir", new_fastq_dir],
                         stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    mkdir_results.append(res)
    res = fastq_df.loc[fastq_df["Sample ID"].isin(sra_meta_df.iloc[start_index:end_index]["library_ID"]),
            "New File Path"].apply(lambda a: subprocess.run(
                                    ["mv", a, new_fastq_dir],
                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE))
    mv_results.append(res)

Check the subprocess outputs to make sure there are no errors. We expect the returncode sums to be 0.

In [91]:
for res in mv_results:
    print(res.apply(lambda a: a.returncode).sum())

0

In [None]:
for res in mkdir_results:
    print(res.returncode)

Now we have generated an SRA metadata file in the project directory, a fastq file directory within the project fastq file directory for **each sub-submission**. The fastqs will be in "/opt/data/project_data/fastq/1" for the first sub-submission, for example.

Let's look at the submission procedure again

# NIH SRA submission
This notebook was used to create an NIH SRA (fastq) submission for Plasmodium falciparum MIP sequencing data in collaboration with Rosenthal Lab. It will also serve as a reference for falciparum sample submissions. 

### Prerequisites
1) Prior to SRA submission, create a bioproject on https://submit.ncbi.nlm.nih.gov/subs/bioproject/  
This is a simple procedure that produces a bioproject ID which will be associated with one or more SRA submissions.  
2) Prior to SRA submission, submit all of the samples used in a project to NIH BioSample database.



SRA submissions are made through the NIH website: https://submit.ncbi.nlm.nih.gov/subs/sra

There are 5 steps in the submission portal for SRA submissions that are listed below. **The main purpose of this notebook is to create the file needed at step 4**.

Step 1: Submitter details

Step 2: General Info
  * Did you already register a BioProject for this research, e.g. for the submission of the reads to SRA: **YES**
  * Did you already register a BioSample for this sample, e.g. for the submission of the reads to SRA: **YES**
  * Chose data release time (immediate, or future): we will chose a future date during the sumbission. Once the sumbission is finished, an email to SRA requesting removal of possible human sequences is recommended. Although our targeted sequencing should not generate any human data, and even if there are some off targets, those would not have any identifying information. Still, SRA provides this service and it makes sense to take it. Once that is done, you can publish the SRA publicly, or wait until publication date.
  
  
Step 3: SRA metadata
  * Upload a file using Excel or text format (tab-delimited): This file will be created using this notebook.
  
Step 4: Files
  * This notebook will copy the fastq files to specific directories. The files will be uploaded to the FTP server of NIH at this step.
  * The first step in the file transfer is to use command line and navigate to wherever the fastq files were saved (e.g. ssh to seekdeep server and cd to the fastq directory). Optionally, start a screen if you are comfortable using one. I'll call this location the "data location".
  
  * On the SRA submission portal choose: *FTP or Aspera Command Line file preload*
  * click on FTP upload instructions
  
  * From the *data location* connect to the NIH FTP server using the credentials provided on the submission portal
    * ftp ftp-private.ncbi.nlm.nih.gov
    * follow instructions on the portal until step 6
    * for step 6, instead do the following commands
    * prompt (this should turn off interactive, if it was already off, then it'll turn it on. We want this off, so if needed repeat the "prompt" command to turn off interactive.
    * mput *
  * Once the upload is complete go back to submission portal, click **Select preload folder**
  * Select the folder containing the uploaded files.
  * Autofinish
  * Continue
  

Step 5: Review and submit.

### Upload SRA metadata
We'll pick up from step 3 where we will upload the meta files generated for each submission.

### Upload fastq files
Following the instuctions in step 4, we'll:
  * navigate to "/opt/data/project_data/fastq/1"
  * start a screen
  * connect to NIH ftp
  * upload the files