# NIH biosample submission
This notebook was used to create an NIH biosample submission for Plasmodium falciparum MIP sequencing data in collaboration with Rosenthal Lab. It will also serve as a reference for falciparum sample submissions. 

Submissions are made through the NIH website: https://submit.ncbi.nlm.nih.gov/subs/biosample/

There are 5 steps in the submission portal for sample submissions that are listed below. **The main purpose of this notebook is to create the file needed at step 4**.

Step 1: Submitter details

Step 2: General Info
  * Chose data release time (immediate, or future)
  * Single sample vs batch: this guide is for batch submissions
  
Step 3: Sample Type
  * Although _pathogen affecting public health_ seems like the natural choice for falciparum, it is the wrong one here. Apparently only bacteria and viruses are in this category. We go with **Microbe**.
  
Step 4: Attributes
  * Upload a file using Excel or text format (tab-delimited) that includes the attributes for each of your BioSamples

Step 5: Review and submit.

### Starting the notebook
This notebook was started with the below commands:
```bash
base_resources=~/git/MIPTools/base_resources
data_dir=~/processed/analysis/2020-06/ROS_200612/project_data
analysis_dir=~/processed/analysis/2020-06/ROS_200612/sra_submission
container=~/shared_bin/miptools_20200728.sif

mkdir -p $analysis_dir

singularity run --app jupyter \
    -B $base_resources:/opt/resources \
    -B $data_dir:/opt/data \
    -B $analysis_dir:/opt/analysis \
    $container
```

### Sample meta data
We need the sample meta data to be used with the submission. The minimum information needed are a unique name for each sample, collection date, geographical information, sample type (whole blood, dbs, cell culture, etc.). 

Full description of the meta data fields can be found on the submission portal https://submit.ncbi.nlm.nih.gov/biosample/template/?package=Microbe.1.0&action=definition

It is best to have a data folder for the specific project where sequencing data, sample data etc will be stored. In this guide the location of that folder is "/opt/data/" and the sample meta data file should be placed there before starting. In the context of singularity MIPTools container, this means that we bind the locat data directory to /opt/data/ when starting the jupyter notebook (see above). New files will be saved to /opt/analysis folder.

Import required libraries

In [2]:
import sys
sys.path.append("/opt/src")
import os
import subprocess
import pandas as pd

Provide meta data file name and data directory. Import the sample meta data file which is a comma separated text file in this guide.

In [4]:
data_dir = "/opt/data/"
sample_meta_file = "combined_sample_meta.csv"
sample_meta = pd.read_csv(os.path.join(data_dir, sample_meta_file))
sample_meta.head()

Unnamed: 0.1,Unnamed: 0,Age (years),Barcode,Code,Comments,Control Type,DNA Concentration (ng/ul),Date of collection,District,Human DNA,...,Unnamed: 5,Well,Well position,Year,extraction,level_0,level_1,plate,sample ID,volume ul
0,0,10.0,,,,,,5/18/2018,Agago,,...,,,A01,2018.0,chelex,,,,AG-3-01,50.0
1,1,2.0,,,,,,5/18/2018,Agago,,...,,,A02,2018.0,chelex,,,,AG-3-02,50.0
2,2,7.0,,,,,,5/19/2018,Agago,,...,,,A03,2018.0,chelex,,,,AG-3-03,50.0
3,3,10.0,,,,,,5/22/2018,Agago,,...,,,A04,2018.0,chelex,,,,AG-3-04,50.0
4,4,1.3,,,,,,5/23/2018,Agago,,...,,,A05,2018.0,chelex,,,,AG-3-05,50.0


Below are the required headers for the submission.

In [7]:
sra_meta_keys = ["sample_name", "organism", "isolate", "host",
                 "isolation_source", "collection_date", "geo_loc_name",
                 "sample_type"]

Find out what the headers provided in the sample meta file are

In [3]:
sample_meta.columns

Index(['Unnamed: 0', 'Age (years)', 'Barcode', 'Code', 'Comments',
       'Control Type', 'DNA Concentration (ng/ul)', 'Date of collection',
       'District', 'Human DNA', 'MIPS plate ID', 'Parasite Density', 'Plate ',
       'Rosenthal Lab Sample ID', 'S/N', 'SITE', 'Sample Name', 'Site',
       'Specimen Type', 'Study Short Code', 'Survey round', 'Test', 'UID',
       'Unnamed: 5', 'Well', 'Well position', 'Year', 'extraction', 'level_0',
       'level_1', 'plate', 'sample ID', 'volume ul'],
      dtype='object')

**Important:** A user can submit a sample name to the BioSample database only once. So if you had submitted any sample that is used in this project, you'll need to remove those samples here.

This meta file contains control samples as well and those were submitted to the database before, so we'll remove those.

In [5]:
sample_meta = sample_meta.loc[sample_meta["Control Type"].isnull()]

### sample_name and isolate
In our lab, we use "Sample Name" header as a sample's unique name. This will correspond to "sample_name" field for the submission.

isolate header refers to "identification or description of the specific individual from which this sample was obtained", as per the sumbission portal. For our purposes, this is the same as the sample name. However, if we had two longitidunal samples from the same human individual, those two samples would have the same isolate ID.

So we'll use "Sample Name" for both of these fields.

In [None]:
sample_meta["sample_name"] = sample_meta["isolate"] = sample_meta["Sample Name"]

### collection_date
Our meta file has a "Date of collection" field but not all samples have that information. All samples have a "Year" field, so we'll use that instead. Year field is a float in this table because control samples, so those will be converted to integer values. Last entries in the meta data shows the missing dates below.

In [12]:
sample_meta.tail()

Unnamed: 0.1,Unnamed: 0,Age (years),Barcode,Code,Comments,Control Type,DNA Concentration (ng/ul),Date of collection,District,Human DNA,...,Unnamed: 5,Well,Well position,Year,extraction,level_0,level_1,plate,sample ID,volume ul
1401,1401,,8037052000.0,KB,,,,,Kabale,,...,,F09,,2019.0,,plate 8,68.0,plate 8,,
1402,1402,,8037052000.0,KB,,,,,Kabale,,...,,F10,,2019.0,,plate 8,69.0,plate 8,,
1403,1403,,8037051000.0,KB,,,,,Kabale,,...,,F11,,2019.0,,plate 8,70.0,plate 8,,
1404,1404,,8037051000.0,KB,,,,,Kabale,,...,,F12,,2019.0,,plate 8,71.0,plate 8,,
1405,1405,,8037051000.0,KB,,,,,Kabale,,...,,G01,,2019.0,,plate 8,72.0,plate 8,,


In [None]:
sample_meta["collection_date"] = sample_meta["Year"].astype(int)

### organism and host
For this submission the organism is Pf, host is human

In [None]:
sample_meta["organism"] = "Plasmodium falciparum"
sample_meta["host"] = "Homo sapiens"

### isolation_source and sample_type
Dried blood spot describes both the isolation source and sample type

In [None]:
sample_meta["isolation_source"] = sample_meta["sample_type"] = "Dried Blood Spot"

### geo_loc_name
locations are given at the district level. So we can use Country: District notation as suggested at the submission portal.

In [None]:
sample_meta["geo_loc_name"] = "Uganda: " + sample_meta["District"]

Get the required columns only

In [40]:
sra_meta = sample_meta[sra_meta_keys]
sra_meta.shape

(1406, 8)

Check for any missing values.

In [42]:
sra_meta.loc[sra_meta.isnull().any(1)]

Unnamed: 0,sample_name,organism,isolate,host,isolation_source,collection_date,geo_loc_name,sample_type
484,RXS485,Plasmodium falciparum,RXS485,Homo sapiens,Dried Blood Spot,2018,,Dried Blood Spot
485,RXS486,Plasmodium falciparum,RXS486,Homo sapiens,Dried Blood Spot,2018,,Dried Blood Spot
486,RXS487,Plasmodium falciparum,RXS487,Homo sapiens,Dried Blood Spot,2018,,Dried Blood Spot
487,RXS488,Plasmodium falciparum,RXS488,Homo sapiens,Dried Blood Spot,2018,,Dried Blood Spot


4 samples seem to have missing values for the district name. We'll replace those with "not available".

In [43]:
sra_meta["geo_loc_name"].fillna("not available", inplace=True)

### Maximum number of samples
The database allows a maximum of 999 samples per submission. We have over 1400, so we'll divide them into two (one per year) and perform two submissions.

In [46]:
sra_meta.loc[sra_meta["collection_date"] == 2018].to_csv(
    "/opt/data/project_data/sra_sample_meta_2018.tsv", sep="\t", index=False)

In [47]:
sra_meta.loc[sra_meta["collection_date"] == 2019].to_csv(
    "/opt/data/project_data/sra_sample_meta_2019.tsv", sep="\t", index=False)

### Adding biosample_accession numbers
Once we have the files created above we go ahead with the sample submission on the submission portal. In a minute or two, the sumbission should be processed and we'd get a "attributes file with BioSample accessions" on the portal. We'll download that file to the data directory so that we can add the biosample accessions to the sample meta data.

For this submission, the attribute files were saved to the data dir as "sra_output_2018.tsv" and "sra_output_2019.tsv". We'll load those files.

In [48]:
output_file = os.path.join(data_dir, "sra_output_2018.tsv")
sra_output_2018 = pd.read_table(output_file)
sra_output_2018.head()

Unnamed: 0,accession,message,sample_name,organism,isolate,host,isolation_source,collection_date,geo_loc_name,sample_type
0,SAMN15749034,Successfully loaded,RXS1,Plasmodium falciparum,RXS1,Homo sapiens,Dried Blood Spot,2018,Uganda: Agago,Dried Blood Spot
1,SAMN15749035,Successfully loaded,RXS2,Plasmodium falciparum,RXS2,Homo sapiens,Dried Blood Spot,2018,Uganda: Agago,Dried Blood Spot
2,SAMN15749036,Successfully loaded,RXS3,Plasmodium falciparum,RXS3,Homo sapiens,Dried Blood Spot,2018,Uganda: Agago,Dried Blood Spot
3,SAMN15749037,Successfully loaded,RXS4,Plasmodium falciparum,RXS4,Homo sapiens,Dried Blood Spot,2018,Uganda: Agago,Dried Blood Spot
4,SAMN15749038,Successfully loaded,RXS5,Plasmodium falciparum,RXS5,Homo sapiens,Dried Blood Spot,2018,Uganda: Agago,Dried Blood Spot


In [49]:
output_file = os.path.join(data_dir, "sra_output_2019.tsv")
sra_output_2019 = pd.read_table(output_file)
sra_output_2019.head()

Unnamed: 0,accession,message,sample_name,organism,isolate,host,isolation_source,collection_date,geo_loc_name,sample_type
0,SAMN15749698,Successfully loaded,TO-04-01,Plasmodium falciparum,TO-04-01,Homo sapiens,Dried Blood Spot,2019,Uganda: Tororo,Dried Blood Spot
1,SAMN15749699,Successfully loaded,TO-04-02,Plasmodium falciparum,TO-04-02,Homo sapiens,Dried Blood Spot,2019,Uganda: Tororo,Dried Blood Spot
2,SAMN15749700,Successfully loaded,TO-04-03,Plasmodium falciparum,TO-04-03,Homo sapiens,Dried Blood Spot,2019,Uganda: Tororo,Dried Blood Spot
3,SAMN15749701,Successfully loaded,TO-04-04,Plasmodium falciparum,TO-04-04,Homo sapiens,Dried Blood Spot,2019,Uganda: Tororo,Dried Blood Spot
4,SAMN15749702,Successfully loaded,TO-04-05,Plasmodium falciparum,TO-04-05,Homo sapiens,Dried Blood Spot,2019,Uganda: Tororo,Dried Blood Spot


Concatanate the two files from biosample submission and merge with the meta data table we already have.

First, check that merging will produce expected size tables. 

In [50]:
sra_meta.shape

(1406, 8)

In [51]:
sra_meta.merge(pd.concat([sra_output_2018, sra_output_2019])).shape

(1406, 10)

In [52]:
sra_meta = sra_meta.merge(pd.concat([sra_output_2018, sra_output_2019]))

There is a "message" field that has submission outcome for each sample. All samples should have "Successfully loaded" value. Let's check that there are no other values.

In [53]:
sra_meta["message"].unique()

array(['Successfully loaded'], dtype=object)

Remove the "message" field and save the final table.

In [54]:
sra_meta.drop("message", axis=1, inplace=True)

In [56]:
output_file = "sra_sample_meta.tsv"
output_path = os.path.join(data_dir, output_file)
sra_meta.to_csv(output_path, sep="\t", index=False)