Uploading FASTQ and BAM files to NDAR

The python package [nda-tools](https://github.com/NDAR/nda-tools) have been upgraded with `pip3 install --upgrade nda-tools`.  The command line validation tool `vtcmd` will be essential for the upload.

In [1]:
%%bash
which vtcmd

/home/attila/.local/bin/vtcmd


In [2]:
import synapseclient
import pandas as pd
import os
import sys
import cmc_submit2ndar as cmc

Welcome, Attila Gulyás-Kovács!



In [3]:
%%bash
cd /projects/bsm/attila/results/
export bn=2019-02-19-upload-to-ndar
if test ! -d $bn; then mkdir $bn; fi
echo $bn

2019-02-19-upload-to-ndar


## Template manifest files

### brain and tissue bank (nichd_btb02)

[This Synapse folder](https://www.synapse.org/#!Synapse:syn12128752) (syn12128752) contains two manifest files for all CMC subjects.  The first one is a *brain and tissue bank* file:

In [4]:
btb_temp, btb_syn = cmc.get_manifest("syn12154562")
btb_temp.head()

Manifest file path: /tmp/nichd_btb02_U01MH106891_Chess.csv


Unnamed: 0,subjectkey,src_subject_id,interview_age,interview_date,gender,race,ethnic_group,grade_highed,disorder,cdeathoff,...,mskelzyn,respzyn,scordzyn,urinzyn,otherzyn,systxyn,mcomments,frozentissue,fixedbrain,adi_r_score
0,NDAR_INVDVXZZ5G0,CMC_MSSM_295,744,4/13/18,M,White,,,,,...,,,,,,,,,,
1,NDAR_INVDVXZZ5G0,CMC_MSSM_295,744,4/13/18,M,White,,,,,...,,,,,,,,,,
2,NDAR_INVY3TCVYKD,CMC_PITT_101,504,4/13/18,M,White,,,,,...,,,,,,,,,,
3,NDAR_INVEUUEDMKH,CMC_MSSM_304,912,4/13/18,M,White,,,,,...,,,,,,,,,,
4,NDAR_INVEUUEDMKH,CMC_MSSM_304,912,4/13/18,M,White,,,,,...,,,,,,,,,,


Each of its row corresponds to a tissue sample so a `src_subject_id` is not unique if multiple samples have been taken from the subject/individual

In [5]:
btb_temp.loc[:, ["src_subject_id", "sample_id_original"]].head()

Unnamed: 0,src_subject_id,sample_id_original
0,CMC_MSSM_295,MSSM_295.DLPFC_1178.np1
1,CMC_MSSM_295,MSSM_295.TMPR_69114.mu1
2,CMC_PITT_101,PITT_101.DRPC700.np1
3,CMC_MSSM_304,MSSM_304.DLPFC_1163.np1
4,CMC_MSSM_304,MSSM_304.TMPR_69091.mu1


### genomics subjects (genomics_subject02)

The second manifest is the *genomics subjects* file.  Each row is a subject/individual with clinical information such as gender, race, and phenotype (control or schizophrenia).

In [6]:
gsub_temp, gsub_syn = cmc.get_manifest("syn12128754")
gsub_temp.head()

Manifest file path: /tmp/genomics_subject02_U01MH106891_Chess.csv


Unnamed: 0,subjectkey,src_subject_id,interview_date,interview_age,gender,race,ethnic_group,phenotype,phenotype_description,twins_study,...,sample_taken,sample_id_original,sample_description,biorepository,patient_id_biorepository,sample_id_biorepository,cell_id_original,cell_id_biorepository,adi_dx,ados_dx
0,NDAR_INV0971H4H4,CMC_MSSM_033,4/13/18,972,F,African American,,control,No,No,...,Yes,MSSM_033.DLPFC_1355.np1,PFC,MSBB,,,,,,
1,NDAR_INV0UA2YLF3,CMC_MSSM_046,4/13/18,1080,F,White,,control,No,No,...,Yes,MSSM_046.DLPFC_1339.np1,PFC,MSBB,,,,,,
2,NDAR_INV1VPUF5CL,CMC_MSSM_056,4/13/18,804,F,White,,control,No,No,...,Yes,MSSM_056.DLPFC_1181.np1,PFC,MSBB,,,,,,
3,NDAR_INV2459CJE1,CMC_MSSM_061,4/13/18,816,M,White,,control,No,No,...,Yes,MSSM_061.DLPFC_1188.np1,PFC,MSBB,,,,,,
4,NDAR_INV27XJ4YKX,CMC_MSSM_065,4/13/18,1080,F,White,,control,No,No,...,Yes,MSSM_065.DLPFC_1334.np1,PFC,MSBB,,,,,,


### genomics samples (genomics_sample03)

The third manifest is the *genomics samples* file, which is missing from Synapse folder [syn12128752](https://www.synapse.org/#!Synapse:syn12128752).  Its template file and its definitions, however, are available on Synapse from the [Data Submission Instructions](https://www.synapse.org/#!Synapse:syn5902559/wiki/408697) Wiki.

In [7]:
gsam_temp, gsam_syn = cmc.get_manifest("syn8464096")
gsam_def, gsam_def_syn = cmc.get_manifest("syn7896813", skiprows=0)
gsam_def

Manifest file path: /tmp/genomics_sample03_template.csv
Manifest file path: /tmp/genomics_sample03_definitions.csv


Unnamed: 0,ElementName,DataType,Size,Required,ElementDescription,ValueRange,Notes,Aliases
0,subjectkey,GUID,,Required,The NDAR Global Unique Identifier (GUID) for r...,NDAR*,,
1,experiment_id,Integer,,Required,ID for the Experiment/settings/run,,,
2,src_subject_id,String,20.0,Required,Subject ID how it's defined in lab/project,,,
3,interview_age,Integer,,Required,Age in months at the time of the interview/tes...,0 :: 1260,Age is rounded to chronological month. If the ...,
4,interview_date,Date,,Required,Date on which the interview/genetic test/sampl...,,Required field,
5,sample_description,String,3500.0,Required,"Sample description: tissue type, i.e. blood, s...",whole blood; saliva; brain; urine; serum; plas...,,
6,sample_id_original,String,100.0,Required,"Original, user-defined Sample ID",,,
7,organism,String,50.0,Required,Organism,,,
8,sample_amount,Float,,Required,Sample amount,,,
9,sample_unit,String,50.0,Required,Measurement unit for Sample,,,


*genomics samples* is not readily available for CMC subjects/samples because it depends on the data (files) based on those subjects/samples.  However, some of the required fields of *genomics samples* are also present in *genomics subjects* so these fields can be filled out based on `genomics_subject02_U01MH106891_Chess.csv`.

In [8]:
shared_columns = gsam_template.loc[:, [y in gsub.columns for y in gsam_template.columns]].columns
shared_columns

NameError: name 'gsam_template' is not defined

The remaining required fields of *genomics samples* must be filled based on other information source; these fields are listed below.

In [None]:
gsam_specific_columns = gsam_template.loc[:, [y not in gsub.columns for y in gsam_template.columns]].columns
gsam_required_columns = gsam_def.loc[gsam_def["Required"] == "Required", "ElementName"]
print("Columns that are both required for and specific to the 'genomics samples' manifest")
gsam_specific_required_columns = gsam_required_columns.loc[[y in gsam_specific_columns for y in gsam_required_columns]]
gsam_def.loc[gsam_def["ElementName"].isin(gsam_specific_required_columns), :]

For our purposes `data_file2` and `data_file2_type` will also be needed because we have paired end sequencing data.

## Creating manifest files

MSSM_106 and PITT_118

In [12]:
sel_subj = "MSSM_106"
sel_subj_cmc = "CMC_" + sel_subj

# brain a tissue bank manifest
btb1 = btb_temp.loc[btb_temp["src_subject_id"] == sel_subj_cmc, :]
cmc.write_manifest(btb1, template_path="/tmp/nichd_btb02_U01MH106891_Chess.csv", target_path="nichd_btb02-" + sel_subj_cmc + ".csv")    
btb1

Unnamed: 0,subjectkey,src_subject_id,interview_age,interview_date,gender,race,ethnic_group,grade_highed,disorder,cdeathoff,...,mskelzyn,respzyn,scordzyn,urinzyn,otherzyn,systxyn,mcomments,frozentissue,fixedbrain,adi_r_score
55,NDAR_INV3F2A6YAC,CMC_MSSM_106,1080,4/13/18,F,White,,,,,...,,,,,,,,,,
56,NDAR_INV3F2A6YAC,CMC_MSSM_106,1080,4/13/18,F,White,,,,,...,,,,,,,,,,
57,NDAR_INV3F2A6YAC,CMC_MSSM_106,1080,4/13/18,F,White,,,,,...,,,,,,,,,,


In [14]:
# genomic subjects manifest
gsub1 = gsub_temp.loc[gsub_temp["src_subject_id"] == sel_subj_cmc, :]
cmc.write_manifest(gsub1, template_path="/tmp/genomics_subject02_U01MH106891_Chess.csv", target_path="genomics_subject02-" + sel_subj_cmc + ".csv")
gsub1

Unnamed: 0,subjectkey,src_subject_id,interview_date,interview_age,gender,race,ethnic_group,phenotype,phenotype_description,twins_study,...,sample_taken,sample_id_original,sample_description,biorepository,patient_id_biorepository,sample_id_biorepository,cell_id_original,cell_id_biorepository,adi_dx,ados_dx
10,NDAR_INV3F2A6YAC,CMC_MSSM_106,4/13/18,1080,F,White,,control,No,No,...,Yes,MSSM_106.DLPFC_1399.np1,PFC,MSBB,,,,,,


```
#! /usr/bin/env bash

#d=/home/attila/projects/bsm/ndar/benchmark/vt-python/test0
d=`dirname $0`
f1=$d/nichd_btb02.csv
f2=$d/genomics_subject02.csv
f3=$d/genomics_sample03.csv

vtcmd \
$f1 $f2 $f3 \
-u attilagk \
-p Chesslab13 \
-l /projects/bsm/alignments/ceph-benchmark/ /projects/bsm/reads/2018-01-10-Benchmark-DV-X10/ \
-a BSMN-S3 \
-t "U01MH106891, reference_tissue, benchmark/mixin" \
-d "FASTQs and BAMs for Benchmark (CEPH/Utah DNA mixes), Chess lab" \
-b
```