In [1]:
%load_ext autoreload
%autoreload 2

Uploading FASTQ and BAM files to NDAR

The python package [nda-tools](https://github.com/NDAR/nda-tools) have been upgraded with `pip3 install --upgrade nda-tools`.  The command line validation tool `vtcmd` will be essential for the upload.

In [2]:
%%bash
which vtcmd

/home/attila/.local/bin/vtcmd


In [3]:
import synapseclient
import pandas as pd
import os
import sys
import glob
import cmc_submit2ndar as cmc
syn = synapseclient.login()

Welcome, Attila Gulyás-Kovács!



In [4]:
%%bash
cd /projects/bsm/attila/results/
export bn=2019-02-19-upload-to-ndar
if test ! -d $bn; then mkdir $bn; fi
echo $bn

2019-02-19-upload-to-ndar


## Template manifest files

### brain and tissue bank (nichd_btb02)

[This Synapse folder](https://www.synapse.org/#!Synapse:syn12128752) (syn12128752) contains two manifest files for all CMC subjects.  The first one is a *brain and tissue bank* file:

In [5]:
btb_temp, btb_syn = cmc.get_manifest("syn12154562", syn)
btb_temp.head()

Unnamed: 0,subjectkey,src_subject_id,interview_age,interview_date,gender,race,ethnic_group,grade_highed,disorder,cdeathoff,...,mskelzyn,respzyn,scordzyn,urinzyn,otherzyn,systxyn,mcomments,frozentissue,fixedbrain,adi_r_score
0,NDAR_INVDVXZZ5G0,CMC_MSSM_295,744,4/13/18,M,White,,,,,...,,,,,,,,,,
1,NDAR_INVDVXZZ5G0,CMC_MSSM_295,744,4/13/18,M,White,,,,,...,,,,,,,,,,
2,NDAR_INVY3TCVYKD,CMC_PITT_101,504,4/13/18,M,White,,,,,...,,,,,,,,,,
3,NDAR_INVEUUEDMKH,CMC_MSSM_304,912,4/13/18,M,White,,,,,...,,,,,,,,,,
4,NDAR_INVEUUEDMKH,CMC_MSSM_304,912,4/13/18,M,White,,,,,...,,,,,,,,,,


Each of its row corresponds to a tissue sample so a `src_subject_id` is not unique if multiple samples have been taken from the subject/individual

In [6]:
btb_temp.loc[:, ["src_subject_id", "sample_id_original"]].head()

Unnamed: 0,src_subject_id,sample_id_original
0,CMC_MSSM_295,MSSM_295.DLPFC_1178.np1
1,CMC_MSSM_295,MSSM_295.TMPR_69114.mu1
2,CMC_PITT_101,PITT_101.DRPC700.np1
3,CMC_MSSM_304,MSSM_304.DLPFC_1163.np1
4,CMC_MSSM_304,MSSM_304.TMPR_69091.mu1


### genomics subjects (genomics_subject02)

The second manifest is the *genomics subjects* file.  Each row is a subject/individual with clinical information such as gender, race, and phenotype (control or schizophrenia).

In [7]:
gsub_temp, gsub_syn = cmc.get_manifest("syn12128754", syn)
gsub_temp.head()

Unnamed: 0,subjectkey,src_subject_id,interview_date,interview_age,gender,race,ethnic_group,phenotype,phenotype_description,twins_study,...,sample_taken,sample_id_original,sample_description,biorepository,patient_id_biorepository,sample_id_biorepository,cell_id_original,cell_id_biorepository,adi_dx,ados_dx
0,NDAR_INV0971H4H4,CMC_MSSM_033,4/13/18,972,F,African American,,control,No,No,...,Yes,MSSM_033.DLPFC_1355.np1,PFC,MSBB,,,,,,
1,NDAR_INV0UA2YLF3,CMC_MSSM_046,4/13/18,1080,F,White,,control,No,No,...,Yes,MSSM_046.DLPFC_1339.np1,PFC,MSBB,,,,,,
2,NDAR_INV1VPUF5CL,CMC_MSSM_056,4/13/18,804,F,White,,control,No,No,...,Yes,MSSM_056.DLPFC_1181.np1,PFC,MSBB,,,,,,
3,NDAR_INV2459CJE1,CMC_MSSM_061,4/13/18,816,M,White,,control,No,No,...,Yes,MSSM_061.DLPFC_1188.np1,PFC,MSBB,,,,,,
4,NDAR_INV27XJ4YKX,CMC_MSSM_065,4/13/18,1080,F,White,,control,No,No,...,Yes,MSSM_065.DLPFC_1334.np1,PFC,MSBB,,,,,,


### genomics samples (genomics_sample03)

The third manifest is the *genomics samples* file, which is missing from Synapse folder [syn12128752](https://www.synapse.org/#!Synapse:syn12128752).  Its template file and its definitions, however, are available on Synapse from the [Data Submission Instructions](https://www.synapse.org/#!Synapse:syn5902559/wiki/408697) Wiki.

In [8]:
gsam_temp, gsam_syn = cmc.get_manifest("syn8464096", syn)
gsam_def, gsam_def_syn = cmc.get_manifest("syn7896813", syn, skiprows=0)
gsam_def

Unnamed: 0,ElementName,DataType,Size,Required,ElementDescription,ValueRange,Notes,Aliases
0,subjectkey,GUID,,Required,The NDAR Global Unique Identifier (GUID) for r...,NDAR*,,
1,experiment_id,Integer,,Required,ID for the Experiment/settings/run,,,
2,src_subject_id,String,20.0,Required,Subject ID how it's defined in lab/project,,,
3,interview_age,Integer,,Required,Age in months at the time of the interview/tes...,0 :: 1260,Age is rounded to chronological month. If the ...,
4,interview_date,Date,,Required,Date on which the interview/genetic test/sampl...,,Required field,
5,sample_description,String,3500.0,Required,"Sample description: tissue type, i.e. blood, s...",whole blood; saliva; brain; urine; serum; plas...,,
6,sample_id_original,String,100.0,Required,"Original, user-defined Sample ID",,,
7,organism,String,50.0,Required,Organism,,,
8,sample_amount,Float,,Required,Sample amount,,,
9,sample_unit,String,50.0,Required,Measurement unit for Sample,,,


*genomics samples* is not readily available for CMC subjects/samples because it depends on the data (files) based on those subjects/samples.  However, some of the required fields of *genomics samples* are also present in *genomics subjects* so these fields can be filled out based on `genomics_subject02_U01MH106891_Chess.csv`.

In [9]:
shared_columns = gsam_temp.loc[:, [y in gsub_temp.columns for y in gsam_temp.columns]].columns
shared_columns

Index(['subjectkey', 'src_subject_id', 'interview_age', 'interview_date',
       'sample_description', 'sample_id_original', 'biorepository',
       'patient_id_biorepository', 'sample_id_biorepository',
       'cell_id_original', 'cell_id_biorepository'],
      dtype='object')

The remaining required fields of *genomics samples* must be filled based on other information source; these fields are listed below.

In [10]:
gsam_specific_columns = gsam_temp.loc[:, [y not in gsub_temp.columns for y in gsam_temp.columns]].columns
gsam_required_columns = gsam_def.loc[gsam_def["Required"] == "Required", "ElementName"]
print("Columns that are both required for and specific to the 'genomics samples' manifest")
gsam_specific_required_columns = gsam_required_columns.loc[[y in gsam_specific_columns for y in gsam_required_columns]]
gsam_def.loc[gsam_def["ElementName"].isin(gsam_specific_required_columns), :]

Columns that are both required for and specific to the 'genomics samples' manifest


Unnamed: 0,ElementName,DataType,Size,Required,ElementDescription,ValueRange,Notes,Aliases
1,experiment_id,Integer,,Required,ID for the Experiment/settings/run,,,
7,organism,String,50.0,Required,Organism,,,
8,sample_amount,Float,,Required,Sample amount,,,
9,sample_unit,String,50.0,Required,Measurement unit for Sample,,,
11,data_file1_type,String,100.0,Required,type of data file,,,
12,data_file1,File,,Required,Data file,,,
19,storage_protocol,String,255.0,Required,Description of Storage Protocol,,,
20,data_file_location,String,50.0,Required,dbGaP; NDAR; NIMH Genetics; AGRE; Sfari,,,


For our purposes `data_file2` and `data_file2_type` will also be needed because we have paired end sequencing data.

## Creating manifest files

MSSM_106 and PITT_118

In [44]:
sel_subj = "MSSM_118"
target_dir = "/projects/bsm/attila/results/2019-02-19-upload-to-ndar"
btb, gsubj, gsam = cmc.make_manifests(sel_subj, syn, target_dir=target_dir)
btb

Unnamed: 0,subjectkey,src_subject_id,interview_age,interview_date,gender,race,ethnic_group,grade_highed,disorder,cdeathoff,...,mskelzyn,respzyn,scordzyn,urinzyn,otherzyn,systxyn,mcomments,frozentissue,fixedbrain,adi_r_score
12,NDAR_INV42DGPMAB,CMC_MSSM_118,648,2018-04-13,M,White,,,,,...,,,,,,,,,,
13,NDAR_INV42DGPMAB,CMC_MSSM_118,648,2018-04-13,M,White,,,,,...,,,,,,,,,,


In [45]:
gsubj

Unnamed: 0,subjectkey,src_subject_id,interview_date,interview_age,gender,race,ethnic_group,phenotype,phenotype_description,twins_study,...,sample_taken,sample_id_original,sample_description,biorepository,patient_id_biorepository,sample_id_biorepository,cell_id_original,cell_id_biorepository,adi_dx,ados_dx
14,NDAR_INV42DGPMAB,CMC_MSSM_118,2018-04-13,648,M,White,,schizophrenia,No,No,...,Yes,MSSM_118.DLPFC_1236.np1,brain,MSBB,CMC_MSSM_118,CMC_MSSM_118,,,,


In [46]:
gsam

Unnamed: 0,subjectkey,experiment_id,src_subject_id,interview_age,interview_date,sample_description,sample_id_original,organism,sample_amount,sample_unit,...,patient_id_biorepository,sample_id_biorepository,cell_id_original,cell_id_biorepository,comments_misc,site,rat280,rat230,gqn,seq_batch
0,NDAR_INV42DGPMAB,1223,CMC_MSSM_118,648,2018-04-13,brain,MSSM_118.DLPFC_1236.np1,human,104.0,ng,...,CMC_MSSM_118,MSSM_DNA_PFC_1236,,,,,,,,
0,NDAR_INV42DGPMAB,1223,CMC_MSSM_118,648,2018-04-13,brain,MSSM_118.DLPFC_1236.np1,human,104.0,ng,...,CMC_MSSM_118,MSSM_DNA_PFC_1236,,,,,,,,
0,NDAR_INV42DGPMAB,1223,CMC_MSSM_118,648,2018-04-13,brain,MSSM_118.DLPFC_1236.np1,human,104.0,ng,...,CMC_MSSM_118,MSSM_DNA_PFC_1236,,,,,,,,
0,NDAR_INV42DGPMAB,1223,CMC_MSSM_118,648,2018-04-13,brain,MSSM_118.DLPFC_1236.np1,human,104.0,ng,...,CMC_MSSM_118,MSSM_DNA_PFC_1236,,,,,,,,
0,NDAR_INV42DGPMAB,1223,CMC_MSSM_118,648,2018-04-13,brain,MSSM_118.DLPFC_1236.np1,human,104.0,ng,...,CMC_MSSM_118,MSSM_DNA_PFC_1236,,,,,,,,
0,NDAR_INV42DGPMAB,1223,CMC_MSSM_118,648,2018-04-13,brain,MSSM_118.DLPFC_1236.np1,human,104.0,ng,...,CMC_MSSM_118,MSSM_DNA_PFC_1236,,,,,,,,
0,NDAR_INV42DGPMAB,1223,CMC_MSSM_118,648,2018-04-13,brain,MSSM_118.DLPFC_1236.np1,human,104.0,ng,...,CMC_MSSM_118,MSSM_DNA_PFC_1236,,,,,,,,
0,NDAR_INV42DGPMAB,1223,CMC_MSSM_118,648,2018-04-13,brain,MSSM_118.DLPFC_1236.np1,human,104.0,ng,...,CMC_MSSM_118,MSSM_DNA_PFC_1236,,,,,,,,
0,NDAR_INV42DGPMAB,1223,CMC_MSSM_118,648,2018-04-13,brain,MSSM_118.DLPFC_1236.np1,human,104.0,ng,...,CMC_MSSM_118,MSSM_DNA_PFC_1236,,,,,,,,
0,NDAR_INV42DGPMAB,1223,CMC_MSSM_118,648,2018-04-13,brain,MSSM_118.DLPFC_1236.np1,human,104.0,ng,...,CMC_MSSM_118,MSSM_DNA_PFC_1236,,,,,,,,


In [178]:
with open('/projects/bsm/attila/results/2018-09-12-sequenced-individuals/sequenced-individuals', 'r') as f:
    subjects = f.readlines()

pattern = '^CMC_((MSSM|PITT)_[0-9]{3})\t(Control|SCZ).*$\n'
subjects = {re.sub(pattern, '\\1', i): re.sub(pattern, '\\3', i) for i in subjects}
print(subjects)

{'MSSM_056': 'Control', 'MSSM_106': 'Control', 'MSSM_109': 'Control', 'MSSM_118': 'SCZ', 'MSSM_175': 'Control', 'MSSM_179': 'Control', 'MSSM_183': 'Control', 'MSSM_215': 'Control', 'MSSM_295': 'SCZ', 'MSSM_304': 'SCZ', 'MSSM_331': 'SCZ', 'MSSM_369': 'Control', 'MSSM_373': 'SCZ', 'MSSM_391': 'Control', 'PITT_010': 'Control', 'PITT_064': 'Control', 'PITT_091': 'SCZ', 'PITT_118': 'SCZ'}


In [179]:
manifests = {s: cmc.make_manifests(s, syn, target_dir=target_dir) for s in subjects.keys()}

In [214]:
%%bash
seqind=/projects/bsm/attila/results/2018-09-12-sequenced-individuals/sequenced-individuals
while read subject Dx; do ./submit.sh $subject $Dx; done < $seqind


Validating files...
Validation report output to: /home/attila/NDAValidationResults/validation_results_20190304T155013.csv


All files have finished validating.

The following files passed validation:
UUID 4b6626dc-fa68-4a33-9af1-23b25e7c679f: /projects/bsm/attila/results/2019-02-19-upload-to-ndar//CMC_MSSM_056-genomics_sample03_U01MH106891_Chess.csv
UUID 34de45a0-60d3-4144-8481-12f80a2aca7b: /projects/bsm/attila/results/2019-02-19-upload-to-ndar//CMC_MSSM_056-genomics_subject02_U01MH106891_Chess.csv
UUID 4817ba33-d385-4ce9-95a4-14a3600e0ee8: /projects/bsm/attila/results/2019-02-19-upload-to-ndar//CMC_MSSM_056-nichd_btb02_U01MH106891_Chess.csv

Validating files...
Validation report output to: /home/attila/NDAValidationResults/validation_results_20190304T155015.csv


All files have finished validating.

The following files passed validation:
UUID 4142bfd9-1748-4645-8854-f2e177ca04dd: /projects/bsm/attila/results/2019-02-19-upload-to-ndar//CMC_MSSM_106-genomics_subject02_U01MH106891_Ches

  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:01<00:02,  1.13s/it]100%|██████████| 3/3 [00:01<00:00,  1.20it/s]
  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:00<00:01,  1.07it/s] 67%|██████▋   | 2/3 [00:01<00:00,  1.43it/s]100%|██████████| 3/3 [00:01<00:00,  1.86it/s]
  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:00<00:01,  1.03it/s]100%|██████████| 3/3 [00:01<00:00,  1.44it/s]
  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:00<00:01,  1.09it/s] 67%|██████▋   | 2/3 [00:01<00:00,  1.48it/s]100%|██████████| 3/3 [00:01<00:00,  2.70it/s]
  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:00<00:01,  1.41it/s] 67%|██████▋   | 2/3 [00:00<00:00,  1.79it/s]100%|██████████| 3/3 [00:01<00:00,  2.30it/s]
  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:01<00:02,  1.23s/it]100%|██████████| 3/3 [00:01<00:00,  1.09it/s]
  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:00<00:01,  1