Here I extend the cmc_submit2ndar.py script with functionality to upload data from an AWS S3 bucket.  This work begins with looking at metadata from the CommonMind Consortium (CMC) and the format of manifests required by NDA.  Then I intruduce new functions in the cmc_submit2ndar python module that produce new manifests for the data from the s3://chesslab-bsmn bucket.  Finally I validete these manifests with NDA's validation tool. 

In [1]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
import synapseclient
import cmc_submit2ndar as s2n
import pandas as pd
import numpy as np
import re

## Reviewing metadata and standards
### Getting metadata

In [2]:
syn = synapseclient.login()

Welcome, Attila Jones!



NDA manifest template files

In [3]:
wdir = '~/projects/bsm/results/2020-04-22-upload-to-ndar-from-s3/'
gsub, gsub_syn = s2n.get_manifest(s2n.manifest_template_synids['genomics_subject02'], syn, download_dir=wdir)
btb, btb_syn = s2n.get_manifest(s2n.manifest_template_synids['nichd_btb02'], syn, download_dir=wdir)
gsam, gsam_syn = s2n.get_manifest(s2n.manifest_template_synids['genomics_sample03'], syn, download_dir=wdir)

CMC files

In [4]:
# CMC_Human_WGS_metadata_working.csv
#syn.get('syn17021773')
# CMC_Human_clinical_metadata.csv
cmc_clinical_syn = syn.get('syn2279441', downloadLocation=wdir, ifcollision='overwrite.local')
cmc_clinical = pd.read_csv(cmc_clinical_syn.path, index_col='Individual ID')
# CMC_Human_brainRegion_metadata.csv
cmc_brainreg_syn = syn.get('syn21446693', downloadLocation=wdir, ifcollision='overwrite.local')
cmc_brainreg = pd.read_csv(cmc_brainreg_syn.path)
# CMC_Human_isolation_metadata_DNA.csv
cmc_dnaisol_syn = syn.get('syn2279444', downloadLocation=wdir, ifcollision='overwrite.local')
cmc_dnaisol = pd.read_csv(cmc_dnaisol_syn.path, index_col='Institution Dissection ID')

This sheet was created by Chaggai.  It had a few missing entries in the `PFC #` column, which I manually filled out with the `Institution Dissection ID` using `CMC_Human_brainRegion_metadata.csv`

In [5]:
genewiz_serialn_syn = syn.get('syn21982509', downloadLocation=wdir, ifcollision='overwrite.local')
genewiz_serialn = pd.read_csv(genewiz_serialn_syn.path, index_col='CMC_simple_id')

### Inspecting metadata

In [6]:
cmc_clinical.columns

Index(['Individual Notes', 'Institution', 'Brain ID', 'SCZ Pair', 'BP Pair',
       'Changed (used Affy phenotype)', 'Reported Gender', 'Sex', 'Ethnicity',
       'Race', 'Genotype Inferred Ancestry', 'ageOfDeath', 'Date of Death',
       'Time of Death', 'Time of Death (Military)', 'Autopsy ID',
       'Brain Weight (in grams)', 'PMI (in hours)', 'pH', 'Dx',
       'primaryDiagnosisDetail', 'Presence or Absence of Dementia (Y/N)',
       'CDR', 'Year of Autopsy', 'Neuropath', 'Neuropath desc',
       'Gross Diagnosis', 'Benzodiazepines', 'Anticonvulsants', 'AntipsychTyp',
       'AntipsychAtyp', 'Antidepress', 'Lithium', 'Tobacco', 'Tobacco (Past)',
       'Alcohol', 'Illicitsub', 'causeOfDeath', 'DescDeath', 'Hyperten',
       'DiabetesInsDep', 'DiabetesNonInsDep', 'ECT', 'Seizures', 'Braak Stage',
       'H/O Head Inj.', 'H/O COPD', 'H/O Stroke', 'H/O PD.AD.LBD.Pick',
       'Cardiovascular Disease', 'Lobotomy', 'BMI (Traditional)',
       'BMI (cm/kg)', 'Height (cm)', 'Weight (kg)'

In [7]:
cmc_brainreg.columns

Index(['Individual Notes', 'Individual ID', 'Institution Dissection ID',
       'Institution Source ID', 'Brodmann Area', 'Hemisphere',
       'Tissue Amount (grams)', 'Operator', 'Date Dissected', 'Brain Region'],
      dtype='object')

In [8]:
cmc_dnaisol.columns

Index(['Sample DNA ID', 'Initial Tissue State', 'DNA Prep Date',
       'DNA Prep Operator', 'Dneasy Kit ID#', 'Total DNA (ug)', '260/280',
       '260/230', 'GQN', 'Brain Region', 'Cell Type', 'Nuclei Frozen',
       'Number of Nuclei'],
      dtype='object')

In [9]:
pd.set_option('display.max_columns', None)
gsub.iloc[0]
gsub

Unnamed: 0,subjectkey,src_subject_id,interview_date,interview_age,gender,race,ethnic_group,phenotype,phenotype_description,twins_study,sibling_study,family_study,family_user_def_id,subjectkey_mother,subjectkey_father,subjectkey_sibling1,sibling_type1,subjectkey_sibling2,sibling_type2,subjectkey_sibling3,sibling_type3,subjectkey_sibling4,sibling_type4,zygosity,sample_taken,sample_id_original,sample_description,biorepository,patient_id_biorepository,sample_id_biorepository,cell_id_original,cell_id_biorepository,adi_dx,ados_dx
0,NDAR_INV0971H4H4,CMC_MSSM_033,4/13/18,972,F,African American,,control,No,No,No,,,,,,,,,,,,,,Yes,MSSM_033.DLPFC_1355.np1,PFC,MSBB,,,,,,
1,NDAR_INV0UA2YLF3,CMC_MSSM_046,4/13/18,1080,F,White,,control,No,No,No,,,,,,,,,,,,,,Yes,MSSM_046.DLPFC_1339.np1,PFC,MSBB,,,,,,
2,NDAR_INV1VPUF5CL,CMC_MSSM_056,4/13/18,804,F,White,,control,No,No,No,,,,,,,,,,,,,,Yes,MSSM_056.DLPFC_1181.np1,PFC,MSBB,,,,,,
3,NDAR_INV2459CJE1,CMC_MSSM_061,4/13/18,816,M,White,,control,No,No,No,,,,,,,,,,,,,,Yes,MSSM_061.DLPFC_1188.np1,PFC,MSBB,,,,,,
4,NDAR_INV27XJ4YKX,CMC_MSSM_065,4/13/18,1080,F,White,,control,No,No,No,,,,,,,,,,,,,,Yes,MSSM_065.DLPFC_1334.np1,PFC,MSBB,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,NDAR_INVYJJTJCR3,CMC_PITT_118,4/13/18,852,F,White,,schizophrenia,No,No,No,,,,,,,,,,,,,,Yes,PITT_118.DRPC917.np1,PFC,UPittNBB,,,,,,
89,NDAR_INVYV5TNUZA,CMC_PITT_123,4/13/18,984,M,White,,control,No,No,No,,,,,,,,,,,,,,Yes,PITT_123.DRPC988.np1,PFC,UPittNBB,,,,,,
90,NDAR_INVBP413PJE,CMC_MSSM_168,12/12/19,876,M,White,,schizophrenia,No,No,No,,,,,,,,,,,,,,Yes,MSSM_168.DLPFC_1279.np1,PFC,MSBB,,,,,,
91,NDAR_INVUB953NGH,CMC_MSSM_327,12/12/19,972,F,African American,,schizophrenia,No,No,No,,,,,,,,,,,,,,Yes,MSSM_327.DLPFC_1350.np1,PFC,MSBB,,,,,,


In [10]:
btb.iloc[0]
btb

Unnamed: 0,subjectkey,src_subject_id,interview_age,interview_date,gender,race,ethnic_group,grade_highed,disorder,cdeathoff,death027,pminterval,ph,sample_id_original,celltype,br_reg,rindlpfc,agedays,bmi,historyrec,surgoraut,ageyears,adi_r,hbsag,hiv,mravail,npavail,adiravail,rincortex,rincbell,s201,s203,s205,s207,s209,s211,s213,s215,s217,s219,s221,s223,s225,s227,s229,s231,s232,s234,s236,s238,s250,s240,s242,s244,s246,s248,s189,s190,s191,s187,s188,s166,s196,s197,s101,s102,s103,s104,s105,s106,s107,s108,s109,s110,s111,s112,s113,s114,s115,s116,s117,s118,s119,s120,s121,s122,s123,s124,s125,s126,s127,s128,s129,s130,s131,s132,s133,s134,s135,s136,s137,s138,s139,s140,s141,s142,s143,s144,s145,s146,s147,s148,s149,s150,s151,s152,s153,s154,s155,s156,s157,s158,s159,s160,s161,s162,s163,s164,s165,s202,s204,s206,s208,s210,s212,s214,s216,s218,s220,s222,s224,s226,s228,s230,s233,s235,s237,s239,s251,s241,s243,s245,s247,s249,s89,s90,s87,s88,s66,s96,s97,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,s22,s23,s24,s25,s26,s27,s28,s29,s30,s31,s32,s33,s34,s35,s36,s37,s38,s39,s40,s41,s42,s43,s44,s45,s46,s47,s48,s49,s50,s51,s52,s53,s54,s55,s56,s57,s58,s59,s60,s62,s64,brainzyn,brainxyn,cardzyn,endoczyn,gastrzyn,genitzyn,hematzyn,integzyn,mskelzyn,respzyn,scordzyn,urinzyn,otherzyn,systxyn,mcomments,frozentissue,fixedbrain,adi_r_score
0,NDAR_INVDVXZZ5G0,CMC_MSSM_295,744,4/13/18,M,White,,,,,,,,MSSM_295.DLPFC_1178.np1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,NDAR_INVDVXZZ5G0,CMC_MSSM_295,744,4/13/18,M,White,,,,,,,,MSSM_295.TMPR_69114.mu1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,NDAR_INVY3TCVYKD,CMC_PITT_101,504,4/13/18,M,White,,,,,,,,PITT_101.DRPC700.np1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,NDAR_INVEUUEDMKH,CMC_MSSM_304,912,4/13/18,M,White,,,,,,,,MSSM_304.DLPFC_1163.np1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,NDAR_INVEUUEDMKH,CMC_MSSM_304,912,4/13/18,M,White,,,,,,,,MSSM_304.TMPR_69091.mu1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
211,NDAR_INV7V1JWUWT,CMC_MSSM_193,840,4/13/18,M,White,,,,,,,,MSSM_193.DLPFC_1164.np1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
212,NDAR_INVV02H1WYK,CMC_PITT_048,612,4/13/18,F,White,,,,,,,,PITT_048.DRPC1391.np1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
213,NDAR_INVBP413PJE,CMC_MSSM_168,876,12/12/19,M,White,,,,,,,,MSSM_168.DLPFC_1279.np1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
214,NDAR_INVUB953NGH,CMC_MSSM_327,972,12/12/19,F,African American,,,,,,,,MSSM_327.DLPFC_1350.np1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


## Producing new manifests

In [11]:
wdir = '/home/attila/projects/bsm/results/2020-04-22-upload-to-ndar-from-s3/'
gsub_s3, btb_s3, gsam_s3 = s2n.make_manif_s3(wdir)

Welcome, Attila Jones!

nichd_btb02 written to /home/attila/projects/bsm/results/2020-04-22-upload-to-ndar-from-s3/2020-06-08-nichd_btb02.csv
genomics_subject02 written to /home/attila/projects/bsm/results/2020-04-22-upload-to-ndar-from-s3/2020-06-08-genomics_subject02.csv
genomics_sample03 written to /home/attila/projects/bsm/results/2020-04-22-upload-to-ndar-from-s3/2020-06-08-genomics_sample03.csv


### Making sample_list for bsmn-pipeline

In [12]:
def gsam2sample_list(data_file='data_file1'):
    df = gsam_s3.loc[:, ['src_subject_id', data_file, data_file]]
    df.columns = ['#sample_id', 'file_name', 'location']
    df['#sample_id'] = [re.sub('^CMC_(.+)$', '\\1_NeuN_pl', y) for y in df['#sample_id']]
    df['location'] = ['s3://chesslab-bsmn/' + y for y in df['location']]
    df['file_name'] = [re.sub('^.*\/', '', y) for y in df['file_name']]
    return(df)

sample_list = pd.concat([gsam2sample_list(y) for y in ['data_file1', 'data_file2']])
sample_list = sample_list.sort_values(by=['#sample_id', 'file_name'])
slist_path = '/big/results/bsm/2020-04-22-upload-to-ndar-from-s3/sample_list'
sample_list.to_csv(slist_path, sep='\t', header=True, index=False)

## Submission
### Validating the manifests

This is validation without submission (i.e without building the submission package).

In [13]:
%%bash
cd ~/projects/bsm/results/2020-04-22-upload-to-ndar-from-s3
validate="vtcmd -t title -d description -u $NDA_USER -p $NDA_PASSWORD -c 2965 -ak $AWS_ACCESSKEY -sk $AWS_SECRETACCESSKEY -w -s3 chesslab-bsmn -pre GENEWIZ/30-317737003"
manifests="$(date +%Y-%m-%d)*.csv"
$validate $manifests

Running NDATools Version 0.2.0
Opening log: /home/attila/NDAValidationResults/debug_log_20200608T132528.txt

Validating files...
Validation report output to: /home/attila/NDAValidationResults/validation_results_20200608T132528.csv

All files have finished validating.

The following files passed validation:
UUID 8b6bc5bd-fc81-4200-b7c2-95d5beebc087: 2020-06-08-genomics_subject02.csv
UUID ee76d27d-49b6-49da-b8bc-5e15a2b56b75: 2020-06-08-nichd_btb02.csv
UUID 25a67ad5-b2e8-488e-8f73-0bdae8f5b168: 2020-06-08-genomics_sample03.csv


  0%|          | 0/3 [00:00<?, ?it/s] 33%|███▎      | 1/3 [00:01<00:02,  1.22s/it] 67%|██████▋   | 2/3 [00:01<00:00,  1.11it/s]100%|██████████| 3/3 [00:01<00:00,  1.99it/s]


### Building the submission package

Now let's add the `--buildPackage` flag to build the submission package!

In [14]:
%%bash
if false; then
cd ~/projects/bsm/results/2020-04-22-upload-to-ndar-from-s3
validate="vtcmd -t title -d description -u $NDA_USER -p $NDA_PASSWORD -c 2965 -ak $AWS_ACCESSKEY -sk $AWS_SECRETACCESSKEY -w -s3 chesslab-bsmn -pre GENEWIZ/30-317737003"
manifests="$(date +%Y-%m-%d)*.csv"
$validate -b $manifests
fi

### Resubmitting on Ada

The original submission from `attila-ThinkPad` was interrupted, canceled, and restarted on `Ada`.  A day later the NDA credentials expired and the submission had to be resumed with `vtcmd -r` command **including** the NDA credencials like this:
```
vtcmd -r 32164 -ak $AWS_ACCESSKEY -sk $AWS_SECRETACCESSKEY -s3 chesslab-bsmn -pre GENEWIZ/30-317737003
```

In [15]:
%connect_info

{
  "shell_port": 37307,
  "iopub_port": 50619,
  "stdin_port": 51237,
  "control_port": 44679,
  "hb_port": 50967,
  "ip": "127.0.0.1",
  "key": "cc852086-b0dbff4f3520e8c3b766af2f",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-13fc6d90-7670-4d32-97e7-25c6ba8da988.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
