# Downloading the ABCD imaging data

In the previous data exercises you have download and interacted with the [ABCD 3.0 release](https://nda.nih.gov/abcd/query/abcd-curated-annual-release-3.0.html). While there are many measures derived from the imaging data within the pre-packaged tabulated data, you may have noticed that the full set of MRI images are not included in this release.

As stated on [NDA's website](https://nda.nih.gov/abcd/query/abcd-curated-annual-release-3.0.html): 

"The raw MRI images and the minimally processed imaging files are over 100TB in size which may make data transfer difficult. "

The data are stored on [Amazon Simple Storage Service (s3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) servers. 

There are multiple routes to obtaining the full imaging data, we'll focus on the following two:
1. Using links from the [fmriresults01](https://nda.nih.gov/data_structure.html?short_name=fmriresults01) structure
2. Using the [nda-abcd-s3-downloader](https://github.com/DCAN-Labs/nda-abcd-s3-downloader)

Both routes involve creating a data package through the NDA, downloading a manifest file, parsing the manifest file, and finally downloading the data.

For brevity, the exercises in this notebook will guide you through downloading the resting state and T1w data from 5 subjects using each of the above download methods. You will need active NDA credentials and an ABCD DUC to download the data.

**A Note about GUIDs and BIDS**

[From the NDA](https://nda.nih.gov/s/guid/nda-guid.html): "The Global Unique Identifier (GUID) is a subject ID allowing researchers to share data specific to a study participant without exposing personally identifiable information (PII) and match participants across labs and research data repositories."

The GUID's format is `NDAR_INVXXXXXXXX`, where `XXXXXXXX` is a random string of numbers and uppercase letters. The standard GUID format is *not* [BIDS compatible](https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#file-name-structure). In BIDS, the underscore character is reserved to separate key:value entities (eg, `key1-value1_key2-value2`, `sub-01_task-rest`). For the BIDS imaging data on the NDA, the underscore in the GUID has been removed (ie, `NDARINVXXXXXXXX`), but be aware that you might need to do a string replace operation to remove the underscore from the GUIDs in the tabulated data to match the GUIDs in the BIDS imaging data.

***

## Downloading the data using the fmriresults01 structure

The general workflow on the NDA is to add data to your Filter Cart and then create a Data Package from the filter. Here we will create a Data Package from the *fmriresults01* data structure. See Getting Image Volumes [here](https://nda.nih.gov/abcd/query/abcd-release-faqs.html) for more info on the *fmriresults01* structure.

**NOTE**: The `fmriresults01.txt` file is distributed in the ABCD 3.0 Release. So if you've already downnloaded that, then you could use that file. If so, you can skip to step 13.

1. Let's begin at the [NDA's front page](https://nda.nih.gov/). Select **Get Data** > **Get Data**

<img src="./screenshots/nda_frontpage.png" width="900" />

***

2. On the **NDA Query Tool**'s menu, select **Data Structures**. Then enter "fmriresults01" into the Text Search field and hit enter.

<img src="./screenshots/nda_query.png" width="900" />

***

3. Click the **Processed MRI Data** link, which will open the structure. Then select **Add to Filter Cart** in the lower left corner.

<img src="./screenshots/add_filter_cart.png" width="150" />

Your Filter Cart will take a few minutes to update. Make yourself some tea. Once it is finished, you should see the following.

<img src="./screenshots/filter_cart.png" width="400" />

(Sample size may vary depending on when you are working through this exercise)

***

4. In the Filter Cart, select **Create Data Package/Add Data to Study**, which will take you to the Data Packaging Page.

5. On the Data Packaging Page, select **Create Data Package**.

<img src="./screenshots/create_data_package.png" width="200" />

6. If you are not logged into the NDA, this will prompt you to log in with your credientials. After, you will see a menu to define your Data Package. Give it a short name and ensure that **Include Associated Data Files** is *unchecked*. Otherwise, the Data Package will contain the all images in *fmriresults*. It will be faster and more flexible to only download the pointers to the data and not the data istself. When you are finished entering this information, click **Create Data Package**.

<img src="./screenshots/create_menu.png" width="300" />

***

7. You will see a window that confirms that your package was initiated. Click the link to navigate to your Dashboard.

<img src="./screenshots/package_created.png" width="350" />

***

8. In the drop down menu on the Data Package Dashboard, select **My Data Packages**. You should see the Data Package you just created. It will take a few minutes to move from the "Creating Package" status to "Ready to Download". Maybe refill your tea. In the below image **ABCDndar** is the Data Package we just created. **ABCDdcan** will be created in the second section of this exercise.

<img src="./screenshots/create_dash.png" width="350" />

<img src="./screenshots/ready_dash.png" width="350" />

***

9. Once the Data Package is ready to download, we can use the [NDA tools](https://github.com/NDAR/nda-tools) to download it. The NDA tools are already installed and ready to use on the ABCD-ReproNim JupyterHub. The relevant command will be `downloadcmd`. Let's see what options `downloadcmd` has.

In [2]:
! downloadcmd -h

Running NDATools Version 0.2.3
Opening log: /home/jovyan/NDAValidationResults/debug_log_20210305T010219.txt
usage: downloadcmd <S3_path_list>

This application allows you to enter a list of aws S3 paths and will download
the files to your drive in your home folder. Alternatively, you may enter a
packageID, an NDA data structure file or a text file with s3 links, and the
client will download all files from the S3 links listed. Please note, the
maximum transfer limit of data is 5TB at one time.

positional arguments:
  <S3_path_list>        Will download all S3 files to your local drive

optional arguments:
  -h, --help            show this help message and exit
  -dp, --package        Flags to download all S3 files in package.
  -t, --txt             Flags that a text file has been entered from where to
                        download S3 files.
  -ds, --datastructure  Flags that a data structure text file has been entered
                        from where to download S3 files.
  -u <a

Recall that the `!` in the code cell of a Jupyter notebook means to execute that command using shell.

***

10. Our first usage of `downloadcmd` will use the Package ID to download the associated package files. Let's put the ABCDndar package into it's own directory. If you have already set up your NDA credentials to download the ABCD 3.0 Release, then `downloadcmd` will use the already stored credentials.  

In [3]:
! mkdir /home/jovyan/ABCDndar3
! downloadcmd 1186291 -dp -d /home/jovyan/ABCDndar3 # number is the data package ID from ndar

mkdir: cannot create directory ‘/home/jovyan/ABCDndar3’: File exists
Running NDATools Version 0.2.3
Opening log: /home/jovyan/NDAValidationResults/debug_log_20210305T010225.txt


***

11. Once the download is complete, we can list the files. The relevant file is `fmriresults01.txt`, which contains information about each image in this structure.

In [4]:
! ls /home/jovyan/ABCDndar3

experiments	   guid_pseudoguid.txt	README.pdf
fmriresults01.txt  package_info.txt


***

12. `fmriresults01.txt` is a tab-separated table that contains information about corresponding image files. Let's read this table into python so that we can parse and choose only the image files we want.

In [5]:
import pandas as pd
fmri = pd.read_csv('/home/jovyan/ABCDndar3/fmriresults01.txt', sep='\t', low_memory=False)

*** 

13. Let's look at the structure and contents of the `fmriresults01.txt`. 

In [6]:
fmri.head()

Unnamed: 0,collection_id,fmriresults01_id,dataset_id,subjectkey,src_subject_id,origin_dataset_id,interview_date,interview_age,sex,experiment_id,...,qc_outcome,derived_files,scan_type,img03_id2,file_source2,session_det,image_history,manifest,image_description,collection_title
0,collection_id,fmriresults01_id,dataset_id,The NDAR Global Unique Identifier (GUID) for r...,Subject ID how it's defined in lab/project,Origin dataset Id,Date on which the interview/genetic test/sampl...,Age in months at the time of the interview/tes...,Sex of the subject,ID for the Experiment/settings/run,...,Provide information on the conclusion of the q...,An archive of the files produced by the pipeline,Type of Scan,Corresponds to row_id in image03 data structur...,"File name/location, 2",session details,"Image history,f.e. transformations steps and o...",,"Image description, i.e. DTI, fMRI, Fast SPGR, ...",collection_title
1,2573,1140322,34648,NDAR_INVZVW3HKWN,NDAR_INVZVW3HKWN,17077,02/03/2017,131,M,650,...,pass,s3://NDAR_Central_4/submission_32739/NDARINVZV...,fMRI,,,ABCD-MPROC-SST,"motion correction, B0 inhomogeneity correction...",,,Adolescent Brain Cognitive Development Study (...
2,2573,1140323,34648,NDAR_INVXEYPED0K,NDAR_INVXEYPED0K,17049,11/29/2017,126,M,649,...,pass,s3://NDAR_Central_4/submission_32739/NDARINVXE...,fMRI,,,ABCD-MPROC-REST,"motion correction, B0 inhomogeneity correction...",,,Adolescent Brain Cognitive Development Study (...
3,2573,1140324,34648,NDAR_INVZ3638RYW,NDAR_INVZ3638RYW,17077,06/16/2017,125,M,649,...,pass,s3://NDAR_Central_4/submission_32739/NDARINVZ3...,fMRI,,,ABCD-MPROC-REST,"motion correction, B0 inhomogeneity correction...",,,Adolescent Brain Cognitive Development Study (...
4,2573,1140325,34648,NDAR_INVWVFZTJX6,NDAR_INVWVFZTJX6,24209,05/21/2019,147,M,651,...,pass,s3://NDAR_Central_4/submission_32739/NDARINVWV...,fMRI,,,ABCD-MPROC-NBACK,"motion correction, B0 inhomogeneity correction...",,,Adolescent Brain Cognitive Development Study (...


We can see that the first row contains a detailed description of the column. We won't need to include this in our parsing, so we can drop it.

In [7]:
fmri = fmri.drop([0])
fmri.head()

Unnamed: 0,collection_id,fmriresults01_id,dataset_id,subjectkey,src_subject_id,origin_dataset_id,interview_date,interview_age,sex,experiment_id,...,qc_outcome,derived_files,scan_type,img03_id2,file_source2,session_det,image_history,manifest,image_description,collection_title
1,2573,1140322,34648,NDAR_INVZVW3HKWN,NDAR_INVZVW3HKWN,17077,02/03/2017,131,M,650,...,pass,s3://NDAR_Central_4/submission_32739/NDARINVZV...,fMRI,,,ABCD-MPROC-SST,"motion correction, B0 inhomogeneity correction...",,,Adolescent Brain Cognitive Development Study (...
2,2573,1140323,34648,NDAR_INVXEYPED0K,NDAR_INVXEYPED0K,17049,11/29/2017,126,M,649,...,pass,s3://NDAR_Central_4/submission_32739/NDARINVXE...,fMRI,,,ABCD-MPROC-REST,"motion correction, B0 inhomogeneity correction...",,,Adolescent Brain Cognitive Development Study (...
3,2573,1140324,34648,NDAR_INVZ3638RYW,NDAR_INVZ3638RYW,17077,06/16/2017,125,M,649,...,pass,s3://NDAR_Central_4/submission_32739/NDARINVZ3...,fMRI,,,ABCD-MPROC-REST,"motion correction, B0 inhomogeneity correction...",,,Adolescent Brain Cognitive Development Study (...
4,2573,1140325,34648,NDAR_INVWVFZTJX6,NDAR_INVWVFZTJX6,24209,05/21/2019,147,M,651,...,pass,s3://NDAR_Central_4/submission_32739/NDARINVWV...,fMRI,,,ABCD-MPROC-NBACK,"motion correction, B0 inhomogeneity correction...",,,Adolescent Brain Cognitive Development Study (...
5,2573,1140326,34648,NDAR_INVXFURZ24F,NDAR_INVXFURZ24F,17050,01/08/2018,109,M,649,...,pass,s3://NDAR_Central_4/submission_32739/NDARINVXF...,fMRI,,,ABCD-MPROC-REST,"motion correction, B0 inhomogeneity correction...",,,Adolescent Brain Cognitive Development Study (...


***

14. We do not need most of the information in this table. The relevant columns are `file_source`, which contains the s3 links to the raw DICOM images and `derived_files` which contains the s3 links to the minimally preprocessed images. Here we will focus on downloading the minimally processed files in `derived_files`. We will create a dataframe that contains the s3 links and other relevant fields so that we can filter the s3 links. Explore other columns of this table to see the processing steps that has been applied to the `derived_files`.

In [8]:
# create a new data frame from the derived_files column
s3_derv = fmri.loc[:,['derived_files']] 
# view dataframe
s3_derv.head()

Unnamed: 0,derived_files
1,s3://NDAR_Central_4/submission_32739/NDARINVZV...
2,s3://NDAR_Central_4/submission_32739/NDARINVXE...
3,s3://NDAR_Central_4/submission_32739/NDARINVZ3...
4,s3://NDAR_Central_4/submission_32739/NDARINVWV...
5,s3://NDAR_Central_4/submission_32739/NDARINVXF...


Let's look at the format of the s3 links to see how we could parse this for filtering:

*s3://NDAR_Central_4/submission_32739/NDARINVXXXXXXXX_baselineYear1Arm1_ABCD-MPROC-SST-fMRI_XXXXXXXXXXXXXX.tgz*

- *s3://NDAR_Central_4/submission_32739* is the location of the data on the s3 server
- *NDARINVXXXXXXXX* is the GUID
- *baselineYear1Arm1* is the session
- *ABCD-MPROC-SST-fMRI* is the scan type information
- The number at the end of the file is the acqusition date/time
- *.tgz* is the TAR archive file extension

We can use python's ability to split strings to parse these strings so that we can filter by GUID, session, and scan type. Let's see an example:

In [9]:
example = 's3://NDAR_Central_4/submission_32739/NDARINVXXXXXXXX_baselineYear1Arm1_ABCD-MPROC-SST-fMRI_XXXXXXXXXXXXXX.tgz'
example.split('/')

['s3:',
 '',
 'NDAR_Central_4',
 'submission_32739',
 'NDARINVXXXXXXXX_baselineYear1Arm1_ABCD-MPROC-SST-fMRI_XXXXXXXXXXXXXX.tgz']

The above code splits the `example` string into a list of strings at every occurence of `/`.

`.split` only operates on strings, but we have an entire column of strings we want to split. Here we can leverage python's list comprehension to iterate through each string.

For example:

In [10]:
test_split = [i.split('/') for i in s3_derv['derived_files']]
test_split[0:3]

[['s3:',
  '',
  'NDAR_Central_4',
  'submission_32739',
  'NDARINVZVW3HKWN_baselineYear1Arm1_ABCD-MPROC-SST-fMRI_20170203160134.tgz'],
 ['s3:',
  '',
  'NDAR_Central_4',
  'submission_32739',
  'NDARINVXEYPED0K_baselineYear1Arm1_ABCD-MPROC-rsfMRI_20171129140732.tgz'],
 ['s3:',
  '',
  'NDAR_Central_4',
  'submission_32739',
  'NDARINVZ3638RYW_baselineYear1Arm1_ABCD-MPROC-rsfMRI_20170709114733.tgz']]

The above code submits the same split operation to every item in the `s3_derv['derived_files']` data frame we created above. You could also complete this with a regular `for` loop, but list comprehension is cleaner and more efficient.

We can leverage string splitting and list comprehension to parse each s3 link into a corresponding GUID, session, and scan type.

In [13]:
s3_derv['guid'] = [i.split('/')[-1].split('_')[0] for i in s3_derv['derived_files']] # get the GUID
s3_derv['session'] = [i.split('/')[-1].split('_')[1] for i in s3_derv['derived_files']] # get the session
s3_derv['scan'] = [i.split('/')[-1].split('_')[2].split('-',1)[-1] for i in s3_derv['derived_files']] # get the scan type

s3_derv

Unnamed: 0,derived_files,guid,session,scan
1,s3://NDAR_Central_4/submission_32739/NDARINVZV...,NDARINVZVW3HKWN,baselineYear1Arm1,MPROC-SST-fMRI
2,s3://NDAR_Central_4/submission_32739/NDARINVXE...,NDARINVXEYPED0K,baselineYear1Arm1,MPROC-rsfMRI
3,s3://NDAR_Central_4/submission_32739/NDARINVZ3...,NDARINVZ3638RYW,baselineYear1Arm1,MPROC-rsfMRI
4,s3://NDAR_Central_4/submission_32739/NDARINVWV...,NDARINVWVFZTJX6,2YearFollowUpYArm1,MPROC-nBack-fMRI
5,s3://NDAR_Central_4/submission_32739/NDARINVXF...,NDARINVXFURZ24F,baselineYear1Arm1,MPROC-rsfMRI
...,...,...,...,...
200204,s3://NDAR_Central_4/submission_32739/NDARINV18...,NDARINV18KYVZ3C,2YearFollowUpYArm1,MPROC-nBack-fMRI
200205,s3://NDAR_Central_4/submission_32739/NDARINV1K...,NDARINV1KZTEZF5,2YearFollowUpYArm1,MPROC-nBack-fMRI
200206,s3://NDAR_Central_4/submission_32739/NDARINV0P...,NDARINV0PKCP1P3,2YearFollowUpYArm1,MPROC-rsfMRI
200207,s3://NDAR_Central_4/submission_32739/NDARINV27...,NDARINV279N3WC9,2YearFollowUpYArm1,MPROC-SST-fMRI


The above list comprehension and string splitting code looks complicated, let's break down the code for parsing the scan type:

- `[i for i in s3_derv['derived_files']` is looping through every string in `s3_derv['derived_files']`. `i` will be the string in the current iteration.
- `i.split('/')[-1]` gives us the last (`[-1]`) item in the list (the filename) once you split the full s3 link by the `/` character.
- The second `.split('_')[2]` splits the filename by `_`. `[2]` is choosing the third item in that list (because of 0 indexing). This is `ABCD-MPROC-SST-fMRI` in the above example.
- The third `.split('-',1)[-1]` is splitting `ABCD-MPROC-SST-fMRI` by `-`, only by the first occurence of `-`. `[-1]` means that we are grabbing the last in that two item list (`MPROC-SST-fMRI`).

Now we have a dataframe where we can filter s3 links by GUID, session, and scan type! Let's see what scan types we have:

In [12]:
s3_derv['scan'].value_counts()

MPROC-rsfMRI        62398
MPROC-MID-fMRI      28864
MPROC-nBack-fMRI    28331
MPROC-SST-fMRI      28231
MPROC-DTI           18589
MPROC-T1            17349
MPROC-T2            16446
Name: scan, dtype: int64

***

15. Let's specify our filtering critera. Choose 5 subject GUIDs (you can choose 5 random GUIDs from your work on the ABCD 3.0 Release), only the `baselineYear1Arm1` session, and scan types of `MPROC-T1` and `MPROC-rsfMRI`.

In [14]:
subjs = ['NDARINVZVW3HKWN', 'NDARINVXEYPED0K', 'NDARINVZ3638RYW', 
        'NDARINVXFURZ24F', 'NDARINV0PKCP1P3'] # enter 5 GUIDs.
runs = ['MPROC-T1', 'MPROC-rsfMRI'] # need to match the scan types in s3_derv
ses = ['baselineYear1Arm1'] # session

# filter the s3_derv data frame using the above filters
sub_s3derv = s3_derv[s3_derv['guid'].isin(subjs) & s3_derv['scan'].isin(runs) & s3_derv['session'].isin(ses)]
sub_s3derv.sort_values(['guid', 'scan']) # sort to make it pretty

Unnamed: 0,derived_files,guid,session,scan
183141,s3://NDAR_Central_4/submission_32739/NDARINV0P...,NDARINV0PKCP1P3,baselineYear1Arm1,MPROC-T1
193027,s3://NDAR_Central_4/submission_32739/NDARINV0P...,NDARINV0PKCP1P3,baselineYear1Arm1,MPROC-rsfMRI
193029,s3://NDAR_Central_4/submission_32739/NDARINV0P...,NDARINV0PKCP1P3,baselineYear1Arm1,MPROC-rsfMRI
195322,s3://NDAR_Central_4/submission_32739/NDARINV0P...,NDARINV0PKCP1P3,baselineYear1Arm1,MPROC-rsfMRI
197927,s3://NDAR_Central_4/submission_32739/NDARINV0P...,NDARINV0PKCP1P3,baselineYear1Arm1,MPROC-rsfMRI
164645,s3://NDAR_Central_4/submission_32739/NDARINVXE...,NDARINVXEYPED0K,baselineYear1Arm1,MPROC-T1
2,s3://NDAR_Central_4/submission_32739/NDARINVXE...,NDARINVXEYPED0K,baselineYear1Arm1,MPROC-rsfMRI
3238,s3://NDAR_Central_4/submission_32739/NDARINVXE...,NDARINVXEYPED0K,baselineYear1Arm1,MPROC-rsfMRI
3878,s3://NDAR_Central_4/submission_32739/NDARINVXE...,NDARINVXEYPED0K,baselineYear1Arm1,MPROC-rsfMRI
3880,s3://NDAR_Central_4/submission_32739/NDARINVXE...,NDARINVXEYPED0K,baselineYear1Arm1,MPROC-rsfMRI


Let's see a count of how many s3 links met the filter criteria.

In [25]:
sub_s3derv['scan'].value_counts()

MPROC-rsfMRI    20
MPROC-T1         5
Name: scan, dtype: int64

***

16. Great! Now we can write the filtered s3 links to a text file. `s3_derv_links_5subj.txt` will be a simple text file that only contains the relevant s3 links.

In [26]:
with open('/home/jovyan/ABCDndar3/s3_derv_links_5subj.txt', 'w') as f:
    f.write('\n'.join(sub_s3derv['derived_files']))

***

17. Now we can use `downloadcmd` to download the actual data! Let's also make a directory to store the downloaded files. The download will take a few minutes. Brew some more tea.

In [27]:
! mkdir /home/jovyan/ABCDndar3/tar_files
! downloadcmd -d /home/jovyan/ABCDndar3/tar_files -t /home/jovyan/ABCDndar3/s3_derv_links_5subj.txt

mkdir: cannot create directory ‘/home/jovyan/ABCDndar3/tar_files’: File exists
Running NDATools Version 0.2.3
Opening log: /home/jovyan/NDAValidationResults/debug_log_20210305T011756.txt


***

18. Let's list out the files we've downloaded. You'll notice that the data was downloaded into a `submission_XXXXX` directory. You will also notice that the files are in `.tgz` format. The last step will be to unzip the files. The unzipping and `datalad save` steps will take a few minutes. Fourth tea refill is a charm!

The `%%bash` in the cell tells the entire cell to run the code in bash.

In [28]:
! ls /home/jovyan/ABCDndar3/tar_files

submission_32739


In [30]:
%%bash

# let's use datalad to track the unzipped dataset
datalad create /home/jovyan/ABCDndar3/image_files

# now unzip the files
cd /home/jovyan/ABCDndar3/tar_files
for sub in submission_*; do
    cd $sub
    for f in *.tgz; do
        tar zxf $f --directory /home/jovyan/ABCDndar3/image_files
    done
done

create(error): /home/jovyan/ABCDndar3/image_files (dataset) [will not create a dataset in a non-empty directory, use `force` option to ignore]


[ERROR] will not create a dataset in a non-empty directory, use `force` option to ignore [create(/home/jovyan/ABCDndar3/image_files)] 


In [31]:
%%bash

# track the changes in datalad
cd /home/jovyan/ABCDndar3/image_files
datalad save -m 'add unzipped files from NDA' .

add(ok): dataset_description.json (file)
add(ok): sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/anat/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_run-01_T1w.json (file)
add(ok): sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/anat/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_run-01_T1w.nii (file)
add(ok): sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_run-01_bold.json (file)
add(ok): sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_run-01_bold.nii (file)
add(ok): sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_run-01_motion.tsv (file)
add(ok): sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_run-02_bold.json (file)
add(ok): sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_run-02_bold.nii (file)
add(ok): sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub

Let's look at the log to see the changes we've made to this dataset.

In [32]:
%%bash

cd /home/jovyan/ABCDndar3/image_files
git log

commit 636a705825ce96aff996f716fcbc5e337b4897db
Author: Anna <anna.vannucci@columbia.edu>
Date:   Fri Mar 5 01:40:03 2021 +0000

    add unzipped files from NDA

commit a1fedb66c33bbcf339884625c003cea777d0b613
Author: Anna <anna.vannucci@columbia.edu>
Date:   Fri Mar 5 01:28:21 2021 +0000

    [DATALAD] new dataset


### Success!!
# 🎉🎉🎉

***

## Downloading the data using the nda-abcd-s3-downloader

The [Developmental Cognition and Neuroimaging Lab (DCAN)](https://www.ohsu.edu/school-of-medicine/developmental-cognition-and-neuroimaging-lab) at Oregon Health & Science University has created a handy tool to make downloading easier. In addition, they have uploaded preprocessed derivatives to facilitate quick analysis. More information about the specific contents and preprocessing pipeline can be found at the [Collection 3165 documentation page](https://collection3165.readthedocs.io/en/stable/release_notes/). The procedure for preparing the final download is similar to the procedure above.

First you need to create a package and download the list of files contained in Collection 3165. Below are the instructions for using this tool from the [`nda-abcd-s3-downloader` README](https://github.com/DCAN-Labs/nda-abcd-s3-downloader). 

1. Navigate to the [NDA website](https://nda.nih.gov/general-query.html?q=query=collections%20~and~%20searchTerm=DCAN%20Labs%20ABCD-BIDS%20MRI%20pipeline%20inputs%20and%20derivatives%20~and~%20orderBy=id%20~and~%20orderDirection=Ascending) 
2. Under "Get Data" select "Data from Labs"
3. Search for "DCAN Labs ABCD-BIDS MRI pipeline inputs and derivatives"
4. After clicking on the Collection Title select "Shared Data"
5. Click "Add to Cart" at the bottom
6. It will take a minute to update the "Filter Cart" in the upper right corner, but when that is done select "Package/Add to Study"Select "Create Package", name your package accordingly, and click "Create Package"
- IMPORTANT: Make sure "Include associated data files" is deselected or it will automatically attempt to download all the data through the NDA's package manager which is unreliable on such a large dataset. That is why we've created this downloader for you.
7. Now download the "Download Manager" to actually download the package or use the NDA's nda-tools to download the package from the command line. This may take several minutes.
8. After the download is complete find the "datastructure_manifest.txt" in the downloaded directory. This is the input S3 file that contains AWS S3 links to every input and derivative for all of the selected subjects and you will need to give the path to this file when calling download.py

***
 Hopefully most of the steps to download the manifest sound familiar. Once you have created the Data Package in the NDA, you can proceed.
 
 9. Let's make a directory and download the data package.

In [33]:
! mkdir /home/jovyan/ABCDdcan
! downloadcmd 1186298 -dp -d /home/jovyan/ABCDdcan # numbers are the data package ID from nda

Running NDATools Version 0.2.3
Opening log: /home/jovyan/NDAValidationResults/debug_log_20210305T035615.txt


In [34]:
! ls /home/jovyan/ABCDdcan

Collection_Documents	    fmriresults01.txt	     README.pdf
datastructure_manifest.txt  imagingcollection01.txt
experiments		    package_info.txt


***

10. We should see `datastructure_manifest.txt` in `/home/jovyan/ABCDdcan`. This is the file that contains all of the s3 links for the input and derivative data. As before, we'll need to filter the larger file, but we'll do so in a different way. But first, we need to clone the [nda-abcd-s3-downloader](https://github.com/DCAN-Labs/nda-abcd-s3-downloader).

In [35]:
%%bash

cd /home/jovyan/ABCDdcan
git clone https://github.com/DCAN-Labs/nda-abcd-s3-downloader.git

Cloning into 'nda-abcd-s3-downloader'...


***

11. The `nda-abcd-s3-downloader` accepts a `data_subsets.txt` file in which we can specify which subsets of the data we want. Let's download some raw (input) data and some derivative data, but keep them separate. You can see the list of data subsets [here](https://github.com/DCAN-Labs/nda-abcd-s3-downloader/blob/master/data_subsets.txt). First, we'll download the inputs.

In [36]:
%%bash

cd /home/jovyan/ABCDdcan
# write inputs of interest to inputs.txt
echo inputs.anat.T1w >> inputs.txt
echo inputs.func.task-rest >> inputs.txt

***

12. Now that you have the data subsets, you'll also need to create a subject subset file. You can use the same subjects as you did for the above exercise. Create a file called `5subjects.txt` that contains the following format. You can `echo` the GUIDs to a text file as we did with the inputs above, or you can create a text file outside of this notebook.

```
sub-NDARINVXXXXXXXX
sub-NDARINVXXXXXXXX
sub-NDARINVXXXXXXXX
sub-NDARINVXXXXXXXX
sub-NDARINVXXXXXXXX
```

In [47]:
%%bash

cd /home/jovyan/ABCDdcan

# create .txt file
touch 5subjects.txt 

# add 5 guids to the .txt file
echo "sub-NDARINVZVW3HKWN" > 5subjects.txt 
echo "sub-NDARINVXEYPED0K" >> 5subjects.txt
echo "sub-NDARINVZ3638RYW" >> 5subjects.txt
echo "sub-NDARINVXFURZ24F" >> 5subjects.txt
echo "sub-NDARINV0PKCP1P3" >> 5subjects.txt

# view contents of .txt file
cat 5subjects.txt

sub-NDARINVZVW3HKWN
sub-NDARINVXEYPED0K
sub-NDARINVZ3638RYW
sub-NDARINVXFURZ24F
sub-NDARINV0PKCP1P3


***

13. Now we are ready to download the input data! This step will take a few minutes. You're probably out of tea at this point.

#### **NOTE**: The first time you run `download.py`, you'll need to run it from your terminal because it will ask for your NDA credentials

In [48]:
%%bash

# let's use datalad to track the unzipped dataset
datalad create /home/jovyan/ABCDdcan/image_files

## NOTE: the first time you run download.py, you'll need to run it from your terminal because it will ask for your NDA credentials
cd /home/jovyan/ABCDdcan/

# run the downloader
nda-abcd-s3-downloader/download.py -i datastructure_manifest.txt -o image_files -s 5subjects.txt -d inputs.txt

create(error): /home/jovyan/ABCDdcan/image_files (dataset) [will not create a dataset in a non-empty directory, use `force` option to ignore]
Derivatives downloader called at 2021:03:05 04:17 with:

Enter your NIMH Data Archives username: 

[ERROR] will not create a dataset in a non-empty directory, use `force` option to ignore [create(/home/jovyan/ABCDdcan/image_files)] 
Traceback (most recent call last):
  File "nda-abcd-s3-downloader/download.py", line 386, in <module>
    _cli()
  File "nda-abcd-s3-downloader/download.py", line 109, in _cli
    make_nda_token(args.credentials)
  File "nda-abcd-s3-downloader/download.py", line 304, in make_nda_token
    username = input("\nEnter your NIMH Data Archives username: ")
EOFError: EOF when reading a line


CalledProcessError: Command 'b"\n# let's use datalad to track the unzipped dataset\ndatalad create /home/jovyan/ABCDdcan/image_files\n\n## NOTE: the first time you run download.py, you'll need to run it from your terminal because it will ask for your NDA credentials\ncd /home/jovyan/ABCDdcan/\n\n# run the downloader\nnda-abcd-s3-downloader/download.py -i datastructure_manifest.txt -o image_files -s 5subjects.txt -d inputs.txt\n"' returned non-zero exit status 1.

***

14. Let's check to see if the input data we downloaded are in proper BIDS format. To do this we'll use the [bids-validator](https://github.com/bids-standard/bids-validator). We'll build a singularity container from the docker image so that we can run the validator on the JupyterHub.

In [2]:
! singularity build bids_validator-1.6.1.simg docker://bids/validator:v1.6.1

2021/03/05 19:08:03 bufio.Scanner: token too long
Build target 'bids_validator-1.6.1.simg' already exists and will be deleted during the build process. Do you want to continue? [N/y]^C
[31mFATAL:  [0m While checking build target: stopping build


Now let's see if the input data are in BIDS.

In [3]:
! singularity run /home/jovyan/bids_validator-1.6.1.simg /home/jovyan/ABCDdcan/image_files

bids-validator@1.6.1

	[33m1: [WARN] You should define 'SliceTiming' for this file. If you don't provide this information slice time correction will not be possible. 'Slice Timing' is the time at which each slice was acquired within each volume (frame) of the acquisition. Slice timing is not slice order -- rather, it is a list of times containing the time (in seconds) of each slice acquisition in relation to the beginning of volume acquisition. (code: 13 - SLICE_TIMING_NOT_DEFINED)[39m
		./sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_run-01_bold.nii.gz
		./sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_run-02_bold.nii.gz
		./sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_run-03_bold.nii.gz
		./sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_run-04_bold.nii.gz
		./sub-NDARINVXEYP

It looks like the data are in BIDS! (Warnings are ok, but something you should be aware of)

***

15. Now let's download some derivatives! You can use the same subjects file.

In [5]:
%%bash

cd /home/jovyan/ABCDdcan

# make subsets file
cat ./nda-abcd-s3-downloader/data_subsets.txt | grep derivatives | grep rest >> derivatives.txt
cat ./nda-abcd-s3-downloader/data_subsets.txt | grep derivatives | grep T1w >> derivatives.txt

***

16. Now download the derivatives. `download.py` automatically adds the derivatives directory. So we'll have to force a subdataset with datalad after the download.

In [6]:
%%bash

cd /home/jovyan/ABCDdcan/

# run the downloader
nda-abcd-s3-downloader/download.py -i datastructure_manifest.txt -o image_files/derivatives -s 5subjects.txt -d derivatives.txt

Process is terminated.


Downloaded data using commands in terminal to monitor progress more easily.

Sample of the s3:// file path for each participant: 
`s3://NDAR_Central_2/submission_23337/derivatives/abcd-hcp-pipeline/sub-NDARINVZVW3HKWN/ses-baselineYear1Arm1/func/sub-NDARINVZVW3HKWN_ses-baselineYear1Arm1_task-rest_bold_atlas-Gordon2014FreeSurferSubcortical_desc-filtered_timeseries.ptseries.nii`

***

18. Now use datalad to create a subdataset and save the subdataset's state.

In [8]:
%%bash

cd /home/jovyan/ABCDdcan/image_files
# force the creation of a subdataset
datalad create -d derivatives --force

# now save the subdataset
cd derivatives
datalad save -m 'add T1w and rest derivatives'

create(ok): . (dataset)
add(ok): derivatives/abcd-hcp-pipeline/sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/anat/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_hemi-L_space-T1w_mesh-native_midthickness.surf.gii (file)
add(ok): derivatives/abcd-hcp-pipeline/sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/anat/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_hemi-R_space-T1w_mesh-fsLR32k_midthickness.surf.gii (file)
add(ok): derivatives/abcd-hcp-pipeline/sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/anat/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_hemi-R_space-T1w_mesh-native_midthickness.surf.gii (file)
add(ok): derivatives/abcd-hcp-pipeline/sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_bold_atlas-Gordon2014FreeSurferSubcortical_desc-filtered_timeseries.ptseries.nii (file)
add(ok): derivatives/abcd-hcp-pipeline/sub-NDARINV0PKCP1P3/ses-baselineYear1Arm1/func/sub-NDARINV0PKCP1P3_ses-baselineYear1Arm1_task-rest_bold_atlas-HCP2016FreeSurferSubcortical_desc-filtered

[INFO] Creating a new annex repo at /home/jovyan/ABCDdcan/image_files/derivatives 


Finally, we can look at the log to see the changes we've made to this dataset.

In [9]:
%%bash

cd /home/jovyan/ABCDdcan/image_files
git log

commit 6f0d548efaa9a45d0f9087cfddaed8f1ffd3e305
Author: Anna <anna.vannucci@columbia.edu>
Date:   Fri Mar 5 04:11:37 2021 +0000

    [DATALAD] new dataset


### Success!!
# 🎉🎉🎉

You should probably buy more tea... ☕