# Downloading the ABCD imaging data

In the previous data exercises you have download and interacted with the [ABCD 3.0 release](https://nda.nih.gov/abcd/query/abcd-curated-annual-release-3.0.html). While there are many measures derived from the imaging data within the pre-packaged tabulated data, you may have noticed that the full set of MRI images are not included in this release.

As stated on [NDA's website](https://nda.nih.gov/abcd/query/abcd-curated-annual-release-3.0.html): 

"The raw MRI images and the minimally processed imaging files are over 100TB in size which may make data transfer difficult. "

The data are stored on [Amazon Simple Storage Service (s3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) servers. 

There are multiple routes to obtaining the full imaging data, we'll focus on the following two:
1. Using links from the [fmriresults01](https://nda.nih.gov/data_structure.html?short_name=fmriresults01) structure
2. Using the [nda-abcd-s3-downloader](https://github.com/DCAN-Labs/nda-abcd-s3-downloader)

Both routes involve creating a data package through the NDA, downloading a manifest file, parsing the manifest file, and finally downloading the data.

For brevity, the exercises in this notebook will guide you through downloading the resting state and T1w data from 5 subjects using each of the above download methods. You will need active NDA credentials and an ABCD DUC to download the data.

**A Note about GUIDs and BIDS**

[From the NDA](https://nda.nih.gov/s/guid/nda-guid.html): "The Global Unique Identifier (GUID) is a subject ID allowing researchers to share data specific to a study participant without exposing personally identifiable information (PII) and match participants across labs and research data repositories."

The GUID's format is `NDAR_INVXXXXXXXX`, where `XXXXXXXX` is a random string of numbers and uppercase letters. The standard GUID format is *not* [BIDS compatible](https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#file-name-structure). In BIDS, the underscore character is reserved to separate key:value entities (eg, `key1-value1_key2-value2`, `sub-01_task-rest`). For the BIDS imaging data on the NDA, the underscore in the GUID has been removed (ie, `NDARINVXXXXXXXX`), but be aware that you might need to do a string replace operation to remove the underscore from the GUIDs in the tabulated data to match the GUIDs in the BIDS imaging data.

***

## Downloading the data using the fmriresults01 structure

The general workflow on the NDA is to add data to your Filter Cart and then create a Data Package from the filter. Here we will create a Data Package from the *fmriresults01* data structure. See Getting Image Volumes [here](https://nda.nih.gov/abcd/query/abcd-release-faqs.html) for more info on the *fmriresults01* structure.

**NOTE**: The `fmriresults01.txt` file is distributed in the ABCD 3.0 Release. So if you've already downnloaded that, then you could use that file. If so, you can skip to step 13.

1. Let's begin at the [NDA's front page](https://nda.nih.gov/). Select **Get Data** > **Get Data**

<img src="./screenshots/nda_frontpage.png" width="900" />

***

2. On the **NDA Query Tool**'s menu, select **Data Structures**. Then enter "fmriresults01" into the Text Search field and hit enter.

<img src="./screenshots/nda_query.png" width="900" />

***

3. Click the **Processed MRI Data** link, which will open the structure. Then select **Add to Filter Cart** in the lower left corner.

<img src="./screenshots/add_filter_cart.png" width="150" />

Your Filter Cart will take a few minutes to update. Make yourself some tea. Once it is finished, you should see the following.

<img src="./screenshots/filter_cart.png" width="400" />

(Sample size may vary depending on when you are working through this exercise)

***

4. In the Filter Cart, select **Create Data Package/Add Data to Study**, which will take you to the Data Packaging Page.

5. On the Data Packaging Page, select **Create Data Package**.

<img src="./screenshots/create_data_package.png" width="200" />

6. If you are not logged into the NDA, this will prompt you to log in with your credientials. After, you will see a menu to define your Data Package. Give it a short name and ensure that **Include Associated Data Files** is *unchecked*. Otherwise, the Data Package will contain the all images in *fmriresults*. It will be faster and more flexible to only download the pointers to the data and not the data istself. When you are finished entering this information, click **Create Data Package**.

<img src="./screenshots/create_menu.png" width="300" />

***

7. You will see a window that confirms that your package was initiated. Click the link to navigate to your Dashboard.

<img src="./screenshots/package_created.png" width="350" />

***

8. In the drop down menu on the Data Package Dashboard, select **My Data Packages**. You should see the Data Package you just created. It will take a few minutes to move from the "Creating Package" status to "Ready to Download". Maybe refill your tea. In the below image **ABCDndar** is the Data Package we just created. **ABCDdcan** will be created in the second section of this exercise.

<img src="./screenshots/create_dash.png" width="350" />

<img src="./screenshots/ready_dash.png" width="350" />

***

9. Once the Data Package is ready to download, we can use the [NDA tools](https://github.com/NDAR/nda-tools) to download it. The NDA tools are already installed and ready to use on the ABCD-ReproNim JupyterHub. The relevant command will be `downloadcmd`. Let's see what options `downloadcmd` has.

In [1]:
! downloadcmd -h

Running NDATools Version 0.2.3
Opening log: /home/jovyan/NDAValidationResults/debug_log_20210304T184400.txt
usage: downloadcmd <S3_path_list>

This application allows you to enter a list of aws S3 paths and will download
the files to your drive in your home folder. Alternatively, you may enter a
packageID, an NDA data structure file or a text file with s3 links, and the
client will download all files from the S3 links listed. Please note, the
maximum transfer limit of data is 5TB at one time.

positional arguments:
  <S3_path_list>        Will download all S3 files to your local drive

optional arguments:
  -h, --help            show this help message and exit
  -dp, --package        Flags to download all S3 files in package.
  -t, --txt             Flags that a text file has been entered from where to
                        download S3 files.
  -ds, --datastructure  Flags that a data structure text file has been entered
                        from where to download S3 files.
  -u <a

Recall that the `!` in the code cell of a Jupyter notebook means to execute that command using shell.

***

10. Our first usage of `downloadcmd` will use the Package ID to download the associated package files. Let's put the ABCDndar package into it's own directory. If you have already set up your NDA credentials to download the ABCD 3.0 Release, then `downloadcmd` will use the already stored credentials.  

In [14]:
! mkdir /home/jovyan/ABCDndar
! downloadcmd 1186278 -dp -d /home/jovyan/ABCDndar # number is the data package ID from ndar

mkdir: cannot create directory ‘/home/jovyan/ABCDndar’: File exists
Running NDATools Version 0.2.3
Opening log: /home/jovyan/NDAValidationResults/debug_log_20210304T185714.txt


***

11. Once the download is complete, we can list the files. The relevant file is `fmriresults01.txt`, which contains information about each image in this structure.

In [16]:
! ls /home/jovyan/ABCDndar

datastructure_manifest.txt  fmriresults01.txt	 package_info.txt
experiments		    guid_pseudoguid.txt  README.pdf


***

12. `fmriresults01.txt` is a tab-separated table that contains information about corresponding image files. Let's read this table into python so that we can parse and choose only the image files we want.

In [17]:
import pandas as pd
fmri = pd.read_csv('/home/jovyan/ABCDndar/fmriresults01.txt', sep='\t', low_memory=False)

*** 

13. Let's look at the structure and contents of the `fmriresults01.txt`. 

In [18]:
fmri.head()

Unnamed: 0,collection_id,fmriresults01_id,dataset_id,subjectkey,src_subject_id,origin_dataset_id,interview_date,interview_age,sex,experiment_id,...,qc_outcome,derived_files,scan_type,img03_id2,file_source2,session_det,image_history,manifest,image_description,collection_title
0,collection_id,fmriresults01_id,dataset_id,The NDAR Global Unique Identifier (GUID) for r...,Subject ID how it's defined in lab/project,Origin dataset Id,Date on which the interview/genetic test/sampl...,Age in months at the time of the interview/tes...,Sex of the subject,ID for the Experiment/settings/run,...,Provide information on the conclusion of the q...,An archive of the files produced by the pipeline,Type of Scan,Corresponds to row_id in image03 data structur...,"File name/location, 2",session details,"Image history,f.e. transformations steps and o...",,"Image description, i.e. DTI, fMRI, Fast SPGR, ...",collection_title
1,2099,448,10764,NDARUC212ALG,10235701,,01/25/2016,300,F,,...,pass,s3://NDAR_Central_1/submission_11664/home-nfs/...,MR structural (T1),,,,,,,"RDoC Constructs: Neural Substrates, Heritabili..."
2,2099,449,10764,NDARLB157TKZ,10235702,,01/25/2016,300,F,,...,pass,s3://NDAR_Central_1/submission_11664/home-nfs/...,MR structural (T1),,,,,,,"RDoC Constructs: Neural Substrates, Heritabili..."
3,2099,450,10764,NDARHA308FFG,10383701,,01/28/2016,360,M,,...,pass,s3://NDAR_Central_1/submission_11664/home-nfs/...,MR structural (T1),,,,,,,"RDoC Constructs: Neural Substrates, Heritabili..."
4,2099,451,10764,NDARBE334JDM,10383702,,01/28/2016,360,M,,...,pass,s3://NDAR_Central_1/submission_11664/home-nfs/...,MR structural (T1),,,,,,,"RDoC Constructs: Neural Substrates, Heritabili..."


We can see that the first row contains a detailed description of the column. We won't need to include this in our parsing, so we can drop it.

In [19]:
fmri = fmri.drop([0])
fmri.head()

Unnamed: 0,collection_id,fmriresults01_id,dataset_id,subjectkey,src_subject_id,origin_dataset_id,interview_date,interview_age,sex,experiment_id,...,qc_outcome,derived_files,scan_type,img03_id2,file_source2,session_det,image_history,manifest,image_description,collection_title
1,2099,448,10764,NDARUC212ALG,10235701,,01/25/2016,300,F,,...,pass,s3://NDAR_Central_1/submission_11664/home-nfs/...,MR structural (T1),,,,,,,"RDoC Constructs: Neural Substrates, Heritabili..."
2,2099,449,10764,NDARLB157TKZ,10235702,,01/25/2016,300,F,,...,pass,s3://NDAR_Central_1/submission_11664/home-nfs/...,MR structural (T1),,,,,,,"RDoC Constructs: Neural Substrates, Heritabili..."
3,2099,450,10764,NDARHA308FFG,10383701,,01/28/2016,360,M,,...,pass,s3://NDAR_Central_1/submission_11664/home-nfs/...,MR structural (T1),,,,,,,"RDoC Constructs: Neural Substrates, Heritabili..."
4,2099,451,10764,NDARBE334JDM,10383702,,01/28/2016,360,M,,...,pass,s3://NDAR_Central_1/submission_11664/home-nfs/...,MR structural (T1),,,,,,,"RDoC Constructs: Neural Substrates, Heritabili..."
5,2099,452,10764,NDARUP674KX9,10119501,,01/30/2016,288,F,,...,pass,s3://NDAR_Central_1/submission_11664/home-nfs/...,MR structural (T1),,,,,,,"RDoC Constructs: Neural Substrates, Heritabili..."


***

14. We do not need most of the information in this table. The relevant columns are `file_source`, which contains the s3 links to the raw DICOM images and `derived_files` which contains the s3 links to the minimally preprocessed images. Here we will focus on downloading the minimally processed files in `derived_files`. We will create a dataframe that contains the s3 links and other relevant fields so that we can filter the s3 links. Explore other columns of this table to see the processing steps that has been applied to the `derived_files`.

In [20]:
# create a new data frame from the derived_files column
s3_derv = fmri.loc[:,['derived_files']] 
# view dataframe
s3_derv.head()

Unnamed: 0,derived_files
1,s3://NDAR_Central_1/submission_11664/home-nfs/...
2,s3://NDAR_Central_1/submission_11664/home-nfs/...
3,s3://NDAR_Central_1/submission_11664/home-nfs/...
4,s3://NDAR_Central_1/submission_11664/home-nfs/...
5,s3://NDAR_Central_1/submission_11664/home-nfs/...


In [9]:
# number of rows in the derived_columns dataframe
total_rows = s3_derv.count
print(total_rows) # 642628 rows; end seems to have a lot of NaN values

<bound method DataFrame.count of                                             derived_files
1       s3://NDAR_Central_1/submission_11664/home-nfs/...
2       s3://NDAR_Central_1/submission_11664/home-nfs/...
3       s3://NDAR_Central_1/submission_11664/home-nfs/...
4       s3://NDAR_Central_1/submission_11664/home-nfs/...
5       s3://NDAR_Central_1/submission_11664/home-nfs/...
...                                                   ...
642624                                                NaN
642625                                                NaN
642626                                                NaN
642627                                                NaN
642628                                                NaN

[642628 rows x 1 columns]>


Let's look at the format of the s3 links to see how we could parse this for filtering:

*s3://NDAR_Central_4/submission_32739/NDARINVXXXXXXXX_baselineYear1Arm1_ABCD-MPROC-SST-fMRI_XXXXXXXXXXXXXX.tgz*

- *s3://NDAR_Central_4/submission_32739* is the location of the data on the s3 server
- *NDARINVXXXXXXXX* is the GUID
- *baselineYear1Arm1* is the session
- *ABCD-MPROC-SST-fMRI* is the scan type information
- The number at the end of the file is the acqusition date/time
- *.tgz* is the TAR archive file extension

We can use python's ability to split strings to parse these strings so that we can filter by GUID, session, and scan type. Let's see an example:

In [25]:
example = 's3://NDAR_Central_4/submission_32739/NDARINVXXXXXXXX_baselineYear1Arm1_ABCD-MPROC-SST-fMRI_XXXXXXXXXXXXXX.tgz'
example.split('/')

['s3:',
 '',
 'NDAR_Central_4',
 'submission_32739',
 'NDARINVXXXXXXXX_baselineYear1Arm1_ABCD-MPROC-SST-fMRI_XXXXXXXXXXXXXX.tgz']

The above code splits the `example` string into a list of strings at every occurence of `/`.

`.split` only operates on strings, but we have an entire column of strings we want to split. Here we can leverage python's list comprehension to iterate through each string.

For example:

In [26]:
test_split = [i.split('/') for i in s3_derv['derived_files']]
test_split[0:3]

AttributeError: 'float' object has no attribute 'split'

The above code submits the same split operation to every item in the `s3_derv['derived_files']` data frame we created above. You could also complete this with a regular `for` loop, but list comprehension is cleaner and more efficient.

We can leverage string splitting and list comprehension to parse each s3 link into a corresponding GUID, session, and scan type.

In [27]:
s3_derv['guid'] = [i.split('/')[-1].split('_')[0] for i in s3_derv['derived_files']] # get the GUID
s3_derv['session'] = [i.split('/')[-1].split('_')[1] for i in s3_derv['derived_files']] # get the session
s3_derv['scan'] = [i.split('/')[-1].split('_')[2].split('-',1)[-1] for i in s3_derv['derived_files']] # get the scan type

s3_derv.head()

AttributeError: 'float' object has no attribute 'split'