# Open MRI Datasets

For this workshop and the fMRI and dwi workshops that follow, we will be using a subset of a publicly available dataset, ds000030, from [openneuro.org](https://openneuro.org/datasets/ds000030). This dataset and all others hosted on OpenNeuro is structured according to BIDS.

## OpenNeuro
- client-side BIDS validation
- resumable uploads
- running BIDS apps

## Downloading Data

### Datalad

`Datalad` installs the data - which for a dataset means that we get the "small" data (i.e. the text files) and the download instructions for the larger files. We can now navigate the dataset like its a file system and plan our analysis.

In [1]:
import datalad.api as dl



In [2]:
ds = dl.install('../../data/ds000030', source='///openfmri/ds000030')

Getting and dropping data

In [3]:
ds.get('../../data/ds000030/sub-10159')

[{'action': 'get',
  'path': '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030/sub-10159',
  'type': 'directory',
  'refds': '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030',
  'status': 'notneeded',
  'message': ('nothing to get from %s',
   '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030/sub-10159')}]

In [28]:
ds.get('../../data/ds000030/*json')

IncompleteResultsError: Command did not complete successfully [{'action': 'get', 'path': '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030/*json', 'refds': '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030', 'raw_input': True, 'orig_request': '../../data/ds000030/*json', 'state': 'absent', 'parentds': '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030', 'status': 'impossible', 'message': 'path does not exist'}]

In [4]:
ds.drop('../../data/ds000030/sub-10159')

[{'action': 'drop',
  'type': 'file',
  'refds': '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030',
  'status': 'ok',
  'path': '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030/sub-10159/anat/sub-10159_T1w.json',
  'annexkey': 'MD5E-s1180--903e6d1d0a4aba79b3317fbad68c13aa.json',
  'message': 'checking http://openneuro.s3.amazonaws.com/ds000030/ds000030_R1.0.3/uncompressed/sub-10159/anat/sub-10159_T1w.json?versionId=lmDP2N7EM4VLM2Qu98_PMBKILnNheGF_...'},
 {'action': 'drop',
  'type': 'file',
  'refds': '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030',
  'status': 'ok',
  'path': '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030/sub-10159/anat/sub-10159_T1w.nii.gz',
  'annexkey': 'MD5E-s11637742--75cab8005361c2504ba5a7f02ecbacd7.nii.gz',
  'message': 'checking http://openneuro.s3.amazonaws.com/ds000030/ds000030_R1.0.3/uncompressed/sub-10159/anat/sub-1015

### Amazon Web Services (AWS)

In [None]:
!aws s3 sync --no-sign-request \
  s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/sub-10159 \
  data/ds000030/sub-10159

In [None]:
!aws s3 sync --no-sign-request \
  s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/sub-10159 \
  data/ds000030/sub-10159 \
  --exclude '*' \
  --include '*task-rest_bold*'

## Exploring Data

Below is a tree diagram showing the folder structure of single MR session within ds000030. This was obtained by using the bash command `tree`.  
`!tree data/ds000030`

```
ds000030
├── CHANGES
├── dataset_description.json
├── derivatives
│   └── fmriprep
├── participants.tsv
├── README
├── sub-50083
│   ├── anat
│   │   ├── sub-50083_T1w.json
│   │   └── sub-50083_T1w.nii.gz
│   └── func
│       ├── sub-50083_task-rest_bold.json
│       └── sub-50083_task-rest_bold.nii.gz
└── task-rest_bold.json
```

The `participants.tsv` file is meant to describe some demographic information on each participant within your study (eg. age, handedness, sex, etc.) Let's take a look at the `participants.tsv` file to see what's been included in this dataset.

In order to load the data into Python, we'll need to import the `pandas` package. The `pandas` **dataframe** is Python's equivalent to an Excel spreadsheet.

In [5]:
import pandas as pd

We'll use the `read_csv()` function. It requires us to specify the name of the file we want to import and the separator that is used to distinguish each column in our file (`\t` since we're working with a `.tsv` file).

In [9]:
participant_metadata = pd.read_csv('../../data/ds000030/participants.tsv', sep='\t')

In order to get a glimpse of our data, we'll use the `head()` function. By default, `head` prints the first 5 rows of our dataframe.

In [10]:
participant_metadata.head()

Unnamed: 0,participant_id,diagnosis,age,gender,bart,bht,dwi,pamenc,pamret,rest,scap,stopsignal,T1w,taskswitch,ScannerSerialNumber
0,sub-10159,CONTROL,30,F,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
1,sub-10171,CONTROL,24,M,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
2,sub-10189,CONTROL,49,M,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
3,sub-10193,CONTROL,40,M,1.0,,1.0,,,,,,1.0,,35343.0
4,sub-10206,CONTROL,21,M,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0


We can view any number of rows by specifying `n=?` as an argument within `head()`.  
If we want to select particular rows within the dataframe, we can use the `loc[]` function and identify the rows we want based on their index label (the numbers in the left-most column).

In [11]:
participant_metadata.loc[[6, 10, 12]]

Unnamed: 0,participant_id,diagnosis,age,gender,bart,bht,dwi,pamenc,pamret,rest,scap,stopsignal,T1w,taskswitch,ScannerSerialNumber
6,sub-10225,CONTROL,35,M,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
10,sub-10249,CONTROL,28,M,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
12,sub-10271,CONTROL,41,F,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0


**EXERCISE**: Select the first 5 rows of the dataframe using `loc[]`.

In [12]:
participant_metadata.loc[:4]

Unnamed: 0,participant_id,diagnosis,age,gender,bart,bht,dwi,pamenc,pamret,rest,scap,stopsignal,T1w,taskswitch,ScannerSerialNumber
0,sub-10159,CONTROL,30,F,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
1,sub-10171,CONTROL,24,M,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
2,sub-10189,CONTROL,49,M,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
3,sub-10193,CONTROL,40,M,1.0,,1.0,,,,,,1.0,,35343.0
4,sub-10206,CONTROL,21,M,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0


**EXERCISE:** How many participants do we have in total?

In [13]:
participant_metadata.shape

(272, 15)

There are 2 different methods of selecting columns in a dataframe:  
*  participant_metadata[`'<column_name>'`] (this is similar to selecting a key in a Python dictionary)  
*  participant_metadata.`<column_name>`  

Another way to see how many participants are in the study is to select the `participant_id` column and use the `count()` function.

In [14]:
participant_metadata['participant_id'].count()

272

**EXERCISE:** Which diagnosis groups are part of the study?  
*Hint: use the* `unique()` *function.*

In [15]:
participant_metadata['diagnosis'].unique()

array(['CONTROL', 'SCHZ', 'BIPOLAR', 'ADHD'], dtype=object)

If we want to count the number of participants in each diagnosis group, we can use the `value_counts()` function.

In [16]:
participant_metadata['diagnosis'].value_counts()

CONTROL    130
SCHZ        50
BIPOLAR     49
ADHD        43
Name: diagnosis, dtype: int64

**EXERCISE:** How many males and females are in the study? How many are in each diagnosis group?

In [17]:
participant_metadata['gender'].value_counts()

M    155
F    117
Name: gender, dtype: int64

In [18]:
participant_metadata.groupby(['diagnosis', 'gender']).size()

diagnosis  gender
ADHD       F         22
           M         21
BIPOLAR    F         21
           M         28
CONTROL    F         62
           M         68
SCHZ       F         12
           M         38
dtype: int64

When looking at the participant dataframe, we noticed that there is a column called `ghost_NoGhost`. We should look at the README file that comes with the dataset to find out more about this.

In [20]:
!cat ../../data/ds000030/README

## UCLA Consortium for Neuropsychiatric Phenomics LA5c Study

## Feedback and Discussion of Dataset DS000030

Feedback, discussions, and comments are welcome. For information on how and where to discuss this data set, please see the OpenfMRI FAQ: https://openfmri.org/faq/ item "**Is there a place to discuss these datasets with the larger community?**"

## Data Organization

The data set is organized in BIDS version 1.0.0rc3 (http://bids.neuroimaging.io) format.

## Subjects / Participants
The participants.tsv file contains subject IDs with demographic informations as well as an inventory of the scans that are included for each subject.

## Dataset Derivatives (/derivatives)
The /derivaties folder contains summary information that reflects the data and its contents:

1. Final_Scan_Count.pdf - Plot showing the over all scan inclusion, for quick reference.
2. parameter_plots/ - Folder contains many scan parameters plotted over time. Plot symbols are color coded by imaging site. Intended t

For this tutorial, we're just going to work with participants that are either CONTROL or SCHZ (`diagnosis`) and have both a T1w (`T1w == 1`) and rest (`rest == 1`) scan.

<b>EXERCISE:</b> Filter <code>participant_metadata</code> so that only the above conditions are present.

In [21]:
participant_metadata = participant_metadata[(participant_metadata.diagnosis.isin(['CONTROL', 'SCHZ'])) & 
                                            (participant_metadata.T1w == 1) & 
                                            (participant_metadata.rest == 1)]
participant_metadata

Unnamed: 0,participant_id,diagnosis,age,gender,bart,bht,dwi,pamenc,pamret,rest,scap,stopsignal,T1w,taskswitch,ScannerSerialNumber
0,sub-10159,CONTROL,30,F,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
1,sub-10171,CONTROL,24,M,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
2,sub-10189,CONTROL,49,M,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
4,sub-10206,CONTROL,21,M,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
5,sub-10217,CONTROL,33,F,1.0,,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
175,sub-50077,SCHZ,29,M,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35426.0
176,sub-50080,SCHZ,29,M,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35426.0
177,sub-50081,SCHZ,32,M,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35426.0
178,sub-50083,SCHZ,40,M,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35426.0


To ease the analysis and quicken the amount of time required to download the data, we're just going to use scans from 10 randomly sampled CONTROL and 10 SCHZ participants.

In [22]:
diagnosis_groups = participant_metadata.groupby('diagnosis')
filtered_participant_metadata = diagnosis_groups.apply(lambda x: x.sample(n = 10))
filtered_participant_metadata

Unnamed: 0_level_0,Unnamed: 1_level_0,participant_id,diagnosis,age,gender,bart,bht,dwi,pamenc,pamret,rest,scap,stopsignal,T1w,taskswitch,ScannerSerialNumber
diagnosis,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
CONTROL,19,sub-10304,CONTROL,23,F,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
CONTROL,46,sub-10506,CONTROL,25,M,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35343.0
CONTROL,43,sub-10487,CONTROL,31,M,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
CONTROL,47,sub-10517,CONTROL,21,F,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35343.0
CONTROL,67,sub-10692,CONTROL,28,F,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35343.0
CONTROL,62,sub-10672,CONTROL,29,M,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35343.0
CONTROL,8,sub-10228,CONTROL,40,F,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
CONTROL,40,sub-10460,CONTROL,40,M,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0
CONTROL,106,sub-11061,CONTROL,45,F,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35426.0
CONTROL,20,sub-10316,CONTROL,29,M,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,35343.0


In [23]:
participant_list = filtered_participant_metadata.participant_id.tolist()
participant_list

['sub-10304',
 'sub-10506',
 'sub-10487',
 'sub-10517',
 'sub-10692',
 'sub-10672',
 'sub-10228',
 'sub-10460',
 'sub-11061',
 'sub-10316',
 'sub-50038',
 'sub-50022',
 'sub-50064',
 'sub-50033',
 'sub-50005',
 'sub-50014',
 'sub-50069',
 'sub-50054',
 'sub-50049',
 'sub-50080']

We've already randomly sampled 10 CONTROL and 10 SCHZ participants and placed the participant list in the `../download_list` text file. Let's download that data now.

In [None]:
# # download T1w scans
# !cat ../download_list | \
#   xargs -I '{}' aws s3 sync --no-sign-request \
#   s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/{}/anat \
#   ../data/ds000030/{}/anat

# # download resting state fMRI scans
# !cat ../download_list | \
#   xargs -I '{}' aws s3 sync --no-sign-request \
#   s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/{}/func \
#   ../data/ds000030/{}/func \
#   --exclude '*' \
#   --include '*task-rest_bold*'

# # download fmriprep preprocessed anat data
# !cat ../download_list | \
#   xargs -I '{}' aws s3 sync --no-sign-request \
#   s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/derivatives/fmriprep/{}/anat \
#   ../data/ds000030/derivatives/fmriprep/{}/anat

# # download fmriprep preprocessed func data
# !cat ../download_list | \
#   xargs -I '{}' aws s3 sync --no-sign-request \
#   s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/derivatives/fmriprep/{}/func \
#   ../data/ds000030/derivatives/fmriprep/{}/func \
#   --exclude '*' \
#   --include '*task-rest_bold*'

## Querying a BIDS Dataset

[pybids](https://bids-standard.github.io/pybids/) is a Python API for querying, summarizing and manipulating the BIDS folder structure.

In [24]:
from bids.layout import BIDSLayout

In [25]:
layout = BIDSLayout('../../data/ds000030')

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds000030/phenotype/acds_adult.json'

The pybids layout object lets you query your BIDS dataset according to a number of parameters by using a `get_*()` method.  
We can get a list of the subjects we've downloaded from the dataset.

In [26]:
layout.get_subjects()

NameError: name 'layout' is not defined

To get a list of all of the files, just use `get()`. 

In [None]:
layout.get()

There are many arguments we can use to filter down this list. Any BIDS-defined keyword can be passed on as a constraint. In `pybids`, these keywords are known as **entities**. For a complete list of possibilities:

In [None]:
layout.entities

For example, if we only want the file paths of all of our resting state fMRI scans,

In [None]:
layout.get(datatype='func', suffix='bold', task='rest', extensions=['.nii.gz'], return_type='file')

**EXERCISE**: Retrieve the file paths of any scan where the subject is '10292' or '50081' and the `RepetitionTime` is 2 seconds.

In [None]:
layout.get(subject=['10292', '50081'], RepetitionTime=2, return_type='file')

Let's save the first file from our list of file paths to a variable and pull the metadata from its associated JSON file using the `get_metadata()` function.

In [None]:
fmri_file = layout.get(subject=['10292', '50081'], RepetitionTime=2, return_type='file')[0]
layout.get_metadata(fmri_file)

We can even collect the metadata for all of our fmri scans into a list and convert this into a dataframe.

In [None]:
metadata_list = []
all_fmri_files = layout.get(datatype='func', suffix='bold', return_type='file', extensions='.nii.gz')
for fmri_file in all_fmri_files:
    fmri_metadata = layout.get_metadata(fmri_file)
    metadata_list.append(fmri_metadata)
df = pd.DataFrame.from_records(metadata_list)
df