## MRI Datasets

For this workshop and the fMRI and dwi workshops that follow, we will be using a subset of a publicly available dataset, ds000030, from [openneuro.org](https://openneuro.org/datasets/ds000030). This dataset and all others hosted on OpenNeuro is structured according to BIDS.

### OpenNeuro
- client-side BIDS validation
- resumable uploads
- running BIDS apps

### Downloading Data

#### Datalad

`Datalad` installs the data - which for a dataset means that we get the "small" data (i.e. the text files) and the download instructions for the larger files. We can now navigate the dataset like its a file system and plan our analysis.

In [9]:
import datalad.api as dl



In [11]:
ds = dl.install('../../data/ds000030', source='///openfmri/ds000030')

[INFO] Cloning http://datasets.datalad.org/openfmri/ds000030 [1 other candidates] into '/mnt/tigrlab/projects/mjoseph/tutorials/carpentry/SDC-BIDS-IntroMRI/data/ds00030' 


HBox(children=(IntProgress(value=0, description='Cloning (counting objects)', style=ProgressStyle(description_…



HBox(children=(IntProgress(value=0, description='Cloning (compressing objects)', max=161642, style=ProgressSty…



HBox(children=(IntProgress(value=0, description='Cloning (receiving objects)', max=358420, style=ProgressStyle…



HBox(children=(IntProgress(value=0, description='Cloning (resolving stuff)', max=212731, style=ProgressStyle(d…



HBox(children=(IntProgress(value=0, description='Cloning (checking things out)', max=9926, style=ProgressStyle…



[INFO] pyenv: git-annex-remote-datalad-archives: command not found 
[INFO]  
[INFO] The `git-annex-remote-datalad-archives' command exists in these Python versions: 
[INFO]   3.7.3/envs/carpentry_venv 
[INFO]   3.7.3/envs/tigrlab_venv 
[INFO]   carpentry_venv 
[INFO]   tigrlab_venv 
[INFO] pyenv: git-annex-remote-datalad-archives: command not found 
[INFO] The `git-annex-remote-datalad-archives' command exists in these Python versions: 
[INFO]   3.7.3/envs/carpentry_venv 
[INFO]   3.7.3/envs/tigrlab_venv 
[INFO]   carpentry_venv 
[INFO]   tigrlab_venv 
[INFO]   external special remote protocol error, unexpectedly received "" (unable to parse command) 


Getting and dropping data

In [None]:
ds.get('../../data/ds000030/sub-10159')

In [None]:
ds.drop('../../data/ds000030/sub-10159')

### Amazon Web Services (AWS)

In [None]:
!aws s3 sync --no-sign-request \
  s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/sub-10159 \
  data/ds000030/sub-10159

In [None]:
!aws s3 sync --no-sign-request \
  s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/sub-10159 \
  data/ds000030/sub-10159 \
  --exclude '*' \
  --include '*task-rest_bold*'

### Exploring Data

Below is a tree diagram showing the folder structure of single MR session within ds000030. This was obtained by using the bash command `tree`.  
`!tree data/ds000030`

```
ds000030
├── CHANGES
├── dataset_description.json
├── derivatives
│   └── fmriprep
├── participants.tsv
├── README
├── sub-50083
│   ├── anat
│   │   ├── sub-50083_T1w.json
│   │   └── sub-50083_T1w.nii.gz
│   └── func
│       ├── sub-50083_task-rest_bold.json
│       └── sub-50083_task-rest_bold.nii.gz
└── task-rest_bold.json
```

The `participants.tsv` file is meant to describe some demographic information on each participant within your study (eg. age, handedness, sex, etc.) Let's take a look at the `participants.tsv` file to see what's been included in this dataset.

In order to load the data into Python, we'll need to import the `pandas` package. The `pandas` **dataframe** is Python's equivalent to an Excel spreadsheet.

In [None]:
import pandas as pd

We'll use the `read_csv()` function. It requires us to specify the name of the file we want to import and the separator that is used to distinguish each column in our file (`\t` since we're working with a `.tsv` file).

In [None]:
participant_metadata = pd.read_csv('../../data/ds000030/participants.tsv', sep='\t')

In order to get a glimpse of our data, we'll use the `head()` function. By default, `head` prints the first 5 rows of our dataframe.

In [None]:
participant_metadata.head()

We can view any number of rows by specifying `n=?` as an argument within `head()`.  
If we want to select particular rows within the dataframe, we can use the `loc[]` function and identify the rows we want based on their index label (the numbers in the left-most column).

In [None]:
participant_metadata.loc[[6, 10, 12]]

**EXERCISE**: Select the first 5 rows of the dataframe using `loc[]`.

In [None]:
participant_metadata.loc[:4]

**EXERCISE:** How many participants do we have in total?

In [None]:
participant_metadata.shape

There are 2 different methods of selecting columns in a dataframe:  
*  participant_metadata[`'<column_name>'`] (this is similar to selecting a key in a Python dictionary)  
*  participant_metadata.`<column_name>`  

Another way to see how many participants are in the study is to select the `participant_id` column and use the `count()` function.

In [None]:
participant_metadata['participant_id'].count()

**EXERCISE:** Which diagnosis groups are part of the study?  
*Hint: use the* `unique()` *function.*

In [None]:
participant_metadata['diagnosis'].unique()

If we want to count the number of participants in each diagnosis group, we can use the `value_counts()` function.

In [None]:
participant_metadata['diagnosis'].value_counts()

**EXERCISE:** How many males and females are in the study? How many are in each diagnosis group?

In [None]:
participant_metadata['gender'].value_counts()

In [None]:
participant_metadata.groupby(['diagnosis', 'gender']).size()

When looking at the participant dataframe, we noticed that there is a column called `ghost_NoGhost`. We should look at the README file that comes with the dataset to find out more about this.

In [None]:
!cat ../data/ds000030/README

For this tutorial, we're just going to work with participants that are either CONTROL or SCHZ (`diagnosis`) and have both a T1w (`T1w == 1`) and rest (`rest == 1`) scan. Also, we'll only use data without a ghosting artifact in the T1w scan (`ghost_NoGhost == 'No_ghost'`).

<b>EXERCISE:</b> Filter <code>participant_metadata</code> so that only the above conditions are present.

In [None]:
participant_metadata = participant_metadata[(participant_metadata.diagnosis.isin(['CONTROL', 'SCHZ'])) & 
                                            (participant_metadata.T1w == 1) & 
                                            (participant_metadata.rest == 1) & 
                                            (participant_metadata.ghost_NoGhost == 'No_ghost')]
participant_metadata

To ease the analysis and quicken the amount of time required to download the data, we're just going to use scans from 10 randomly sampled CONTROL and 10 SCHZ participants.

In [None]:
diagnosis_groups = participant_metadata.groupby('diagnosis')
filtered_participant_metadata = diagnosis_groups.apply(lambda x: x.sample(n = 10))
filtered_participant_metadata

In [None]:
participant_list = filtered_participant_metadata.participant_id.tolist()
participant_list

We've already randomly sampled 10 CONTROL and 10 SCHZ participants and placed the participant list in the `../download_list` text file. Let's download that data now.

In [None]:
# # download T1w scans
# !cat ../download_list | \
#   xargs -I '{}' aws s3 sync --no-sign-request \
#   s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/{}/anat \
#   ../data/ds000030/{}/anat

# # download resting state fMRI scans
# !cat ../download_list | \
#   xargs -I '{}' aws s3 sync --no-sign-request \
#   s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/{}/func \
#   ../data/ds000030/{}/func \
#   --exclude '*' \
#   --include '*task-rest_bold*'

# # download fmriprep preprocessed anat data
# !cat ../download_list | \
#   xargs -I '{}' aws s3 sync --no-sign-request \
#   s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/derivatives/fmriprep/{}/anat \
#   ../data/ds000030/derivatives/fmriprep/{}/anat

# # download fmriprep preprocessed func data
# !cat ../download_list | \
#   xargs -I '{}' aws s3 sync --no-sign-request \
#   s3://openneuro/ds000030/ds000030_R1.0.5/uncompressed/derivatives/fmriprep/{}/func \
#   ../data/ds000030/derivatives/fmriprep/{}/func \
#   --exclude '*' \
#   --include '*task-rest_bold*'

### Querying a BIDS Dataset

[pybids](https://bids-standard.github.io/pybids/) is a Python API for querying, summarizing and manipulating the BIDS folder structure.

In [None]:
from bids.layout import BIDSLayout

In [None]:
layout = BIDSLayout('../data/ds000030')

The pybids layout object lets you query your BIDS dataset according to a number of parameters by using a `get_*()` method.  
We can get a list of the subjects we've downloaded from the dataset.

In [None]:
layout.get_subjects()

To get a list of all of the files, just use `get()`. 

In [None]:
layout.get()

There are many arguments we can use to filter down this list. Any BIDS-defined keyword can be passed on as a constraint. In `pybids`, these keywords are known as **entities**. For a complete list of possibilities:

In [None]:
layout.entities

For example, if we only want the file paths of all of our resting state fMRI scans,

In [None]:
layout.get(datatype='func', suffix='bold', task='rest', extensions=['.nii.gz'], return_type='file')

**EXERCISE**: Retrieve the file paths of any scan where the subject is '10292' or '50081' and the `RepetitionTime` is 2 seconds.

In [None]:
layout.get(subject=['10292', '50081'], RepetitionTime=2, return_type='file')

Let's save the first file from our list of file paths to a variable and pull the metadata from its associated JSON file using the `get_metadata()` function.

In [None]:
fmri_file = layout.get(subject=['10292', '50081'], RepetitionTime=2, return_type='file')[0]
layout.get_metadata(fmri_file)

We can even collect the metadata for all of our fmri scans into a list and convert this into a dataframe.

In [None]:
metadata_list = []
all_fmri_files = layout.get(datatype='func', suffix='bold', return_type='file', extensions='.nii.gz')
for fmri_file in all_fmri_files:
    fmri_metadata = layout.get_metadata(fmri_file)
    metadata_list.append(fmri_metadata)
df = pd.DataFrame.from_records(metadata_list)
df