# Open MRI datasets

---

In this lesson, we will be using a subset of a publicly available dataset, **ds000030**, from [openneuro.org](https://openneuro.org/datasets/ds000030). All of the datasets on OpenNeuro are already structured according to BIDS.

## OpenNeuro

- client-side BIDS validation
- resumable uploads

## Downloading data

### DataLad

`DataLad` installs the data - which for a dataset means that we get the "small" data (i.e. the text files) and the download instructions for the larger files. We can now navigate the dataset like its a file system and plan our analysis.

We'll switch to the terminal for this part.

Navigate to the folder where you'd like to download the dataset.

In [None]:
!cd ../data && datalad install https://github.com/OpenNeuroDatasets/ds000030.git

Getting and dropping data

In [None]:
!datalad get ../data/ds000030/sub-10788  
!datalad drop ../data/ds000030/sub-10788

Removing data

In [None]:
!datalad remove ../data/ds000030

## Exploring data

Below is a tree diagram showing the folder structure of single MR session within ds000030.

In [None]:
!tree ../data/ds000030

The `participants.tsv` file is meant to describe some demographic information on each participant within your study (eg. age, handedness, sex, etc.) Let's take a look at the `participants.tsv` file to see what's been included in this dataset.

In order to load the data into Python, we'll need to import the `pandas` package. The `pandas` **dataframe** is Python's equivalent to an Excel spreadsheet.

In [None]:
import pandas as pd

We'll use the `read_csv()` function. It requires us to specify the name of the file we want to import and the separator that is used to distinguish each column in our file (`\t` since we're working with a `.tsv` file).

In [None]:
participant_metadata = pd.read_csv('../data/ds000030/participants.tsv', sep='\t')

In order to get a glimpse of our data, we'll use the `head()` function. By default, `head` prints the first 5 rows of our dataframe.

In [None]:
participant_metadata.head()

We can view any number of rows by specifying `n=?` as an argument within `head()`.  
If we want to select particular rows within the dataframe, we can use the `loc[]` function and identify the rows we want based on their index label (the numbers in the left-most column).

In [None]:
participant_metadata.loc[[6, 10, 12]]

**EXERCISE**: Select the first 5 rows of the dataframe using `loc[]`.

In [None]:
participant_metadata.loc[:4]

**EXERCISE:** How many participants do we have in total?

In [None]:
participant_metadata.shape

There are 2 different methods of selecting columns in a dataframe:  
*  participant_metadata[`'<column_name>'`] (this is similar to selecting a key in a Python dictionary)  
*  participant_metadata.`<column_name>`  

Another way to see how many participants are in the study is to select the `participant_id` column and use the `count()` function.

In [None]:
participant_metadata['participant_id'].count()

**EXERCISE:** Which diagnosis groups are part of the study?  
*Hint: use the* `unique()` *function.*

In [None]:
participant_metadata['diagnosis'].unique()

If we want to count the number of participants in each diagnosis group, we can use the `value_counts()` function.

In [None]:
participant_metadata['diagnosis'].value_counts()

**EXERCISE:** How many males and females are in the study? How many are in each diagnosis group?

In [None]:
participant_metadata['gender'].value_counts()

In [None]:
participant_metadata.groupby(['diagnosis', 'gender']).size()

When looking at the participant dataframe, we noticed that there is a column called `ghost_NoGhost`. We should look at the README file that comes with the dataset to find out more about this.

In [None]:
!cat ../data/ds000030/README

For this tutorial, we're just going to work with participants that are either CONTROL or SCHZ (`diagnosis`) and have both a T1w (`T1w == 1`) and rest (`rest == 1`) scan.

**EXERCISE:** Filter `participant_metadata` so that only the above conditions are present.

In [None]:
participant_metadata = participant_metadata[(participant_metadata.diagnosis.isin(['CONTROL', 'SCHZ'])) & 
                                            (participant_metadata.T1w == 1) & 
                                            (participant_metadata.rest == 1)]
participant_metadata

## Querying a BIDS dataset

[pybids](https://bids-standard.github.io/pybids/) is a Python API for querying, summarizing and manipulating the BIDS folder structure.

In [None]:
from bids.layout import BIDSLayout

In [None]:
layout = BIDSLayout("../data/ds000030")

Indexing a database can take a really long time, especially if you have several subjects, modalities, scan types, etc. `pybids` has an option to save the indexed results to a SQLite database. This database can then be re-used the next time you want to query the same database.

In [None]:
layout.save("../data/ds000030/.db")

In [None]:
layout = BIDSLayout("../data/ds000030", database_path = "../data/ds000030/.db")

The pybids layout object lets you query your BIDS dataset according to a number of parameters by using a `get_*()` method.  
We can get a list of the subjects we've downloaded from the dataset.

In [None]:
layout.get_subjects()

To get a list of all of the files, just use `get()`. 

In [None]:
layout.get()

There are many arguments we can use to filter down this list. Any BIDS-defined keyword can be passed on as a constraint. In `pybids`, these keywords are known as **entities**. For a complete list of possibilities:

In [None]:
layout.entities

For example, if we only want the file paths of all of our resting state fMRI scans,

In [None]:
layout.get(datatype="func", suffix="bold", task="rest", extension=[".nii.gz"], return_type="file")

**EXERCISE**: Retrieve the file paths of any scan where the subject is '10292' or '50081' and the `RepetitionTime` is 2 seconds.

In [None]:
layout.get(subject="10159", RepetitionTime=2, return_type="file")

Let's save the first file from our list of file paths to a variable and pull the metadata from its associated JSON file using the `get_metadata()` function.

In [None]:
fmri_file = layout.get(subject="10159", RepetitionTime=2, return_type="file")[0]
layout.get_metadata(fmri_file)

We can even collect the metadata for all of our fmri scans into a list and convert this into a dataframe.

In [None]:
import pandas as pd

metadata_list = []
all_fmri_files = layout.get(datatype="func", suffix="bold", return_type="file", extension=[".nii.gz"])
for fmri_file in all_fmri_files:
    fmri_metadata = layout.get_metadata(fmri_file)
    metadata_list.append(fmri_metadata)
df = pd.DataFrame.from_records(metadata_list)
df