Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] BEP031 - New entity: sample and samples.tsv file #812

Merged
merged 9 commits into from
Jul 26, 2021
7 changes: 7 additions & 0 deletions src/02-common-principles.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,13 @@ misunderstanding we clarify them here.
context, a session may also indicate a group of related scans,
taken in one or more visits.

1. **Sample** - a sample pertaining to a subject such as tissue, primary cell
or cell-free sample.
The `sample-<label>` key/value pair is used to distinguish between different
samples from the same subject.
The label MUST be unique per subject and is RECOMMENDED to be unique
throughout the dataset.

1. **Data acquisition** - a continuous uninterrupted block of time during which
a brain scanning instrument was acquiring data according to particular
scanning sequence/protocol.
Expand Down
66 changes: 66 additions & 0 deletions src/03-modality-agnostic-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,72 @@ to date of birth.
}
```

## Samples file

Template:

```Text
samples.tsv
samples.json
```

The purpose of this file is to describe properties of samples, indicated by the `sample` entity.
This file is REQUIRED if `sample-<label>` is present in any file name within the dataset.
If this file exists, it MUST contain the three following columns:

- `sample_id`: MUST consist of `sample-<label>` values identifying one row
for each sample

- `participant_id`: MUST consist of `sub-<label>`

- `sample_type`: MUST consist of sample type values, either `cell line`, `in vitro differentiated cells`,
`primary cell`, `cell-free sample`, `cloning host`, `tissue`, `whole organisms`, `organoid` or
`technical sample` from [ENCODE Biosample Type](https://www.encodeproject.org/profiles/biosample_type)

Other optional columns MAY be used to describe the samples.
Each sample MUST be described by one and only one row.

Commonly used *optional* columns in `samples.tsv` files are `pathology` and
`derived_from`. We RECOMMEND to make use of these columns, and in case that
you do use them, we RECOMMEND to use the following values for them:

- `pathology`: string value describing the pathology of the sample or type of control.
When different from `healthy`, pathology SHOULD be specified in `samples.tsv`.
The pathology MAY instead be specified in [Sessions files](06-longitudinal-and-multi-site-studies.md#sessions-file)
in case it changes over time.

- `derived_from`: `sample-<label>` key/value pair from which a sample is derived from,
for example a slice of tissue (`sample-02`) derived from a block of tissue (`sample-01`),
as illustrated in the example below.
mariehbourget marked this conversation as resolved.
Show resolved Hide resolved

`samples.tsv` example:

```Text
sample_id participant_id sample_type derived_from
sample-01 sub-01 tissue n/a
sample-02 sub-01 tissue sample-01
sample-03 sub-01 tissue sample-01
sample-04 sub-02 tissue n/a
sample-05 sub-02 tissue n/a
```

It is RECOMMENDED to accompany each `samples.tsv` file with a sidecar
`samples.json` file to describe the TSV column names and properties of their values
(see also the [section on tabular files](02-common-principles.md#tabular-files)).

`samples.json` example:

```JSON
{
"sample_type": {
"Description": "type of sample from ENCODE Biosample Type (https://www.encodeproject.org/profiles/biosample_type)",
effigies marked this conversation as resolved.
Show resolved Hide resolved
},
"derived_from": {
"Description": "sample_id from which the sample is derived"
}
}
```

## Phenotypic and assessment data

Template:
Expand Down
11 changes: 11 additions & 0 deletions src/schema/entities.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,17 @@ session:
often in the case of some intervention between sessions
(for example, training).
format: label
sample:
name: Sample
entity: sample
description: |
A sample pertaining to a subject such as tissue, primary cell
or cell-free sample.
The `sample-<label>` key/value pair is used to distinguish between different
samples from the same subject.
The label MUST be unique per subject and is RECOMMENDED to be unique
throughout the dataset.
format: label
task:
name: Task
entity: task
Expand Down
5 changes: 5 additions & 0 deletions src/schema/top_level_files.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,8 @@ participants:
extensions:
- .tsv
- .json
samples:
required: false
extensions:
- .tsv
- .json