# Create ISA-API Investigation from Study Design configuration

In this notebook I will show you how you can use a study design configuration is JSON format to generate a single-study ISA investigation and how you can then serialise it in JSON and tabular (i.e. CSV) format.

Or study design configuration consists of:
* two study arms (treatment and control)
* blood and liver samples are taken during the treatment phase
* blood samples are used to perform Metabolite Profiling with two assays: mass spectrometry and nuclear magnetic resonance (NMR).

## 1. Setup

Let's import all the required libraries

In [1]:
from time import time
import os
import json

## ISA-API related imports
from isatools.model import Investigation, Study

## ISA-API create mode related imports
from isatools.create.models import StudyDesign
from isatools.create.connectors import generate_isa_study_design_from_datascriptor_config

# serializer from ISA Investigation to JSON
from isatools.isajson import ISAJSONEncoder

# ISA-Tab serialisation
from isatools import isatab

  yaml_config = yaml.load(yaml_file)


## 2. Load the Study Design JSON configuration

First of all we load the study design configurator

In [2]:
with open(os.path.abspath(os.path.join("config", "study-design-with-two-arms-datascriptor.json")), "r") as config_file:
    study_design_config = json.load(config_file)
study_design_config

{'type': 'here you can describe your type of design',
 'elements': [{'id': '#element/screen',
   'name': 'screen',
   'type': 'screen',
   'duration': 7,
   'durationUnit': 'days'},
  {'id': '#element/first_treatment',
   'name': 'first treatment',
   'type': 'chemical intervention',
   'agent': 'test drug',
   'intensity': 10,
   'intensityUnit': 'mg/day',
   'duration': 30,
   'durationUnit': 'days'},
  {'id': '#element/surgical_treatment',
   'name': 'surgical treatment',
   'type': 'surgical intervention',
   'agent': 'surgery'},
  {'id': '#element/follow-up',
   'name': 'follow-up',
   'type': 'follow-up',
   'duration': 3,
   'durationUnit': 'months'},
  {'id': '#element/control_treatment',
   'name': 'control treatment',
   'type': 'chemical intervention',
   'agent': 'placebo',
   'intensity': 10,
   'intensityUnit': 'mg/day',
   'duration': 30,
   'durationUnit': 'days'}],
 'events': [{'id': '#event/sampling_0',
   'action': 'sampling',
   'input': 'subject',
   'output': 'liv

## 3. Generate the ISA Study Design from the JSON configuration
To perform the conversion we just need to use the function `generate_isa_study_design_from_datascriptor_config` (name possibly subject to change, should we drop the "isa" and "datascriptor" qualifiers?)

In [3]:
study_design = generate_isa_study_design_from_datascriptor_config(study_design_config)
assert isinstance(study_design, StudyDesign)

{}
pv_combination: ()
count: 0, prev_node: extraction_000
count: 0, prev_node: extraction_000
{}
pv_combination: ()
count: 0, prev_node: extract_000_000
count: 1, prev_node: extract_001_000
count: 0, prev_node: labelling_000_000
count: 1, prev_node: labelling_000_001
{isatools.model.OntologyAnnotation(term='instrument', term_source=None, term_accession='', comments=[]): ['Agilent QTQF 6510'], isatools.model.OntologyAnnotation(term='injection_mode', term_source=None, term_accession='', comments=[]): ['FIA', 'LC'], isatools.model.OntologyAnnotation(term='acquisition_mode', term_source=None, term_accession='', comments=[]): ['positive mode']}
pv_combination: ('Agilent QTQF 6510', 'FIA', 'positive mode')
count: 0, prev_node: labelled_extract_000_000
count: 1, prev_node: labelled_extract_000_001
pv_combination: ('Agilent QTQF 6510', 'LC', 'positive mode')
count: 0, prev_node: labelled_extract_000_000
count: 1, prev_node: labelled_extract_000_001
count: 0, prev_node: mass_spectrometry_000_00

## 4. Generate the ISA Study from the StudyDesign and embed it into an ISA Investigation

The `StudyDesign.generate_isa_study()` method returns the complete ISA-API `Study` object.

In [4]:
start = time()
study = study_design.generate_isa_study()
end = time()
print('The generation of the study design took {:.2f} s.'.format(end - start))
assert isinstance(study, Study)
investigation = Investigation(studies=[study])

  config = yaml.load(yaml_file)


Sampling protocol is Protocol(
    name=sample collection
    protocol_type=sample_collection
    uri=
    version=
    parameters=2 ProtocolParameter objects
    components=0 OntologyAnnotation objects
    comments=0 Comment objects
)
The generation of the study design took 2.13 s.


## 5. Serialize and save the JSON representation of the generated ISA Investigation

In [5]:
start = time()
inv_json = json.dumps(investigation, cls=ISAJSONEncoder, sort_keys=True, indent=4, separators=(',', ': '))
end = time()
print('The JSON serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The JSON serialisation of the ISA investigation took 0.48 s.


In [6]:
directory = os.path.abspath(os.path.join('output'))
if not os.path.exists(directory):
    os.makedirs(directory)
with open(os.path.abspath(os.path.join('output','isa-investigation-2-arms-nmr-ms.json')), 'w') as out_fp:
    json.dump(json.loads(inv_json), out_fp)

## 6. Dump the ISA Investigation to ISA-Tab

In [7]:
start = time()
isatab.dump(investigation, os.path.abspath(os.path.join('output')))
end = time()
print('The Tab serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

2020-05-07 14:29:16,049 [INFO]: model.py(graph:1544) >> Building graph for object: Study(
    identifier=
    filename=s_study_01.txt
    title=
    description=
    submission_date=
    public_release_date=
    contacts=0 Person objects
    design_descriptors=0 OntologyAnnotation objects
    publications=0 Publication objects
    factors=3 StudyFactor objects
    protocols=16 Protocol objects
    assays=5 Assay objects
    sources=18 Source objects
    samples=64 Sample objects
    process_sequence=64 Process objects
    other_material=0 Material objects
    characteristic_categories=0 OntologyAnnots
    comments=0 Comment objects
    units=0 Unit objects
)
2020-05-07 14:29:16,095 [INFO]: model.py(graph:1544) >> Building graph for object: Study(
    identifier=
    filename=s_study_01.txt
    title=
    description=
    submission_date=
    public_release_date=
    contacts=0 Person objects
    design_descriptors=0 OntologyAnnotation objects
    publications=0 Publication objects
    

2020-05-07 14:29:29,037 [INFO]: model.py(graph:1544) >> Building graph for object: Assay(
    measurement_type=metabolite profiling
    technology_type=mass spectrometry
    technology_platform=
    filename=a_CELL_treatment_3_ASSAY_GRAPH_000_metabolite profiling_mass spectrometry.txt
    data_files=960 DataFile objects
    samples=0 Sample objects
    process_sequence=630 Process objects
    other_material=180 Material objects
    characteristic_categories=0 OntologyAnnots
    comments=0 Comment objects
    units=0 Unit objects
)
2020-05-07 14:29:35,518 [INFO]: isatab.py(_all_end_to_end_paths:1152) >> Found 480 paths!
2020-05-07 14:29:35,579 [INFO]: model.py(graph:1544) >> Building graph for object: Assay(
    measurement_type=metabolite profiling
    technology_type=nmr spectroscopy
    technology_platform=
    filename=a_CELL_treatment_3_ASSAY_GRAPH_001_metabolite profiling_nmr spectroscopy.txt
    data_files=480 DataFile objects
    samples=0 Sample objects
    process_sequence=510

The Tab serialisation of the ISA investigation took 24.83 s.


To use them on the notebook we can also dump the tables to pandas DataFrames, using the `dump_tables_to_dataframes` function rather than dump

In [8]:
dataframes = isatab.dump_tables_to_dataframes(investigation)

2020-05-07 15:21:29,637 [INFO]: model.py(graph:1544) >> Building graph for object: Study(
    identifier=
    filename=s_study_01.txt
    title=
    description=
    submission_date=
    public_release_date=
    contacts=0 Person objects
    design_descriptors=0 OntologyAnnotation objects
    publications=0 Publication objects
    factors=3 StudyFactor objects
    protocols=16 Protocol objects
    assays=5 Assay objects
    sources=18 Source objects
    samples=64 Sample objects
    process_sequence=64 Process objects
    other_material=0 Material objects
    characteristic_categories=0 OntologyAnnots
    comments=0 Comment objects
    units=0 Unit objects
)
2020-05-07 15:21:29,680 [INFO]: model.py(graph:1544) >> Building graph for object: Study(
    identifier=
    filename=s_study_01.txt
    title=
    description=
    submission_date=
    public_release_date=
    contacts=0 Person objects
    design_descriptors=0 OntologyAnnotation objects
    publications=0 Publication objects
    

2020-05-07 15:21:42,377 [INFO]: model.py(graph:1544) >> Building graph for object: Assay(
    measurement_type=metabolite profiling
    technology_type=mass spectrometry
    technology_platform=
    filename=a_CELL_treatment_3_ASSAY_GRAPH_000_metabolite profiling_mass spectrometry.txt
    data_files=960 DataFile objects
    samples=0 Sample objects
    process_sequence=630 Process objects
    other_material=180 Material objects
    characteristic_categories=0 OntologyAnnots
    comments=0 Comment objects
    units=0 Unit objects
)
2020-05-07 15:21:48,466 [INFO]: isatab.py(_all_end_to_end_paths:1152) >> Found 480 paths!
2020-05-07 15:21:48,535 [INFO]: model.py(graph:1544) >> Building graph for object: Assay(
    measurement_type=metabolite profiling
    technology_type=nmr spectroscopy
    technology_platform=
    filename=a_CELL_treatment_3_ASSAY_GRAPH_001_metabolite profiling_nmr spectroscopy.txt
    data_files=480 DataFile objects
    samples=0 Sample objects
    process_sequence=510

In [9]:
len(dataframes)

6

## 7. Check the correctness of the ISA-Tab DataFrames 

We have 1 study file and 5 assay files:
* 1 assay file for Mass. Spec. (treatmennt arm, third epoch: surgery)
* 2 assay files for Mass. Spec. (both arms, fourth epoch: follow-up)
* 2 assay files for NMR (both arms, fourth epoch: follow-up)

In [14]:
for key in dataframes.keys():
    display(key)

's_study_01.txt'

'a_CELL_control_3_ASSAY_GRAPH_000_metabolite profiling_mass spectrometry.txt'

'a_CELL_treatment_3_ASSAY_GRAPH_001_metabolite profiling_nmr spectroscopy.txt'

'a_CELL_treatment_3_ASSAY_GRAPH_000_metabolite profiling_mass spectrometry.txt'

'a_CELL_treatment_2_ASSAY_GRAPH_000_metabolite profiling_mass spectrometry.txt'

'a_CELL_control_3_ASSAY_GRAPH_001_metabolite profiling_nmr spectroscopy.txt'

We have 8 subjects in the control arm and 24 samples have been collected (3 blood samples per subject during the follow-up epoch)

We have 10 subjects in the control arm and 40 samples have been collected (3 blood samples per subject during the follow-up epoch and 1 liver sample per subject during the surgery epoch)

In [23]:
study_frame = dataframes['s_study_01.txt']
count_control_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'control' in el)])
count_treatment_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'treatment' in el)])
print("There are {} samples in the control arm (i.e. group)".format(count_control_samples))
print("There are {} samples in the treatment arm (i.e. group)".format(count_treatment_samples))

There are 24 samples in the control arm (i.e. group)
There are 40 samples in the treatment arm (i.e. group)


Each control samples is fractioned and 2 labelled extracts are produced (i.e. biological replicates).

```
[
    "labelling",
    {
        "#replicates": 2
    }
]
```

There are 2 possible combinations of Mass Spec Assay specified in our configuration template and 2 techincal replicate are produced for each combinations:

```
[
    "mass spectrometry",
    {
        "#replicates": 2,
        "instrument": [
            "Agilent QTQF 6510"
        ],
        "injection_mode": [
            "FIA",
            "LC"
        ],
        "acquisition_mode": [
            "positive mode"
        ]
    }
]
```

Two output raw spectral files are produced as the result of reach run

```
[
    "raw spectral data file",
    [
        {
            "node_type": "data file",
            "size": 2,
            "is_input_to_next_protocols": false
        }
    ]
]
```

As a total we expect

$$ N_{rows} = N_{subjects} \times N_{biorepl} \times N_{combinations} \times N_{techrepl} = 8 \times 2 \times 2 \times 2 \times 2 = 24 \times 16 = 384 $$

Which we can verify for the mass spectrometry assay file of the control group.

In [15]:
dataframes['a_CELL_control_3_ASSAY_GRAPH_000_metabolite profiling_mass spectrometry.txt']

Unnamed: 0,Sample Name,Protocol REF,Performer,Extract Name,Characteristics[extract type],Protocol REF.1,Performer.1,Labeled Extract Name,Protocol REF.2,Parameter Value[instrument],Parameter Value[injection_mode],Parameter Value[acquisition_mode],MS Assay Name,Performer.2,Raw Data File,Raw Data File.1
0,GRP-control.SBJ-001.CEL-CELL_control_3.SMP-.bl...,extraction,Ellipsis,extract_12-30,lipids,labelling,Ellipsis,labelled extract_12-32,mass spectrometry,Agilent QTQF 6510,LC,positive mode,mass spectrometry_12-36,Ellipsis,raw spectral data file_12-38,raw spectral data file_12-38
1,GRP-control.SBJ-001.CEL-CELL_control_3.SMP-.bl...,extraction,Ellipsis,extract_12-30,lipids,labelling,Ellipsis,labelled extract_12-32,mass spectrometry,Agilent QTQF 6510,FIA,positive mode,mass spectrometry_12-42,Ellipsis,raw spectral data file_12-44,raw spectral data file_12-44
2,GRP-control.SBJ-001.CEL-CELL_control_3.SMP-.bl...,extraction,Ellipsis,extract_12-1,polar fraction,labelling,Ellipsis,labelled extract_12-17,mass spectrometry,Agilent QTQF 6510,FIA,positive mode,mass spectrometry_12-27,Ellipsis,raw spectral data file_12-29,raw spectral data file_12-29
3,GRP-control.SBJ-001.CEL-CELL_control_3.SMP-.bl...,extraction,Ellipsis,extract_12-30,lipids,labelling,Ellipsis,labelled extract_12-32,mass spectrometry,Agilent QTQF 6510,LC,positive mode,mass spectrometry_12-33,Ellipsis,raw spectral data file_12-35,raw spectral data file_12-35
4,GRP-control.SBJ-001.CEL-CELL_control_3.SMP-.bl...,extraction,Ellipsis,extract_12-1,polar fraction,labelling,Ellipsis,labelled extract_12-17,mass spectrometry,Agilent QTQF 6510,LC,positive mode,mass spectrometry_12-18,Ellipsis,raw spectral data file_12-20,raw spectral data file_12-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
379,GRP-control.SBJ-008.CEL-CELL_control_3.SMP-.bl...,extraction,Ellipsis,extract_2-30,lipids,labelling,Ellipsis,labelled extract_2-32,mass spectrometry,Agilent QTQF 6510,FIA,positive mode,mass spectrometry_2-39,Ellipsis,raw spectral data file_2-41,raw spectral data file_2-41
380,GRP-control.SBJ-008.CEL-CELL_control_3.SMP-.bl...,extraction,Ellipsis,extract_2-1,polar fraction,labelling,Ellipsis,labelled extract_2-17,mass spectrometry,Agilent QTQF 6510,LC,positive mode,mass spectrometry_2-21,Ellipsis,raw spectral data file_2-23,raw spectral data file_2-23
381,GRP-control.SBJ-008.CEL-CELL_control_3.SMP-.bl...,extraction,Ellipsis,extract_2-1,polar fraction,labelling,Ellipsis,labelled extract_2-17,mass spectrometry,Agilent QTQF 6510,LC,positive mode,mass spectrometry_2-18,Ellipsis,raw spectral data file_2-20,raw spectral data file_2-20
382,GRP-control.SBJ-008.CEL-CELL_control_3.SMP-.bl...,extraction,Ellipsis,extract_2-30,lipids,labelling,Ellipsis,labelled extract_2-32,mass spectrometry,Agilent QTQF 6510,LC,positive mode,mass spectrometry_2-33,Ellipsis,raw spectral data file_2-35,raw spectral data file_2-35
