# Create ISA-API Investigation from Datascriptor Study Design configuration with chained protocols in the assay workflow

In this notebook I will show you how you can use a study design configuration is JSON format as produce by datascriptor (https://gitlab.com/datascriptor/datascriptor) to generate a single-study ISA investigation and how you can then serialise it in JSON and tabular (i.e. CSV) format.

Or study design configuration consists of:
- a 4-arm repeated measures (crossover) study design
- 4 subject groups: "M" and "F", "healthy" and "diseased"
- 2 treatments in crossover: a drug treatment with ibuprofen and a biological treatment with injection of KpJH46Φ2 bacteriophage
- 2 samples collected: blood and derma samples
- three assays performed: mass. spec. on blood samples, Chip-seq and NMR on derma samples. Please note that the Chip-seq assay contains two chained protocols in its workflow.

## 1. Setup

Let's import all the required libraries

In [1]:
from time import time
import os
import json

In [2]:
## ISA-API related imports
from isatools.model import Investigation, Study

In [3]:
## ISA-API create mode related imports
from isatools.create.model import StudyDesign
from isatools.create.connectors import generate_study_design

# serializer from ISA Investigation to JSON
from isatools.isajson import ISAJSONEncoder

# ISA-Tab serialisation
from isatools import isatab

In [4]:
## ISA-API create mode related imports
from isatools.create import model
from isatools import isajson

## 2. Load the Study Design JSON configuration

First of all we load the study design configurator

In [5]:
with open(os.path.abspath(os.path.join(
    "config", "crossover-study-design-4-arms-blood-derma-nmr-ms-chipseq.json"
)), "r") as config_file:
    study_design_config = json.load(config_file)
study_design_config

{'name': 'Chained protocols study',
 'subjectType': 'Homo sapiens',
 'subjectSize': 10,
 'designType': {'term': 'crossover design',
  'id': 'OBI:0500003',
  'iri': 'http://purl.obolibrary.org/obo/OBI_0500003',
  'label': 'Study subjects receive repeated treatments',
  'value': 'crossover'},
 'observationalFactors': [{'name': 'sex',
   'values': ['M', 'F'],
   'isQuantitative': False,
   'unit': None},
  {'name': 'condition',
   'values': ['healthy', 'diseased'],
   'isQuantitative': False,
   'unit': None}],
 'subjectGroups': {'selected': [{'name': 'SubjectGroup_0',
    'type': 'Homo sapiens',
    'characteristics': [{'name': 'sex',
      'value': 'M',
      'unit': None,
      'isQuantitative': False},
     {'name': 'condition',
      'value': 'diseased',
      'unit': None,
      'isQuantitative': False}]},
   {'name': 'SubjectGroup_1',
    'type': 'Homo sapiens',
    'characteristics': [{'name': 'sex',
      'value': 'M',
      'unit': None,
      'isQuantitative': False},
     {'na

## 3. Generate the ISA Study Design from the JSON configuration
To perform the conversion we just need to use the function `generate_isa_study_design` (name possibly subject to change, should we drop the "isa" and "datascriptor" qualifiers?)

In [6]:
study_design = generate_study_design(study_design_config)
assert isinstance(study_design, StudyDesign)

## 4. Generate the ISA Study from the StudyDesign and embed it into an ISA Investigation

The `StudyDesign.generate_isa_study()` method returns the complete ISA-API `Study` object.

In [7]:
start = time()
study = study_design.generate_isa_study()
end = time()
print('The generation of the study design took {:.2f} s.'.format(end - start))
assert isinstance(study, Study)
investigation = Investigation(studies=[study])

The generation of the study design took 7.56 s.


## 5. Serialize and save the JSON representation of the generated ISA Investigation

In [8]:
start = time()
inv_json = json.dumps(investigation, cls=ISAJSONEncoder, sort_keys=True, indent=4, separators=(',', ': '))
end = time()
print('The JSON serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The JSON serialisation of the ISA investigation took 1.58 s.


In [9]:
directory = os.path.abspath(os.path.join('output', 'crossover-bio+drug-treatment'))
os.makedirs(directory, exist_ok=True)
with open(os.path.abspath(os.path.join(directory,'isa-investigationn-crossover-bio+drug.json')), 'w') as out_fp:
    json.dump(json.loads(inv_json), out_fp)

## 6. Dump the ISA Investigation to ISA-Tab

Expect this to take up to a few minutes, depending on your machine (on mine, with an 8-th gen i7 and 16 GB of RAM it took 90s).

In [10]:
start = time()
isatab.dump(investigation, directory)
end = time()
print('The Tab serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The Tab serialisation of the ISA investigation took 114.08 s.


To use them on the notebook we can also dump the tables to pandas DataFrames, using the `dump_tables_to_dataframes` function rather than dump

In [11]:
dataframes = isatab.dump_tables_to_dataframes(investigation)

In [12]:
len(dataframes)

4

## 7. Check the correctness of the ISA-Tab DataFrames 

We have 1 study file and 3 assay files, one per assay type in our assay plan:

In [13]:
for key in dataframes.keys():
    display(key)

's_study_01.txt'

'a_AT16_protein-DNA-binding-site-identification_nucleic-acid-sequencing.txt'

'a_AT2_metabolite-profiling_NMR-spectroscopy.txt'

'a_AT0_metabolite-profiling_mass-spectrometry.txt'

We have 10 subjects in the each of the four selected arms (Arm_0, Arm_2, Arm_5, Arm_7) and 11 samples have been collected per subject (9 blood samples per subject, 6 of which during the follow-up epoch + 3 derma samples collected during the follow-up phase)

In [14]:
study_frame = dataframes['s_study_01.txt']
count_arm0_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP0' in el)])
count_arm2_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP2' in el)])
count_arm3_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP5' in el)])
count_arm4_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP7' in el)])
print("There are {} samples in the GRP0 arm (i.e. group)".format(count_arm0_samples))
print("There are {} samples in the GRP2 arm (i.e. group)".format(count_arm2_samples))
print("There are {} samples in the GRP3 arm (i.e. group)".format(count_arm3_samples))
print("There are {} samples in the GRP4 arm (i.e. group)".format(count_arm4_samples))

There are 110 samples in the GRP0 arm (i.e. group)
There are 110 samples in the GRP2 arm (i.e. group)
There are 110 samples in the GRP3 arm (i.e. group)
There are 110 samples in the GRP4 arm (i.e. group)


### 7.1 Chip-seq assay table

The two chained protocols have been serialised successfully.

In [15]:
dataframes['a_AT16_protein-DNA-binding-site-identification_nucleic-acid-sequencing.txt']

Unnamed: 0,Sample Name,Comment[study step with treatment],Protocol REF,Parameter Value[cross linking],Parameter Value[DNA fragmentation],Parameter Value[DNA fragment size],Parameter Value[immunoprecipitation antibody],Performer,Extract Name,Characteristics[extract type],Protocol REF.1,Parameter Value[library orientation],Parameter Value[library selection],Parameter Value[library strategy],Performer.1,Protocol REF.2,Parameter Value[sequencing instrument],Assay Name,Performer.2,Raw Data File
0,GRP0_SBJ01_A0E5_SMP-derma-1,NO,extraction,formaldehyde,nebulization,100 nm,monoclonal,Unknown,AT16-S19-Extract-R1,GENOMIC,library preparation,single-end,ChIP,Chip-Seq,Unknown,nucleic acid sequencing,Illumina HiSeq 4000,AT16-S19-nucleic-acid-sequencing-Acquisition-R7,Unknown,AT16-S19-raw_data_file-R7.raw
1,GRP0_SBJ01_A0E5_SMP-derma-1,NO,extraction,formaldehyde,nebulization,100 nm,monoclonal,Unknown,AT16-S19-Extract-R1,GENOMIC,library preparation,single-end,PCR,Chip-Seq,Unknown,nucleic acid sequencing,Illumina HiSeq 4000,AT16-S19-nucleic-acid-sequencing-Acquisition-R4,Unknown,AT16-S19-raw_data_file-R4.raw
2,GRP0_SBJ01_A0E5_SMP-derma-1,NO,extraction,uv-light,nebulization,100 nm,monoclonal,Unknown,AT16-S99-Extract-R1,GENOMIC,library preparation,single-end,PCR,Chip-Seq,Unknown,nucleic acid sequencing,Ion Torrent S5 XL,AT16-S99-nucleic-acid-sequencing-Acquisition-R8,Unknown,AT16-S99-raw_data_file-R8.raw
3,GRP0_SBJ01_A0E5_SMP-derma-1,NO,extraction,formaldehyde,nebulization,100 nm,monoclonal,Unknown,AT16-S19-Extract-R1,GENOMIC,library preparation,single-end,PCR,Chip-Seq,Unknown,nucleic acid sequencing,Ion Torrent S5 XL,AT16-S19-nucleic-acid-sequencing-Acquisition-R3,Unknown,AT16-S19-raw_data_file-R3.raw
4,GRP0_SBJ01_A0E5_SMP-derma-1,NO,extraction,uv-light,nebulization,100 nm,monoclonal,Unknown,AT16-S99-Extract-R1,GENOMIC,library preparation,single-end,ChIP,Chip-Seq,Unknown,nucleic acid sequencing,Illumina HiSeq 4000,AT16-S99-nucleic-acid-sequencing-Acquisition-R1,Unknown,AT16-S99-raw_data_file-R1.raw
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1275,GRP7_SBJ10_A3E5_SMP-derma-2,NO,extraction,formaldehyde,nebulization,100 nm,monoclonal,Unknown,AT16-S66-Extract-R1,GENOMIC,library preparation,single-end,PCR,Chip-Seq,Unknown,nucleic acid sequencing,Ion Torrent S5 XL,AT16-S66-nucleic-acid-sequencing-Acquisition-R1,Unknown,AT16-S66-raw_data_file-R1.raw
1276,GRP7_SBJ10_A3E5_SMP-derma-2,NO,extraction,formaldehyde,nebulization,100 nm,monoclonal,Unknown,AT16-S66-Extract-R1,GENOMIC,library preparation,single-end,ChIP,Chip-Seq,Unknown,nucleic acid sequencing,Illumina HiSeq 4000,AT16-S66-nucleic-acid-sequencing-Acquisition-R7,Unknown,AT16-S66-raw_data_file-R7.raw
1277,GRP7_SBJ10_A3E5_SMP-derma-2,NO,extraction,formaldehyde,nebulization,100 nm,monoclonal,Unknown,AT16-S66-Extract-R1,GENOMIC,library preparation,single-end,PCR,Chip-Seq,Unknown,nucleic acid sequencing,Illumina HiSeq 4000,AT16-S66-nucleic-acid-sequencing-Acquisition-R4,Unknown,AT16-S66-raw_data_file-R4.raw
1278,GRP7_SBJ10_A3E5_SMP-derma-2,NO,extraction,formaldehyde,nebulization,100 nm,monoclonal,Unknown,AT16-S66-Extract-R1,GENOMIC,library preparation,single-end,ChIP,Chip-Seq,Unknown,nucleic acid sequencing,Illumina HiSeq 4000,AT16-S66-nucleic-acid-sequencing-Acquisition-R5,Unknown,AT16-S66-raw_data_file-R5.raw


### 7.2 NMR assay table

In [16]:
dataframes['a_AT2_metabolite-profiling_NMR-spectroscopy.txt']

Unnamed: 0,Sample Name,Comment[study step with treatment],Protocol REF,Performer,Extract Name,Characteristics[extract type],Protocol REF.1,Parameter Value[instrument],Parameter Value[acquisition_mode],Parameter Value[pulse_sequence],Performer.1,Free Induction Decay Data File
0,GRP0_SBJ01_A0E5_SMP-derma-1,NO,extraction,Unknown,AT2-S19-Extract-R2,supernatant,nmr_spectroscopy,Bruker Avance II 1 GHz,2D 13C-13C NMR,TOCSY,Unknown,AT2-S19-raw_spectral_data_file-R12.raw
1,GRP0_SBJ01_A0E5_SMP-derma-1,NO,extraction,Unknown,AT2-S19-Extract-R2,supernatant,nmr_spectroscopy,Bruker Avance II 1 GHz,2D 13C-13C NMR,TOCSY,Unknown,AT2-S19-raw_spectral_data_file-R11.raw
2,GRP0_SBJ01_A0E5_SMP-derma-1,NO,extraction,Unknown,AT2-S19-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,2D 13C-13C NMR,TOCSY,Unknown,AT2-S19-raw_spectral_data_file-R7.raw
3,GRP0_SBJ01_A0E5_SMP-derma-1,NO,extraction,Unknown,AT2-S19-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,2D 13C-13C NMR,TOCSY,Unknown,AT2-S19-raw_spectral_data_file-R8.raw
4,GRP0_SBJ01_A0E5_SMP-derma-1,NO,extraction,Unknown,AT2-S19-Extract-R2,supernatant,nmr_spectroscopy,Bruker Avance II 1 GHz,2D 13C-13C NMR,watergate,Unknown,AT2-S19-raw_spectral_data_file-R14.raw
...,...,...,...,...,...,...,...,...,...,...,...,...
1275,GRP7_SBJ10_A3E5_SMP-derma-2,NO,extraction,Unknown,AT2-S66-Extract-R2,supernatant,nmr_spectroscopy,Bruker Avance II 1 GHz,2D 13C-13C NMR,watergate,Unknown,AT2-S66-raw_spectral_data_file-R13.raw
1276,GRP7_SBJ10_A3E5_SMP-derma-2,NO,extraction,Unknown,AT2-S66-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,2D 13C-13C NMR,watergate,Unknown,AT2-S66-raw_spectral_data_file-R4.raw
1277,GRP7_SBJ10_A3E5_SMP-derma-2,NO,extraction,Unknown,AT2-S66-Extract-R2,supernatant,nmr_spectroscopy,Bruker Avance II 1 GHz,2D 13C-13C NMR,TOCSY,Unknown,AT2-S66-raw_spectral_data_file-R12.raw
1278,GRP7_SBJ10_A3E5_SMP-derma-2,NO,extraction,Unknown,AT2-S66-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 13C NMR,TOCSY,Unknown,AT2-S66-raw_spectral_data_file-R5.raw


### 7.3 Mass Spectrometry assay table

In [17]:
dataframes['a_AT0_metabolite-profiling_mass-spectrometry.txt']

Unnamed: 0,Sample Name,Comment[study step with treatment],Protocol REF,Performer,Extract Name,Characteristics[extract type],Protocol REF.1,Performer.1,Labeled Extract Name,Label,Protocol REF.2,Parameter Value[instrument],Parameter Value[injection_mode],Parameter Value[acquisition_mode],MS Assay Name,Performer.2,Raw Spectral Data File
0,GRP0_SBJ01_A0E1_SMP-blood-1,NO,extraction,Unknown,AT0-S20-Extract-R2,non-polar fraction,labeling,Unknown,AT0-S20-LE-R2,label_0,mass spectrometry,Agilent QTQF 6510,LC,positive mode,AT0-S20-mass-spectrometry-Acquisition-R7,Unknown,AT0-S20-raw-spectral-data-file-R7.raw
1,GRP0_SBJ01_A0E1_SMP-blood-1,NO,extraction,Unknown,AT0-S19-Extract-R2,non-polar fraction,labeling,Unknown,AT0-S19-LE-R2,label_0,mass spectrometry,Agilent QTQF 6510,FIA,positive mode,AT0-S19-mass-spectrometry-Acquisition-R5,Unknown,AT0-S19-raw-spectral-data-file-R5.raw
2,GRP0_SBJ01_A0E1_SMP-blood-1,NO,extraction,Unknown,AT0-S19-Extract-R2,non-polar fraction,labeling,Unknown,AT0-S19-LE-R2,label_0,mass spectrometry,Agilent QTQF 6510,LC,positive mode,AT0-S19-mass-spectrometry-Acquisition-R7,Unknown,AT0-S19-raw-spectral-data-file-R7.raw
3,GRP0_SBJ01_A0E1_SMP-blood-1,NO,extraction,Unknown,AT0-S20-Extract-R2,non-polar fraction,labeling,Unknown,AT0-S20-LE-R2,label_0,mass spectrometry,Agilent QTQF 6510,FIA,positive mode,AT0-S20-mass-spectrometry-Acquisition-R6,Unknown,AT0-S20-raw-spectral-data-file-R6.raw
4,GRP0_SBJ01_A0E1_SMP-blood-1,NO,extraction,Unknown,AT0-S19-Extract-R1,polar fraction,labeling,Unknown,AT0-S19-LE-R1,label_0,mass spectrometry,Agilent QTQF 6510,LC,positive mode,AT0-S19-mass-spectrometry-Acquisition-R4,Unknown,AT0-S19-raw-spectral-data-file-R4.raw
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5115,GRP7_SBJ10_A3E5_SMP-blood-6,NO,extraction,Unknown,AT0-S556-Extract-R1,polar fraction,labeling,Unknown,AT0-S556-LE-R1,label_0,mass spectrometry,Agilent QTQF 6510,FIA,positive mode,AT0-S556-mass-spectrometry-Acquisition-R1,Unknown,AT0-S556-raw-spectral-data-file-R1.raw
5116,GRP7_SBJ10_A3E5_SMP-blood-6,NO,extraction,Unknown,AT0-S555-Extract-R2,non-polar fraction,labeling,Unknown,AT0-S555-LE-R2,label_0,mass spectrometry,Agilent QTQF 6510,LC,positive mode,AT0-S555-mass-spectrometry-Acquisition-R7,Unknown,AT0-S555-raw-spectral-data-file-R7.raw
5117,GRP7_SBJ10_A3E5_SMP-blood-6,NO,extraction,Unknown,AT0-S556-Extract-R2,non-polar fraction,labeling,Unknown,AT0-S556-LE-R2,label_0,mass spectrometry,Agilent QTQF 6510,FIA,positive mode,AT0-S556-mass-spectrometry-Acquisition-R6,Unknown,AT0-S556-raw-spectral-data-file-R6.raw
5118,GRP7_SBJ10_A3E5_SMP-blood-6,NO,extraction,Unknown,AT0-S555-Extract-R2,non-polar fraction,labeling,Unknown,AT0-S555-LE-R2,label_0,mass spectrometry,Agilent QTQF 6510,LC,positive mode,AT0-S555-mass-spectrometry-Acquisition-R8,Unknown,AT0-S555-raw-spectral-data-file-R8.raw
