# Create ISA-API Investigation from Datascriptor Study Design configuration

In this notebook I will show you how you can use a study design configuration is JSON format as produce by datascriptor (https://gitlab.com/datascriptor/datascriptor) to generate a single-study ISA investigation and how you can then serialise it in JSON and tabular (i.e. CSV) format.

Or study design configuration consists of:
- a 4-arm study design. Each arm has 10 subjects
- there is an observational factor, named "condition" with two values: "healthy" and "diseased"
- a crossover of two treatments, a drug treatment and a biological treatment
- three non-treatment phases: screen, washout and follow-up
- two sample types colllected: blood and derma
- two assay types: 
    - metabolite profiling through mass spectrometry on the derma sample. The mass spec will be run on a "Agilent QTQF 6510" instrument, testing both "FIA" and "LC" injection modes, and "positive" acquisition mode.
    - metabolite profiling through  NMR spectroscopy on the blood samples.  The NMR will be run on a "Bruker Avance II 1 GHz" instrument, on "1D 1H NMR" acquisition mode, testing both "CPGM" amd "TOCSY" pulse sequences.

## 1. Setup

Let's import all the required libraries

In [1]:
from time import time
import os
import json

In [2]:
## ISA-API related imports
from isatools.model import Investigation, Study

In [3]:
## ISA-API create mode related imports
from isatools.create.model import StudyDesign
from isatools.create.connectors import generate_study_design

# serializer from ISA Investigation to JSON
from isatools.isajson import ISAJSONEncoder

# ISA-Tab serialisation
from isatools import isatab

In [4]:
## ISA-API create mode related imports
from isatools.create import model
from isatools import isajson

## 2. Load the Study Design JSON configuration

First of all we load the study design configurator with all the specs defined above

In [5]:
with open(os.path.abspath(os.path.join(
    "config", "crossover-study-design-4-arms-healthy-diseased-nmr-ms.json"
)), "r") as config_file:
    study_design_config = json.load(config_file)
study_design_config

{'name': 'Crossover study with observational factors ',
 'description': 'Or study design configuration consists of:\na 4-arm (i.e. groups) study design. Each arm has 10 subjects\nthere is an observational factor , named "condition" with two values: "healthy" and "diseased"\na crossover of two treatments: a drug treatment (ibuprofen) and a biological treatment (KpJH46Φ2 phage injection)\nthree non-treatment phases: screen, washout and follow-up\ntwo sample types colllected: blood and derma\ntwo assay types: (1) metabolite profiling through mass spectrometry on the saliva sample. The mass spec will be run on a "Agilent QTQF 6510" instrument, testing both "FIA" and "LC" injection modes, and "positive" acquisition mode.(2) metabolite profiling through NMR spectroscopy on the blood samples. The NMR will be run on a "Bruker Avance II 1 GHz" instrument, on "1D 1H NMR" acquisition mode, testing both "CPGM" amd "TOCSY" pulse sequences.',
 'subjectType': 'Homo sapiens',
 'subjectSize': 10,
 'des

## 3. Generate the ISA Study Design from the JSON configuration
To perform the conversion we just need to use the function `generate_isa_study_design_from_config` (name possibly subject to change, should we drop the "isa" and "datascriptor" qualifiers?)

In [6]:
study_design = generate_study_design(study_design_config)
assert isinstance(study_design, StudyDesign)

## 4. Generate the ISA Study from the StudyDesign and embed it into an ISA Investigation

The `StudyDesign.generate_isa_study()` method returns the complete ISA-API `Study` object.

In [7]:
start = time()
study = study_design.generate_isa_study()
end = time()
print('The generation of the study design took {:.2f} s.'.format(end - start))
assert isinstance(study, Study)
investigation = Investigation(identifier='inv01', studies=[study])

The generation of the study design took 2.29 s.


## 5. Serialize and save the JSON representation of the generated ISA Investigation

In [8]:
start = time()
inv_json = json.dumps(investigation, cls=ISAJSONEncoder, sort_keys=True, indent=4, separators=(',', ': '))
end = time()
print('The JSON serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The JSON serialisation of the ISA investigation took 0.80 s.


In [9]:
directory = os.path.abspath(os.path.join('output', 'crossover-4-arms'))
if not os.path.exists(directory):
    os.makedirs(directory)
with open(os.path.abspath(os.path.join('output', 'crossover-4-arms', 'isa-investigation-crossover-4-arms.json')), 'w') as out_fp:
    json.dump(json.loads(inv_json), out_fp)

## 6. Dump the ISA Investigation to ISA-Tab

In [10]:
start = time()
isatab.dump(investigation, os.path.abspath(os.path.join('output', 'crossover-4-arms')))
end = time()
print('The Tab serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The Tab serialisation of the ISA investigation took 29.21 s.


To use them on the notebook we can also dump the tables to pandas DataFrames, using the `dump_tables_to_dataframes` function rather than dump

In [12]:
dataframes = isatab.dump_tables_to_dataframes(investigation)

In [13]:
len(dataframes)

3

## 7. Check the correctness of the ISA-Tab DataFrames 

We have 1 study file and 2 assay files (one for MS and one for NMR). Let's check the names:

In [14]:
for key in dataframes.keys():
    display(key)

's_study_01.txt'

'a_AT2_metabolite-profiling_NMR-spectroscopy.txt'

'a_AT1_metabolite-profiling_mass-spectrometry.txt'

### 7.1 Count of subjects and samples

We have 10 subjects in the each of the six arms for a total of 60 subjects. 5 blood samples per subject are collected (1 in treatment 1 phase, 1 in treatment, and 3 in the follow-up phase) for a total of 300 blood samples. These will undergo the NMR assay. We have 4 saliva samples per subject (1 during screen and 3 during follow-up) for a total of 240 saliva samples. These will undergo the "mass spcetrometry" assay.

In [15]:
study_frame = dataframes['s_study_01.txt']
count_arm0_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP0' in el)])
count_arm1_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP1' in el)])
count_arm2_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP2' in el)])
count_arm3_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP3' in el)])
print("There are {} samples in the GRP0 arm (i.e. group)".format(count_arm0_samples))
print("There are {} samples in the GRP1 arm (i.e. group)".format(count_arm1_samples))
print("There are {} samples in the GRP2 arm (i.e. group)".format(count_arm2_samples))
print("There are {} samples in the GRP3 arm (i.e. group)".format(count_arm3_samples))

There are 130 samples in the GRP0 arm (i.e. group)
There are 130 samples in the GRP1 arm (i.e. group)
There are 130 samples in the GRP2 arm (i.e. group)
There are 130 samples in the GRP3 arm (i.e. group)


In [26]:
study_frame

Unnamed: 0,Source Name,Characteristics[Study Subject],Characteristics[condition],Protocol REF,Parameter Value[Sampling order],Parameter Value[Study cell],Date,Performer,Sample Name,Characteristics[organism part],Comment[Treatment?],Factor Value[Sequence Order],Factor Value[AGENT],Factor Value[DURATION],Unit,Factor Value[INTENSITY],Unit.1
0,GRP0_SBJ01,Homo sapiens,healthy,sample collection,122,A0E4,2021-03-23,Unknown,GRP0_SBJ01_A0E4_SMP-blood-4,blood,NO,4,,90,days,,
1,GRP0_SBJ01,Homo sapiens,healthy,sample collection,009,A0E0,2021-03-23,Unknown,GRP0_SBJ01_A0E0_SMP-blood-1,blood,NO,0,,7,days,,
2,GRP0_SBJ01,Homo sapiens,healthy,sample collection,119,A0E4,2021-03-23,Unknown,GRP0_SBJ01_A0E4_SMP-blood-1,blood,NO,4,,90,days,,
3,GRP0_SBJ01,Homo sapiens,healthy,sample collection,120,A0E4,2021-03-23,Unknown,GRP0_SBJ01_A0E4_SMP-blood-2,blood,NO,4,,90,days,,
4,GRP0_SBJ01,Homo sapiens,healthy,sample collection,039,A0E3,2021-03-23,Unknown,GRP0_SBJ01_A0E3_SMP-blood-1,blood,YES,3,KpJH46Φ2 phage,14,days,3.0,injections/day
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
515,GRP3_SBJ10,Homo sapiens,diseased,sample collection,450,A3E4,2021-03-23,Unknown,GRP3_SBJ10_A3E4_SMP-derma-2,derma,NO,4,,90,days,,
516,GRP3_SBJ10,Homo sapiens,diseased,sample collection,502,A3E4,2021-03-23,Unknown,GRP3_SBJ10_A3E4_SMP-blood-6,blood,NO,4,,90,days,,
517,GRP3_SBJ10,Homo sapiens,diseased,sample collection,499,A3E4,2021-03-23,Unknown,GRP3_SBJ10_A3E4_SMP-blood-3,blood,NO,4,,90,days,,
518,GRP3_SBJ10,Homo sapiens,diseased,sample collection,451,A3E4,2021-03-23,Unknown,GRP3_SBJ10_A3E4_SMP-derma-3,derma,NO,4,,90,days,,


In [16]:
dataframes['a_AT1_metabolite-profiling_mass-spectrometry.txt']

Unnamed: 0,Sample Name,Comment[Treatment?],Protocol REF,Performer,Extract Name,Characteristics[extract type],Term Accession Number,Protocol REF.1,Performer.1,Labeled Extract Name,...,Protocol REF.2,Parameter Value[instrument],Term Accession Number.1,Parameter Value[injection_mode],Term Accession Number.2,Parameter Value[acquisition_mode],Term Accession Number.3,MS Assay Name,Performer.2,Raw Spectral Data File
0,GRP0_SBJ01_A0E0_SMP-derma-1,NO,extraction,Unknown,AT1-S9-Extract-R1,polar fraction,polar fraction,labeling,Unknown,AT1-S9-LE-R2,...,mass spectrometry,Agilent QTQF 6510,http://purl.obolibrary.org/obo/MS_1000676,LC,,positive mode,http://purl.obolibrary.org/obo/MS_1002807,AT1-S9-mass-spectrometry-Acquisition-R3,Unknown,AT1-S9-raw-spectral-data-file-R3.raw
1,GRP0_SBJ01_A0E0_SMP-derma-1,NO,extraction,Unknown,AT1-S9-Extract-R1,polar fraction,polar fraction,labeling,Unknown,AT1-S9-LE-R1,...,mass spectrometry,Agilent QTQF 6510,http://purl.obolibrary.org/obo/MS_1000676,FIA,http://purl.obolibrary.org/obo/MS_1000058,positive mode,http://purl.obolibrary.org/obo/MS_1002807,AT1-S9-mass-spectrometry-Acquisition-R2,Unknown,AT1-S9-raw-spectral-data-file-R2.raw
2,GRP0_SBJ01_A0E0_SMP-derma-1,NO,extraction,Unknown,AT1-S9-Extract-R1,polar fraction,polar fraction,labeling,Unknown,AT1-S9-LE-R1,...,mass spectrometry,Agilent QTQF 6510,http://purl.obolibrary.org/obo/MS_1000676,LC,,positive mode,http://purl.obolibrary.org/obo/MS_1002807,AT1-S9-mass-spectrometry-Acquisition-R1,Unknown,AT1-S9-raw-spectral-data-file-R1.raw
3,GRP0_SBJ01_A0E0_SMP-derma-1,NO,extraction,Unknown,AT1-S9-Extract-R1,polar fraction,polar fraction,labeling,Unknown,AT1-S9-LE-R2,...,mass spectrometry,Agilent QTQF 6510,http://purl.obolibrary.org/obo/MS_1000676,FIA,http://purl.obolibrary.org/obo/MS_1000058,positive mode,http://purl.obolibrary.org/obo/MS_1002807,AT1-S9-mass-spectrometry-Acquisition-R4,Unknown,AT1-S9-raw-spectral-data-file-R4.raw
4,GRP0_SBJ01_A0E4_SMP-derma-1,NO,extraction,Unknown,AT1-S35-Extract-R1,polar fraction,polar fraction,labeling,Unknown,AT1-S35-LE-R1,...,mass spectrometry,Agilent QTQF 6510,http://purl.obolibrary.org/obo/MS_1000676,FIA,http://purl.obolibrary.org/obo/MS_1000058,positive mode,http://purl.obolibrary.org/obo/MS_1002807,AT1-S35-mass-spectrometry-Acquisition-R2,Unknown,AT1-S35-raw-spectral-data-file-R2.raw
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
635,GRP3_SBJ10_A3E4_SMP-derma-2,NO,extraction,Unknown,AT1-S150-Extract-R1,polar fraction,polar fraction,labeling,Unknown,AT1-S150-LE-R1,...,mass spectrometry,Agilent QTQF 6510,http://purl.obolibrary.org/obo/MS_1000676,FIA,http://purl.obolibrary.org/obo/MS_1000058,positive mode,http://purl.obolibrary.org/obo/MS_1002807,AT1-S150-mass-spectrometry-Acquisition-R2,Unknown,AT1-S150-raw-spectral-data-file-R2.raw
636,GRP3_SBJ10_A3E4_SMP-derma-3,NO,extraction,Unknown,AT1-S151-Extract-R1,polar fraction,polar fraction,labeling,Unknown,AT1-S151-LE-R1,...,mass spectrometry,Agilent QTQF 6510,http://purl.obolibrary.org/obo/MS_1000676,LC,,positive mode,http://purl.obolibrary.org/obo/MS_1002807,AT1-S151-mass-spectrometry-Acquisition-R1,Unknown,AT1-S151-raw-spectral-data-file-R1.raw
637,GRP3_SBJ10_A3E4_SMP-derma-3,NO,extraction,Unknown,AT1-S151-Extract-R1,polar fraction,polar fraction,labeling,Unknown,AT1-S151-LE-R2,...,mass spectrometry,Agilent QTQF 6510,http://purl.obolibrary.org/obo/MS_1000676,FIA,http://purl.obolibrary.org/obo/MS_1000058,positive mode,http://purl.obolibrary.org/obo/MS_1002807,AT1-S151-mass-spectrometry-Acquisition-R4,Unknown,AT1-S151-raw-spectral-data-file-R4.raw
638,GRP3_SBJ10_A3E4_SMP-derma-3,NO,extraction,Unknown,AT1-S151-Extract-R1,polar fraction,polar fraction,labeling,Unknown,AT1-S151-LE-R2,...,mass spectrometry,Agilent QTQF 6510,http://purl.obolibrary.org/obo/MS_1000676,LC,,positive mode,http://purl.obolibrary.org/obo/MS_1002807,AT1-S151-mass-spectrometry-Acquisition-R3,Unknown,AT1-S151-raw-spectral-data-file-R3.raw


In [17]:
dataframes['a_AT2_metabolite-profiling_NMR-spectroscopy.txt']

Unnamed: 0,Sample Name,Comment[Treatment?],Protocol REF,Performer,Extract Name,Characteristics[extract type],Protocol REF.1,Parameter Value[instrument],Parameter Value[acquisition_mode],Parameter Value[pulse_sequence],NMR Assay Name,Performer.1,Free Induction Decay Data File
0,GRP0_SBJ01_A0E0_SMP-blood-1,NO,extraction,Unknown,AT2-S9-Extract-R1,pellet,NMR spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,TOCSY,AT2-S9-NMR-spectroscopy-Acquisition-R4,Unknown,AT2-S9-raw_spectral_data_file-R4.raw
1,GRP0_SBJ01_A0E0_SMP-blood-1,NO,extraction,Unknown,AT2-S9-Extract-R2,supernatant,NMR spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,CPMG,AT2-S9-NMR-spectroscopy-Acquisition-R5,Unknown,AT2-S9-raw_spectral_data_file-R5.raw
2,GRP0_SBJ01_A0E0_SMP-blood-1,NO,extraction,Unknown,AT2-S9-Extract-R2,supernatant,NMR spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,TOCSY,AT2-S9-NMR-spectroscopy-Acquisition-R8,Unknown,AT2-S9-raw_spectral_data_file-R8.raw
3,GRP0_SBJ01_A0E0_SMP-blood-1,NO,extraction,Unknown,AT2-S9-Extract-R1,pellet,NMR spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,CPMG,AT2-S9-NMR-spectroscopy-Acquisition-R1,Unknown,AT2-S9-raw_spectral_data_file-R1.raw
4,GRP0_SBJ01_A0E0_SMP-blood-1,NO,extraction,Unknown,AT2-S9-Extract-R2,supernatant,NMR spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,CPMG,AT2-S9-NMR-spectroscopy-Acquisition-R6,Unknown,AT2-S9-raw_spectral_data_file-R6.raw
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2875,GRP3_SBJ10_A3E4_SMP-blood-6,NO,extraction,Unknown,AT2-S342-Extract-R1,pellet,NMR spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,CPMG,AT2-S342-NMR-spectroscopy-Acquisition-R2,Unknown,AT2-S342-raw_spectral_data_file-R2.raw
2876,GRP3_SBJ10_A3E4_SMP-blood-6,NO,extraction,Unknown,AT2-S342-Extract-R2,supernatant,NMR spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,CPMG,AT2-S342-NMR-spectroscopy-Acquisition-R6,Unknown,AT2-S342-raw_spectral_data_file-R6.raw
2877,GRP3_SBJ10_A3E4_SMP-blood-6,NO,extraction,Unknown,AT2-S342-Extract-R1,pellet,NMR spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,TOCSY,AT2-S342-NMR-spectroscopy-Acquisition-R3,Unknown,AT2-S342-raw_spectral_data_file-R3.raw
2878,GRP3_SBJ10_A3E4_SMP-blood-6,NO,extraction,Unknown,AT2-S342-Extract-R2,supernatant,NMR spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,TOCSY,AT2-S342-NMR-spectroscopy-Acquisition-R7,Unknown,AT2-S342-raw_spectral_data_file-R7.raw


# 7.1 Overview of the Mass Spec assay table

For the mass. spec. assay table, we have 160 (derma) samples, 160 extracts (1 per  sample, "polar" fraction), 320 labeled extracts (2 per extract, as "#replicates" is  2) and 640 mass spec protocols + 640 output files (2 per labeled extract as we do 1 technical replicate with 2 protocol parameter combinations `["Agilent QTQF 6510", "FIA", "positive mode"]` and `["Agilent QTQF 6510", "LC", "positive mode"]`).

In [18]:
dataframes['a_AT1_metabolite-profiling_mass-spectrometry.txt'].nunique(axis=0, dropna=True)

Sample Name                          160
Comment[Treatment?]                    1
Protocol REF                           1
Performer                              1
Extract Name                         160
Characteristics[extract type]          1
Term Accession Number                  1
Protocol REF.1                         1
Performer.1                            1
Labeled Extract Name                 320
Label                                  1
Protocol REF.2                         1
Parameter Value[instrument]            1
Term Accession Number.1                1
Parameter Value[injection_mode]        2
Term Accession Number.2                2
Parameter Value[acquisition_mode]      1
Term Accession Number.3                1
MS Assay Name                        640
Performer.2                            1
Raw Spectral Data File               640
dtype: int64

### Overview of the NMR assay table

For the NMR assay table, we have 360 (blood) samples, 720 extracts (2 per  sample, a single replicate of the "supernatant" and "pellet" fractions) and 2880 NMR protocols + 4800 output files (4 per extract as we do 2 technical replicates with 2 protocol parameter combinations `["Bruker Avance II 1 GHz", "1D 1H NMR", "CPGM"]` and `["Bruker Avance II 1 GHz", "1D 1H NMR", "TOCSY"]`).

In [19]:
dataframes['a_AT2_metabolite-profiling_NMR-spectroscopy.txt'].nunique(axis=0, dropna=True)

Sample Name                           360
Comment[Treatment?]                     2
Protocol REF                            1
Performer                               1
Extract Name                          720
Characteristics[extract type]           2
Protocol REF.1                          1
Parameter Value[instrument]             1
Parameter Value[acquisition_mode]       1
Parameter Value[pulse_sequence]         2
NMR Assay Name                       2880
Performer.1                             1
Free Induction Decay Data File       2880
dtype: int64

## 8. Removing a value from one of the additional Characteristics of the source



In [20]:
a_source = investigation.studies[0].sources[7]

In [21]:
a_source.characteristics

[isatools.model.Characteristic(category=isatools.model.OntologyAnnotation(term='Study Subject', term_source=isatools.model.OntologySource(name='NCIT', file='http://purl.obolibrary.org/obo/ncit/releases/2019-03-02/ncit.owl', version='19.02d', description='NCI Thesaurus OBO Edition', comments=[]), term_accession='http://purl.obolibrary.org/obo/NCIT_C41189', comments=[]), value='Homo sapiens', unit=None, comments=[]),
 isatools.model.Characteristic(category=isatools.model.OntologyAnnotation(term='condition', term_source=None, term_accession='', comments=[]), value=isatools.model.OntologyAnnotation(term='healthy', term_source=None, term_accession='', comments=[]), unit=None, comments=[])]

In [22]:
a_source.characteristics[1].value

isatools.model.OntologyAnnotation(term='healthy', term_source=None, term_accession='', comments=[])

Let's set this value to `None` and then let's dump again the dataframes in ISA-TAB format. This will break the serialisation

In [23]:
from isatools.model import OntologyAnnotation
a_source.characteristics[1].value = 'ciao'
a_source.characteristics[1].value

'ciao'

In [24]:
a_source.characteristics[1].value

'ciao'

In [25]:
dataframes = isatab.dump_tables_to_dataframes(investigation)

KeyError: 'Source Name.Characteristics[condition].Term Source REF'