# Create ISA-API Investigation from Datascriptor Study Design configuration
# Factorial Study on Painkillers in Humans

In this notebook I will show you how you can use a study design configuration is JSON format as produce by datascriptor (https://gitlab.com/datascriptor/datascriptor) to generate a single-study ISA investigation and how you can then serialise it in JSON and tabular (i.e. CSV) format.

Or study design configuration consists of:
- a 6-arm study design. Each arm has 10 subjects
- subjects are humans divided in two groups: <50 years, and 50+ years
- a single-treatment factorial design. Three drugs are being supplied (painkillers: tramadol, acetaminophen, oxycodone) at three different dosages (5, 10, 15 mg/day)), for three different durations (20, 30, 40 days)
- a follow-up phase after treatment 
- two sample types colllected: adipose tissue, blood
- two assay types: 
    - NMR for adipose tissue
    - cell counting for blood samples

## 1. Setup

Let's import all the required libraries

In [1]:
from time import time
import os
import json

In [2]:
## ISA-API related imports
from isatools.model import Investigation, Study

In [3]:
## ISA-API create mode related imports
from isatools.create.model import StudyDesign
from isatools.create.connectors import generate_study_design

# serializer from ISA Investigation to JSON
from isatools.isajson import ISAJSONEncoder

# ISA-Tab serialisation
from isatools import isatab

In [4]:
## ISA-API create mode related imports
from isatools.create import model
from isatools import isajson

## 2. Load the Study Design JSON configuration

First of all we load the study design configurator with all the specs defined above

In [5]:
with open(os.path.abspath(os.path.join(
    "config", "painkillers-factorial-study-design.json"
)), "r") as config_file:
    study_design_config = json.load(config_file)
study_design_config

{'name': 'Factorial Study on 3 Painkillers',
 'description': 'Example of a 3x3x3  factorial design.  We have human subjects with two age groups (>50 years, 50+ years). Three drugs are being supplied (painkillers: tramadol, acetaminophen, oxycodone) at three different dosages (5, 10, 15 mg/day)), for three different durations (20, 30, 40 days). Only 6  study groups (i.e. arms) are extracted from the full factorial study design. Two sample types are collected: adipose tissue and blood sample. The first one undergoes an NMR assay, while cell counting with cytofluorimetry is run on the blood sample cells.',
 'subjectType': 'Homo sapiens',
 'subjectSize': 8,
 'designType': {'term': 'full factorial design',
  'id': 'STATO:0000270',
  'iri': 'http://purl.obolibrary.org/obo/STATO_0000270',
  'label': 'Study subjects receive a single treatment',
  'value': 'fullFactorial'},
 'observationalFactors': [{'name': 'age group',
   'values': ['<50', '50+'],
   'isQuantitative': True,
   'unit': 'years'

## 3. Generate the ISA Study Design from the JSON configuration
To perform the conversion we just need to use the function `generate_isa_study_design()` (name possibly subject to change, should we drop the "isa" and "datascriptor" qualifiers?)

In [6]:
study_design = generate_study_design(study_design_config)
assert isinstance(study_design, StudyDesign)

## 4. Generate the ISA Study from the StudyDesign and embed it into an ISA Investigation

The `StudyDesign.generate_isa_study()` method returns the complete ISA-API `Study` object.

In [7]:
start = time()
study = study_design.generate_isa_study()
end = time()
print('The generation of the study design took {:.2f} s.'.format(end - start))
assert isinstance(study, Study)
investigation = Investigation(identifier='inv01', studies=[study])

The generation of the study design took 0.37 s.


## 5. Serialize and save the JSON representation of the generated ISA Investigation

In [8]:
start = time()
inv_json = json.dumps(investigation, cls=ISAJSONEncoder, sort_keys=True, indent=4, separators=(',', ': '))
end = time()
print('The JSON serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The JSON serialisation of the ISA investigation took 0.18 s.


In [9]:
directory = os.path.abspath(os.path.join('output', 'painkillers-factorial'))
os.makedirs(directory, exist_ok=True)
with open(os.path.abspath(os.path.join(directory, 'isa-investigation-painkillers-factorial.json')), 'w') as out_fp:
    json.dump(json.loads(inv_json), out_fp)

## 6. Dump the ISA Investigation to ISA-Tab

In [10]:
start = time()
isatab.dump(investigation, directory)
end = time()
print('The Tab serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The Tab serialisation of the ISA investigation took 4.90 s.


To use them on the notebook we can also dump the tables to pandas DataFrames, using the `dump_tables_to_dataframes` function rather than dump

In [11]:
dataframes = isatab.dump_tables_to_dataframes(investigation)

In [12]:
len(dataframes)

3

## 7. Check the correctness of the ISA-Tab DataFrames 

We have 1 study file and 2 assay files (one for MS and one for NMR). Let's check the names:

In [13]:
for key in dataframes.keys():
    display(key)

's_study_01.txt'

'a_AT15_cell-counting_flow-cytometry.txt'

'a_AT2_metabolite-profiling_NMR-spectroscopy.txt'

### 7.1 Count of subjects and samples

We have 10 subjects in the each of the six arms for a total of 60 subjects. 5 blood samples per subject are collected (1 in treatment 1 phase, 1 in treatment, and 3 in the follow-up phase) for a total of 300 blood samples. These will undergo the NMR assay. We have 4 saliva samples per subject (1 during screen and 3 during follow-up) for a total of 240 saliva samples. These will undergo the "mass spcetrometry" assay.

In [14]:
study_frame = dataframes['s_study_01.txt']
count_arm0_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP0' in el)])
count_arm1_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP1' in el)])
count_arm2_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP2' in el)])
count_arm3_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP3' in el)])
print("There are {} samples in the GRP0 arm (i.e. group)".format(count_arm0_samples))
print("There are {} samples in the GRP1 arm (i.e. group)".format(count_arm1_samples))
print("There are {} samples in the GRP2 arm (i.e. group)".format(count_arm2_samples))
print("There are {} samples in the GRP3 arm (i.e. group)".format(count_arm3_samples))

There are 32 samples in the GRP0 arm (i.e. group)
There are 32 samples in the GRP1 arm (i.e. group)
There are 64 samples in the GRP2 arm (i.e. group)
There are 32 samples in the GRP3 arm (i.e. group)


In [15]:
study_frame

Unnamed: 0,Source Name,Characteristics[Study Subject],Characteristics[age group],Protocol REF,Parameter Value[Sampling order],Parameter Value[Study cell],Date,Performer,Sample Name,Characteristics[organism part],Comment[study step with treatment],Factor Value[Sequence Order],Factor Value[AGENT],Factor Value[DURATION],Unit,Factor Value[INTENSITY],Unit.1
0,GRP0_SBJ1,Homo sapiens,<50,sample collection,029,A0E1,2021-04-14,Unknown,GRP0_SBJ1_A0E1_SMP-Blood-Sample-1,Blood Sample,NO,1,,60,days,,
1,GRP0_SBJ1,Homo sapiens,<50,sample collection,015,A0E0,2021-04-14,Unknown,GRP0_SBJ1_A0E0_SMP-Blood-Sample-1,Blood Sample,YES,0,tramadol,20,days,15.0,mg/day
2,GRP0_SBJ1,Homo sapiens,<50,sample collection,030,A0E1,2021-04-14,Unknown,GRP0_SBJ1_A0E1_SMP-Blood-Sample-2,Blood Sample,NO,1,,60,days,,
3,GRP0_SBJ1,Homo sapiens,<50,sample collection,007,A0E0,2021-04-14,Unknown,GRP0_SBJ1_A0E0_SMP-Adipose-Tissue-1,Adipose Tissue,YES,0,tramadol,20,days,15.0,mg/day
4,GRP0_SBJ2,Homo sapiens,<50,sample collection,013,A0E0,2021-04-14,Unknown,GRP0_SBJ2_A0E0_SMP-Blood-Sample-1,Blood Sample,YES,0,tramadol,20,days,15.0,mg/day
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,GRP49_SBJ7,Homo sapiens,50+,sample collection,182,A4E1,2021-04-14,Unknown,GRP49_SBJ7_A4E1_SMP-Blood-Sample-2,Blood Sample,NO,1,,60,days,,
188,GRP49_SBJ8,Homo sapiens,50+,sample collection,164,A4E0,2021-04-14,Unknown,GRP49_SBJ8_A4E0_SMP-Adipose-Tissue-1,Adipose Tissue,YES,0,oxycodone,30,days,10.0,mg/day
189,GRP49_SBJ8,Homo sapiens,50+,sample collection,183,A4E1,2021-04-14,Unknown,GRP49_SBJ8_A4E1_SMP-Blood-Sample-1,Blood Sample,NO,1,,60,days,,
190,GRP49_SBJ8,Homo sapiens,50+,sample collection,184,A4E1,2021-04-14,Unknown,GRP49_SBJ8_A4E1_SMP-Blood-Sample-2,Blood Sample,NO,1,,60,days,,


In [16]:
dataframes['a_AT15_cell-counting_flow-cytometry.txt']

Unnamed: 0,Sample Name,Comment[study step with treatment],Protocol REF,Performer,Labeled Extract Name,Label,Protocol REF.1,Parameter Value[instrument],Parameter Value[optical_path],Parameter Value[detector voltage],Scan Name,Performer.1,Raw Data File
0,GRP0_SBJ1_A0E0_SMP-Blood-Sample-1,YES,labeling,Unknown,AT15-S7-LE-R1,Biotin,data collection,Beckman Coulter,,,AT15-S7-data-collection-Acquisition-R1,Unknown,AT15-S7-raw_data_file-R1.raw
1,GRP0_SBJ1_A0E0_SMP-Blood-Sample-1,YES,labeling,Unknown,AT15-S7-LE-R2,Cy-3,data collection,Beckman Coulter,,,AT15-S7-data-collection-Acquisition-R2,Unknown,AT15-S7-raw_data_file-R2.raw
2,GRP0_SBJ1_A0E1_SMP-Blood-Sample-1,NO,labeling,Unknown,AT15-S21-LE-R1,Biotin,data collection,Beckman Coulter,,,AT15-S21-data-collection-Acquisition-R1,Unknown,AT15-S21-raw_data_file-R1.raw
3,GRP0_SBJ1_A0E1_SMP-Blood-Sample-1,NO,labeling,Unknown,AT15-S21-LE-R2,Cy-3,data collection,Beckman Coulter,,,AT15-S21-data-collection-Acquisition-R2,Unknown,AT15-S21-raw_data_file-R2.raw
4,GRP0_SBJ1_A0E1_SMP-Blood-Sample-2,NO,labeling,Unknown,AT15-S22-LE-R2,Cy-3,data collection,Beckman Coulter,,,AT15-S22-data-collection-Acquisition-R2,Unknown,AT15-S22-raw_data_file-R2.raw
...,...,...,...,...,...,...,...,...,...,...,...,...,...
283,GRP49_SBJ8_A4E0_SMP-Blood-Sample-1,YES,labeling,Unknown,AT15-S124-LE-R1,Biotin,data collection,Beckman Coulter,,,AT15-S124-data-collection-Acquisition-R1,Unknown,AT15-S124-raw_data_file-R1.raw
284,GRP49_SBJ8_A4E1_SMP-Blood-Sample-1,NO,labeling,Unknown,AT15-S135-LE-R1,Biotin,data collection,Beckman Coulter,,,AT15-S135-data-collection-Acquisition-R1,Unknown,AT15-S135-raw_data_file-R1.raw
285,GRP49_SBJ8_A4E1_SMP-Blood-Sample-1,NO,labeling,Unknown,AT15-S135-LE-R2,Cy-3,data collection,Beckman Coulter,,,AT15-S135-data-collection-Acquisition-R2,Unknown,AT15-S135-raw_data_file-R2.raw
286,GRP49_SBJ8_A4E1_SMP-Blood-Sample-2,NO,labeling,Unknown,AT15-S136-LE-R2,Cy-3,data collection,Beckman Coulter,,,AT15-S136-data-collection-Acquisition-R2,Unknown,AT15-S136-raw_data_file-R2.raw


In [17]:
dataframes['a_AT2_metabolite-profiling_NMR-spectroscopy.txt']

Unnamed: 0,Sample Name,Comment[study step with treatment],Protocol REF,Performer,Extract Name,Characteristics[extract type],Protocol REF.1,Parameter Value[instrument],Parameter Value[acquisition_mode],Parameter Value[pulse_sequence],Performer.1,Free Induction Decay Data File
0,GRP0_SBJ1_A0E0_SMP-Adipose-Tissue-1,YES,extraction,Unknown,AT2-S7-Extract-R1,supernatant,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,watergate,Unknown,AT2-S7-raw_spectral_data_file-R2.raw
1,GRP0_SBJ1_A0E0_SMP-Adipose-Tissue-1,YES,extraction,Unknown,AT2-S7-Extract-R2,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,watergate,Unknown,AT2-S7-raw_spectral_data_file-R4.raw
2,GRP0_SBJ1_A0E0_SMP-Adipose-Tissue-1,YES,extraction,Unknown,AT2-S7-Extract-R2,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,TOCSY,Unknown,AT2-S7-raw_spectral_data_file-R3.raw
3,GRP0_SBJ1_A0E0_SMP-Adipose-Tissue-1,YES,extraction,Unknown,AT2-S7-Extract-R1,supernatant,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,TOCSY,Unknown,AT2-S7-raw_spectral_data_file-R1.raw
4,GRP0_SBJ2_A0E0_SMP-Adipose-Tissue-1,YES,extraction,Unknown,AT2-S5-Extract-R2,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,watergate,Unknown,AT2-S5-raw_spectral_data_file-R4.raw
...,...,...,...,...,...,...,...,...,...,...,...,...
187,GRP49_SBJ7_A4E0_SMP-Adipose-Tissue-1,YES,extraction,Unknown,AT2-S43-Extract-R1,supernatant,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,TOCSY,Unknown,AT2-S43-raw_spectral_data_file-R1.raw
188,GRP49_SBJ8_A4E0_SMP-Adipose-Tissue-1,YES,extraction,Unknown,AT2-S44-Extract-R1,supernatant,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,TOCSY,Unknown,AT2-S44-raw_spectral_data_file-R1.raw
189,GRP49_SBJ8_A4E0_SMP-Adipose-Tissue-1,YES,extraction,Unknown,AT2-S44-Extract-R1,supernatant,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,watergate,Unknown,AT2-S44-raw_spectral_data_file-R2.raw
190,GRP49_SBJ8_A4E0_SMP-Adipose-Tissue-1,YES,extraction,Unknown,AT2-S44-Extract-R2,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,TOCSY,Unknown,AT2-S44-raw_spectral_data_file-R3.raw


### 7.1 Overview of the Flow Cytometry assay table
 

In [18]:
dataframes['a_AT15_cell-counting_flow-cytometry.txt'].nunique(axis=0, dropna=True)

Sample Name                           144
Comment[study step with treatment]      2
Protocol REF                            1
Performer                               1
Labeled Extract Name                  288
Label                                   2
Protocol REF.1                          1
Parameter Value[instrument]             1
Parameter Value[optical_path]           1
Parameter Value[detector voltage]       1
Scan Name                             288
Performer.1                             1
Raw Data File                         288
dtype: int64

### Overview of the NMR assay table


In [19]:
dataframes['a_AT2_metabolite-profiling_NMR-spectroscopy.txt'].nunique(axis=0, dropna=True)

Sample Name                            48
Comment[study step with treatment]      1
Protocol REF                            1
Performer                               1
Extract Name                           96
Characteristics[extract type]           2
Protocol REF.1                          1
Parameter Value[instrument]             1
Parameter Value[acquisition_mode]       1
Parameter Value[pulse_sequence]         2
Performer.1                             1
Free Induction Decay Data File        192
dtype: int64