# Create ISA-API Investigation from Datascriptor Study Design configuration
# Crossover Study with two dietary treatments on dogs

In this notebook I will show you how you can use a study design configuration is JSON format as produce by datascriptor (https://gitlab.com/datascriptor/datascriptor) to generate a single-study ISA investigation and how you can then serialise it in JSON and tabular (i.e. CSV) format.

Or study design configuration consists of:
- a 4-arm study design. Each arm has 10 subjects
- subjects are dogs. There is an observational factor, named "health status" with two values: "healthy" and ""with idiopathic epilepsy""
- a crossover of two dietary treatments, a control treatment ("control oil" 3 times/day for 90 days) and a proper treatment ("MCT oil supplement" 3 times/day for 90 days)
- four non-treatment phases: screen (14 days), run-in (5 days), washout (7 days) and follow-up (6 months)
- three sample types colllected: nasopharyngeal samples using and blood samples
- three assay types: 
    - metabolite profiling through NMR spectrometry on the blood samples.
    - genomic sequencing on nasopharyngeal samples using 

## 1. Setup

Let's import all the required libraries

In [1]:
from time import time
import os
import json

In [2]:
## ISA-API related imports
from isatools.model import Investigation, Study

In [3]:
## ISA-API create mode related imports
from isatools.create.model import StudyDesign
from isatools.create.connectors import generate_study_design

# serializer from ISA Investigation to JSON
from isatools.isajson import ISAJSONEncoder

# ISA-Tab serialisation
from isatools import isatab

In [4]:
## ISA-API create mode related imports
from isatools.create import model
from isatools import isajson

## 2. Load the Study Design JSON configuration

First of all we load the study design configurator with all the specs defined above

In [5]:
with open(os.path.abspath(os.path.join(
    "config", "crossover-study-dietary-dog.json"
)), "r") as config_file:
    study_design_config = json.load(config_file)

## 3. Generate the ISA Study Design from the JSON configuration
To perform the conversion we just need to use the function `generate_isa_study_design()` (name possibly subject to change, should we drop the "isa" and "datascriptor" qualifiers?)

In [6]:
study_design = generate_study_design(study_design_config)
assert isinstance(study_design, StudyDesign)

## 4. Generate the ISA Study from the StudyDesign and embed it into an ISA Investigation

The `StudyDesign.generate_isa_study()` method returns the complete ISA-API `Study` object.

In [7]:
start = time()
study = study_design.generate_isa_study()
end = time()
print('The generation of the study design took {:.2f} s.'.format(end - start))
assert isinstance(study, Study)
investigation = Investigation(identifier='inv01', studies=[study])

The generation of the study design took 2.76 s.


## 5. Serialize and save the JSON representation of the generated ISA Investigation

In [8]:
start = time()
inv_json = json.dumps(investigation, cls=ISAJSONEncoder, sort_keys=True, indent=4, separators=(',', ': '))
end = time()
print('The JSON serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The JSON serialisation of the ISA investigation took 1.03 s.


In [9]:
directory = os.path.abspath(os.path.join('output', 'crossover-2-treatments-mice'))
os.makedirs(directory, exist_ok=True)
with open(os.path.abspath(os.path.join(directory, 'isa-investigation-crossover-2-treatments-mice.json')), 'w') as out_fp:
    json.dump(json.loads(inv_json), out_fp)

## 6. Dump the ISA Investigation to ISA-Tab

In [10]:
start = time()
isatab.dump(investigation, directory)
end = time()
print('The Tab serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The Tab serialisation of the ISA investigation took 39.34 s.


To use them on the notebook we can also dump the tables to pandas DataFrames, using the `dump_tables_to_dataframes` function rather than dump

In [11]:
dataframes = isatab.dump_tables_to_dataframes(investigation)

In [12]:
len(dataframes)

3

## 7. Check the correctness of the ISA-Tab DataFrames 

We have 1 study file and 2 assay files (one for MS and one for NMR). Let's check the names:

In [13]:
for key in dataframes.keys():
    display(key)

's_study_01.txt'

'a_AT2_metabolite-profiling_NMR-spectroscopy.txt'

'a_AT8_genome-sequencing_nucleic-acid-sequencing.txt'

### 7.1 Count of subjects and samples

We have 10 subjects in the each of the 4 arms for a total of 40 subjects.

We collect:
- 7 urine samples per subject (total 280 samples)
- 8 blood samples per subject (total 320 samples)
- 1 tissue sample per subject (total 40 samples)

Across the 4 study arms a total of 640 samples are collected (160 samples per arm)

In [14]:
study_frame = dataframes['s_study_01.txt']
count_arm0_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP0' in el)])
count_arm1_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP1' in el)])
count_arm2_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP2' in el)])
count_arm3_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP3' in el)])
print("There are {} samples in the GRP0 arm (i.e. group)".format(count_arm0_samples))
print("There are {} samples in the GRP1 arm (i.e. group)".format(count_arm1_samples))
print("There are {} samples in the GRP2 arm (i.e. group)".format(count_arm2_samples))
print("There are {} samples in the GRP3 arm (i.e. group)".format(count_arm3_samples))

There are 192 samples in the GRP0 arm (i.e. group)
There are 192 samples in the GRP1 arm (i.e. group)
There are 192 samples in the GRP2 arm (i.e. group)
There are 192 samples in the GRP3 arm (i.e. group)


### 7.2 Study Table Overview

The study table provides an overview of the subjects (sources) and samples

In [15]:
study_frame

Unnamed: 0,Source Name,Characteristics[Study Subject],Characteristics[health status],Protocol REF,Parameter Value[Sampling order],Parameter Value[Study cell],Date,Performer,Sample Name,Characteristics[organism part],Comment[study step with treatment],Factor Value[Sequence Order],Factor Value[INTENSITY],Unit,Factor Value[DURATION],Unit.1,Factor Value[AGENT]
0,GRP0_SBJ01,Dog,with idiopathic epilepsy,sample collection,115,A0E4,2021-06-29,Unknown,GRP0_SBJ01_A0E4_SMP-blood-1,blood,YES,4,3.0,times/day,90,days,MCT oil supplement
1,GRP0_SBJ01,Dog,with idiopathic epilepsy,sample collection,043,A0E1,2021-06-29,Unknown,GRP0_SBJ01_A0E1_SMP-blood-1,blood,NO,1,,,5,days,
2,GRP0_SBJ01,Dog,with idiopathic epilepsy,sample collection,079,A0E3,2021-06-29,Unknown,GRP0_SBJ01_A0E3_SMP-nasopharingeal-sample-1,nasopharingeal sample,NO,3,,,7,days,
3,GRP0_SBJ01,Dog,with idiopathic epilepsy,sample collection,177,A0E5,2021-06-29,Unknown,GRP0_SBJ01_A0E5_SMP-blood-3,blood,NO,5,,,6,months,
4,GRP0_SBJ01,Dog,with idiopathic epilepsy,sample collection,139,A0E5,2021-06-29,Unknown,GRP0_SBJ01_A0E5_SMP-nasopharingeal-sample-1,nasopharingeal sample,NO,5,,,6,months,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,GRP3_SBJ12,Dog,healthy,sample collection,763,A3E5,2021-06-29,Unknown,GRP3_SBJ12_A3E5_SMP-blood-1,blood,NO,5,,,6,months,
764,GRP3_SBJ12,Dog,healthy,sample collection,623,A3E1,2021-06-29,Unknown,GRP3_SBJ12_A3E1_SMP-blood-1,blood,NO,1,,,5,days,
765,GRP3_SBJ12,Dog,healthy,sample collection,695,A3E4,2021-06-29,Unknown,GRP3_SBJ12_A3E4_SMP-blood-1,blood,YES,4,3.0,times/day,90,days,control oil
766,GRP3_SBJ12,Dog,healthy,sample collection,635,A3E2,2021-06-29,Unknown,GRP3_SBJ12_A3E2_SMP-nasopharingeal-sample-1,nasopharingeal sample,YES,2,3.0,times/day,90,days,MCT oil supplement


### 7.3 First Assay: Metabolite Profiling using NMR spectrometry

This assay takes blood samples as input

In [19]:
dataframes['a_AT2_metabolite-profiling_NMR-spectroscopy.txt']

Unnamed: 0,Sample Name,Comment[study step with treatment],Protocol REF,Performer,Extract Name,Characteristics[extract type],Protocol REF.1,Parameter Value[instrument],Parameter Value[acquisition_mode],Parameter Value[pulse_sequence],Performer.1,Free Induction Decay Data File
0,GRP0_SBJ01_A0E0_SMP-blood-1,NO,extraction,Unknown,AT2-S7-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,watergate,Unknown,AT2-S7-raw_spectral_data_file-R4.raw
1,GRP0_SBJ01_A0E0_SMP-blood-1,NO,extraction,Unknown,AT2-S7-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,HOESY,Unknown,AT2-S7-raw_spectral_data_file-R1.raw
2,GRP0_SBJ01_A0E0_SMP-blood-1,NO,extraction,Unknown,AT2-S7-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,HOESY,Unknown,AT2-S7-raw_spectral_data_file-R2.raw
3,GRP0_SBJ01_A0E0_SMP-blood-1,NO,extraction,Unknown,AT2-S7-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,watergate,Unknown,AT2-S7-raw_spectral_data_file-R3.raw
4,GRP0_SBJ01_A0E1_SMP-blood-1,NO,extraction,Unknown,AT2-S19-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,watergate,Unknown,AT2-S19-raw_spectral_data_file-R4.raw
...,...,...,...,...,...,...,...,...,...,...,...,...
1531,GRP3_SBJ12_A3E5_SMP-blood-2,NO,extraction,Unknown,AT2-S380-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,HOESY,Unknown,AT2-S380-raw_spectral_data_file-R2.raw
1532,GRP3_SBJ12_A3E5_SMP-blood-3,NO,extraction,Unknown,AT2-S381-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,HOESY,Unknown,AT2-S381-raw_spectral_data_file-R2.raw
1533,GRP3_SBJ12_A3E5_SMP-blood-3,NO,extraction,Unknown,AT2-S381-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,watergate,Unknown,AT2-S381-raw_spectral_data_file-R3.raw
1534,GRP3_SBJ12_A3E5_SMP-blood-3,NO,extraction,Unknown,AT2-S381-Extract-R1,pellet,nmr_spectroscopy,Bruker Avance II 1 GHz,1D 1H NMR,watergate,Unknown,AT2-S381-raw_spectral_data_file-R4.raw


#### Mass Spec Stats

For this assay we have 280 urine samples. 280 DNA extracts are extracted from the samples. The 280 extracts are subsequently labeled. For each labeled extract, 4 mass.spec analyses are run (using Agilent QTQF 6510, positive acquisition mode, 2 replicates each for LC and FIA injection mode), for a total of 1120 mass. spec. processes and 1120 raw spectral data files

In [20]:
dataframes['a_AT2_metabolite-profiling_NMR-spectroscopy.txt'].nunique(axis=0, dropna=True)

Sample Name                            384
Comment[study step with treatment]       2
Protocol REF                             1
Performer                                1
Extract Name                           384
Characteristics[extract type]            1
Protocol REF.1                           1
Parameter Value[instrument]              1
Parameter Value[acquisition_mode]        1
Parameter Value[pulse_sequence]          2
Performer.1                              1
Free Induction Decay Data File        1536
dtype: int64

### 7.4 Third Assay: Genome Sequencing

This assay takes a nasopharyngeal samples as input

In [21]:
dataframes['a_AT8_genome-sequencing_nucleic-acid-sequencing.txt']

Unnamed: 0,Sample Name,Comment[study step with treatment],Protocol REF,Performer,Extract Name,Characteristics[extract type],Protocol REF.1,Parameter Value[instrument],Parameter Value[library_orientation],Parameter Value[library_strategy],Performer.1,Raw Data File
0,GRP0_SBJ01_A0E0_SMP-nasopharingeal-sample-1,NO,extraction,Unknown,AT8-S7-Extract-R2,DNA,library_preparation,PromethION,single,WGS,Unknown,AT8-S7-raw_data_file-R5.raw
1,GRP0_SBJ01_A0E0_SMP-nasopharingeal-sample-1,NO,extraction,Unknown,AT8-S7-Extract-R2,DNA,library_preparation,PromethION,single,WGS,Unknown,AT8-S7-raw_data_file-R6.raw
2,GRP0_SBJ01_A0E0_SMP-nasopharingeal-sample-1,NO,extraction,Unknown,AT8-S7-Extract-R2,DNA,library_preparation,PromethION,paired,WGS,Unknown,AT8-S7-raw_data_file-R7.raw
3,GRP0_SBJ01_A0E0_SMP-nasopharingeal-sample-1,NO,extraction,Unknown,AT8-S7-Extract-R2,DNA,library_preparation,PromethION,paired,WGS,Unknown,AT8-S7-raw_data_file-R8.raw
4,GRP0_SBJ01_A0E0_SMP-nasopharingeal-sample-1,NO,extraction,Unknown,AT8-S7-Extract-R1,gDNA,library_preparation,PromethION,single,WGS,Unknown,AT8-S7-raw_data_file-R4.raw
...,...,...,...,...,...,...,...,...,...,...,...,...
3067,GRP3_SBJ12_A3E5_SMP-nasopharingeal-sample-3,NO,extraction,Unknown,AT8-S381-Extract-R2,DNA,library_preparation,PromethION,single,WGS,Unknown,AT8-S381-raw_data_file-R5.raw
3068,GRP3_SBJ12_A3E5_SMP-nasopharingeal-sample-3,NO,extraction,Unknown,AT8-S381-Extract-R1,gDNA,library_preparation,PromethION,paired,WGS,Unknown,AT8-S381-raw_data_file-R2.raw
3069,GRP3_SBJ12_A3E5_SMP-nasopharingeal-sample-3,NO,extraction,Unknown,AT8-S381-Extract-R1,gDNA,library_preparation,PromethION,paired,WGS,Unknown,AT8-S381-raw_data_file-R1.raw
3070,GRP3_SBJ12_A3E5_SMP-nasopharingeal-sample-3,NO,extraction,Unknown,AT8-S381-Extract-R2,DNA,library_preparation,PromethION,paired,WGS,Unknown,AT8-S381-raw_data_file-R8.raw


#### Genome Sequencing Stats

For this assay we use 40 tissue samples. For each sample, four extract are extracted (2 DNA and 2 gDNA replicates). Four each extract, four genomic sequencing assays are run (using Illumina NovaSeq 6000, single-end, 2 replicates for WGS and amplicon each), producing a total of 640 raw data files

In [18]:
dataframes['a_AT8_genome-sequencing_nucleic-acid-sequencing.txt'].nunique(axis=0, dropna=True)

Sample Name                              384
Comment[study step with treatment]         2
Protocol REF                               1
Performer                                  1
Extract Name                             768
Characteristics[extract type]              2
Protocol REF.1                             1
Parameter Value[instrument]                1
Parameter Value[library_orientation]       2
Parameter Value[library_strategy]          1
Performer.1                                1
Raw Data File                           3072
dtype: int64