# Programatically creating SED-ML and COMBINE archives from model files

To make it easy for investigators to work with a broad range of model formats, modeling frameworks, simulation types, simulation algorithms, and simulation tools, BioSimulators uses the [Simulation Experiment Description Markup Language (SED-ML)](http://sed-ml.org/) and [COMBINE/OMEX archive](https://combinearchive.org/) formats. 

BioSimulators uses SED-ML to describe simulation experiments. This includes:
* Which models to simulate
* How to modify models to simulate variants such as alternative initial conditions
* What type of simulations to execute (e.g., steady-state, time course)
* Which algorithms to use (e.g., CVODE, SSA)
* Which observables to record
* How to reduce the recorded values of the observables
* How to plot the observables
* How to export the observables to reports (e.g., CSV, HDF5)

BioSimulators uses COMBINE/OMEX archives to bundle the multiple files typically involved in modeling projects into a single archive.
* Models (e.g., in CellML, SBML format)
* Simulation experiments (SED-ML files)
* Visualizations for visualizing simulation results (e.g., in [Vega format](https://vega.github.io/vega/))
* Supplementary files, such as data used to calibrate the model
* Metadata about the simulation project (RDF files that follow the [OMEX Metadata guidelines](http://co.mbine.org/standards/omex-metadata))

[runBioSimuations](https://run.biosimulations.org/create) provides a simple web form for building COMBINE/OMEX archives with SED-ML files from model files (e.g., CellML, SBML). In addition, [BioSimulators-utils](https://github.com/biosimulators/Biosimulators_utils) provides a command-line application building COMBINE/OMEX archives with SED-ML files from model files. Both tools support all of the modeling languages supported by BioSimulators. Instructions for using the command-line application are available at [https://docs.biosimulators.org](https://docs.biosimulators.org/Biosimulators_utils/).

These tools are easy to use. However, they provide investigators little flexibility to customize the generation of COMBINE/OMEX archives and SED-ML files, such as to bundle simulations of multiple model files into a single COMBINE/OMEX archive. The BioSimulators-utils Python API provides additional flexibility to customize the creation of COMBINE/OMEX archives.

This tutorial illustrates how to use the BioSimulators-utils API to programmatically create COMBINE/OMEX archives from model files by illustrating the construction of a COMBINE/OMEX archive for a [flux balance model of the core metabolism of Escherichia coli](data/Escherichia-coli-core-metabolism.xml) encoded in SBML.

## 1. Use `get_parameters_variables_outputs_for_simulation` to introspect the model file

In [1]:
from biosimulators_utils.sedml.data_model import ModelLanguage, SteadyStateSimulation
from biosimulators_utils.sedml.model_utils import get_parameters_variables_outputs_for_simulation

In [2]:
model_filename = 'data/Escherichia-coli-core-metabolism.xml'
model_language = ModelLanguage.SBML
simulation_type = SteadyStateSimulation

In [3]:
params, simulations, vars, outputs = get_parameters_variables_outputs_for_simulation(
    model_filename, model_language, simulation_type, native_ids=True)

## 2. Sort the outputs to the model into objectives and reaction fluxes

In [4]:
obj_variables = list(filter(lambda var: var.target.startswith('/sbml:sbml/sbml:model/fbc:listOfObjectives/'), vars))
rxn_flux_variables = list(filter(lambda var: var.target.startswith('/sbml:sbml/sbml:model/sbml:listOfReactions/'), vars))

## 3. Initialize a SED-ML document which describes a flux balance analysis simulation of the model

In [5]:
from biosimulators_utils.sedml.data_model import SedDocument, Model, Task, Report
import os.path

In [6]:
# initialize SED document
sed_doc = SedDocument()

# create model
model = Model(
    id='model',
    source=os.path.basename(model_filename),
    language=ModelLanguage.SBML.value,
    changes=params,
)
sed_doc.models.append(model)

# attach simulations
sed_doc.simulations = simulations

# create task
task = Task(
    id='task',
    model=model,
    simulation=simulations[0],
)
sed_doc.tasks.append(task)

## 4. Add reports to the model to export the objective and reaction fluxes predicted by the simulation

In [7]:
from biosimulators_utils.sedml.data_model import Report, DataSet, DataGenerator

In [8]:
report = Report(
    id='objective',
    name='Objective',
)
sed_doc.outputs.append(report)
for var in obj_variables:
    var_id = var.id
    var_name = var.name

    var.id = 'variable_' + var_id
    var.name = None

    var.task = task
    data_gen = DataGenerator(
        id='data_generator_{}'.format(var_id),
        variables=[var],
        math=var.id,
    )
    sed_doc.data_generators.append(data_gen)
    report.data_sets.append(DataSet(
        id=var_id,
        label=var_id,
        name=var_name,
        data_generator=data_gen,
    ))

In [9]:
report = Report(
    id='reaction_fluxes',
    name='Reaction fluxes',
)
sed_doc.outputs.append(report)
for var in rxn_flux_variables:
    var_id = var.id
    var_name = var.name

    var.id = 'variable_' + var_id
    var.name = None

    var.task = task
    data_gen = DataGenerator(
        id='data_generator_{}'.format(var_id),
        variables=[var],
        math=var.id,
    )
    sed_doc.data_generators.append(data_gen)
    report.data_sets.append(DataSet(
        id=var_id,
        label=var_id,
        name=var_name if len(rxn_flux_variables) < 4000 else None,
        data_generator=data_gen,
    ))

## 5. Create a temporary directory to collect the files for the COMBINE archive

In [10]:
import os
import tempfile
if not os.path.isdir('tmp'):
    os.mkdir('tmp')
archive_dirname = tempfile.mkdtemp(dir='tmp/')

## 6. Copy the model file to the temporary directory

In [11]:
import shutil
shutil.copyfile(model_filename, os.path.join(archive_dirname, os.path.basename(model_filename)))

'tmp/tmpde5i_9i5/Escherichia-coli-core-metabolism.xml'

## 7. Export the SED document to a file

In [12]:
from biosimulators_utils.sedml.io import SedmlSimulationWriter

In [13]:
sedml_filename = os.path.join(archive_dirname, 'simulation.sedml')
SedmlSimulationWriter().run(sed_doc, sedml_filename)

  - Model `model` may be invalid.
    - The model file `Escherichia-coli-core-metabolism.xml` may be invalid.
      - In situations where a mathematical expression refers to a compartment, species or parameter, it is necessary to know the units of the object to establish unit consistency. In models where the units of an object have not been declared, libSBML does not yet have the functionality to accurately verify the consistency of the units in mathematical expressions referring to that object. 
         The units of the <compartment> 'c' cannot be fully checked. Unit consistency reported as either no errors or further unit errors related to this object may not be accurate.
        
      - If neither the attribute 'units' nor the attribute 'spatialDimensions' on a Compartment object is set, the unit associated with that compartment's size is undefined.
        Reference: L3V1 Section 4.5
         The <compartment> 'c' has no discernable units.
        
      - In situations where a m

## 8. Collect metadata about the model

In [14]:
import datetime

In [15]:
citation_doi = '10.1186/1752-0509-7-74'
now = datetime.datetime.now()
metadata = {
    "uri": '.',
    'title': 'Escherichia coli core metabolism',
    'abstract': 'Flux balance analysis model of the metabolism of Escherichia coli',
    'keywords': [
        'metabolism',
        'BiGG',
    ],
    'description': None,
    'taxa': [
        {
            'uri': 'http://identifiers.org/taxonomy:83333',
            'label': 'Escherichia coli K-12',
        },
    ],
    'encodes': [
        {
            'uri': 'http://identifiers.org/GO:0008152',
            'label': 'metabolic process',
        },
    ],
    'thumbnails': [
        ''
    ],
    'sources': [],
    'predecessors': [],
    'successors': [],
    'see_also': [],
    'creators': [{
        'uri': 'https://identifiers.org/github:opencobra/cobrapy',
        'label': 'COBRApy Team',
    }],
    'contributors': [{
        'uri': 'https://identifiers.org/orcid:0000-0002-2605-5080',
        'label': 'Jonathan Karr',
    }],
    'identifiers': [
        {
            'uri': 'http://identifiers.org/bigg.model:e_coli_core',
            'label': 'bigg.model:e_coli_core',
        },
    ],
    'citations': [
        {
            'uri': 'http://identifiers.org/doi:' + citation_doi,
            'label': None,
        },
    ],
    'license': {
        'uri': 'http://bigg.ucsd.edu/license',
        'label': 'BiGG',
    },
    'funders': [],
    'created': '{}-{:02d}-{:02d}'.format(now.year, now.month, now.day),
    'modified': [
        '{}-{:02d}-{:02d}'.format(now.year, now.month, now.day),
    ],
    'other': [],
}

## 9. Use `get_reference` to lookup citation information for the refernece

In [16]:
from biosimulators_utils.ref.utils import get_reference
import Bio.Entrez
Bio.Entrez.email = 'john.doe@university.edu'

In [17]:
citation = get_reference(doi=citation_doi)

## 10. Add the citation to the metadata

In [18]:
metadata['citations'][0]['label'] = citation.get_citation()

## 11. Use `get_pubmed_central_open_access_graphics` to retrieve thumbnail images for the model from the [open-access subset of PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/)

In [19]:
from biosimulators_utils.ref.utils import get_pubmed_central_open_access_graphics

In [20]:
thumbnails = get_pubmed_central_open_access_graphics(
    citation.pubmed_central_id,
    archive_dirname,
)

## 12. Add the thumbnails to the metadata

In [21]:
metadata['thumbails'] = [os.path.relpath(thumbnail.filename, archive_dirname) for thumbnail in thumbnails]

## 13. Save metadata about the model to an RDF file

In [22]:
from biosimulators_utils.omex_meta.io import BiosimulationsOmexMetaWriter

In [23]:
metadata_filename = os.path.join(archive_dirname, 'metadata.rdf')
BiosimulationsOmexMetaWriter().run([metadata], metadata_filename)

## 14. Create a description of the desired COMBINE/OMEX archive

In [24]:
from biosimulators_utils.combine.data_model import CombineArchive, CombineArchiveContent, CombineArchiveContentFormat

In [25]:
# initialize the archive
archive = CombineArchive()

# add the model file to the archive
archive.contents.append(CombineArchiveContent(
    location=os.path.basename(model_filename),
    format=CombineArchiveContentFormat.SBML.value,
))

# add the SED-ML file to the archive
archive.contents.append(CombineArchiveContent(
    location='simulation.sedml',
    format=CombineArchiveContentFormat.SED_ML.value,
    master=True,
))

# add the RDf metadata file to the archive
archive.contents.append(CombineArchiveContent(
    location='metdata.rdf',
    format=CombineArchiveContentFormat.OMEX_METADATA.value,
))

## 15. Export the COMBINE/OMEX archive to a file

In [26]:
from biosimulators_utils.combine.io import CombineArchiveWriter

In [27]:
_, archive_filename = tempfile.mkstemp(dir='tmp/', suffix='.omex')
CombineArchiveWriter().run(archive, archive_dirname, archive_filename)

## 16. Inspect the COMBINE/OMEX archive

In [28]:
from IPython.display import FileLink, FileLinks

In [29]:
FileLinks(archive_dirname)

In [30]:
FileLink(archive_filename)

## 16. Verify the COMBINE/OMEX archive 

Execute the archive with [CBMpy](https://biosimulators.org/simulators/cbmpy) and [COBRApy](https://biosimulators.org/simulators/cobrapy).

In [31]:
from biosimulators_utils.config import Config
import biosimulators_cbmpy
import biosimulators_cobrapy


INFO: No xlwt module available, Excel spreadsheet creation disabled



No module named 'cplex'



CPLEX not available

*****
Using GLPK
*****


INFO: No xlrd module available, Excel spreadsheet reading disabled

CBMPy environment
******************
Revision: r689


***********************************************************************
* Welcome to CBMPy (0.7.25) - PySCeS Constraint Based Modelling       *
*                http://cbmpy.sourceforge.net                         *
* Copyright(C) Brett G. Olivier 2014 - 2019                           *
* Dept. of Systems Bioinformatics                                     *
* Vrije Universiteit Amsterdam, Amsterdam, The Netherlands            *
* CBMPy is developed as part of the BeBasic MetaToolKit Project       *
* Distributed under the GNU GPL v 3.0 licence, see                    *
* LICENCE (supplied with this release) for details                    *
***********************************************************************



In [32]:
output_dirname = tempfile.mkdtemp(dir='tmp/')
config = Config(
    COLLECT_COMBINE_ARCHIVE_RESULTS=True,
    LOG=False,
)

results_cbmpy, _ = biosimulators_cbmpy.exec_sedml_docs_in_combine_archive(archive_filename, output_dirname, config=config)
results_cobrapy, _ = biosimulators_cobrapy.exec_sedml_docs_in_combine_archive(archive_filename, output_dirname, config=config)

  - Model `model` may be invalid.
    - The model file `Escherichia-coli-core-metabolism.xml` may be invalid.
      - In situations where a mathematical expression refers to a compartment, species or parameter, it is necessary to know the units of the object to establish unit consistency. In models where the units of an object have not been declared, libSBML does not yet have the functionality to accurately verify the consistency of the units in mathematical expressions referring to that object. 
         The units of the <compartment> 'c' cannot be fully checked. Unit consistency reported as either no errors or further unit errors related to this object may not be accurate.
        
      - If neither the attribute 'units' nor the attribute 'spatialDimensions' on a Compartment object is set, the unit associated with that compartment's size is undefined.
        Reference: L3V1 Section 4.5
         The <compartment> 'c' has no discernable units.
        
      - In situations where a m

Archive contains 1 SED-ML documents with 1 models, 1 simulations, 1 tasks, 2 reports, and 0 plots:
  simulation.sedml:
    Tasks (1):
      task
    Reports (2):
      objective: 1 data sets
      reaction_fluxes: 95 data sets

Executing SED-ML file 1: simulation.sedml ...
  Found 1 tasks and 2 outputs:
    Tasks:
      `task`
    Outputs:
      `objective`
      `reaction_fluxes`
  Executing task 1: `task`
    Executing simulation ...FBC version: 2
M.getNumReactions: 95
M.getNumSpecies: 72
FBC.getNumObjectives: 1
FBC.getNumParameters: 5
FBC.getNumGeneProducts: 137
Zero dimension compartment detected: c
Zero dimension compartment detected: e
FluxBounds process1: 0.006
INFO: Active objective: obj
Adding objective: obj
FluxBounds process2: 0.002

SBML3 load time: 0.191


gplk_constructLPfromFBA time: 0.013360738754272461


glpk_analyzeModel FBA --> LP time: 4.76837158203125e-07


analyzeModel objective value: 0.8739215069684909

Objective obj: "maximize"
 [34msucceeded[0m
    Generatin

Check that the predicted objectives are positive

In [33]:
assert results_cbmpy['simulation.sedml']['objective']['obj'] > 0
assert results_cobrapy['simulation.sedml']['objective']['obj'] > 0

Check that the simulators produced consistent objective values

In [34]:
import numpy.testing

In [35]:
numpy.testing.assert_allclose(
    results_cbmpy['simulation.sedml']['objective']['obj'],
    results_cobrapy['simulation.sedml']['objective']['obj'],
)

## 17. Submit model to runBioSimulations

In [36]:
from biosimulators_utils.biosimulations.utils import submit_project_to_runbiosimulations

In [37]:
# run_id = submit_project_to_runbiosimulations(
#    name=metadata['title'], 
#    filename_or_url=archive_filename, 
#    simulator='cobrapy', 
#    public=False)

In [38]:
# print('https://run.biosimulations.org/simulations/' + run_id)