Skip to content
gmiclotte edited this page Oct 26, 2023 · 20 revisions

OMSim: simulating optical map read data

OMSim is a simulation tool for optical map reads of the Irys platform (BioNano Genomics).

Preliminary Notes

This wiki refers to the latest version on the master branch. Older versions may not be entirely compatible with this wiki.

The command line examples in this wiki are based on bash, if you use another command line, other syntax (e.g. for accessing subdirectories) might be required.

Input

OMSim takes as input a genome file in fasta format and an XML file specifying the knicking enzymes. Both of these and all other (optional) parameters can be specified in an XML file. Example XML files are provided in the example and test directories.

The output is a BNX file (per chip) containing the reads.

Dependencies

At the moment OMSim requires Python 3 and scipy. To install scipy please follow the scipy installation instructions. It is not required to install numpy with BLAS/LAPACK/ATLAS/MKL.

The GUI is no longer being maintained. GUI binaries for 64 bit linux and Windows are available in the bin directory, the C++ source can be compiled for other platforms, requiring wxWidgets 3.0.

Test run

A test data set has been provided in the test folder. Navigate to test/ecoli and run:

python ../../omsim/src/omsim/__main__.py example.xml  

This will produces the following files:
ecoli_output.label_0.1.bnx containing reads with recognition site GCTCTTC
ecoli_output.label_1.1.bnx containing reads with recognition site CACGAG
ecoli_output.bed containing the start and end positions on the reference of all generated reads

Additionally the following terminal output should be generated:

../../omsim/src/omsim/__main__.py example.xml  
Version: v0.2  
BNX version: 1.2  
Circular genome.  
Minimal molecule length: 20000 bp  
Average molecule length: 200000.0 bp  
Minimal coverage: 1x  
Chimera rate: 1.0%  
Random seed: 0  
  
Indexing sequence: gi|49175990|ref|NC_000913.2|  
Found 1490 nicks in 4639675bp.  
Generating reads on 1 chip, estimated coverage: 9698x.  
Finished processing E. coli.  

A similar example can be found in the directory test/hsapiens, herein the index has already been built for hg38 and simulation can start right away.

Quick start

To jump right in with your own data, edit example/minimal.xml and replace "genome.fasta" with the location of your (non-circular) input data in fasta format. Running OMSim (with the predefined BspQI enzyme) then simply becomes:

python omsim/src/omsim/__main__.py example/minimal.xml

The output will be in files omsim_output.label_0.xxx.bnx .

Finetuning

Many parameters can be fine tuned. True positive and false negative rates are enzyme dependent and have to be specified in the enzymes file (e.g. enzymes.xml). All other settings can be specified in the main XML file (see example.xml).

Number of molecules

The number of molecules that are simulated can be influenced in multiple ways:

  1. by specifying the 'chips' and 'scans_per_chip' settings, the final number of molecules will be obtained in a natural manner
  2. by specifying the 'min_num_mol' setting, a single chip is generated and scans will be generated on this chip until this minimal number is achieved.
  3. by specifying the 'coverage' setting, the required number of chips is estimated and this reduces to case 1.
  4. by specifying the 'max_num_mol' setting, the simulation will stop as soon as this number is reached, this applies to all 3 previous scenarios. Note: if 'min_num_mol' is higher than 'max_num_mol', then 'max_num_mol' will be set to 'min_num_mol'.

If unexpected combinations of these settings are provided, then a warning will be given, and priority is given to scenario 1., 2., and 3., while 4. will always apply.

If you want a specific number of molecules, set both 'min_num_mol' and 'max_num_mol' to the desired number.

Note: not all molecules that are simulated will appear in the final output, some will be filtered out due to 'min_mol_len' and 'min_nicks' settings. If you want exactly n molecules, then set both 'min_num_mol' and 'max_num_mol' to n, and 'min_mol_len' and 'min_nicks' to 0.