IsoMut: a robust method for calling unique mutations (SNVs and small indels) from multiple isogenic samples
This is the implementation of the mutation calling algorithm described here (article to come ... ). The core parts are written in C, and the example script, and the wrapper for parallel execution are in python. Please note, that running the example python script does not require any knowledge in python programming, only pathnames and parameter values need to be changed.
For the impatient:
- compile IsoMut
- modify the pathnames for your data in isomut_example_script.py, and run the script
- check out test_isomut.ipynb, for necessities, and details
cd src/ gcc -O3 -c isomut_lib.c fisher.c -W -Wall gcc -O3 -o isomut isomut.c isomut_lib.o fisher.o -lm -W -Wall
For the standalone C application:
- src/isomut_lib.h src/isomut_lib.c the library
- src/fisher.c code for calculating the mutation quality score
- src/isomut.c the application
For the parallel wrapper:
- src/isomut_wrappers.py the python functions for parallel execution
For running IsoMut:
- ismout_example_script.py, example script to run IsoMut
Testing and demonstration:
- tests/test_isomut.ipynb jupyter notebook showing a parallel IsoMut test run
- tests/test_isomut_c_part.ipynb jupyter notebook showing a test run for the C program
- samtools needs to be installed
Understanding the output
- The output is a tab separated file, each line is a detected mutation. The columns are the following:
- #sample_name: the name of the bam file which contains the mutation.
- chr: the chromosome/contig name of the mutation
- pos: the 1-based position of the mutation for SNV, and the previous base before INS/DEL
- type: the mutation type: values can be: SNV/INS/DEL
- score: Quality score for the mutation
- ref: reference base/sequence of the mutation (- for insertions)
- mut: mutated base/sequence of the mutation (- for deletions)
- cov: coverage of the mutated position (after base quality filtering!)
- mut_freq: frequency of reads containing the mutation
- cleanliness: lowest reference nucleotide frequency among the other samples
Understading the mutation quality score
- We generalized the scoring used by VarScan , the Fisher's exact p-value for multiple samples. The score is calculated taking the counts in the mutated samples, and the counts from the noisiest sample among the other samples. This value is realted to the probability of the mutation being a false positive, but it has not direct phisical interpretation. Therefore, to avoid confusion we use the negative 10 based logarithm of the Fisher's exact p-value calculated. This means, that the higher the score, the better the mutation is.
- We have found that the score can be used to significantly improve the IsoMuts hard filters, and it allows for a very precise tuning between sensitivity and sepcificity of IsoMut.
- 1, IsoMut calls samtools mpileup wih -B flag for all samples together
- 2, It processes every line and selects the ones which meet the following criteria
- reliability: coverage is higher than a given threshold
- enough evidence: mutated base frequency is higher than a given threshold
- low noise: other samples minimum frequency is higher than a given threshold
- no gaps are found closer than a given radius
- 3, For efficient proximal indel induced false positive SNV filtering, IsoMut repeats the the above for the suspected SNVs without the -B flag.
- The samples have higher than 1000 per base read coverage. In this case our default settings might not use all the reads in the bam files. We call samtools with the --max-depth 1000 argument, to prevent possible memory issues with very large read depths. If you want to use all the reads, please change the SAMTOOLS_MAX_DEPTH variable at the beggining of isomut_wrappers.py to a number which is higher than highest coverage in any of the samples, and be careful with the memory usage.
- Number of samples is higher than 1000. We do not recommend using that many samples, but in this case please redefine MAXSAMPLE in the beggining of isomut_lib.h to a number high enough. Also please watch out for the potential implications of samtools mpileup memory usage if the overall coverage will be higher than 1M reads.
Main code was written by Dezso Ribli, and the whole research was done in the collaboration of:
- David Szuts, Janos Molnar (Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences)
- Istvan Csabai, Orsi Pipek, Dezso Ribli (Department of Physics of Complex Systems, Eotvos Lorand University)
- Zolan Szallasi (Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark)
Fisher's exact test from this github repo, ( thanks Christopher Chang ) :