Skip to content

eaton-lab/GOOGA

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genome Order Optimization using a Genetic Algorithm (GOOGA) (now published: https://doi.org/10.1371/journal.pcbi.1006949)

This project implements a genetic algorithm (GA) designed to take genotype data from a segregating population and use it to identify an optimal scaffold order from a perhaps not so well assembled genome. It does this by permuting scaffold orders from the genome assembly and fitting a HMM with genotype data to estimate recombination rates and the ultimate likelihood of a particular order. Then this is fed into the GA to search for a near optimal order.

The code runs in python and was developed for python2.7 and requires the scipy package to be installed.

To run on the test data, first download this repo. Then you need to extract the test data:

gunzip data/LVR.cross.tar.gz
tar xf data/LVR.cross.tar

This gives several files, including the genotype files (e.g. g.F2.163.txt), which look like:

head g.F2.163.txt
F2.163	1	0	AB
F2.163	1	100000	AB
F2.163	1	200000	AB
F2.163	1	300000	AB
F2.163	1	400000	AB
F2.163	1	500000	AB

where the 4 columns are line_name, chromosome, interval, genotype (AA, AB, BB, or NN for missing/unknown)

Also the test data contains a list of all segregating lines you wish to consider for ordering scaffolds, it's called LVR.f2set.txt, an estimate of the intitial intra-scaffold recombinantion rates called LVR.isr.txt, and a file of error rates called LVR.er2.txt.

Finally, the genetic algorithm code is called parallel.genet.alg.final.py. It was written to be run on a multiprocessor system and can make use of parallelism. It takes several flags at runtime. To get help on these flags simply run:

python2 parallel.genet.alg.final.py --help

Usage: parallel.genet.alg.final.py marker_file [options]


Options:

  -h, --help            show this help message and exit

  -e ERROR_FILE         File with precaculated error rates -
                        default=error.rates.txt
                        
  -f F2_FILE            File with F2s to use in analysis -
                        default=test.f2group.txt
                        
  -i INTRASCAFF_RATES_FILE
                        File with precalculated intra-scaffold recombination
                        rates - default=intrascaff_rates.txt
                        
  -c NCPU               Number of CPUs - default=16
  
  -g NGEN               Number of generations to run the GA - default=10^10
  
  -l ELITE              Number of elite in each generation default=3
  
  -t TERMINATION        Number of generations with no improvement before
                        termination - default=1000 
  
  -r                    Use -r flag if your mapping pop is RILs - default is
                        F2 seg. pop

NCPU should be set according to your computer. The code was developed for a 16 core machine, hence the default setting.

ELITE designates the number of contigs orders (in a GA they are called individuals) to be carried over to the next generation. I've had good luck setting this between 2-4.

The GA will automatically select it's population size based on NCPU and ELITE. Specifically, it will be NCPU + ELITE. So for example, given the default settings, the GA population size will be 19.

You can also control how long the GA runs, using both NGEN and TERMINATION. NGEN simply sets a max number of generations for the GA, whereas TERMINATION allows you to tell the GA to stop after a specified number of generations where the best scaffold order (individual) has not be improved upon. For example, if NGEN is set to 500 and TERMINATION is set to 200, the GA will stop at 500 generations total or 200 gen without improvement, whichever comes first. As a default the NGEN is set astronomically high, to rely only on TERMINATION to end the run.

Finally the -r flag is a boolean, which when used swithces the HMM backend to accomadate a RIL population (no hets) rather than a standard seg pop, like a F2 (has hets). If you don't use this flag, the GA defaults to standard F2 seg. pop.

The only mandatory input file is a starting a genomic scaffold order ("marker_file"). There's an example in markers.2.v2.txt. This file correpsonds to Chromosome 2 in Mimulus guttatus. The GA uses this order as a starting place.

Then run the code:

python2 parallel.genet.alg.final.py markers.2.v2.txt -c 8 -l 1 -i LVR.isr.txt -f LVR.f2set.txt -e LVR.er2.txt > some.output.file

This runs on 8 CPUs with 1 elite on the test files noted above. And at each generation it will output various run statistics, including all current scaffold orders and likelihoods that the algorithm is grinding away on.

At some point it will stop, and you might want to run:

python2 run.stats.py some.output

This step requires that you have the matplotlib package installed. It gives a plot of various run statistics so you can see how it progressed. It gives some sense of whether you need to tune things or if it was pretty successful.

You'll also want to check out the last generation of the output file. This will contain the full likelihood of the best individual and the corresponding contig order.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.1%
  • Shell 2.2%
  • R 1.7%