Effective algorithms on genotype dataset
Genotype datasets are usually very large and they are expected to grow rapidly. Data size and format will start affecting speed of programs, therefore it is neccesary to have a fast framework and data representation and structure to get the best performance.
This paper discusses how simplifying the datasets and algorithms can have an improvement on program execution speed. We developed a data format (GMAP), framework (GMap) and programs for analysing genotype datasets.
We compared the speed of programs with Plink using different file formats. Testing showed that we can get a large improvement in performance using binary file format (such as our GMAP and Plink's BED) instead of text-based format (such as PED and TPED). Also, we show that our programs work faster than Plink, yet we could not say definitively if this is due to our data format or our algorithm implementation.
Install gdc or dmd. Install dsss and rebuild http://www.dsource.org/projects/dsss.
./src -- source ./src/gmap/ -- gmap libraries ./test -- testing scripts clean.sh -- deletes all automatically generated data generate_data.sh -- generates data for testing into data folder change it if you want more data test_gmap.sh -- runs gmap programs, timing data is in test_gmap.log test_plink.sh -- runs plink program, timing data is in test_plink.log _test_/ -- this folder holds all results and data generated by tests ./bin -- binary files gmapassoc -- does a association study gmapfreq -- outputs genotype frequencies gmaphardyweinberg -- tests for hardy-weinberg equilibrium gmaprandpheno -- generates random phenotype data gmapconvert -- converts ped to gmap gmapgenerate -- generates a random ped file gmappack -- packs gmap file ./obj -- object files