Genotype datasets are usually very large and they are expected to grow rapidly. Data size and format will start affecting speed of programs, therefore it is neccesary to have a fast framework and data representation and structure to get the best performance.
This paper discusses how simplifying the datasets and algorithms can have an improvement on program execution speed. We developed a data format (GMAP), framework (GMap) and programs for analysing genotype datasets.
We compared the speed of programs with Plink using different file formats. Testing showed that we can get a large improvement in performance using binary file format (such as our GMAP and Plink's BED) instead of text-based format (such as PED and TPED). Also, we show that our programs work faster than Plink, yet we could not say definitively if this is due to our data format or our algorithm implementation.
Install gdc or dmd. Install dsss and rebuild http://www.dsource.org/projects/dsss.
./src -- source
./src/gmap/ -- gmap libraries
./test -- testing scripts
clean.sh -- deletes all automatically generated data
generate_data.sh -- generates data for testing into data folder
change it if you want more data
test_gmap.sh -- runs gmap programs, timing data is in test_gmap.log
test_plink.sh -- runs plink program, timing data is in test_plink.log
_test_/ -- this folder holds all results and data generated by tests
./bin -- binary files
gmapassoc -- does a association study
gmapfreq -- outputs genotype frequencies
gmaphardyweinberg -- tests for hardy-weinberg equilibrium
gmaprandpheno -- generates random phenotype data
gmapconvert -- converts ped to gmap
gmapgenerate -- generates a random ped file
gmappack -- packs gmap file
./obj -- object files