Bachelor thesis on "Effective analysis of genotype datasets"
D Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
test
Makefile
README.markdown
bsc.pdf

README.markdown

Effective algorithms on genotype dataset

Abstract

Genotype datasets are usually very large and they are expected to grow rapidly. Data size and format will start affecting speed of programs, therefore it is neccesary to have a fast framework and data representation and structure to get the best performance.

This paper discusses how simplifying the datasets and algorithms can have an improvement on program execution speed. We developed a data format (GMAP), framework (GMap) and programs for analysing genotype datasets.

We compared the speed of programs with Plink using different file formats. Testing showed that we can get a large improvement in performance using binary file format (such as our GMAP and Plink's BED) instead of text-based format (such as PED and TPED). Also, we show that our programs work faster than Plink, yet we could not say definitively if this is due to our data format or our algorithm implementation.

Compiling

Install gdc or dmd. Install dsss and rebuild http://www.dsource.org/projects/dsss.

make

Folder Structure

./src                  -- source
./src/gmap/            -- gmap libraries

./test                 -- testing scripts
    clean.sh           -- deletes all automatically generated data
    generate_data.sh   -- generates data for testing into data folder
                          change it if you want more data
    test_gmap.sh       -- runs gmap programs, timing data is in test_gmap.log
    test_plink.sh      -- runs plink program, timing data is in test_plink.log
    _test_/            -- this folder holds all results and data generated by tests

./bin                  -- binary files
    gmapassoc          -- does a association study
    gmapfreq           -- outputs genotype frequencies
    gmaphardyweinberg  -- tests for hardy-weinberg equilibrium
    gmaprandpheno      -- generates random phenotype data
    gmapconvert        -- converts ped to gmap
    gmapgenerate       -- generates a random ped file
    gmappack           -- packs gmap file

./obj                  -- object files