Anything related to GenSel program from Iowa State. Writen for genomic selection and GWAS analysis for linear and categorical traits. This program only handles animals with phenotypes and genotypes. Please refer to Hao Cheng's Julia package JWAS here for more information on this program. That package will handle more complex models (maternal, repeatability, etc) and single-step analyses based on their GSE paper.
If you were wondering how to run gensel, I hope this helps. Here is an example of my run with gensel and how I set up a shell script to run all my traits back-to-back. I currently have around 500 columns in my dataset. Doing all these GWAS analyses would be quite difficult so shell scripting comes in handy (as always).
GenSel was created and written at Iowa State University by Rohan Fernando (here) and Dorian Garrick (now at Massey University in New Zealand, see here).
You should be able to contact Dorian Garrick (d.garrick@massey.ac.nz) for information on the download.
You need two more packages as well to run on a Mac/Linux (don't know about Windows). To get GenSel libraries you need to go to this link and get the correct version here.
I believe you need this link. But it's been awhile so I forget, but you need to download the gfortran I believe.
For Ubuntu, you need to create an environmental variable in your .bash_profile
that sets LD_LIBRARY_PATH. You will need to modify or create a .bash_profile
(create with touch
or vim
) that goes in your $HOME
directory. On Linux this should look like /home/yourname/
. I'm not sure about other flavors of Linux.
Depending on where you put the program, add this to your .bash_profile
:
LD_LIBRARY_PATH=$HOME/bin/GenSel/
export LD_LIBRARY_PATH
You will need to adjust the directory based on where you leave the binaries. I like to keep all mine within my $HOME/bin/
directory (folder) or within a subfolder of my $HOME/bin/
directory.
I also add this line to my .bash_profile
to find the directory where I put my binary program. I put them in my ~/bin/GenSel/
folder.
export PATH=$HOME/bin/GenSel:$PATH
Once you have it downloaded, added to your path, and any other software needed installed, you can begin to run GenSel programs.
You can get the manual here. It has some information in, but as a whole lacks a lot of information to get started such as file formats and that type of thing. The rest is learning how to tune GenSel to get the correct answers.
It seems like either tab or spaces work fine from what I understand. I'd keep it simple and just use spaces. Most other programs, this is easiest. Whichever you use, be consistent (like always).
IMPORTANT: You need the first word in the data file, map file (chromosome info file), and genotype file to all have the same word (although it doesn't make sense because for some it's a SNP for the map file and animal ID for the data and genotype files). I name this 'SNP' in my example.
An example space delimited (data_sub2.txt):
SNP | Phenotype | Group$ |
---|---|---|
0001 | 0.3647 | 1A |
0002 | 0.5974 | 1A |
... | ... | ... |
1340 | 0.5173 | 3G |
1341 | 0.3967 | 3G |
1343 | 0.2509 | 3G |
The first column needs to be the animal ID matching the genotype file. The 2nd column needs to be the phenotype of interest (any column name). The $
after Group tells GenSel that it's group/class/factor/etc not a covariate (linear). All columns in the datafile will be used, so don't add extra columns. I will try to write an external bash script or something to handle this I hope.
An example space delimited (ChrInfoFinal_GenSel.txt):
SNP | ALGP2_chr | ALGP2_pos |
---|---|---|
AX-116097596 | 1 | 5104 |
AX-116696855 | 1 | 11173 |
AX-116696856 | 1 | 11289 |
... | ... | ... |
AX-116627004 | 19 | 144283026 |
AX-116627005 | 19 | 144285591 |
AX-116799393 | 19 | 144287132 |
IMPORTANT: Remember to add the same beginning to the 2nd and 3rd column names (Chromosome and Position). (I used ALGP2, see parameter file below to see how to specify this)
I don't think the SNPs have to be in order. You should be able to include more SNP than you have in your genotype file. GenSel will only use those it finds in the genotype file. This is nice if you have different subsets. The SNP will be specified as the column header.
The first column name should match the other files (map and data files). The rest of the column names should be the SNP names from the map file. Genotypes need to be -10, 0, 10. Missing genotypes are not allowed (usually 5 or -1), but you can impute with the column average (or allele frequency) and GenSel will ignore them (I've been told, but I would suggest imputing with FImpute or something like this).
An example space delimited (genotypes_imp_GenSel.txt):
SNP | AX-116097596 | AX-116696855 | AX-116696856 | AX-116696857 | ... |
---|---|---|---|---|---|
0185 | 10 | 10 | 10 | 10 | ... |
0324 | 10 | 10 | 10 | 10 | ... |
0332 | 10 | 10 | 10 | 10 | ... |
.... | ... | ... | ... | ... | ... |
0994 | 10 | 10 | 10 | 10 | ... |
1007 | 10 | 10 | 10 | 10 | ... |
1014 | 10 | 10 | 10 | 10 | ... |
You need to have the same name in the top left (SNP for me). Then the SNP names cooresponding to the SNP in the map file as column names. ID's of animals should go in the first column.
Once you have ran the parameter file once, it will write out a binary file with the genotypes. This will be saved as old_file_name.txt.newbin
. Use the .newbin
file after you've read them in once. This will save some time (depending on the size of your file).
This is a simple space/tab delimited file to add the inputs for GenSel like most of the software ran on the command line. Always name with the .inp extension for GenSel and to find them later. One keyword per line.
These are pretty easy in GenSel. Just use the keywords from the manual. outputMarkerHeaderName
will indicate the first column name from the map, data, and genotype file. linkageMap
will tell GenSel what the beginning of the chromosome and position column name is in the map file (see above).
Example (MSE.inp):
# Files
phenotypeFileName data_sub2.txt # Phenotype file
markerFileName genotypes_imp_GenSel.txt # Genotype file
mapOrderFileName ChrInfoFinal_GenSel.txt # Map/Chr file
# column names
outputMarkerHeaderName SNP # Name of first column name in all 3 files
linkageMap ALGP2 # Beginning part of map file chromosome and position column names
# analysis
analysisType Bayes # Bayes analysis
bayesType BayesC # BayesC analysis (1 variance)
probFixed 0 # pi = 0 (all markers fit in the model)
chainLength 50000 # 50,000 iterations
burnin 10000 # 10,000 samples to discard at start
varGenotypic 0.00233 # prior for genetic variance (from GBLUP)
varResidual 0.00843 # prior for residual variance (from GBLUP)
# other
addMapInfoToMarkers yes # Not sure...
windowBV yes # Calculate Window variances (1 Mb by default)
This analysis will run BayesC with pi = 0 for 50,000 iterations and 10,000 burnin samples. You need to set good starting values for priors for the genetic and residual variances. We like to use the results from BayesC pi = 0 for the other Bayesian analyses.
GenSel outputs many different output files. It generally names them with a number behind it, in case you run multiple runs with the same parameter file. So if you title your parameter file ADG.inp
(you should always name it with the .inp extension) the output will be ADG.out1
, followed by ADG.out2
, ADG.out3
, etc. This will happen for all of the other output files as well (.mrkRes1
, .mrkRes2
, etc for example).
Main output files:
- base.cgrResn
- base.cgrResSampelsn
- base.ghatRELn
- base.mrkResn
- base.outn
- base.outSamplesBINn
- base.winQTLn
- base.winVarn
Where 'n' refers to the analysis number. They don't overwrite files like other programs, so they just keep numbering files 1,2,...,n.
- Remember to convert the file to unix if did the data or genotyping on Windows. Use
dos2unix
on the Linux command line if you have it available or find another way to remove the ^M from Windows.
As always, please email me if there are errors or you just have comments or suggestions about what to include in this help file.