-
Notifications
You must be signed in to change notification settings - Fork 7
###Fst estimation between cohorts and geographical location inference###
Fst is commonly used as a measure for genetic differentiation. This function is designed to calculate Fst from GWAS summary statistics, and to infer of relative geographical locations between cohorts.
Depending on the purpose of the study, the function can include two steps.
Step 1: estimation of Fst for each cohort from the reference cohorts;
Step 2: inference of relative geographical locations based on Fst.
Demonstration can be downloaded here.
Data format for GWAS summary statistics
The summary statistics should contain the following fields: "SNP", "EAF" (Reference allele -- A1 -- frequency), "SE" (sampling variance of EAF), "A1", "A2", "CHR", "BP", and "P".
SNP | CHR | BP | A1 | A2 | RAF | RAF_SE | Pval |
---|---|---|---|---|---|---|---|
snp1 | 1 | 100 | G | T | 0.35 | 0.032 | 0.05 |
snp2 | 2 | 200 | T | A | 0.36 | 0.033 | 0.1 |
NOTE: when using --key option the keywords are case-insensitive and should exactly match the field names in your data. Although the keywords should be specified in order as listed, their order in the summary data need not to. "A1" is the reference allele, and "A2" is the other allele. Other columns such as reference allele frequency, standard error of allele frequency can also be included.
The program will automatically eliminate palindrome loci, such as loci having A/T alleles or G/C alleles. In the example, the second row, which has ambiguous alleles A/T, will be eliminated. Of note, all the summary statistic files should have the same field names as passed into --key, but their order of each column can be different from one file to another.
Step 1: estimating Fst for each cohort
Master command: sfst
Options
--meta-batch
Specify a file that lists names of all summary data files to be used in a meta-analysis, one file per line:
gwas1.txt
gwas2.txt
...
~~~~~~
--qt-size <arg>
Specify the file in which each line contains the sample size corresponding to rows in meta-batch.
~~~~~~
100
200
...
~~~~
--cc-size <arg>
Specify the file in which each line contains the numbers of cases and controls corresponding to each cohort in meta-batch, e.g.
200 300 1000 800 ...
--me <arg>
Specifies the number of markers that should be sampled for calculating F<sub>st</sub>.
It defaults to 30000 markers.
--key <args>
Although summary statistic files have the columns required, their names may be different. For the field names specified as the above example, the keywords should be
"--key markerID RAF RAF_SE Ref_Allele Other_allele Chromosome BP Pval"
Pval will be used to calculate the genomic inflation factor for each cohort.
--top <arg>
This option keeps the top X files listed in --meta-batch will be compared to all files. For example, if there are 10 summary statistic files included in --meta-batch and "--top 1" is used, only F<sub>st</sub> between the first file and other files will be calculated. For example, if we wanted to calculate F<sub>st</sub> between our cohorts and 1KG African, Asian, and European samples, then the three 1KG summary statistic files should be listed as the first three file in --meta-batch and use "--top 3" option.
--chr <arg>
Specify the chromosome for analysis. Otherwise will use all autosome.
--verbose
This option will produce detailed F<sub>st</sub> results for each selected SNP into "*.fst.gz." files for each pair of cohorts.
Examples
java -jar gear.jar sfst --meta-batch metalist.txt --qt-size qt-sample-size.txt --key SNP EAF EAFSE A1 A2 CHR BP Pval --out test java -jar gear.jar sfst --meta-batch metalist.txt --qt-size cc-sample-size.txt --key SNP EAF EAFSE A1 A2 CHR BP Pval --me 50000 --out test
***
**Step 2: inference for relative distance between cohorts**
Given the estimated F<sub>st</sub>, the geographical location of each cohort can be inferred using "fpc" subcommand.
1000 Genome reference samples can be found at [*1000 Reference samples*](http://sourceforge.net/projects/gbchen/files/Demo/1kg_ref.zip/download)
Examples
java -jar gear.jar fpc --fst test.fst --out test java -jar gear.jar fpc --fst test.fst --ref 9 3 1 --out test
test.fst is the fst matrix calculated from sfst, step 1. --ref specifies the three reference populations in test.fst. By default, the first three cohorts in test.fst will be set as the reference populations.
The output is written to test.fpc, which has two columns representing coordinates of the inferred geographical location for each cohort.
[Return to GEAR Home](https://github.com/gc5k/GEAR/wiki)