# Computing Population Genetics Statistics (*f*<sub>2</sub> and F<sub>ST</sub>)

Author: Mary T. Yohannes

## Index
1. [Setting Default Paths](#1.-Set-Default-Paths)
2. [Read in Pre-QC Dataset and Apply Quality Control Filters](#2.-Read-in-Pre-QC-Dataset-and-Apply-Quality-Control-Filters)
3. [*f*<sub>2</sub> Analysis](#3.-f2-Analysis)
4. [F<sub>ST</sub>](#4.-F_ST)
    1. [F<sub>ST</sub> with PLINK](#4.a.-F_ST-with-PLINK)
        1. [PLINK Set up](#4.a.1.-PLINK-Set-up)
        2. [Files Set up](#4.a.2.-Files-Set-up)
        3. [Scripts Set up](#4.a.3.-Scripts-Set-up) 
        4. [Run Scripts](#4.a.4.-Run-Scripts)  

# General Overview 

The purpose of this notebook is to show two common population genetics analyses (*f*<sub>2</sub> and F<sub>ST</sub>) which are used to understand recent and deep history. 

*f*<sub>2</sub> analysis computes the number of SNVs that appear twice in a dataset and compares how often they are shared among individuals. Since doubletons are rare variants, they tend to have arisen relatively recently giving us information about recent population history. In contrast, F<sub>ST</sub> is a “fixation index” which calculates the extent of variation within versus between populations using SNVs of many frequencies. Because common variants are used in F<sub>ST</sub> analyses which arose a long time ago, this gives us information about older population history.

**This script contains information on how to:**
- Read in a matrix table (mt) and filter using sample IDs that were obtained from another matrix table 
- Separate a field into multiple fields
- Filter using the call rate field 
- Extract doubletons and check if they are the reference or alternate allele
- Count how many times a sample or a sample pair appears in a field 
- Combine two dictionaries and add up the values for identical keys
- Format list as pandas table 
- Export a matrix table as PLINK2 BED, BIM and FAM files 
- Set up a pair-wise comparison
- Drop certain mt/table fields
- Download a tool such as PLINK using a link and shell command 
- Run shell commands, set up a script & run it from inside a code block 
- Calculate F<sub>st</sub> using PLINK 
- Go through a log file and extract needed information
- Write out results onto the Cloud 

In [2]:
import hail as hl
import pandas as pd

# Functions from gnomAD library to apply genotype filters   
from gnomad.utils.filtering import filter_to_adj

In [3]:
# Initializing Hail 
hl.init()

Running on Apache Spark version 3.1.3
SparkUI available at http://mty-m.c.diverse-pop-seq-ref.internal:42753
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.109-b71b065e4bb6
LOGGING: writing to /home/hail/hail-20230317-1456-0.2.109-b71b065e4bb6.log


# 1. Set Default Paths

These default paths can be edited by users as needed. It is recommended to run these tutorials without writing out datasets.

**By default, all of the dataset write out sections are shown as markdown cells. If you would like to write out your own dataset, you can copy the code and paste it into a new code cell. Don't forget to change the paths in the following cell accordingly.** 

[Back to Index](#Index)

In [4]:
# Beginning input file path for f2 analysis - HGDP+1kGP dataset prior to applying gnomAD QC filters
pre_qc_path = 'gs://gcp-public-data--gnomad/release/3.1.2/mt/genomes/gnomad.genomes.v3.1.2.hgdp_1kg_subset_dense.mt'

# Path for unrelated samples mt without outliers -  for subsetting purposes
unrelateds_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/pca_results/unrelateds_without_outliers.mt'

# Path for final count table for the f2 analysis 
final_doubleton_count_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/f2_fst/doubleton_count.csv'

In [5]:
# Beginning input file path for F_ST analysis - mt generated in Notebook 2: PCA and Ancestry Analyses after LD pruning
fst_input_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/pca_preprocessing/ld_pruned.mt'

# Path for exporting the PLINK files 
# Include file prefix at the end of the path - here the prefix is 'hgdp_tgp'
plink_files_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/f2_fst/hgdp_tgp'

# Path for final F_ST output  
final_mean_fst_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/f2_fst/mean_fst.txt'

# 2. Read in Pre-QC Dataset and Apply Quality Control Filters

Since the post-QC mt was not written out, we run the same function as the previous tutorial notebooks to apply the quality control filters to the pre-QC dataset.

**To avoid errors, make sure to run the next two cells before running any code that includes the post-QC dataset.**

**If running the cell below results in an error, double check that you used the  `--packages gnomad` argument when starting your cluster.**  

- See the tutorials [README](https://github.com/atgu/hgdp_tgp/tree/master/tutorials#readme) for more information on how to start a cluster.

<br>
<details><summary> For more information on Hail methods and expressions click <u><span style="color:blue">here</span></u>.</summary> 
<ul>
<li><a href="https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_matrix_table"> More on  <i> read_matrix_table() </i></a></li>
        
<li><a href="https://hail.is/docs/0.2/hail.expr.Expression.html#hail.expr.Expression.describe"> More on  <i> describe() </i></a></li>

<li><a href="https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.count"> More on  <i> count() </i></a></li>
    
<li><a href="https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#hail.linalg.BlockMatrix.filter_cols"> More on  <i> filter_cols() </i></a></li>

<li><a href="https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#hail.linalg.BlockMatrix.filter_rows"> More on  <i> filter_rows() </i></a></li>
</ul>
</details>

[Back to Index](#Index)

In [None]:
# Set up function to apply gnomAD's sample, variant and genotype QC filters

def run_qc(mt):
    
    ## Apply sample QC filters to dataset 
    # This filters to only samples that passed gnomAD's sample QC hard filters  
    mt = mt.filter_cols(~mt.gnomad_sample_filters.hard_filtered) # removed 31 samples
    
    ## Apply variant QC filters to dataset
    # This subsets to only PASS variants - those which passed gnomAD's variant QC
    # PASS variants have an entry in the filters field 
    mt = mt.filter_rows(hl.len(mt.filters) != 0, keep=False)

    ## Apply genotype QC filters to the dataset
    # This is done using a function imported from gnomAD and is the last step in the QC process
    mt = filter_to_adj(mt)

    return mt

In [None]:
# Read in the HGDP+1kGP pre-QC mt
pre_qc_mt = hl.read_matrix_table(pre_qc_path)

# Run QC 
mt_filt = run_qc(pre_qc_mt)

<a id='3.-f2-Analysis'></a>

# 3. *f*<sub>2</sub> Analysis


We are running *f*<sub>2</sub> on unrelated samples only so for subsetting, we are using the unrelateds mt which was generated after removing outliers and rerunning PCA in <a href="https://nbviewer.org/github/atgu/hgdp_tgp/blob/master/tutorials/nb4.ipynb">Notebook 2</a>. After obtaining the desired samples, we run Hail's common variant statistics so we can separate out doubletons. Once we have the doubletons filtered, we then remove variants with a call rate less than 0.05 (no more than 5% missingness/low missingness).

*Doubletons are variants that show up twice and only twice in a dataset and useful in detecting rare variance & understanding recent history.


[Back to Index](#Index)

In [5]:
# This code chunk took 20min to run 

# Filter to only unrelated samples - 3378 samples 
mt_unrel = hl.read_matrix_table(unrelateds_path) 
unrel_samples = mt_unrel.s.collect() # collect sample IDs as a list 
unrel_samples = hl.literal(unrel_samples) # capture and broadcast the list as an expression 
mt_filt_unrel = mt_filt.filter_cols(unrel_samples.contains(mt_filt['s'])) # filter mt 
print(f'Num of samples after filtering (unrelated samples) = {mt_filt_unrel.count_cols()}') # 3378 samples

# Run common variant statistics (quality control metrics)  
mt_unrel_varqc = hl.variant_qc(mt_filt_unrel)

# Separate the AC array into individual fields and extract the doubletons  
mt_unrel_interm = mt_unrel_varqc.annotate_rows(AC1 = mt_unrel_varqc.variant_qc.AC[0], AC2 = mt_unrel_varqc.variant_qc.AC[1])
mt_unrel_only2 = mt_unrel_interm.filter_rows((mt_unrel_interm.AC1 == 2) | (mt_unrel_interm.AC2 == 2))
print(f'Num of variants that are doubletons = {mt_unrel_only2.count_rows()}') # 17279480 variants

# Remove variants with call rate < 0.05 (no more than 5% missingness/low missingness)  
mt_unrel_only2_filtered = mt_unrel_only2.filter_rows(mt_unrel_only2.variant_qc.call_rate > 0.05)
print(f'Num of variants > 0.05 = {mt_unrel_only2_filtered.count_rows()}') # 17229743 variants

Num of samples after filtering (unrelated samples) = 3378
Num of variants that are doubletons = 17279480
Num of variants > 0.05 = 17229743


The next step is to check which allele, reference (ref) or alternate (alt), is the the doubleton. If the first element of the array in the allele frequency field (AF[0]) is less than the second elelement (AF[1]), then the doubleton is the 1st allele (ref). If the first element (AF[0]) is greater than the second elelement (AF[1]), then the doubleton is the 2nd allele (alt).

[Back to Index](#Index)

In [None]:
# This code chunk took 44min to run because of the print commands which we've commented out here

# Check allele frequency (AF) to see if the doubleton is the ref or alt allele 

# AF[0] < AF[1] - doubleton is 1st allele (ref)
mt_doubl_ref = mt_unrel_only2_filtered.filter_rows((mt_unrel_only2_filtered.variant_qc.AF[0] < mt_unrel_only2_filtered.variant_qc.AF[1]))
#print(f'Num of variants where the 1st allele (ref) is the doubleton = {mt_doubl_ref.count_rows()}') # 2979 variants


# AF[0] > AF[1] - doubleton is 2nd allele (alt)
mt_doubl_alt = mt_unrel_only2_filtered.filter_rows((mt_unrel_only2_filtered.variant_qc.AF[0] > mt_unrel_only2_filtered.variant_qc.AF[1]))
#print(f'Num of variants where the 2nd allele (alt) is the doubleton = {mt_doubl_alt.count_rows()}') # 17226764 variants

# Validity check: should print True
#mt_doubl_ref.count_rows() + mt_doubl_alt.count_rows() == mt_unrel_only2_filtered.count_rows() # True
#print(3159 + 17994582 == 17997741) 

Once we've figured out which allele is the doubleton and divided the doubleton matrix table accordingly, the next step is to find the samples that have doubletons, compile them in a set, and annotate that onto the mt as a new row field. This done for each mt separately and can be achieved by looking at the genotype call (GT) field. When the doubleton is the 1st allele (ref), the genotype call would be 0|1 & 0|0. When the doubleton is the 2nd allele (alt), the genotype call would be 0|1 & 1|1. 

We chose a set for the results instead of a list because a list isn't hashable and the next step wouldn't run. After the annotation of the new row field in each mt, we then count how many times a sample or a sample pair appears within that field and store the results in a dictionary. Once we have the two dictionaries (one for the ref and one for the alt), we merge them into one and add up the values for identical keys.

If you want to do a validity check at this point, you can add up the count of the two dictionaries and then subtract the number of keys that intersect between the two. The value that you get should be equal to the length of the combined dictionary.  

[Back to Index](#Index)

In [7]:
# This code chunk took 21min to run

# For each mt, find the samples that have doubletons, compile them in a set, and add as a new row field 

# Doubleton is 1st allele (ref) - 0|1 & 0|0
# If there is one sample in the new column field then it's 0|0
# If there are two samples, then it's 0|1
mt_ref_collected = mt_doubl_ref.annotate_rows(
    samples_with_doubletons=hl.agg.filter(
        (mt_doubl_ref.GT == hl.call(0, 1)) | (mt_doubl_ref.GT == hl.call(0, 0)), hl.agg.collect_as_set(mt_doubl_ref.s)))

# Doubleton is 2nd allele (alt) - 0|1 & 1|1
# If there is one sample in the new column field then it's 1|1
# If there are two samples, then it's 0|1
mt_alt_collected = mt_doubl_alt.annotate_rows(
    samples_with_doubletons=hl.agg.filter(
        (mt_doubl_alt.GT == hl.call(0, 1)) | (mt_doubl_alt.GT == hl.call(1, 1)), hl.agg.collect_as_set(mt_doubl_alt.s)))

# Count how many times a sample or a sample pair appears in the "samples_with_doubletons" field - returns a dictionary
ref_doubl_count = mt_ref_collected.aggregate_rows(hl.agg.counter(mt_ref_collected.samples_with_doubletons))
alt_doubl_count = mt_alt_collected.aggregate_rows(hl.agg.counter(mt_alt_collected.samples_with_doubletons))

# Combine the two dictionaries and add up the values for identical keys  
all_doubl_count = {k: ref_doubl_count.get(k, 0) + alt_doubl_count.get(k, 0) for k in set(ref_doubl_count) | set(alt_doubl_count)}
print(f'Length of dictionary = {len(all_doubl_count)}') # 2989787

# Validity check 
#len(all_doubl_count) == (len(ref_doubl_count) + len(alt_doubl_count)) - # of keys that intersect b/n the two dictionaries  

Length of dictionary = 2989787


For the next step, we get a list of samples that are in the doubleton mt and also create sample pairs out of them. We also divide the combined dictionary into two: one for when a sample is a key by itself (len(key) == 1) and the other for when the dictionary key is a pair of samples (len(key) != 1). 

We then go through the lists of samples obtained from the mt and see if any of them are keys in their respective doubleton dictionaries:
- list of samples by themselves is compared against the dictionary that has a single sample as a key 
- the list with sample pairs is compared against the dictionary where the key is a pair of samples 

[Back to Index](#Index)

In [9]:
# Get list of samples from mt - 3378 samples 
mt_sample_list = mt_unrel_only2_filtered.s.collect()

# Make pairs from sample list: n(n-1)/2 - 5703753 pairs 
mt_sample_pairs = [{x,y} for i, x in enumerate(mt_sample_list) for j,y in enumerate(mt_sample_list) if i<j]

# Subset dict to only keys with length of 1 - one sample 
dict_single_samples = {x:all_doubl_count[x] for x in all_doubl_count if len(x) == 1}

# subset dict to keys with sample pairs (not just 1)
dict_pair_samples = {x:all_doubl_count[x] for x in all_doubl_count if len(x) != 1}

# Validity check 
print(len(dict_single_samples) + len(dict_pair_samples) == len(all_doubl_count)) # True

# Are the samples in the list the same as the dict keys?
print(len(mt_sample_list) == len(dict_single_samples)) # True 

# Are the sample pairs obtained from the mt equal to what's in the pair dict? 
print(len(mt_sample_pairs) == len(dict_pair_samples)) # False - there are more sample pairs obtained from the mt

True
True
False


If a single sample is a key in the single-sample-key dictionary, we record the sample ID twice and it's corresponding value from the dictionary. If it is not a key, we record the sample ID twice and set the value to 0. 

[Back to Index](#Index)

In [13]:
# Single sample comparison 
single_sample_final_list = [[s, s, 0] if dict_single_samples.get(frozenset([s])) is None else [s, s, dict_single_samples[frozenset([s])]] for s in mt_sample_list]

# Validity check 
# For the single samples, the length should be consistent across dict, mt sample list, and final list
print(len(single_sample_final_list) == len(mt_sample_list) == len(dict_single_samples)) # True 

True


If a sample pair is a key in the sample-pair-key dictionary, we record the two sample IDs and the corresponding value from the dictionary. If that is not the case, we record the two sample IDs and set the value to 0. 

In [14]:
# Sample pair comparison
sample_pair_final_list = [[list(s)[0], list(s)[1], 0] if dict_pair_samples.get(frozenset(list(s))) is None else [list(s)[0], list(s)[1], dict_pair_samples[frozenset(list(s))]] for s in mt_sample_pairs]

# Validity check 
# Length of final list should be equal to the length of the sample list obtained from the mt 
print(len(sample_pair_final_list) == len(mt_sample_pairs)) # True

True


Last step is to combine the two lists obtained from the comparisons, convert that into a pandas table, format it as needed, and write it out as a csv so the values can be plotted as a heat map in R. 

In [34]:
final_list = single_sample_final_list + sample_pair_final_list

# Validity check 
len(final_list) == len(single_sample_final_list) + len(sample_pair_final_list) # True

# Format list as pandas table 
df = pd.DataFrame(final_list)
df.rename({0:'sample1', 1:'sample2', 2:'count'}, axis=1, inplace=True) # rename column names 

- Write out table to the Cloud so it can be plotted in R 

```python3
df.to_csv(final_doubleton_count_path, index=False, sep='\t')
```

### The sample-level *f*<sub>2</sub> heatmap was plotted in R using this [code](https://github.com/atgu/hgdp_tgp/blob/master/F2_heatmap.Rmd).

[Back to Index](#Index)

<a id='4.-F_ST'></a>
# 4. F<sub>ST</sub>

F<sub>ST</sub> detects genetic divergence from common variance allowing us to understand past deep history. Similar to the *f*<sub>2</sub> analysis, we are running F<sub>ST</sub> on unrelated samples only. However, here we are starting with a filtered and pruned dataset instead of the pre-QC dataset (without outliers).

**Something to note**: Since the mt we are starting with is filtered and pruned, running <code>hl.variant_qc</code> and filtering to variants with <code>call rate > 0.05 </code> (similar to what we did for the *f*<sub>2</sub> analysis) doesn't make a difference to the number of variants.


[Back to Index](#Index)

In [38]:
# Read-in the filtered and pruned mt with outliers - 199974 variants and 4120 samples
mt_FST_initial = hl.read_matrix_table(fst_input_path) 

# Filter to only unrelated samples - 3378 samples (also excludes outliers)
mt_unrel = hl.read_matrix_table(unrelateds_path) 
unrel_samples = mt_unrel.s.collect() # collect sample IDs as a list 
unrel_samples = hl.literal(unrel_samples) # capture and broadcast the list as an expression 
mt_FST_unrel = mt_FST_initial.filter_cols(unrel_samples.contains(mt_FST_initial['s'])) # filter mt
print(f'Num of samples after filtering (unrelated samples) = {mt_FST_unrel.count_cols()}') # 3378 samples

Num of samples after filtering (unrelated samples) = 3378


<a id='4.a.-F_ST-with-PLINK'></a>
## 4.a. F<sub>ST</sub> with PLINK

In order to calculate F<sub>ST</sub> using PLINK, we first need to export the mt as PLINK files.

After exporting the files to PLINK format, the rest of the analysis is done using shell commands within the notebook. 

**Place <code>!</code> before the command you want to run and proceed as if you are running codes in a terminal.** You can use <code>! ls</code> after each run to check for ouputs in the directory and see if commands have run correctly. 
 
**Every time you start a new cluster, you will need to download PLINK to run the F<sub>ST</sub> analysis since downloads and files are discarded when a cluster is stopped.**  

**Something to note**: when running the notebook on the Cloud, the shell commands still run even if we didn't use <code>!</code>. The same is true for when running the notebook locally. If you are having issues with the shell commands, we recommend trying both ways. 

[Back to Index](#Index)

- Export mt as PLINK2 BED, BIM and FAM files - store on Google Cloud 

```python3
hl.export_plink(mt_FST_unrel, plink_files_path, fam_id=mt_FST_unrel.hgdp_tgp_meta.population)
```

### 4.a.1. PLINK Set up 

[Back to Index](#Index)

In [3]:
# Download PLINK using a link from the PLINK website (linux - recent version - 64 bit - stable) 
! wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20210606.zip
    
# Unzip the ".gz" file: 
! unzip plink_linux_x86_64_20210606.zip

# A documentation output when you run this command indicates that PLINK has been installed properly 
! ./plink 

--2023-02-10 19:48:36--  https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20210606.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.230.96, 52.217.236.96, 52.216.57.176, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.230.96|:443... failed: Operation timed out.
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.236.96|:443... failed: Operation timed out.
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.57.176|:443... failed: Operation timed out.
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.235.96|:443... failed: Operation timed out.
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.200.181|:443... ^C
unzip:  cannot find or open plink_linux_x86_64_20210606.zip, plink_linux_x86_64_20210606.zip.zip or plink_linux_x86_64_20210606.zip.ZIP.
zsh:1: no such file or directory: ./plink


### 4.a.2. Files Set up 

Because F<sub>ST</sub> is computed among groups, we need to create a list of all pairs of populations.

[Back to Index](#Index)

In [4]:
# Copy the PLINK files that are stored on the Cloud to the current session directory
! gsutil cp {plink_files_path}.fam . # fam 
! gsutil cp {plink_files_path}.bim . # bim
! gsutil cp {plink_files_path}.bed . # bed

# Obtain FID - in this case, it is the 78 populations in the first column of the FAM file
! awk '{print $1}' hgdp_tgp.fam | sort -u > pop.codes

# Make all possible combinations of pairs using the 78 populations 
! for i in `seq 78`; do for j in `seq 78`; do if [ $i -lt $j ]; then VAR1=`sed "${i}q;d" pop.codes`; VAR2=`sed "${j}q;d" pop.codes`; echo $VAR1 $VAR2; fi; done; done > pop.combos

# Validity check 
! wc -l pop.combos # 3003

# Create directories for intermediate files and F_ST results 
! mkdir within_files # intermediate files
! mkdir FST_results # F_ST results

Copying gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/f2_fst/hgdp_tgp.fam...
/ [1 files][ 74.8 KiB/ 74.8 KiB]                                                
Operation completed over 1 objects/74.8 KiB.                                     
Copying gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/f2_fst/hgdp_tgp.bim...
/ [1 files][  7.9 MiB/  7.9 MiB]                                                
Operation completed over 1 objects/7.9 MiB.                                      
Copying gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/f2_fst/hgdp_tgp.bed...
| [1 files][161.2 MiB/161.2 MiB]                                                
Operation completed over 1 objects/161.2 MiB.                                    


### 4.a.3. Scripts Set up 

[Back to Index](#Index)

In [45]:
### Script 1 ####

# For each population pair, set up a bash script to create a "within" file and run F_ST 
# Files to be produced: 
#    [pop_pairs].within will be saved in the "within_files" directory
#    [pop_pairs].fst, [pop_pairs].log, and [pop_pairs].nosex will be saved in "FST_results" directory

fst_script = '''    
#!/bin/bash

# set variables
for i in `seq 3003`
do
    POP1=`sed "${i}q;d" pop.combos | awk '{ print $1 }'`
    POP2=`sed "${i}q;d" pop.combos | awk '{ print $2 }'`

# create "within" files for each population pair using the FAM file (columns 1,2 and 1 again)
    awk -v r1=$POP1 -v r2=$POP2 '$1 == r1 || $1 == r2' hgdp_tgp.fam | awk '{ print $1, $2, $1 }' > within_files/${POP1}_${POP2}.within

# run F_st
    ./plink --bfile hgdp_tgp --fst --within within_files/${POP1}_${POP2}.within --out FST_results/${POP1}_${POP2}
done'''

with open('run_fst.py', mode='w') as file:
    file.write(fst_script)

**Script 1 has to be run for script 2 to run** 

In [46]:
### Script 2 ### 

# Use the "[pop_pairs].log" file produced from script 1 above (located in "FST_results" directory) to get the "Mean F_st estimate" for each population pair 
# Then compile all of the values in a single file for F_st heat map generation

extract_mean_script = ''' 
#!/bin/bash

# set variables
for i in `seq 3003`
do
    POP1=`sed "${i}q;d" pop.combos | awk '{ print $1 }'`
    POP2=`sed "${i}q;d" pop.combos | awk '{ print $2 }'`
    mean_FST=$(tail -n4 FST_results/${POP1}_${POP2}.log | head -n 1 | awk -F: '{print $2}' | awk '{$1=$1};1')
    printf "%-20s\t%-20s\t%-20s\n " ${POP1} ${POP2} $mean_FST >> mean_fst_sum.txt
done'''

with open('extract_mean.py', mode='w') as file:
    file.write(extract_mean_script)

### 4.a.4. Run Scripts  

[Back to Index](#Index)

In [47]:
# Run script 1 and direct the run log into a file (~20min to run) 
! sh run_fst.py > fst_script.log

# Validity check 
! cd within_files/; ls | wc -l # 3003
! cd FST_results/; ls | wc -l # 3003 * 3 = 9009 

3003
9009


**Script 2 requires script 1 to be run first**

In [None]:
# Run script 2 (~1min to run)
! sh extract_mean.py 

- Copy the output from script 2 to the Cloud for heat map plotting in R 

```python3
! gsutil cp mean_fst_sum.txt {final_mean_fst_path}
```

### The population-level F<sub>ST</sub> heatmap was plotted in R using this [code](https://github.com/atgu/hgdp_tgp/blob/master/FST_heatmap.Rmd).

[Back to Index](#Index)