# Practical 6

## Assessing genetic ancestry III - Outgroup f3-statistics

This week, you’ll use the tool `qp3Pop` from the `AdmixTools` package [1] to calculate outgroup f<sub>3</sub>-statistics of the form <i>f<sub>3</sub></i>(Mbuti.HO; Mystery Genome, <i>Test</i>), where the population Mbuti.HO is serving as an outgroup and <i>Test</i> is a placeholder term that refers to all of the other populations in your working dataset [2,3] that you’ll be comparing your mystery genome to. Outgroup f<sub>3</sub>-statistics measure how closely two populations (in this case your mystery genome and <i>Test</i>) are to one another, as measured against a distant outgroup (in this case, Mbuti.HO). 

### Getting Started

<b>If you haven't already done so, start an interactive session</b>

- Sign in to https://ood.huit.harvard.edu/ 
- Navigate to `Interactive Apps → Jupyter Lab`
- Launch a Jupyter Lab session with the following parameters:
    - Number of hours: 2
    - Number of CPUs: 2
- When the session is ready, click “Connect to Jupyter”

<b>Create a working directory (called "practical_6" from which you will run commands and store any files that you generate</b>

```bash
mkdir practical_6
cd practical_6
```

<b>Copy these practical instructions to your working directory and open them as a Jupyter Notebook</b>

```bash
cp ~/139860/practical_instructions/Practical6.ipynb ./
```

Then navigate to the practical_6 directory on the sidebar and click on Practical6.ipynb to open it as a Jupyter Notebook

### Part 1) Prepare a poplist file

In order to run `qp3Pop`, you’ll need to create a “poplist” file that specifies what outgroup f3-statistics should be computed. The poplist contains three tab-separated columns, which contain (1) your population of interest (i.e. the population label of your mystery genome), (2) the “Test” populations you are comparing against, and (3) the outgroup population. 

In this exercise, you’ll compare your mystery genome to all of the Human Origins and 1000 Genomes populations that are included in your working dataset. It will be up to you to identify which populations to include in your analysis by pulling this information from the working dataset anno file: https://docs.google.com/spreadsheets/d/1NJEPY-JPSjj3ERmM1SXkz7vYVafIaJ0gjpRQ-XLxAmk/edit?usp=sharing



Here’s an example of what the first few columns of your poplist file should look like:

```bash
{YOUR MYSTERY GENOME}	SouthAfrica_Tswana.HO	Mbuti.HO
{YOUR MYSTERY GENOME}	Ethiopia_BetaIsrael.HO	Mbuti.HO
{YOUR MYSTERY GENOME}	Ethiopia_Oromo.HO	Mbuti.HO
```

Using a method of your choosing, prepare your poplist and save it in a file named `poplist`. Be sure that your poplist doesn’t contain any duplicate lines and that you don’t use your outgroup population (Mbuti.HO) as a Test population (since it can't be both a Test and outgroup). 

<b>Tips for identifying populations in the anno file:<b>

- <b>For the 1000 Genomes populations:</b>
You can identify 1000 Genomes populations by looking for populations that are labeled as coming from the 1KGPhase3 publication in the `Publication` column.
    - Note - There are two versions of each 1000 Genomes population, which are indicated with different suffixes. These two versions are described as follows by the AADR website:
        - SG: Samples with whole genome shotgun sequence data, randomly drawing a single read to represent each position in the genome
        - DG: Samples shotgun sequenced with high enough coverage to call diploid genotypes, allowing for heterozygous calls
    - Just choose one version to include in your analysis. I used .DG, but your results will likely be similar either way
- <b>For the Human Origins populations:</b>
You can identify Human Origins populations by looking at the column called `DataSource`.
    - Note - It is up to you whether you'd like to include non-human primate populations in your analysis. But you don't need to include them on the map that you create.


### Part 2) Prepare a parameter file
Since `qp3Pop` is part of the `AdmixTools` package, you’ll need to create a parameter (or par) file to run it. Create a file called `outgroup_f3.par`, which includes the following:

```bash
genotypename: {POINTER TO YOUR WORKING DATASET GENO FILE}
snpname:      {POINTER TO YOUR WORKING DATASET SNP FILE}
indivname:    {POINTER TO YOUR WORKING DATASET IND FILE}
popfilename: poplist
inbreed: YES
printsd: YES
chrom: 1   
```

### Part 3) Run qp3Pop

To run qp3Pop, you can use the following command (Note - This may take a few minutes to run, so it is up to you whether to submit it as a job using sbatch):

```bash
qp3Pop -p outgroup_f3.par > outgroup_f3.out
```

### Part 4) Plot your results

When qp3Pop is finished running, the results of your analysis will be available in a file called: `outgroup_f3.out`. 

The results of each outgroup f3 statistic will be displayed on a line that starts with the word “result:”, which will have 6 additional columns that specify: 
1. The name of your mystery genome
2. The name of the test population
3. The name of the outgroup population used (in this case Mbuti.HO)
4. The value of the outgroup-f3 statistic
5. The standard error associated with that statistic
6. The z-score associated with the statistic
7. The number of SNPs used in the calculation. 

There are two ways that people commonly plot the results of outgroup-f3 analyses:
1. In a plot that shows all the test populations ordered from highest to lowest associated f-statistic, with error bars showing the associated standard error
2. On a map showing the geographic location of each test population, where marker color indicates the associated f3-statistic.

For this practical, you’ll plot your results on a map.

The following Jupyter Notebook provides an example of how you can plot your results on a map. Make a copy of this notebook and use it as a template for plotting your results. (Alternatively you are welcome to write your own script to do this):
`~/139860/practical_instructions/Practical6_map_plotting.ipynb`
  
To make your map, you’ll need to look up the geographic coordinates associated with each population included in your analysis. They can be found in the AADR “anno” file (https://docs.google.com/spreadsheets/d/1NJEPY-JPSjj3ERmM1SXkz7vYVafIaJ0gjpRQ-XLxAmk/edit?usp=sharing). 

*Note 1- Sometimes different individuals from the same population have different coordinates listed in the AADR anno file. If that’s the case, don’t worry, for the purpose of this assignment, you can just pick one set of coordinates to represent each population's location.*

*Note 2 - Some populations don't have any coordinates associated with them. Based on the available location information, you are encouraged to choose an approximate location for them so that they can be included in your map. But it is also okay to exclude these populations from your plot.*

### Part 5) Make a table

Create a table that includes the top 25 results for populations that are most similar to your mystery genome, based on the computed f3-statistic. At a minimum, this table should include columns 2, 4, 5, 6 and 7 from the `outgroup_f3.out` results file. 

Note - Don’t forget that you made a file called `results_outgroup_f3.csv` during the part 4 which might make this easier. 

## When you are finished

### Be sure to include the following in your report: 
<b>Methods section</b>: <br>
A description of how you computed the outgroup f<sub>3</sub>-statistics included in this report.

<i>Remember:</i>
- <i>Be sure to describe the statistic that you computed and which populations you included in your list of <i>Test</i> populations, but you don't need to describe how you made your poplist.</i>
- <i>You don't need to describe how you plotted your results.</i>

<b>Results section</b>: <br>
Be sure to include the following in your results section:
- The map you made showing your outgroup f<sub>3</sub>-statistic results with an appropriate caption.
- A table showing the results for the 25 populations that are most similar to your mystery genome according to your outgroup f<sub>3</sub>-statistic analysis. 

<b>Conclusion section</b>: <br>
Be sure to address the following, based on the results of your analyses: 
- Which populations is your mystery genome most similar to? (Be sure to look at both the assigned f<sub>3</sub>-statistic and the associated standard error)
- Are your mystery genome’s top matches all concentrated in a single geographic region? What does this tell you about your mystery genome's likely ancestry? 

### Additional Questions to answer at the end of your report: 
- Why was Mbuti.HO chosen as the outgroup population for this analysis?
- When might Mbuti.HO not be an ideal outgroup population (think about the ancestry of the individual/population you might want to analyze)? What is an alternative outgroup that you could use in this circumstance?
- Why did you use the inbreed: YES parameter?

### References
1. Patterson, Nick, et al. "Ancient admixture in human history." Genetics 192.3 (2012): 1065-1093.
2. Mallick S, Reich D, 2023, "The Allen Ancient DNA Resource (AADR): A curated compendium of ancient human genomes", https://doi.org/10.7910/DVN/FFIDCW, Harvard Dataverse, V8
3. Mallick S, Micco A, Mah M, Ringbauer H, Lazaridis I, Olalde I, Patterson N, Reich D (2024) The Allen Ancient DNA Resource (AADR) a curated compendium of ancient human genomes. Sci Data 11, 182.


## Additional Resources

Here is a link to the qp3Pop README file on GitHub, which you can refer to if you want more information about how the tool works:
https://github.com/DReichLab/AdmixTools/blob/master/README.3PopTest

Here is the user guide for geopandas which you can refer to if you want to adjust your figure:
https://geopandas.org/en/stable/docs/user_guide.html 
