# Practical 5

## Assessing genetic ancestry II - ADMIXTURE

This week, you’ll use the tool `ADMIXTURE` [1] to explore the ancestry of your mystery genome by comparing it to 26 populations from across the globe from the 1000 Genomes dataset. 

Before you begin this week’s practical, take a moment to familiarize yourself with the ADMIXTURE documentation here: https://dalexander.github.io/admixture/admixture-manual.pdf. We’ll also be using the tool `plink` [2-3] to filter our data before we run the `ADMIXTURE` analysis. `plink` is another extremely powerful tool that has many additional uses that we won’t explore in this practical and you can learn more about it here: https://www.cog-genomics.org/plink/

For this assignment, you’ll be replicating the ADMIXTURE plot from The 1000 Genomes Project Consortium phase 3 paper [4]. You’ll also include your mystery individual to see how they compare to each of the 26 1000 Genomes populations included in the original analysis. The method you use to create your ADMIXTURE plot will differ from what is described in the 1000 Genomes Project paper [4]. Instead, refer to the supplementary materials of Sedig et al [5] for an example of how to describe your results.

Remember that while you are using data from the 1000 Genomes dataset, you are still running your analysis on the Human Origins SNP set and you downloaded the data from the Ancient Allen DNA Resource (AADR) [6,7].

### Getting Started

<b>If you haven't already done so, start an interactive session</b>

- Sign in to https://ood.huit.harvard.edu/ 
- Navigate to `Interactive Apps → Jupyter Lab`
- Launch a Jupyter Lab session with the following parameters:
    - Number of hours: 2
    - Number of CPUs: 2
- When the session is ready, click “Connect to Jupyter”

<b>Create a working directory (called "practical_5" from which you will run commands and store any files that you generate</b>

```bash
mkdir practical_5
cd practical_5
```

<b>Copy these practical instructions to your working directory and open them as a Jupyter Notebook</b>

```bash
cp ~/139860/practical_instructions/Practical5.ipynb ./
```

Then navigate to the practical_5 directory on the sidebar and click on Practical5.ipynb to open it as a Jupyter Notebook

### Part 1) Convert your analysis dataset to ped format 

The program `ADMIXTURE` [1] cannot take input data in `PACKEDANCESTRYMAP` format, so the first step of your analysis will be to convert your working dataset (from practical 0) into a format that it can operate on. We’ll use `ped` format (https://www.cog-genomics.org/plink/1.9/input#ped)

For this analysis, you’ll compare your mystery genome to the following populations from the 1000 Genomes dataset:

```bash
LWK.DG
ESN.DG
YRI.DG
MSL.DG
GWD.DG
ACB.DG
ASW.DG
CLM.DG
MXL.DG
PUR.DG
PEL.DG
TSI.DG
IBS.DG
GBR.DG
CEU.DG
FIN.DG
PJL.DG
GIH.DG
ITU.DG
STU.DG
BEB.DG
CDX.DG
KHV.DG
CHS.DG
CHB.DG
JPT.DG
```

`ADMIXTURE` uses all of the data that is included in the input dataset. So, unlike last week, you cannot just change the population labels to “Ignore.” You’ll need to make a new version of your analysis dataset that does not include any other individuals that we don’t want to include in our analysis (but make sure not to filter out your mystery genome). 

You can do both things at once using the tool `convertf`, which is part of the `EIGENSTRAT` [8] package (https://github.com/DReichLab/AdmixTools/blob/master/convertf/README).  

Like with `mergeit` and `smartpca`, you need to create a parameter (par) file to run convertf. Make a parameter file called `convertf.par`, which includes the following:

```bash
genotypename: {POINTER TO YOUR WORKING DATASET GENO FILE}
snpname:      {POINTER TO YOUR WORKING DATASET SNP FILE}
indivname:    {POINTER TO YOUR WORKING DATASET IND FILE}
outputformat:   PED
genooutfilename:   admixture_data.ped
snpoutfilename:    admixture_data.pedsnp
indoutfilename:    admixture_data.pedind
poplistname: {POINTER TO FILE THAT CONTAINS A LIST OF ALL THE POPS YOU WANT TO RETAIN. ONE PER LINE}
chrom: 1
```

To run `convertf`, submit the following command as a job using sbatch:

```bash
convertf -p convertf.par > convertf.out
```

Monitor your job using the squeue command or by checking to see what output is written to `convertf.out`.

Once `convertf` finishes running, you’ll need to run one more command to reformat the `.pedsnp` file that `convertf` produced into `.map` format:

```bash
awk -F" " '{print $1" "$2" "$3" "$4}' admixture_data.pedsnp > admixture_data.map
```

*Note - Since all you are doing in Part 1 is reformatting your data, not doing any analyses, you don't need to describe this section in your methods section, except to describe which populations you've included in your analysis and what SNP set you are analyzing (including what chromosome you focused your analysis on)*

### Part 2) Prune out SNPs that are in linkage disequilibrium (LD)

For `ADMIXTURE` to run correctly, it should be run on a dataset that does not contain SNPs that are in linkage disequilibrium with one another (i.e. that are non-randomly associated with one another, which means that they tend to be inherited together). We can use the program `plink` to identify alleles that are in LD (first command) and then to create a new dataset that they are excluded from (second command), using the following commands:

```bash
plink --file admixture_data  --indep-pairwise 200 25 0.4
```

```bash
plink --file admixture_data --extract plink.prune.in --make-bed --out admixture_data_pruned
```

### Part 3) Run ADMIXTURE 

Now that your data is ready, you can run `ADMIXTURE`. For this exercise, you’ll be setting k=8, which means that all of the individuals will be assigned some amount of ancestry from 8 theoretical ancestral populations (i.e. there will be 8 colors in your ADMIXTURE plot). 

Unfortunately, `ADMIXTURE` doesn't let you pick the output file name or location, so if you want to run multiple replicates (which we do), you should run them from different folders. So make a folder called `rep1` and move into it before you start running admixture. To do this you can run the following commands (make sure you are in your practical_5 directory before you run them):

```bash
mkdir rep1
cd rep1
```

To run `ADMIXTURE` submit the following command as a job using sbatch from within your `rep1` folder:

```bash
~/139860/tools/admixture/admixture -s time --cv ../admixture_data_pruned.bed 8 > runlog.k8.rep1.txt
```

### Part 4) Run another ADMIXTURE replicate

Typically researchers run multiple replicates of each `ADMIXTURE` analysis in order to avoid generating results that converge to a local rather than the global maximum (i.e. based on the random starting point used by the algorithm, it may yield wonky results). For the sake of time, you will only run 2 replicates, but in many publications researchers will choose the best of 10 or 20 replicates. 

Run another replicate (`rep2`)  using the approach from Part 3, but make sure that you **don't overwrite any of the results from your first replicate** (e.g., by submitting the job from a new directory and renaming your run log).


### Part 5) Identify the best replicate 

To choose the best `ADMIXTURE` replicate, you’ll need to identify the replicate that is assigned the maximum (i.e. least negative) loglikelihood score. Look in the runlog files that you generated for each replicate to try to determine which replicate has the maximum loglikelihood score. This is the replicate you should use for plotting going forward.

*Note - if your results look crazy when you plot them, you may have had the bad luck of running two replicates that converged at a local maximum rather than the global maximum. If you’ve ruled out all other possibilities, you may need to consider running another replicate.*


### Part 6) Plot your results

The results of your `ADMIXTURE` analysis will be stored in a file that ends with the suffix “.Q”. Each column in this file corresponds to one of the K ancestral components produced in your analysis (so in this case, there will be 8 columns). Each row in your file corresponds to a single individual in your analysis. Unfortunately, `ADMIXTURE` does not include the individual or population labels in this output file, so you’ll have to add this information by referring to the input files. The file called `admixture_data_pruned.fam` that you generated in part 2 will contain the individual IDs in the correct order, but to get the corresponding population labels you’ll need to refer to the original `.ind` file that you used in part 1 or to the AADR anno file that is linked to on canvas.

There are lots of fancy scripts for plotting `ADMIXTURE` plots, for instance*:
- Here’s a Jupyter Notebook-based plotting script (https://github.com/dportik/Pandas-for-Population-Structure-Barplots) 
- Here’s an R script (https://github.com/TCLamnidis/AdmixturePlotter) 

*Note: I have not tried either of these scripts, so this is not an endorsement of either of them*

However, I believe that the **best way** to plot `ADMIXTURE` plots is using the “stacked barplot” chart in Excel (or a "stacked column chart" in Google Sheets). It’s simple to set up and easy to edit.

You are welcome to choose whatever method you prefer for plotting your results, as long as you can do the following:
- Display your results as a stacked bar plot, where each of the k=8 ancestral components are represented by a different color (and the sum of these components is equal to 100% for each individual). 
- Arrange the individuals by their population label and then into larger geographic groups (i.e., place all of the African populations next to one another), and include population labels in your plot (and ideally, add additional labels for the larger geographic groups).
- Make the bar associated with your mystery individual wider than those associated with the 1000 Genomes populations so that it is easier to see.
- **For an extra challenge:**
    - Within each population group, order the individuals according to the amount of ancestry that they are assigned to one or more of the k=8 ancestral components so that the transition between individuals is as “smooth” as possible (see the example above).
    - Add spaces between each population so they are clearly distinguishable


## When you are finished

### Be sure to include the following in your report: 
<b>Methods section</b>: <br>
A description of how you ran your `ADMIXTURE` analysis. <br>
*Remember, rather than describing the exact steps that you took to prepare your input datasets, focus your explaination on highlighting which populations were included in your analysis (i.e. how many populations and from what dataset did they come from). You should describe any steps that you took that fundamentally alter your dataset or impact how it will be analyzed (like LD pruning).*

<b>Results section</b>: <br>
Be sure to include the following in your results section:
- Your ADMIXTURE plot, with an appropriate caption. 
- Specify which replicate you chose and explain why.

<b>Conclusion section</b>: <br>
Be sure to address the following, based on the results of your analyses: 
- Which ancestral component(s) serve as the predominant source of your mystery genome’s ancestry?
- Which 1000 Genomes population does your mystery genome appear most similar to based on the ADMIXTURE results?

### Additional Questions to answer at the end of your report: 
- Why is it necessary to perform LD pruning before running ADMIXTURE?
- Why is it not advisable to only run ADMIXTURE once when performing this type of analysis?
- For each of your ADMIXTURE replicates, how did you set the random seed? What would have happened if you set the random seed to be equal to 100 for both replicates and why would you want to avoid doing this? *Hint: check the ADMIXTURE user manual*
- In this practical, we’ve run “unsupervised” ADMIXTURE. How does this differ from “supervised” ADMIXTURE? What would you need to change about the code used in Part 3 and 4 to instead run your analysis in supervised mode instead. *Hint: check the ADMIXTURE user manual*

### References

1. Alexander, David H., John Novembre, and Kenneth Lange. "Fast model-based estimation of ancestry in unrelated individuals." Genome research 19.9 (2009): 1655-1664.
2. Chang, Christopher C., et al. "Second-generation PLINK: rising to the challenge of larger and richer datasets." Gigascience 4.1 (2015): s13742-015.
3. www.cog-genomics.org/plink/1.9/ 
4. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015). https://doi.org/10.1038/nature15393
5. Sedig, Jakob, et al. High levels of consanguinity in a child from Paquimé, Chihuahua, Mexico. Antiquity. 2024;98(400):1023-1039. doi:10.15184/aqy.2024.94
6. Mallick S, Reich D, 2023, "The Allen Ancient DNA Resource (AADR): A curated compendium of ancient human genomes", https://doi.org/10.7910/DVN/FFIDCW, Harvard Dataverse, V8
7. Mallick S, Micco A, Mah M, Ringbauer H, Lazaridis I, Olalde I, Patterson N, Reich D (2024) The Allen Ancient DNA Resource (AADR) a curated compendium of ancient human genomes. Sci Data 11, 182.
8. Patterson, Nick, Alkes L. Price, and David Reich. "Population structure and eigenanalysis." PLoS genetics 2.12 (2006): e190.
