## Practical 4

## Assessing genetic ancestry I - Principal Components Analysis

In this practical, you will project your mystery genome onto a PCA plot composed of present-day populations from around the world using the tool `smartpca` [1]. 

Before your begin this week’s practical, take a moment to familiarize yourself with the `smartpca` documentation here: (https://github.com/DReichLab/EIG/tree/master/POPGEN) 

### Getting Started

<b>If you haven't already done so, start an interactive session</b>

- Sign in to https://ood.huit.harvard.edu/ 
- Navigate to `Interactive Apps → Jupyter Lab - HEB 115`
- Launch a Jupyter Lab session with the following parameters:
    - Number of hours: 2
    - Number of CPUs: 4
- When the session is ready, click “Connect to Jupyter”

<b>Create a working directory (called "practical_4" from which you will run commands and store any files that you generate</b>

```bash
mkdir practical_4
cd practical_4
```

<b>Copy these practical instructions to your working directory and open them as a Jupyter Notebook</b>

```bash
cp ~/153784/practical_instructions/Practical4.ipynb ./
```

Then navigate to the practical_4 directory on the sidebar and click on Practical4.ipynb to open it as a Jupyter Notebook

### Part 1) Prepare your analysis dataset

Today you will be working with the merged dataset that you prepared during practical 0, which contains your mystery genome merged with data from 2,438 present-day individuals at 584,131 SNP positions that are included on the Human Origins genotyping array accessed via the Ancient Allen DNA Resource [2,3].

This dataset should be saved in your `~/practical_0 directory` and each file should have the prefix `HO_HEB115_working_dataset`, followed by the suffixes `.geno`, `.ind` and `.snp`.

To double check that each of the files exists, you can run the following commands:

```bash
ls ~/practical_0/HO_HEB115_working_dataset.geno
ls ~/practical_0/HO_HEB115_working_dataset.ind
ls ~/practical_0/HO_HEB115_working_dataset.snp
```

To make the global PCA plot, you will project your mystery genome onto a PCA plot composed of present-day individuals from a predetermined set of 169 different populations that have been gentoyped on the Human Origins array. You can find a list of these populations here:

`~/153784/data/reference_data/poplist_world_HO.txt`

By default, the tool smartpca will include every individual that is included in the input dataset in the PCA plot that it creates EXCEPT for those whose population label is listed as "Ignore". Since your working dataset contains data from more than just the individuals that you need to include in your PCA plot, your first step should be to create a new `.ind` file with updated population labels for all of the individuals that you don’t want to include in your analysis. 

Your first task is to change all of the population labels for all of the populations that you don't want to include in yoru PCA analysis to Ignore. 

You can write your own script to do this, or use a simple one that I wrote (`~/153784/tools/change_to_ignore.py`). To learn how it works, use the command: 


```bash
python ~/153784/tools/change_to_ignore.py --help
```

<b><u>Remember!</u> In addition to including the 220 present-day populations in your new ind file you also need to include your mystery genome. Make sure you don't change the population label for your mystery genome to Ignore during this process</b>

When you are done, take a moment to examine your newly created .ind file to make sure that the labels for populations that you don’t want to include in your analysis have been changed to “Ignore” while the ones you do want to retain (like that of your mystery genome) remain intact.

**Note** - This part of the practical doesn’t involve any actual data analysis, so you don’t need to describe this script in your report. Only provide a general description of what populations were included in your smartpca analysis and where that dataset comes from (i.e. Human Origins populations from the AADR).  


### Part 2) Prepare your parameter file

In order to run smartpca, you’ll need to create a parameter file that contains information about the analysis that you want to run. Make a parameter file called `global_pca.par` that includes the following information (make sure to replace any placeholder text):


```bash 
genotypename: /shared/home/{YOUR USER ID}/practical_0/HO_HEB115_working_dataset.geno
snpname: /shared/home/{YOUR USER ID}/practical_0/HO_HEB115_working_dataset.snp
indivname: {POINTER TO THE NEW IND FILE YOU MADE IN PART 1}
evecoutname: global_pca.evec
evaloutname: global_pca.eval
lsqproject:  YES 
poplistname: /shared/home/{YOUR USER ID}/153784/data/reference_data/poplist_world_HO.txt
numoutevec: 2
numoutlieriter: 0
hashcheck: NO
chrom: 1
```

### Part 3) Run smartpca 

Now you are ready to run `smartpca` using the following command:

```bash
~/153784/EIG/bin/smartpca -p global_pca.par > global_pca.out
```

<i><b>Hint</b> - This analysis will take some time to run, so submit it as a job using the sbatch command like you have done previously.</i>

When `smartPCA` is finished, it will produce several output files, including the `.evec` file, which contains the information that you need to create your PCA plot. Your `.evec` file will contain two columns, corresponding to the first two principal components (PC1 and PC2). (By default `smartPCA` typically generates the first ten principal components, but to same time/memory, you specified that only two principal components should be generated, using the `numoutevec` parameter in your par file.) 


<i><b>Hint</b>: The default formatting of an evec file is a little messy. You might find the following code helpful if you’d like to remove all the extra white space before you start plotting:</i>

```bash
tr -s " " < {EVEC FILE NAME} | sed 's/^ *//g' > {OUTPUT FILE NAME}
```

### Part 4) Plot your results

Using the program of your choosing (e.g. Jupyter Notebook within the Jupyter Hub, or R, or even Excel or Google Sheets), create a scatterplot that shows the results of your PCA. 

<b>Requirements for your PCA plot:</b>
- Make sure that your mystery genome’s position is prominently displayed, it should not be buried beneath the markers for the present-day individuals in the scatter plot. 
- Using text annotations, marker colors and/or marker shapes to label the present-day populations in your dataset according to geography.

You can learn more about the Human Origins populations that you included in your analysis by looking at the information that was included in the AADR anno file (version 62.0), which you can view on google drive at this link: https://docs.google.com/spreadsheets/d/1U95pDtpNCFxYWNuutPJIAl1QM3frMnV8a0RRg7ah6DU/edit?usp=sharing 

## When you are finished

### Be sure to include the following in your report: 
<b>Methods section</b>: <br>
A description of how you ran your smartpca analysis, highlighting any non-default parameters that you specified in your analysis. <br>
Remember, rather than describing the exact steps that you took to prepare your input datasets, focus your explaination on highlighting which populations were included in your analysis (i.e. how many populations and from what dataset did they come from). 

<b>Results section</b>: <br>
Be sure to include the following in your results section:
- Your PCA plot, with an appropriate caption.
- Describe where your mystery genome cluster relative to the other Human Origins populations included in your PCA plot. Focus on continental-level ancestry groupings, rather than specific populations. 

<b>Conclusion section</b>: <br>
Be sure to address the following, based on the results of your analyses: 
- What does your mystery genome's position in the PCA plot suggest about their ancestry?

### Additional Questions to answer at the end of your report: 
1) Why might you want to project your mystery genome onto the PCA plot rather than including it in the analysis? 
2) How did you specify that your mystery genome should be projected in the PCA plot that you created? Be sure to highlight any parameters that you specified and what information you provided to those parameters. 
3) What does the numoutevec parameter do in the par file that you created? How would the evec file that smartpca output have differed if you did not include this parameter? 

### References

1. Patterson N, Price AL, and Reich D. "Population structure and eigenanalysis." PLoS genetics 2.12 (2006): e190.
2. Swapan Mallick and David Reich: The Allen Ancient DNA Resource (AADR): A curated compendium of ancient human genomes, https://doi.org/10.7910/DVN/FFIDCW”, Harvard Dataverse, V9 data release [September 16, 2024]
3. Mallick S, Micco A, Mah M, Ringbauer H, Lazaridis I, Olalde I, Patterson N, Reich D (2024) The Allen Ancient DNA Resource (AADR) a curated compendium of ancient human genomes. Sci Data 11, 182.
