# Practical 0

## Preparing your working dataset

The purpose of today's practical is to create a (compressed) eigenstrat format dataset that contains genotype information for your mystery genome AND an additional 7,857 present-day individuals at 584,131 SNP positions that are included on the Human Origins genotyping array.

In order to do this, first we will create psuedo-haploid genotype calls for your mystery genome at the SNPs of interest and save the results in eigenstrat format. Then we will merge this newly created eigenstrat dataset with data from the 7,857 present-day individuals whose genomes are included in the Allen Ancient DNA Resource (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FFIDCW). 

### Getting Started

<b>If you haven't already done so, start an interactive session</b>

- Sign in to https://ood.huit.harvard.edu/ 
- Navigate to `Interactive Apps → Jupyter Lab - HEB 115`
- Launch a Jupyter Lab session with the following parameters:
    - Number of hours: 2
    - Number of CPUs: 1
- When the session is ready, click “Connect to Jupyter”

<b>From within your home directory, create a working directory (called "practical_0" from which you will run commands and store any files that you generate</b>

```bash
mkdir practical_0
cd practical_0
```

<b>Copy these practical instructions to your working directory and open them as a Jupyter Notebook</b>

```bash
cp ~/153784/practical_instructions/Practical0.ipynb ./
```

Then navigate to the practical_0 directory on the sidebar and click on Practical0.ipynb to open it as a Jupyter Notebook

### Part 1) Convert your mystery genome to eigenstrat format

<b>Create a position list file</b>

In order to convert your data to eigenstrat format, we first need to create a position list file that contains information about all of the SNPs that we want to include in our eigenstrat dataset. The file `~/153784/data/AADR_dataset/v62.0_HO_HEB115-subset.snp` contains information about these SNPs, but it isn't in the right format for `samtools mpileup`, one of the tools that we will be using to create genotype calls.

We can convert it into the format that we need using the following `awk` command:


```bash
awk '{print $2 "\t" $4}' ~/153784/data/AADR_dataset/v62.0_HO_HEB115-subset.snp > HO_position_list.txt
```

If you'd like to know what that `awk` command did, use the following command to take a look at the first few rows of the original SNP file:

```bash
head ~/153784/data/AADR_dataset/v62.0_HO_HEB115-subset.snp
```

And compare that to the new positions list file that you just made:

```bash
head HO_position_list.txt
```

<b>Submit a job to create your eigenstrat dataset</b>

Now we can use a single, multi-step command to create an eigenstrat format dataset from your mystery genome's bam file. 

In the first part of the command, we will use the `mpileup` function of `samtools` to convert your bam file into pileup format. Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. We'll learn more about pileup format in future practicals. You can learn more about `samtools mpileup` here: https://www.htslib.org/doc/samtools-mpileup.html 

Then we will use the tool `pileupCaller` to convert that pileup format dataset into eigenstrat format by randomly selecting a single read to represent the (homozygous) genotype call at each position in your position list. Learn more about pileupCaller here: https://github.com/stschiff/sequenceTools

Replace the placeholder text in the following code and enter the command into a terminal window to run it (see if you can spot the breakpoint between the two steps in the command):

```bash
samtools mpileup -R -B -q30 -Q30 \
-l ../HO_position_list.txt \
-f ~/153784/data/reference_genomes/human_g1k_v37.fasta \
~/153784/data/mystery_genomes/{YOUR MYSTERY GENOME ALIAS}.bam | \
~/153784/tools/pileupCaller \
--randomHaploid \
--sampleNames {YOUR MYSTERY GENOME ALIAS} \
--samplePopName {YOUR MYSTERY GENOME ALIAS} \
-f ~/153784/data/AADR_dataset/v62.0_HO_HEB115-subset.snp \
-e mystery_genome_eigenstrat
```

<b> How to know that your job is running </b> 

When you submit your command, `samtools mpileup` will immediately return the following output:

`[mpileup] 1 samples in 1 input files`

Then, while the rest of your job is running, the prompt (the text at the left of each line in the terminal window) will disappear and the black cursor box will appear at the far left of the line. Once the command finishes, the prompt will reappear, signaling that the terminal is ready for the next input. 

<b>When your job is done</b>

Once your job is finished, pileupCaller will print out some useful summary statistics about your mystery genome. Make a note of these statistics beause they <i>might</i> come in handy later.

You should also see that you've created three new files in your working directory:

- `mystery_genome_eigenstrat.geno`
- `mystery_genome_eigenstrat.snp`
- `mystery_genome_eigenstrat.ind`

These are the three files that make up the eigenstrat format. You can take a look at them, but remember that there will be nearly 600,000 rows in the geno and snp files so consider using the `head` or `more` commands instead of `cat`

### Part 2) Merge your mystery genome with the Human Origins Dataset

Now that your mystery genome is in eigenstrat format we can use the tool `mergeit` from the `EIGENSOFT` package to merge it with the Human Origins dataset. Learn more about `mergeit` here: https://github.com/DReichLab/EIG/tree/master/CONVERTF 

`mergeit` and the other tools in the eigensoft packages require a parameter file to run (often called a 'par file' for short). The par file is where you specify any of the information that is required to run `mergeit` along with any optional parameters you might what to use. 

<b>Create a `mergeit` par file</b>

Using the text editor of your choosing (such as the one called "Text File" that you can open from the launcher window in Jupyter Lab) create a file called `mergeit.par` that contains the following information:

```bash
geno1: /shared/home/{YOUR USER ID}/153784/data/AADR_dataset/v62.0_HO_HEB115-subset.geno
snp1:  /shared/home/{YOUR USER ID}/153784/data/AADR_dataset/v62.0_HO_HEB115-subset.snp
ind1:  /shared/home/{YOUR USER ID}/153784/data/AADR_dataset/v62.0_HO_HEB115-subset.ind
geno2: mystery_genome_eigenstrat.geno
snp2:  mystery_genome_eigenstrat.snp
ind2:  mystery_genome_eigenstrat.ind
genooutfilename: HO_HEB115_working_dataset.geno
snpoutfilename: HO_HEB115_working_dataset.snp
indoutfilename: HO_HEB115_working_dataset.ind
```

<i> Note - mergeit cannot interpret the ~ symbol that typically directs programs to your home directory, so you will need to provide the full path to the AADR dataset by replacing the placeholder text above with your user ID. </i>

To find your user ID, enter the following command in terminal `whoami`

<b>Submit your mergeit job</b>

If we wanted to run mergeit within our interactive session, we would just need to run the following command:

`mergeit -p mergeit.par > mergeit.out` 

But since mergeit can take a while to run, we will instead submit it as a stand-alone job to the compute server using the `sbatch` command using the `--wrap` parameter, which lets us directly input the command that we want to run. 

So to run your job, enter the following command into your terminal window:

```bash
sbatch --wrap="mergeit -p mergeit.par > mergeit.out"
```

<i>Remember, any time you want to submit a stand alone job to the compute server, you can use the sbatch function. Just replace the code between the parentheses with whatever code you want to run!</i>

<b> Monitor your job </b>

You can check to see if your job is still running in a few ways:

- In your terminal window, run the command `squeue -u {YOUR USER ID}`. This will return a list of the jobs that you are currently running
- Navigate to the active jobs section of the Open OnDemand browser (https://ood.huit.harvard.edu/pun/sys/dashboard/activejobs) and confirm that it is listed as an active job.

<i>Note - the Juptyer Lab interactive session you created at the start of class is also considered an active job, so if your job is still running, you should see a total of two active jobs</i>

While your job is running, mergeit will write any logging information to the file that you specificed called `mergeit.out`. You can use the `more` or `cat` commands to see any output that has been added to that file. 

Additionally, when you submit a job using sbatch a logging file called `slurm-{JOB ID}.out` will also be created, where the job ID is the ID that was assigned to your job by the slurm system that carries out each job. If something goes wrong with your job, the error message might be logged in this file. 

<b>Take a look at your merged eigenstrat format dataset!</b>

Once your job is finished, you should see that you've created three new files in your working directory:

- `HO_HEB115_working_dataset.geno`
- `HO_HEB115_working_dataset.snp`
- `HO_HEB115_working_dataset.ind`

Like before, these are the three files that make up the eigenstrat format. You can take a look at them, but remember that there will be nearly 600,000 rows in the .geno and .snp files so consider using the `head` or `more` commands instead of `cat`

Your mystery genome should be listed in the very last position in the .ind file, so you can also use the `tail` command to view the last few lines of this file

If you look at the geno file, you'll notice that it is no longer in a human readable format. That's because the default output of mergeit is to generate data in `PACKEDANCESTRYMAP` format, which is a compressed format. 

### When you are finished

Congrats! You just created a working dataset that we will use in future practicals to compare your mystery genome to other present-day individuals with known ancestries. 

You don't have to write a practical report this week, but take some time to familiarize yourself with the code that you ran and what it did, since you will need to describe this process in your final project report. 