## Practical 2

## Determining Mitochondrial Haplogroups

The purpose of today's practical is determine your mystery genome's mitochondrial haplogroup. Mitochondrial haplogroups are inherited maternally, so they can reveal information about an individual's ancestral origins on their maternal lineage. 

You will prepare a vcf file containing your mystery genome's mitochondrial genome, which you will then download and analyze using the tool Haplogrep3 (https://haplogrep.i-med.ac.at/)[1]. 

### Getting Started

<b>If you haven't already done so, start an interactive session</b>

- Sign in to https://ood.huit.harvard.edu/ 
- Navigate to `Interactive Apps → Jupyter Lab`- HEB 115
- Launch a Jupyter Lab session with the following parameters:
    - Number of hours: 2
    - Number of CPUs: 1
- When the session is ready, click “Connect to Jupyter”

<b>Create a working directory (called "practical_2" from which you will run commands and store any files that you generate</b>

```bash
mkdir practical_2
cd practical_2
```

<b>Copy these practical instructions to your working directory and open them as a Jupyter Notebook</b>

```bash
cp ~/153784/practical_instructions/Practical2.ipynb ./
```

Then navigate to the practical_2 directory on the sidebar and click on Practical2.ipynb to open it as a Jupyter Notebook

### Part 1) Create a VCF file containing your mystery genome's mitochondrial DNA

You will extract your mystery genome's mitochondrial DNA sequences from your whole-genome aligned bam file and realign them to the RSRS mitochondrial reference genome. Then you will call haploid genotypes at each position in the mitochondrial genome and save them as a VCF file that you can download and analyze with haplogrep3.

<b>Part 1a) Create a symbolic link to your bam file </b>

You can't write new files to the shared course directory where your mystery genome's bam file is stored. And the next step in this analysis requires you to create an index for your bam file, which needs to be stored in the same directory as the bam. How can we get around this problem? With a symbolic link! 

A symbolic link looks like a file, but instead it is really just a pointer to the real location where that file is stored. So let's make a symbolic link to your mystery genome. You can either place this symbolic link in your practical_2 directory or place it in your home directory so that it is easier to reference during future practicals. Just remember when you put it.

You can create a symbolic link with this command:

```bash
ln -s {POINTER TO YOUR MYSTERY GENOME} {POINTER TO WHERE YOU WANT YOUR SYMBOLIC LINK TO GO}
```

So for instance, if I want to put my mystery genome in my home directory, I would use the command:

```bash
ln -s {POINTER TO YOUR MYSTERY GENOME} ./
```

<b>Part 1b) Index your mystery bam file </b>


Now you are ready to index your bam file, using `samtools` [2]:

```bash
samtools index {POINTER TO YOUR MYSTERY GENOME - USE THE SYMBOLIC LINK}
```

Note - this will create a new file in the same directory as the symbolic link that you created, which will end with the suffix `.bai`. You won't actually use this file directly going forward, but samtools will be expecting it, and will error if it isn't there.

<b>Part 1c) Extract mitochondrial DNA reads</b>

Use `samtools` [2] to extract mitochondrial reads and save them in a new bam file. 

```bash
samtools view -b -h -o {MYSTERY GENOME ALIAS}_MT.bam {POINTER TO YOUR MYSTERY GENOME} MT
```

<b>Part 1d) Convert your mitochondrial BAM file to fastq format</b>

Use `samtools fastq` [2] to convert your mitochondrial BAM file to fastq format. By converting to fastq format (an unaligned format), this will make it possible for us to align the mitochondrial reads to a new reference genome in the next step

```bash
samtools fastq {MYSTERY GENOME ALIAS}_MT.bam > {MYSTERY GENOME ALIAS}_MT.fastq
```

<b>Part 1e) Realign the mitochondrial DNA reads to the RSRS Mitochondrial Reference Genome</b>

Next, you'll use the tool `bwa` [3] to realign your mystery genome to the RSRS mitochondrial reference genome. You'll save the output as a SAM file (that's the non-compressed version of a BAM file)

```bash
~/153784/tools/bwa/bwa mem  ~/153784/data/reference_genomes/mtdna_rsrs.fa {MYSTERY GENOME ALIAS}_MT.fastq > {MYSTERY GENOME ALIAS}_RSRS.sam
```

<b> Part 1f) Sort your SAM file </b>

Before you can proceed any further, you'll need to make sure that the reads in the SAM file you just created are in sorted order.

```bash
samtools sort -o {MYSTERY GENOME ALIAS}_RSRS_sorted.sam {MYSTERY GENOME ALIAS}_RSRS.sam
```


<b> Part 1g) Call haploid genotypes 

Now you are ready to use the `mpileup` and `call` functions of `bcftools` [2] to create a VCF file that contains haploid genotypes for your mystery genome at every position in the mitochondrial genome. 

```bash
bcftools mpileup -f ~/153784/data/reference_genomes/mtdna_rsrs.fa --min-MQ 30 --min-BQ 20 {MYSTERY GENOME ALIAS}_RSRS_sorted.sam | bcftools call --ploidy 1 -mv -Ov -o {MYSTERY GENOME ALIAS}_RSRS.vcf
```

When your VCF file is ready, you can download it by left-clicking on the file name in the sidebar. 

### Part 2) Use haplogrep3 to determine the mitochondrial haplogroup of your mystery genome

In your web browser, navigate to https://haplogrep.i-med.ac.at/
- Upload the vcf you created for your mystery genome
- Choose the following options:
    - File Format: Auto-Detect (Default)
    - Choose the phylogenetic tree: PhyloTree 17.1 (Not the default)
    - Distance function: Kulczynski (Default)
- Click Upload and Classify

Now its time to explore the results produced for your mystery genome. 

#### Part 3) Learn more about your mystery genome's assigned haplogroup

Now it's time to do some research to learn more about your mystery genome's assigned haplogroup. See if you can find any academic papers or maps that describe the global distribution of this haplogroup. 

<i><b>Tip</b> If your haplogroup is really specific, you may need to search for one of the broader haplogroups to which it belongs (e.g., if your individual is assigned to haplogroup J1b1a1a, if you can't find anything about that specific haplogroup, try searching for information about haplogroup J1b1a1 or J1b1a and so on).</i>

## When you are finished

### Be sure to include the following in your report: 
<b>Methods section</b>: <br>
A description of each of the analyses that you performed. 

<b>Results section</b>: <br>
Be sure to include the following in your results section:
- Your mystery genome's assigned haplogroup
- The broader cluster to which the assigned haplogroup belongs (according to haplogrep).
- The quality score associated with the assigned haplogroup. Were there any other haplogrups (i.e. Additional Hits) that recieved an equal quality score? If so, be sure to indicate what they were.
- A table detailing the Expected and Included Mutations, Expected But Not Included Mutations and any Remaining Mutations 

<b>Conclusion section</b>: <br>
Be sure to address the following, based on the results of your analyses: 
- How confident you are in your mystery genome's assigned mitochondrial haplogroup. 
- The population(s) where the cluster to which your mystery genome’s haplogroup belongs occurs at the highest frequency, along with the total number of haplogroups assigned to that cluster.
- Anything further that you have learned about your mystery genome’s assigned haplogroup.
  - Describe what the outside literature says about your assigned haplogroup.
  - Include any maps that you might have found showing the distribution of this haplogroup (or the broader cluster to which it belongs). 


### Additional Questions to answer at the end of your report: 
1) Why did we need to realign your mystery genome to the RSRS mitochondrial genome? 
2) What do each of the bcftools commands that you used to prepare your vcf file in part 1f do? 
3) What does Haplogrep3 mean by “expected” and “remaining” mutations?
4) Briefly explain how the quality score is calculated by the Kulczynski distance function?



## References

1) Schönherr, Sebastian, et al. "Haplogrep 3-an interactive haplogroup classification and analysis platform." Nucleic Acids Research (2023): gkad284. https://doi.org/10.1093/nar/gkad284
2) Danecek, Petr, et al. "Twelve years of SAMtools and BCFtools." Gigascience 10.2 (2021): giab008. https://doi.org/10.1093/gigascience/giab008
3) Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]