# Practical 7

## Predicting phenotypes 

This week, you will use the programs `samtools` [1] to predict the genotype of your mystery genome at 47 SNPs of phenotypic interest, enabling you to make predictions about their likely phenotypes. Many of the phenotypes that you will consider in today’s exercise are controlled by a single locus (i.e. a monogenic trait) or by a small number of loci.


### Getting Started

<b>If you haven't already done so, start an interactive session</b>

- Sign in to https://ood.huit.harvard.edu/ 
- Navigate to `Interactive Apps → Jupyter Lab - HEB 115`
- Launch a Jupyter Lab session with the following parameters:
    - Number of hours: 2
    - Number of CPUs: 4
- When the session is ready, click “Connect to Jupyter”

<b>Create a working directory (called "practical_7" from which you will run commands and store any files that you generate</b>

```bash
mkdir practical_7
cd practical_7
```

<b>Copy these practical instructions to your working directory and open them as a Jupyter Notebook</b>

```bash
cp ~/153784/practical_instructions/Practical7.ipynb ./
```

Then navigate to the practical_7 directory on the sidebar and click on Practical7.ipynb to open it as a Jupyter Notebook

### Part 1) Prepare a positions file 

In today’s analysis, you will be considering a set of 47 SNPs of phenotypic interest. These are a subset of the SNPs of phenotypic interest that were analyzed in a 2023 study of Otzi the Iceman by Wang et al [2]. There is a copy of this snplist in:

`~/153784/data/reference_data/Wang_2023_snplist.txt`

You will need to provide a position list file to samtools in order to indicate what SNPs should be included in your analysis. The expected format of the position list file is described in the samtools mpileup documentation as part of the description of the -l  parameter (also called the --positions parameter).

Prepare the positions file using either of the methods described and save it in your `practical_7` working directory. 

### Part 2) Count the number of reference and alternative alleles at each SNP position using samtools mpileup

Next, use the `mpileup` function from `samtools` [1] to count the number of reference and alternative alleles that align to each position of interest, with the following command:

```bash
samtools mpileup {POINTER TO YOUR MYSTERY BAM} -f ~/153784/data/reference_genomes/human_g1k_v37.fasta  -l {POINTER TO YOUR POSITIONS LIST FILE} -Q 30 -q 30 -B > samtools_mpileup_output.txt
```

This will create a file called `samtools_mpileup_output.txt` that contains information about the reads that align to each position of interest specified in your positions list. 

Be sure to review the samtools mpileup documentation (https://www.htslib.org/doc/samtools-mpileup.html) to interpret the mpileup file you created. 

You can learn more about phred quality scores (shown in the last column of the results file) and the ASCii encoding system here: https://people.duke.edu/~ccc14/duke-hts-2018/bioinformatics/quality_scores.html 


### Part 3) Create a table with your results 

Create a table that contains the following information for each of the positions of interest from the Wang et al [2] study. 

In this table you should include the following columns from the Wang et al [2] table:

- SNP ID
- Gene
- Chromosome
- Position
- Reference Allele 
- Alternative allele (if known)
- Effect allele (i.e. this is the allele that is associated with the phenotype)
- A brief phenotype description
- A detailed phenotype description (if available)
- Whether or not the SNP of interest is included in the 1240k array (Note - This column wasn't included in the original Wang et al [2] publication)

You should also add the following columns based on the results of your analyses:
- The number of reads that align to this positon
- The number of reference alleles observed
- The number of alternative alleles observed
  

*Note - Be sure to report positions that have no overlapping reads for your mystery genome. These positions won't be reported in the output of samtools mpileup*

### Part 4) Visualizing the sequence alignments at a SNP of interest

Based on your results from part 3, choose three SNPs of interest (that impact different phenotypes) that you want to investigate further. 

When making your choice, you are welcome to prioritize SNPs that interest you most based on the phenotypes they impact, but you should also make sure to choose SNPs that have (relatively) good coverage based on the read counts you report in your table from part 3. See below for more things to consider when selecting which SNPs to investiage.

For each SNP you choose, use `samtools tview` [1]  (https://www.htslib.org/doc/samtools-tview.html) to visualize it, using the following command:

```bash
samtools tview -d T -p {CHROMOSOME}:{STARTING POSITION} {POINTER TO YOUR MYSTERY BAM} ~/139860/data/reference_genomes/human_g1k_v37.fasta
```

*Note - The starting position that you specify will be the leftmost position output by samtools tview. Be sure to choose a starting position that centers the position of interest in the middle of the screen, ideally showing the start and end position of each read that overlaps the position (although this may not be possible for particularly long reads).* 

Take a screenshot showing the output of samtools tview for each of your chosen SNPs of interest to include as a figure in your report.

## When you are finished

### Be sure to include the following in your report: 
<b>Methods section</b>: <br>
Be sure to describe each of the methods that you used to learn about the positions of interest in your mystery genome in parts 2 and 4. 
- You should not describe how you made your position list (although you should describe how many positions you include in your analysis and where they were chosen from, i.e. from the Wang et al [2] study)

<b>Results section</b>: <br>
Be sure to include the following in your results section:
- The table described in part 3
- A figure (i.e. a screenshot) showing the output of `samtools tview` for your chosen SNPs of interest. Overlay an arrow or a box to highlight where the SNP of interest is in the figure. 

<b>Conclusion section</b>: <br>
For each of the three SNPs you chose to visualize in part 4: 
- Briefly describe what is known about the SNP and its impact on phenotype. If there are mutliple SNPs that impact this phenotype, be sure to indicate this and specify how these SNPs work together to determine phenotype. 
- Based on the number of refererence and alternative alleles that aligned to your SNP of itnerest, make a prediction about your mystery genome's likely genotype at this position. How confident are you in this prediction? Be sure to explain your reasoning. 
- Based on your genotype prediction at this position (and any other SNP positions that you investigated during this exercise that also impact this phenotype, if relevant) make a prediction about your mystery individual’s likely phenotype. How confident are you in this prediction? Be sure to explain your reasoning.


*Keep in mind:*
- Check the “A brief phenotype description” column in  `Wang_2023_snplist.txt` to see if there are multiple SNPs that control the same phenotype. This list is sorted in order of chromosome position (not by grouping related phenotypes together), so be sure to check the entire list.
- When choosing which SNPs to investigate, keep in mind whether they are relevant to your individual, based on their known ancestry and number of reads that overlap the positions of interest. 
    - For example, if you have previously determined that your individual has entirely European ancestry it wouldn’t be very interesting to highlight that they are homozygous for the G allele at the rs1229984 SNP, which is associated with the alcohol flush reaction, as this allele is fixed (i.e. occurs at 100% frequency) in European populations. 
    - Similarly, it would not be very interesting to tell me that you observed zero reads aligning to a particular position of interest.
    - You might find that the most exciting SNPs to examine might be those that have moderate sequence depth and that carry both the reference and alternative alleles. By examining these positions in more detail, you can investigate whether your mystery genome is heterozygous at these positions or if their might be a sequencing error or other factor impacting the distribution of reads you observed
- Your description of the role that the SNPs play in defining each phenotype should be based on your own research, and it should reference studies published in academic journals. You may use sites like wikipedia and https://www.snpedia.com/ to help in your search for relevant references, but don’t cite them in your report.


### Additional Questions to answer at the end of your report: 
1) Identify and describe what each of the optional parameters that you used when running `samtools mpileup` did.
2) What do you notice about the relative number of sequences that align to positions that are included on the 1240k array versus those that are not? Provide a possible explanation for the pattern that you observe. How do you think this pattern might differ for the other mystery genomes that your classmates are analyzing? 
3) Describe a scenario in which it might be useful to visualize the sequences that align to a position of interest using `samtools tview`. 
4) When viewing the aligned reads using `samtools tview`, the total number of sequences that you see might be greater than the numbers output by the samtools mpileup and bcftools mpileup commands that you ran. What is a possible explaination for this descrepancy?


### References
1) Danecek, Petr, et al. "Twelve years of SAMtools and BCFtools." Gigascience 10.2 (2021): giab008. https://doi.org/10.1093/gigascience/giab008
2) Wang, Ke, et al. "High-coverage genome of the Tyrolean Iceman reveals unusually high Anatolian farmer ancestry." Cell Genomics 3.9 (2023).


## Additional Resources
1) The University of Chicago’s Geography of Genetic Variants Browser (https://popgen.uchicago.edu/ggv) is a handy tool for learning more about the geographic distribution of SNPs of interest. 
    - Make sure you pay attention to the key on the right side of the map, as the scale shown in the pie chart can vary
2) SNPedia (https://www.snpedia.com/) is a wiki that contains lots of information about SNPs. Like wikipedia, it should not be used as a reference in your reports, but it can point you towards useful references from academic publications.
    - Sometimes the reference / alternative alleles listed on SNPedia can refer to the opposite strand and are therefore the complements of the alleles in the reference genome that your mystery genome is aligned to. So if the alleles don't seem to match your expectation, this might be the reason why.
