# Practical 1

## Characterizing your mystery genome

The purpose of today's practical is to characterize your mystery genome to learn more about it's age, preparation method and quality. 

You will use a combination of custom scripts and published tools, including `samtools` (https://www.htslib.org/doc/samtools.html)[1] and `PMDtools` (https://github.com/pontussk/PMDtools)[2], to analyze your mystery genome. 

### Getting Started

<b>If you haven't already done so, start an interactive session</b>

- Sign in to https://ood.huit.harvard.edu/ 
- Navigate to `Interactive Apps → Jupyter Lab`
- Launch a Jupyter Lab session with the following parameters:
    - Number of hours: 2
    - Number of CPUs: 2
- When the session is ready, click “Connect to Jupyter”

<b>Create a working directory (called "practical_1" from which you will run commands and store any files that you generate</b>

```bash
mkdir practical_1
cd practical_1
```

<b>Copy these practical instructions to your working directory and open them as a Jupyter Notebook</b>

```bash
cp ~/139860/practical_instructions/Practical1.ipynb ./
```

Then navigate to the practical_1 directory on the sidebar and click on Practical1.ipynb to open it as a Jupyter Notebook

### Part 1) Calculate the read length distribution for your mystery genome

First we will use `samtools` and some simple `awk` and `bash` scripting to quantify the distribution of DNA segment lengths (or read lengths) in your mystery genome. 

In order to save time, we will only consider the first 1 million reads in your bam file. 

The following command will extract the first 1 million reads from your bam file, calculate  the length of each read, and then create a file called `read_length_distribution.txt` that reports the number of sequences of each length in your bam file. 

<i><b>Tip</b>: Make sure that you provide the full path to your mystery genome's bam file (i.e., include information about the folder where it is located, not just the name of your bam)</i>

```bash
samtools view {POINTER TO YOUR MYSTERY BAM FILE} | head -n 1000000 | awk '{print length($10)}' | sort | uniq -c > read_length_distribution.txt
```

Next, the following command will format your results into an easy to read table called `read_length_table.txt` that has appropriate headers

```bash
awk 'BEGIN {print "Read_Length\tCount"} {print $2 "\t" $1}' read_length_distribution.txt | sort -n -k 1 > read_length_table.txt
```

Take a peek at the `read_length_table.txt` you just produced and see if it gives you an idea about the age of your mystery genome. Does the distribution seem more consistent with that of an ancient or present-day individual?

You can proceed to part 2 for now, but in your final report, use the information in `read_length_table.txt` to create a histogram that displays your read length distribution similar to the one below:

In [43]:


from IPython.display import IFrame
IFrame("practical_1_read_length_histogram.pdf", width=600, height=450)

<i><b>Tip</b>: You can use any tool of your choosing to create your plot, including python (in which case you can make it right here in this Jupyter Notebook), R, or even Excel or Google sheets. If you aren't sure where to start, try asking ChatGPT! </i>

### Part 2) Check to see if your mystery genome has signatures of ancient DNA damage (C-to-T misincorporations)

To assess ancient DNA damage patterns, we will use `PMDtools`, a handy python-based package for analyzing ancient DNA data. To save time, we will only consider the first 1 million reads in your bam file. 

To run PMDtools use the following command:

```bash
samtools view {POINTER TO YOUR MYSTERY BAM FILE} | head -1000000 | python ~/139860/tools/PMDtools/pmdtools.0.60.py --deamination > damage_pattern_table.txt
```

- <i><b>Note</b>: Some of you may see an error that contains the following message:<br><br>
  `cigar found:',cigar,'PMDtools only supports cigar operations M, I, S and D, the alignment has been excluded`<br><br>
  This is because some of the reads in your bam file have undergone "hard clipping", in which the CIGAR string has been updated to indicate that those bases should be ignored from analyses. `PMDtools` isn't able to handle hard-clipped bases, so it fails. If you see this error, you can avoid this issue by filtering out reads that have been hard clipped (i.e. ones that contain a "H" in their CIGAR strings, which are stored in the 6th column of the bam file). To do this, try rerunning the above command, but adding the following extra `awk` step between the `samtools` and `head` commands: `awk '$6 !~ /H/'` Make sure to add an extra pipe symbol so that you are clearly differentiating between each step. </i> 

When it is finished running, `PMDtools` should have created a file called `damage_pattern_table.txt` which contains information about the rate of misincoroporations at the first and last 30 bases in each read.
- <b>Note</b>: For the first 30 bases, it reports the rate of misincorporations at positions where it was expecting to see a "C", while for the last 30 bases it reports the reate of misincorporations at positions where it was expecting to see a "G"

This table isn't formatted in a very easy to read way, so let's use the following custom script, which uses a combination of `awk` commands to reformat it:

```bash
bash ~/139860/tools/custom_scripts/reformat_pmdtools_output.sh damage_pattern_table.txt reformatted_damage_pattern_table.txt
```

You just created a more nicely formatted table called `reformatted_damage_pattern_table.txt` which contains information about the rate of misincorporations in the first and last 30 bases in each read. Take a look at this table, paying special attention to the C->T and G->A misincorporations.

Using the information you gathered from this table and from the previous read length distribution analysis, what conclusions can you draw about your mystery genome? 
- Do you think your mystery genome is from an ancient or present-day individual?
- If they are ancient, do you think the DNA sample underwent any UDG treatment during library preparation? 

Once you have answered the above question, you can proceed to part 3, but for your final report, use the information in `reformatted_damage_pattern_table.txt` to create a "smiley" plot that displays the misincorporation rates in your mystery genome, similar to the one below, using a plotting tool of your choosing:

<i><b>Note:</b> Make sure that you don't zoom too far in on the Y-axis. You should see the pattern best with a maximum Y-axis value of 20-30%.</i>

In [44]:

IFrame("practical_1_misincorporation_plot.pdf", width=800, height=350)

### Part 3) Calculate the coverage of your mystery genome on different SNP sets

Finally, let's calculate the coverage of your mystery genome on three different SNP sets in order to try to determine whether your mystery genome underwent targeted enrichment capture or genome-wide ("shotgun") sequencing. 

If your mystery genome underwent targeted enrichment capture, its coverage on the positions that were enriched for will be signficantly higher than the genome-wide coverage. 

So let's check your mystery genome's coverage on the following three SNP sets:
- <b>Whole genome</b>: For samples that underwent 'shotgun' sequencing, instead of targeted enrichment capture, we report the genome-wide coverage. That's the coverage across all positions on the human genome. 
- <b>1240k SNP set</b>: The '1240k' array is a set of approximately 1.24 million SNPs that are included on a targeted enrichment capture array that is widely used by ancient DNA researchers. 
- <b>390k SNP set</b>: The '390k' array is a list of approximately 390,000 SNPs that were included on a targeted enrichment capture array that pre-dated the 1240k array. A small number of ancient genomes underwent targeted enrichment capture on this SNP set.

You'll use the function `samtools depth` to cacluate the coverage of your mystery genome on these three SNP sets. 
- For the 1240k and 390k SNP sets, we'll specify the SNP sets to focus on by providing a SNP list with the `-b` parameter.
- In all cases, we'll use the `-a` parameter to report counts at all of the SNPs of interest, not just those that have at least one overlapping read.

Use the following three commands to calculate coverage at each SNP set. These can take a while to run, so we'll submit each of them as a separate job to the compute cluster:

<b>Compute coverage on the whole genome:</b>

```bash
sbatch --wrap="samtools depth --min-MQ 10 --min-BQ 20 -a {POINTER TO YOUR MYSTERY BAM FILE} | awk '{sum+=\$3} END { print sum/NR}' > coverage_genome_wide.txt" 
```

<b>Compute coverage on the 1240k SNP set:</b>

```bash
sbatch --wrap="samtools depth --min-MQ 10 --min-BQ 20 -b ~/139860/data/reference_data/1240k_positions.bed -a {POINTER TO YOUR MYSTERY BAM FILE} | awk '{sum+=\$3} END { print sum/NR}' > coverage_1240k.txt"
```

<b>Compute coverage on the 390k SNP set:</b>

```bash
sbatch --wrap="samtools depth --min-MQ 10 --min-BQ 20 -b ~/139860/data/reference_data/390k_positions.bed -a {POINTER TO YOUR MYSTERY BAM FILE} | awk '{sum+=\$3} END { print sum/NR}' > coverage_390k.txt" 
```

### Part 4) Trim the terminal bases in your bam file based on its age and UDG treatment

Now that you know more about your mystery genome, you can perform some modifications to your bam to make sure that your future analyses aren't impacted by C-to-T (or G-to-A) misincorporations. For ancient genomes, it is standard practical to ignore bases that occur at the ends of each read, as they are most susceptible to ancient DNA damage. How many bases you ignore depends on the UDG treatment:

- No UDG treatment: 10 bases
- Partial UDG treatment: 2 bases
- Full UDG treatment: 0 bases

For present-day genomes, you don't need to ignore any bases, since we don't expect to see any ancient DNA damage. 

Next, you will use the `softclip` function of the `ADNA-Tools` package [3](version 1.10.0; https://github.com/DReichLab/ADNA-Tools) to modify your bam so that the number of terminal bases you specify are ignored from future analyses. It works by lowering the quality score assigned to those bases. That way, you can still check what the full read looks like if you want to, but those bases won't be used in analyses where you specify a minimum quality score above that threshold. This is called softclipping. 

You'll save the file in your home directory. That way you can easily reference it for future practicals. 


<b>Option 1: Softclipping </b><br>
If your bam comes from an ancient individual and was prepared using no or a partial UDG treatment, you can use the following code to softclip your bam. Be sure to specify the correct number of bases to softclip based on how you think your genome was prepared.

```bash 
sbatch --wrap="java -jar ~/139860/tools/adnascreen-1.10.0-SNAPSHOT.jar softclip -b \
-n {NUMBER OF POSITIONS TO SOFTCLIP} \
-i {POINTER TO YOUR MYSTERY BAM FILE} \
-o ~/{YOUR MYSTERY GENOME ALIAS}_softclipped.bam"
```

<i><b>Note:</b> The code will take some time to run, so you can submit it as a job that will finish before next week. Be sure to check back and make sure that your job finished without issues</i>

<b>Option 2: No softclipping</b><br>
If your bam was prepared using a Full UDG treatment or comes from a present-day individual, you don't need to softclip it. But let's still create a symbolic link pointing to your file and give it the same name so that you can use it for future analyses. You can do that using the following code:

```bash
ln -s {POINTER TO YOUR MYSTERY BAM FILE} ~/{YOUR MYSTERY GENOME ALIAS}_softclipped.bam
```

## When you are finished

### Be sure to include the following in your report: 
<b>Methods section</b>: <br>
A description of each of the analyses that you performed. Make sure to include a description of the data processing that you did in part 4 even though you won't use your softclipped bam file until future practicals

<b>Results section</b>: <br>
Make sure that your results section includes the following results:
- Figures showing the “smiley” plot from part 1 and the histogram from part 2.
- Report (in words) the coverage you calculated in part 3. 

<b>Conclusion section</b>: <br>
Be sure to address the following questions. Based on the results of your analyses: 
- Do you think your mystery genome was sequenced from an ancient or present-day individual?
- Do you think your mystery genome underwent targeted enrichment capture or shotgun sequencing? If it underwent capture, what array do you think was used?
- If the individual was ancient, what type of UDG treatment do you think was used during processing (minus, plus or half)? 

### Additional Questions to answer at the end of your report: 
1) In part 1, what is the purpose of the awk statement `awk '{print length($10)}'`? 
    - <i><b>Tip:</b> Try using the head command to view the input that the samtools view command is passing to this `awk` statement. <i>
2) In part 2, why does `PMDtools` output include more than just the rate of C-to-T misincorporations as part of this analysis? What other misincorporation rate may be impacted by ancient DNA damage?
3) In part 3, what do the parameters `--min-MQ` and `--min-BQ` specify in the `samtools depth` function? 
    - <i><b>Tip:</b> Check out the `samtools depth` documentation:</i> https://www.htslib.org/doc/samtools-depth.html
5) In part 4, why might we want to ignore SNPs that fall on the terminal ends of the DNA sequences for ancient DNA data? Why does the number of reads that you ignore vary depending on the UDG treatment used?


## References

1) Danecek, P, et al. (2021) Twelve years of SAMtools and BCFtools. Gigascience 10.2 :giab008.
2) P Skoglund, BH Northoff, MV Shunkov, A Derevianko, S Pääbo, J Krause, M Jakobsson (2014): Separating ancient DNA from modern contamination in a Siberian Neandertal, Proceedings of the National Academy of Sciences USA
3) https://github.com/DReichLab/ADNA-Tools

