# Phylogenetics

Phylogenetics is useful **bold**, *italic*,

![no image found](imgs/name_of_image.jpg)

- a
- b
- c 

this is a `python variable name`

```
if x > 1:
    print(true)
```

$x > 3$

## What is phylogenetics

**Phylogenetics:** Using genetics to study evolutionary history

Relationships among:
- Species
- Individuals 
- Genes 

![](/Users/harperhipps/Documents/GitHub/UBIC-Workshops/phylogenetics/img/Tree2.jpg)


## Phylogenetic Trees
Closer together → more genetic overlap

**Clade**: Common ancestor and all its descendants
- COVID is a clade in the family coronaviridae

![](/Users/harperhipps/Documents/GitHub/UBIC-Workshops/phylogenetics/img/Tree1.jpg)



## What is a fasta file?

A FASTA file is a text-based file format that is commonly used in nucleotide sequences or peptide sequences.

In [None]:
# Import necessary modules
from google.colab import files
from Bio import SeqIO

# Upload the FASTA file
uploaded = files.upload()  # Allows interactive file upload

Use `cat` to view the contents of the file.

In [None]:
cat filename.fasta

Scroll through the FASTA file interactively.

In [None]:
less filename.fasta

Display the first or last few lines of the FASTA file.

In [None]:
head -n 10 filename.fasta  # First 10 lines
tail -n 10 filename.fasta  # Last 10 lines

Use `head` and/or `tail` to explore the data more. (Hint: change the value after -n)

Use grep to find specific sequences or headers.

In [None]:
grep ">" filename.fasta  # List all sequence headers

Follow [this link](https://builtin.com/articles/grep-command) and scroll down to "13 grep Options to Know". Explore the options and try on your own. Filter the searches to find only certain sequences.

What do you notice about this file? (Hint: Pay attention to the format and each column!)

What do you think this type of file could be used or analyzed for?

## How to build phylogenetic trees with small and large parsimony problems and alignment?


## Base Level Algorithm

With this example, our goal will be to identify the least number of evolutionary changes within a tree. 
Let's say we have this tree, where each node (1,2,3,4) represents a different nucleotide.
For instance
    1 = A
    2 = G
    3 = A
    4 = C
    
         Root
        /   \
       X     Y
      / \   / \
     1   2  3  4

We can look at this tree and record the different changes in a cost table. 

If there's no change, then cost = 0
Transition, cost = 1
Transversion, cost = 2

With this concept, we can build out a cost table for the different potential values of X


| X Value  |  Cost    |
| -------- | -------  |
| A        |  0 + 1   |
| C        |  2 + 2   |
| G        |  1 + 0   |
| T        |  2 + 2   | 


Try building this tree for the values of Y. 

| Y Value  |  Cost    |
| -------- | -------  |
| A        |          |
| C        |          |
| G        |          |
| T        |          |

Question: What would be the most parsimonious state for X/Y, based on the cost tables we've built above? 
(Hint: Most parsimonious means that there's the least number of changes)

Challenge: How would you apply this to determine the root value?

## Build tree:


## Applications of Phylogenetics - COVID
Useful for tracking the growth in COVID variants
Rapid genome sequencing → determine location someone was infected (based on the COVID clade)
- “The first four cases of COVID-19 in New South Wales, Australia, were found to be closely related to the dominant strain of SARS-CoV-2 found in Wuhan, and these first four cases were all in people who had recently returned from traveling in China” [(Source)](https://www.news-medical.net/health/Viral-Clades-of-SARS-CoV-2.aspx)
- Phylogenetics was helpful for tracking and tracing COVID origins and limiting travel during the pandemic. 

Tree of 10 million COVID sequences from  UC Santa Cruz - the largest tree of genomic sequences of a single species ever assembled [(Image Source)](https://news.ucsc.edu/2022/06/10-million-sequences.html)

![](/Users/harperhipps/Documents/GitHub/UBIC-Workshops/phylogenetics/img/UCSCtree.jpg)


## Applications of Phylogenetics at UCSD
**PanMAN Tool** (Pangenome Mutation-annotate Network) - Turakhia lab
- PanMAN is used to analyze and visualize pangenomes which is especially useful for studying the genetic mutations in viruses like COVID and other microbial datasets
    - Composed of mutation-annotated trees called **PanMATs** (Phylogenetic Analysis of Novel Mutations and Transmissions)
    - **Pangenome**: entire set of genes from all strains within a clade
    
![](/Users/harperhipps/Documents/GitHub/UBIC-Workshops/phylogenetics/img/pangenome.jpg)


   
- Useful because:
    - By annotating mutations on different branches of the tree it is easier to quickly identify and analyze genetic changes which could be affecting its severity, resistance to vaccines, and transmissibility

    


Exploration of SARS-CoV-2 mutational and evolutionary landscape using PanMAN vs UShER-MAT [(Image Source)](https://turakhia.ucsd.edu/PPTs/PanMAN-Oxford-2024.pdf)
- UShER-MAT: another MAT tool, cannot represent complex mutations

![](/Users/harperhipps/Documents/GitHub/UBIC-Workshops/phylogenetics/img/PanMAN.jpg)


## Resources at UCSD

#### Online Resources
[Simple phylogenetic tree explanation](https://evolution.berkeley.edu/evolution-101/the-history-of-life-looking-at-the-patterns/understanding-phylogenies/)

[More in-depth explanation of creating trees in python](https://taylor-lindsay.github.io/phylogenetics/?utm_source=chatgpt.com)


#### Labs at UCSD
Turakhia Lab
 - __[Turakhia Lab Website](https://turakhia.ucsd.edu/)__
 - Led by Prof. Yatish Turakhia
 - Affiliated with Electrical and Computer Engineering
 - Develops automated solutions to construct large-scale phylogenies
 - Also studies
    - __Pangenomics__, genetic variation in a species by looking at multiple genomes. Ex. PanMAN (mentioned previously)
    - __Hardware acceleration__, how to improve computing and make it faster
    - __Outbreak analysis__, looking at real-time phylogenies to analyze pandemics



Mirarab Lab
 - __[Mirarab Lab Website](http://eceweb.ucsd.edu/~smirarab/)__
 - Led by Prof. Siavash Mirarab
 - Affiliated with Electrical and Computer Engineering
 - Focuses on reconstructing and utilizing phylogenetic trees
 - Also works in metagenomics (genomics of whole communities of microorganisms), HIV, and Multiple Sequence Alignment, of which they’ve developed 2 new methods
 - Also studies
    - __Metagenomics__, studying whole communities of microorganisms, identifying taxonomic composition (how much of each species)
    - __HIV__, specifically transmission network reconstruction
    - __Multiple Sequence Alignment__, of which they've developed two new methods. 
