## Project Part 2 (a.k.a. Project 2 on Schedule)
While you are learning about evolutionary trees and distance matrices this week in lecture/lab we can get started on pulling together the data we need for our COVID-19 analysis. We can also do some starter analysis. Our goal over the next few weeks will be to clean, refine, and otherwise make detailed, explained, reproducible analysis.

The current complete genomes can be downloaded from here:
https://covid19.galaxyproject.org/genomics/4-Variation/current_complete_ncov_genomes.fasta

Disclaimers: I definitely expect for the analysis below to change. This isn't a lab. This is a first cut to get us discussing. We need to be critical of this work and look for problems and ways to improve it. We need to be skeptical scientists. We need to dig into the literature not just about the tools, but about the biology itself before putting out potentially misleading information. 

Technical Disclaimers: I've written this to run on my system. I want you to use this notebook as motivation and guidance to approach this weeks project. Some of the things I've done below may or may not be in your data analysis/programming/data science wheel house. That's ok. There is a lot of room for variety. Try to approach this in a genuine way about what you are interested in contributing.

### Guidance (i.e., what you should do)
We've said many times. This project isn't about everyone reaching the same point in a predetermined set of steps. It's about applying what we are learning in class to produce real data analysis for the community. It is about as *Learn by doing* as you could possibly get at Cal Poly. So what should you be doing this week for the project? Here is some guidance (but remember this is only to guide you and not box you into specific tasks). They are in no particular order. 
* Consider what questions we want to ask from our evolutionary tree analysis. Think about what questions the book was trying to answer. Do we even have the data in this notebook to answer some of those questions? If not, spend time trying to find it now that you can know more about what to look for in terms of format. Do some literature searching and see what other work has been done for this virus and others.
* Research and try different evolutionary tree programs/frameworks. What I've done below is not the only game in town by far. Biopython itself has different options.
* Consider the alignment itself. Are there different ways to do this? Did we do it correctly?
* What about the sequences themselves? Are they all of the same quality? Should we exclude some?
* What about the virus alignment program? Did we use that correctly? Should we have done the entire sequence instead of using Spike as a reference? Should we try a different reference. 
* Do we have more data available about the sequences? Part of world, etc. Can we do some digging here to answer different questions.
* And I'm sure you can think of more to attempt... Think about what you want to do. Spend time working towards a well thoughtout goal. Document things as you go. Talk to everyone on Slack. Together we can do this!

### Link to clone the repository
Here is a link to the project repository.

https://github.com/anderson-github-classroom/csc-448-project

The website can be viewed at https://anderson-github-classroom.github.io/csc-448-project/.

### First step is to get the data
We are going to rely on the Galaxy team to pull together our sequence data for now. We might change this later.

In [1]:
import wget
from os.path import isfile

FORCE_GENOME_DOWNLOAD = False

url = 'https://covid19.galaxyproject.org/genomics/4-Variation/current_complete_ncov_genomes.fasta'
file = '../../current_complete_ncov_genomes.fasta'

if FORCE_GENOME_DOWNLOAD or not isfile(file):
    wget.download(url, file)

### Virus Alignment
Using the alignment generated by Dr. A


### Read the sequences into pandas so we can process them

In [2]:
import pandas as pd
from time import time
from os.path import isfile

ALIGNMENT_PATH = '../../data/position_table.csv'

print("Reading alignment table into string dictionary")
start = time()

position_table = pd.read_csv('../../data/position_table.csv')

end = time()
print(f"Read alignment table in {round(end-start, 4)} seconds")

print("Alignment table stats:")
results = position_table.describe()
print(results)

Reading alignment table into string dictionary
Read alignment table in 1.1814 seconds
Alignment table stats:
             seqid S_1_1 S_1_2 S_1_3 S_2_1 S_2_2 S_2_3 S_3_1 S_3_2 S_3_3  ...  \
count          677   677   677   677   677   677   677   677   677   677  ...   
unique         677     1     1     1     1     1     1     1     1     1  ...   
top     MT325629.1     A     T     G     T     T     T     G     T     T  ...   
freq             1   677   677   677   677   677   677   677   677   677  ...   

       S_1270_3 S_1271_1 S_1271_2 S_1271_3 S_1272_1 S_1272_2 S_1272_3  \
count       677      677      677      677      677      677      677   
unique        1        1        1        1        1        1        1   
top           A        C        A        T        T        A        C   
freq        677      677      677      677      677      677      677   

       S_1273_1 S_1273_2 S_1273_3  
count       677      677      677  
unique        1        1        1  
top        

### Pull out the concensus sequence

In [3]:
concensus_seq = position_table.drop('seqid',axis=1).mode(axis=0).T[0]
concensus_seq

S_1_1       A
S_1_2       T
S_1_3       G
S_2_1       T
S_2_2       T
           ..
S_1272_2    A
S_1272_3    C
S_1273_1    A
S_1273_2    C
S_1273_3    A
Name: 0, Length: 3819, dtype: object

### Sort samples by distance from the concensus sequence

In [4]:
from time import time

position_table = position_table.set_index('seqid')
print("Calculating distances from consensus sequence: ", end='')
start = time()
distance_from_concensus_seq = position_table.apply(lambda row: sum(row != concensus_seq),axis=1)
end = time()
print(f"{round(end-start, 4)} seconds")



print("Sorting sequences by consensus distance: ", end='')
start = time()
distance_from_concensus_seq_sorted = distance_from_concensus_seq.sort_values(ascending=False)
end = time()
print(f"{round(end-start, 4)} seconds")

Calculating distances from consensus sequence: 0.832 seconds
Sorting sequences by consensus distance: 0.001 seconds


### Select 10 sequences to do our first analysis

In [5]:
print("10 most distant sequences from consensus")
print(distance_from_concensus_seq_sorted[0:10])
subset_seqs = distance_from_concensus_seq_sorted[:10].index

10 most distant sequences from consensus
seqid
MT233522.1    82
MT308696.1    71
MT308694.1    53
MT263453.1    48
MT259284.1    33
MT293180.1    24
MT263436.1    10
MT293224.1    10
MT326129.1    10
MT259277.1    10
dtype: int64


### Construct distance matrices for our sequences

To compare the effects of using different distance algorithms to generate the distance table, I'm going to apply as many as I can find! I'm using the textdistance package since it neatly packages many of the algorithms into objects that are easily interchangeable.



In [None]:
import textdistance as td
import pandas as pd
from pathlib import Path

LOCAL_DATA_DIR = "./data"
DISTANCE_CSV_SUFFIX = "_distances.csv"
FORCE_DISTANCE_REFRESH = False

# ensure the data folder exists
Path(LOCAL_DATA_DIR).mkdir(parents=True, exist_ok=True)

# We don't want to create objects quite yet so we can get their names
# Note: Hamming will complete relatively quickly, but other algorithms are MANY times slower
dist_algorithms = [td.Hamming, td.Levenshtein]

all_distances = {}

for dist_lib in dist_algorithms:
    this_csv_path = f"{LOCAL_DATA_DIR}/{dist_lib.__name__}{DISTANCE_CSV_SUFFIX}"

    # Only re-calculate distances if the csv doesn't exist or we want to force recalculation
    if FORCE_DISTANCE_REFRESH or not Path(this_csv_path).exists():
        dist_alg = dist_lib()
        print(f"Calculating distances using {dist_lib.__name__} algorithm: ", end='')
        distances = {}
        start = time()
        for i,seqid1 in enumerate(subset_seqs):
            distances[seqid1,seqid1]=0
            for j in range(i+1,len(subset_seqs)):
                seqid2 = subset_seqs[j]
                distances[seqid1,seqid2] = dist_alg.distance(list(position_table.loc[seqid1]), list(position_table.loc[seqid2]))
                distances[seqid2,seqid1] = distances[seqid1,seqid2]
        end = time()
        print(f"{round(end-start, 4)} seconds")
        distances = pd.Series(distances).unstack()
        distances.to_csv(this_csv_path)
        print(distances.describe())


Calculating distances using Hamming algorithm: 0.1057 seconds
       MT233522.1  MT259277.1  MT259284.1  MT263436.1  MT263453.1  MT293180.1  \
count   10.000000   10.000000   10.000000   10.000000   10.000000   10.000000   
mean    99.500000   36.900000   60.500000   36.900000   73.100000   51.700000   
std     41.317873   34.863065   34.961089   34.863065   37.218424   32.649826   
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
25%     90.000000    2.000000   43.000000    2.000000   58.000000   32.000000   
50%     97.000000   37.500000   45.000000   37.500000   63.000000   39.500000   
75%    126.250000   61.750000   84.750000   61.750000   96.000000   74.750000   
max    151.000000   90.000000  115.000000   90.000000  130.000000  104.000000   

       MT293224.1  MT308694.1  MT308696.1  MT326129.1  
count   10.000000   10.000000   10.000000   10.000000  
mean    36.900000   67.300000   80.500000   36.900000  
std     34.863065   37.739016   43.55392

### Utilize biopython
For this analysis we'll use a package called biopython: ``pip install biopython``. 

It has its own formats, so we'll need to convert.

In [None]:
from Bio.Phylo.TreeConstruction import DistanceMatrix
matrix = np.tril(distances.values).tolist()
for i in range(len(matrix)):
    matrix[i] = matrix[i][:i+1]
dm = DistanceMatrix(list(distances.index), matrix)

### Now construct our tree

In [None]:
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)

### Now draw our tree

In [None]:
%matplotlib inline

from Bio import Phylo
tree.ladderize()   # Flip branches so deeper clades are displayed at top
Phylo.draw(tree)

**Please see the guidance at the top of the page for what to try**