# This is Cartography put together into one place. 

## Modules to add to run this code:

- [BioPython][1]
- [Pandas][2]
- [Numpy][3]
- [Altair][4]
- [Seaborn][5]
- [Scikit-Learn][6]
- [UMAP][7]
- json
- nextstrain-augur
- statsmodels
[1]:https://biopython.org/wiki/Download
[2]:https://pandas.pydata.org/pandas-docs/version/0.23.3/install.html
[3]:https://docs.scipy.org/doc/numpy/user/quickstart.html
[4]:https://altair-viz.github.io/getting_started/installation.html
[5]:https://seaborn.pydata.org/installing.html
[6]:https://scikit-learn.org/stable/install.html
[7]:https://umap-learn.readthedocs.io/en/latest/


# Imports Section 

In [None]:
import pandas as pd
import altair as alt
import numpy as np
from scipy.spatial.distance import squareform, pdist
import pandas as pd
import numpy as np
from Bio import SeqIO
import seaborn as sns
import re
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from augur.utils import json_to_tree
import json
from sklearn.manifold import MDS
from sklearn.manifold import TSNE
import umap
from scipy.stats import linregress
from pathlib import Path
import statsmodels
import statistics
import matplotlib.pyplot as plt
from Helpers import get_euclidean_data_frame, get_hamming_distances, linking_tree_with_plots_brush
from Helpers import linking_tree_with_plots_clickable
from Helpers import scatterplot_xyvalues, scatterplot_tooltips, scatterplot_with_tooltip_interactive

%matplotlib inline

## Pathogen-specific variables

Consider consolidating these into a single configuration file that can be passed to the notebook as a command line argument for more scriptable generation of these figures.

In [None]:
path = "Data/aligned_cdc_h3n2_ha_2y_cell_hi.fasta"
dropped_strains = ["A/Usl/3850/2019","A/Navarra/2284/2018", "A/Austria/1123467/2019", "A/Austria/1127203/2019", "A/Catalonia/NSVH100995626/2019", "A/Ontario/RV2296/2019", "A/Sydney/781/2019", 'A/SouthAustralia/1034/2019','A/Usk/3850/2019']
distance_matrix_file = "DistanceMatrixFlu.csv"
pathMeta = Path.cwd().joinpath("Data", "metadata_h3n2_ha.tsv")
tree_path = 'Data/flu_seasonal_h3n2_ha_2y_tree.json'

clades_to_plot = ['3c3.A', 'A1b/131K', 'A1b/135K', 'A2/re', 'A2', 'A3']

### Reading in the Fasta File
- I used BioPython to parse the Fasta file into two numpy Arrays: Genomes and Strains. 

In [None]:
#work on making this work
strains = []
genomes = []
for record in SeqIO.parse(path, "fasta"):
    if(record.id not in dropped_strains):
        strains.append(str(record.id))
        genomes.append(str(record.seq))

#### Checking to make sure the file I picked is a aligned Fasta file / is the file I wanted

In [None]:
len(strains)
print(len(genomes[0]) == len(genomes[1]))
print(len(genomes))

In [None]:
strains[:5]

# Creating the Distance Matrix
- I used Hamming Distance to find the pairwise distance between each genome and each other genome, effectively creating a similarity/distance matrix
    - In my Hamming Distance method, I only counted something as a different if it was a mismatch between the nucleotides (A,G,C, or T), not gaps (as that was throwing off the algorithm too much for smaller strains)
- I then used Seaborn to generate a heatmap to make sure the matrix looked correct

In [None]:
%%time
# Try to load an existing distance matrix. Create it, if it doesn't already exist.
try:
    # The index should be the first column and correspond to strain name for the row.
    similarity_matrix = pd.read_csv(distance_matrix_file, index_col=0)
    print("Loaded existing distance matrix")
except FileNotFoundError:
    print("Could not find existing distance matrix, creating it now", end="...")
    
    # Calculate Hamming distances.
    hamming_distances = get_hamming_distances(genomes)
    
    # Convert distinct pairwise distances into the more redundant but more interpretable square matrix.
    similarity_matrix = squareform(hamming_distances)
    
    # Convert the numpy matrix to a pandas data frame with strain annotations for rows and columns.
    similarity_matrix = pd.DataFrame(
        similarity_matrix,
        columns=strains,
        index=strains
    )
    
    # Write out the resulting data frame to cache distance calculations.
    # Keep the index in the output file, so it is immediately available on read.
    similarity_matrix.to_csv(distance_matrix_file)
    print("done!")

In [None]:
similarity_matrix.head()

In [None]:
sns.heatmap(similarity_matrix)

# Reading in the Metadata
- The metadata is used for getting the region, country, etc of different strains. This data is used to color the clusters.
- The metadata contains all of the possible sampled strains, so many of these will not be in the genomes from the aligned file, probably because the strains were corrupted, too short, etc. 
- We merge this metadata with the strains we have in the aligned file to get a list of all of the strains that match between both. It should come out to the amount of strains in the aligned file.

In [None]:
#merging my final dataframe with their regions and strain names
metadata_df = pd.read_csv(pathMeta, delimiter='\t')

In [None]:
metadata_df.head()

In [None]:
metadata_df.shape

In [None]:
len(strains)

In [None]:
# Keep only the metadata for the current strains.
result_strains = metadata_df[metadata_df["strain"].isin(strains)].copy()

In [None]:
#checking that no strains were lost
result_strains.shape

In [None]:
result_strains.head()

In [None]:
# Confirm that all current strains have metadata.
# If this list is nonzero in length, metadata are missing.
assert len(np.setdiff1d(np.unique(strains), metadata_df['strain'].unique())) == 0

# Creating the Phylogenetic Tree in Altair
- I used Altair to make this tree (Documentation linked [here][1]
- I opened and imported the json from a build from NextStrain ([flu][2], [zika][3], etc)
- The data from the JSON and the Data from the tree are usually a little different, so after merging the two dataframes you may get some errors.

[1]: https://altair-viz.github.io/index.html
[2]: https://github.com/nextstrain/seasonal-flu
[3]: https://altair-viz.github.io/index.html

In [None]:
with open(tree_path) as fh:
    json_tree_handle = json.load(fh)

In [None]:
tree = json_to_tree(json_tree_handle)

In [None]:
tree

In [None]:
node_data = [
    {
        "strain": node.name,
        "date": node.attr["num_date"],
        "y": node.yvalue,
        "region": node.attr["region"],
        "country": node.attr["country"],
        "parent_date": node.parent is not None and node.parent.attr["num_date"] or node.attr["num_date"],
        "parent_y": node.parent is not None and node.parent.yvalue or node.yvalue,
        "clade_membership" : node.attr['clade_membership']
    }
    for node in tree.find_clades(terminal=True)
]

In [None]:
node_data[10]

In [None]:
node_df = pd.DataFrame(node_data)

In [None]:
node_df.head()

In [None]:
node_df["y"] = node_df["y"].max() - node_df["y"]

In [None]:
node_df["parent_y"] = node_df["parent_y"].max() - node_df["parent_y"]

In [None]:
node_df.shape

In [None]:
node_df.head()

In [None]:
node_df["region"].unique()

In [None]:
# Reannotate clades that we aren't interested in as "other" to simplify color assignment in visualizations.
node_df["clade_membership_color"] = node_df["clade_membership"].apply(lambda clade: clade if clade in clades_to_plot else "other")

In [None]:
node_df.head()

In [None]:
node_df.merge(metadata_df, on="strain")

## Checking for Outliers in Pairwise Distance

In [None]:
mean_distances = similarity_matrix.mean().reset_index(name="mean_distance").rename(columns={"index": "strain"})

In [None]:
mean_distances.head()

In [None]:
alt.Chart(mean_distances, height=150).mark_boxplot().encode(
    x = alt.X('mean_distance', title="mean of pairwise distances"),
    tooltip = ["strain"]
)

# Running PCA on Scaled and Centered Data
- I treated each nucleotide as a "site", or dimension, and found the probability of having a certain nucleotide given the frequency of that letter at that site.
- I used [this paper][1] as my source 
- The equation is as follows where C is the matrix of dimensions, M is the mean, and p is the frequency of a nucleotide at that given site. 
![](https://journals.plos.org/plosgenetics/article/file?type=thumbnail&id=info:doi/10.1371/journal.pgen.0020190.e003)

In [None]:
numbers = genomes[:]
for i in range(0,len(genomes)):
    numbers[i] = re.sub(r'[^AGCT]', '5', numbers[i])
    numbers[i] = list(numbers[i].replace('A','1').replace('G','2').replace('C', '3').replace('T','4'))
    numbers[i] = [int(j) for j in numbers[i]]
genomes_df = pd.DataFrame(numbers)
genomes_df.columns = ["Site " + str(k) for k in range(0,len(numbers[i]))]

In [None]:
genomes_df.head()

In [None]:
#performing PCA on my pandas dataframe 
pca = PCA(n_components=10,svd_solver='full') #can specify n, since with no prior knowledge, I use None
principalComponents = pca.fit_transform(genomes_df)

In [None]:
# Create a data frame from the PCA embedding.
principalDf = pd.DataFrame(data = principalComponents, columns = ["PCA" + str(i) for i in range(1,11)])

# Annotate rows by their original strain names. PCA rows are in the same order as
# the `genomes` rows which are in the same order as the `strains` rows.
principalDf["strain"] = strains

In [None]:
df = pd.concat([pd.DataFrame(np.arange(1,11)), pd.DataFrame([round(pca.explained_variance_ratio_[i],4) for i in range(0,len(pca.explained_variance_ratio_))])], axis = 1)
df.columns = ['principal components','explained variance']
df

In [None]:
alt.Chart(df).mark_point().encode(
    x='principal components:Q',
    y='explained variance:Q'
)

In [None]:
merged_pca_df = principalDf.merge(node_df, on="strain")

In [None]:
merged_pca_df.head()

In [None]:
list_of_chart = linking_tree_with_plots_brush(merged_pca_df,['PCA1','PCA2','PCA3','PCA4'],
                                         ['PCA1 (Explained Variance : {}%'.format(round(pca.explained_variance_ratio_[0]*100,2)) + ")",
                                          'PCA2 (Explained Variance : {}%'.format(round(pca.explained_variance_ratio_[1]*100,2)) + ")",
                                          'PCA3 (Explained Variance : {}%'.format(round(pca.explained_variance_ratio_[2]*100,2)) + ")",
                                          'PCA4 (Explained Variance : {}%'.format(round(pca.explained_variance_ratio_[3]*100,2)) + ")"],
                                         "clade_membership:N",['strain','region'])
chart = list_of_chart[0]|list_of_chart[1]|list_of_chart[2]
chart.save("../Docs/PCAFluBrush.html")
chart

In [None]:
PCA_violin_df = get_euclidean_data_frame(merged_pca_df, "PCA1", "PCA2", "PCA")
g = sns.FacetGrid(
    PCA_violin_df,
    col="embedding",
    col_wrap=3,
    sharey=False,
    height=4
)
g = g.map(sns.violinplot, "clade_status", "distance", order=["within", "between"])
g.set_axis_labels("Clade status", "Distance")

In [None]:
total_df = scatterplot_xyvalues(strains, similarity_matrix, merged_pca_df, "PCA1", "PCA2", "PCA")
y_values = statsmodels.nonparametric.smoothers_lowess.lowess(
    total_df["euclidean"],
    total_df["genetic"],
    frac=0.6666666666666666,
    it=3,
    delta=0.0,
    is_sorted=False,
    missing='drop',
    return_sorted=True
)

PD_Y_values = pd.DataFrame(y_values)
PD_Y_values.columns = ["LOWESS_x", "LOWESS_y"]

regression = linregress(total_df["genetic"], total_df["euclidean"])
slope, intercept, r_value, p_value, std_err = regression

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(6, 6))

ax.plot(total_df["genetic"], total_df["euclidean"], "o", alpha=0.25)
ax.plot(PD_Y_values["LOWESS_x"], PD_Y_values["LOWESS_y"], label="LOESS")

ax.set_xlabel("Genetic distance")
ax.set_ylabel("Euclidean distance (PCA)")
ax.set_title("PCA Euclidean distance vs. genetic distance ($R^2=%.3f$)" % (r_value ** 2))

sns.despine()

# Running MDS on the Dataset

In [None]:
embedding = MDS(n_components=10,metric=True,dissimilarity='precomputed')
X_transformed = embedding.fit_transform(similarity_matrix)

In [None]:
raw_stress = embedding.stress_
normalized_stress = np.sqrt(raw_stress /((similarity_matrix.values.ravel() ** 2).sum() / 2))
print(normalized_stress.round(2))

In [None]:
MDS_df = pd.DataFrame(X_transformed,columns=['MDS' + str(i) for i in range(1,11)])

In [None]:
# Annotate rows by their original strain names. The same logic from PCA holds here
# and for later embeddings.
MDS_df["strain"] = strains

In [None]:
merged_df = MDS_df.merge(node_df, on="strain")

In [None]:
merged_df.head()

In [None]:
chart_12_mds = scatterplot_with_tooltip_interactive(merged_df,'MDS1','MDS2',"MDS1","MDS2",['strain','clade_membership'],'clade_membership_color')
chart_34_mds = scatterplot_with_tooltip_interactive(merged_df,'MDS3','MDS4',"MDS3","MDS4",['strain','clade_membership'],'clade_membership_color')
chart_56_mds = scatterplot_with_tooltip_interactive(merged_df,'MDS5','MDS6',"MDS5","MDS6",['strain','clade_membership'],'clade_membership_color')
chart_12_mds|chart_34_mds|chart_56_mds

In [None]:
chart_12_scatter = scatterplot_tooltips(strains, similarity_matrix, merged_df, "MDS1", "MDS2", "MDS", 4000)
chart_34_scatter = scatterplot_tooltips(strains, similarity_matrix, merged_df, "MDS3", "MDS4", "MDS", 4000)
chart_56_scatter = scatterplot_tooltips(strains, similarity_matrix, merged_df, "MDS5", "MDS6", "MDS", 4000)
chart_12_scatter | chart_34_scatter | chart_56_scatter

In [None]:
MDS_violin_df = get_euclidean_data_frame(merged_df, "MDS1", "MDS2", "MDS")
g = sns.FacetGrid(
    MDS_violin_df,
    col="embedding",
    col_wrap=3,
    sharey=False,
    height=4
)
g = g.map(sns.violinplot, "clade_status", "distance", order=["within", "between"])
g.set_axis_labels("Clade status", "Distance")

In [None]:
# TODO: replace with matplotlib plot as in PCA above.
#LOESS_scatterplot(strains, similarity_matrix, merged_df, "MDS1", "MDS2", "MDS", 4000)

In [None]:
list_of_data_and_titles = ['MDS1','MDS2','MDS3','MDS4','MDS5','MDS6']
list_of_chart = linking_tree_with_plots_brush(
    merged_df,
    list_of_data_and_titles,
    list_of_data_and_titles,
    'clade_membership_color',
    ["clade_membership","strain:N"]
)
chart = list_of_chart[0]|list_of_chart[1]|list_of_chart[2]|list_of_chart[3]
chart

In [None]:
chart.save("../Docs/MDSFluBrush.html")

# Running T-SNE on the Dataset 

In [None]:
embedding = TSNE(n_components=2,metric='precomputed',perplexity = 25.95)
X_transformed = embedding.fit_transform(similarity_matrix)

In [None]:
TSNE_df = pd.DataFrame(X_transformed,columns=['TSNE' + str(i) for i in range(1,3)])

In [None]:
TSNE_df["strain"] = strains

In [None]:
TSNE_df.head()

In [None]:
merged_df = TSNE_df.merge(node_df, on="strain")

In [None]:
chart = scatterplot_tooltips(strains, similarity_matrix, merged_df, "TSNE1", "TSNE2", "TSNE", 4000)
chart

In [None]:
scatterplot_with_tooltip_interactive(merged_df,'TSNE1','TSNE2',"TSNE1","TSNE2",['strain','clade_membership'],'clade_membership_color')

In [None]:
list_of_chart = linking_tree_with_plots_brush(
    merged_df,
    ['TSNE1','TSNE2'],
    ['TSNE1','TSNE2'],
    'clade_membership_color',
    ["clade_membership:N","strain:N"]
)
chart = list_of_chart[0]|list_of_chart[1]
chart

In [None]:
TSNE_violin_df = get_euclidean_data_frame(merged_df, "TSNE1", "TSNE2", "TSNE")
g = sns.FacetGrid(
    TSNE_violin_df,
    col="embedding",
    col_wrap=3,
    sharey=False,
    height=4
)
g = g.map(sns.violinplot, "clade_status", "distance", order=["within", "between"])
g.set_axis_labels("Clade status", "Distance")

In [None]:
# TODO: replace with matplotlib as in PCA
#LOESS_scatterplot(strains, similarity_matrix, merged_df, "TSNE1", "TSNE2", "TSNE", 4000)

# Running UMAP on the Dataset

In [None]:
reducer = umap.UMAP(n_neighbors=200,
        min_dist=.05,
        n_components=2,
        init="spectral")
embedding = reducer.fit_transform(similarity_matrix)

In [None]:
UMAP_df = pd.DataFrame(embedding,columns=['UMAP' + str(i) for i in range(1,3)])

In [None]:
UMAP_df["strain"] = strains

In [None]:
UMAP_df.head()

In [None]:
merged_df = UMAP_df.merge(node_df, on="strain")

In [None]:
merged_df.head()

In [None]:
merged_df.shape

In [None]:
chart = scatterplot_tooltips(strains, similarity_matrix, merged_df, "UMAP1", "UMAP2", "UMAP", 4000)
chart

In [None]:
scatterplot_with_tooltip_interactive(merged_df,'UMAP1','UMAP2',"UMAP1","UMAP2",['strain','clade_membership'],'clade_membership_color')

In [None]:
list_of_data_and_titles = ['UMAP1','UMAP2']
list_of_chart = linking_tree_with_plots_brush(
    merged_df,
    list_of_data_and_titles,
    list_of_data_and_titles,
    'clade_membership_color',
    ["clade_membership","strain:N"]
)
chart = list_of_chart[0]|list_of_chart[1]
chart

In [None]:
chart.save("../Docs/UMAPFluBrush.html")

In [None]:
UMAP_violin_df = get_euclidean_data_frame(merged_df, "UMAP1", "UMAP2", "UMAP")
g = sns.FacetGrid(
    UMAP_violin_df,
    col="embedding",
    col_wrap=3,
    sharey=False,
    height=4
)
g = g.map(sns.violinplot, "clade_status", "distance", order=["within", "between"])
g.set_axis_labels("Clade status", "Distance")

In [None]:
# TODO: replace with matplotlib as in PCA
#LOESS_scatterplot(strains, similarity_matrix, merged_df, "UMAP1", "UMAP2", "MDS", 4000)

# Linking all plots together clickable with Tree

In [None]:
merged_df = node_df.merge(
    principalDf,
    on="strain"
).merge(
    MDS_df,
    on="strain"
).merge(
    TSNE_df,
    on="strain"
).merge(
    UMAP_df,
    on="strain"
)

In [None]:
merged_df.shape

In [None]:
merged_df.head()

In [None]:
data = linking_tree_with_plots_clickable(
    merged_df,
    ['MDS1', 'MDS2','TSNE1', 'TSNE2', 'PCA1', 'PCA2', 'UMAP1', 'UMAP2'],
    ['MDS1', 'MDS2', 'TSNE1', 'TSNE2','PCA1 (Expected Variance : {}%'.format(round(pca.explained_variance_ratio_[0]*100,2)) + ")",
    'PCA2 (Expected Variance : {}%'.format(round(pca.explained_variance_ratio_[1]*100,2)) + ")",'UMAP1','UMAP2'],
    'clade_membership_color:N',
    ['clade_membership'],
    ['strain','clade_membership']
)

In [None]:
PCAMDS = data[3]|data[1]|data[5]
TSNEUMAP = data[2]|data[4]
embeddings = alt.vconcat(PCAMDS,TSNEUMAP)
embeddings
fullChart = alt.hconcat(data[0],embeddings)
fullChart.save('../Docs/FullLinkedChartClickable.html')
fullChart

In [None]:
#PCA eigenvectors factoring in - how do I do that
data_frames = [
    scatterplot_tooltips(similarity_matrix, merged_df, "PCA1", "PCA2", "PCA"),
    scatterplot_tooltips_df(similarity_matrix, merged_df, "MDS1", "MDS2", "MDS"),
    scatterplot_tooltips_df(similarity_matrix, merged_df, "TSNE1", "TSNE2", "TSNE"),
    scatterplot_tooltips_df(similarity_matrix, merged_df, "UMAP1", "UMAP2", "UMAP"),
]

In [None]:
merged_index = np.array(sorted(merged_df.index.values))

In [None]:
genetic_distances = squareform(similarity_matrix.values[merged_index][:, merged_index])

In [None]:
data_frames.append(pd.DataFrame({
    "distance": genetic_distances,
    "embedding":"genetic"
}))

In [None]:
len(data_frames)

In [None]:
euclidean_data_frame = pd.concat(data_frames, sort=False)

In [None]:
euclidean_data_frame.head()

In [None]:
euclidean_data_frame.shape

In [None]:
PCA = scatterplot_tooltips(similarity_matrix, merged_df, "PCA1", "PCA2", "PCA")
MDS = scatterplot_tooltips(similarity_matrix, merged_df, "MDS1", "MDS2", "MDS")
TSNE = scatterplot_tooltips(similarity_matrix, merged_df, "TSNE1", "TSNE2", "TSNE")
UMAP = scatterplot_tooltips(similarity_matrix, merged_df, "UMAP1", "UMAP2", "UMAP")

chart = PCA|MDS|TSNE|UMAP
chart.save('../Docs/FullScatterplot.html')

In [None]:
chart

# Notes to Self:

- Collapse cells underneath Markdown headers
- Get docstrings above methods to show up when user presses SHIFT + TAB
- link back to the methods section for user each time method is used
- Run more Flu builds (3 years, 6 years, 12 years, H3, H1)
- Try algorithm on MERS
- Try algorithm on other bacterial genomes (unaligned / small snips of genomes)
- Make Zika clades or run automatic clade naming on 12y H3N2 flu with different cutoffs to standardize coloring for graphs
- Write a paper


## Within- and between-clade Euclidean distances for all embeddings

Use the complete embedding data frame to calculate pairwise Euclidean distances between samples and plot the results in a single figure.

In [None]:
data_frames = [
    get_euclidean_data_frame(merged_df, "PCA1", "PCA2", "PCA"),
    get_euclidean_data_frame(merged_df, "MDS1", "MDS2", "MDS"),
    get_euclidean_data_frame(merged_df, "TSNE1", "TSNE2", "t-SNE"),
    get_euclidean_data_frame(merged_df, "UMAP1", "UMAP2", "UMAP"),
]

Extract pairwise genetic (Hamming) distances corresponding to the records sampled above. This step assumes that the original merged data frame is indexed from zero to N for N total samples in the same order as the similarity matrix.

In [None]:
genetic_distances = squareform(similarity_matrix)

In [None]:
data_frames.append(pd.DataFrame({
    "distance": genetic_distances,
    "clade_status": data_frames[0]["clade_status"].values,
    "embedding": "genetic"
}))

In [None]:
len(data_frames)

In [None]:
euclidean_data_frame = pd.concat(data_frames)

In [None]:
g = sns.FacetGrid(
    euclidean_data_frame,
    col="embedding",
    col_wrap=3,
    col_order=["genetic", "PCA", "MDS", "t-SNE", "UMAP"],
    sharey=False,
    height=4
)
g = g.map(sns.violinplot, "clade_status", "distance", order=["within", "between"])
g.set_axis_labels("Clade status", "Distance")

plt.savefig("../docs/FullViolinPlot.png")

In [None]:
PCA_df = euclidean_data_frame[euclidean_data_frame.embedding == "PCA"]
MDS_df = euclidean_data_frame[euclidean_data_frame.embedding == "MDS"]
TSNE_df = euclidean_data_frame[euclidean_data_frame.embedding == "t-SNE"]
UMAP_df = euclidean_data_frame[euclidean_data_frame.embedding == "UMAP"]
genetic_df = euclidean_data_frame[euclidean_data_frame.embedding == "genetic"]

In [None]:
genetic = alt.Chart(genetic_df,width=300).mark_boxplot().encode(
    x="clade_status:N",
    y="distance:Q"
).properties(title="genetic")
PCA = alt.Chart(PCA_df,width=200).mark_boxplot().encode(
    x="clade_status:N",
    y="distance:Q"
).properties(title="PCA")
MDS = alt.Chart(MDS_df,width=200).mark_boxplot().encode(
    x="clade_status:N",
    y="distance:Q"
).properties(title="MDS",width=200)
TSNE = alt.Chart(TSNE_df).mark_boxplot().encode(
    x="clade_status:N",
    y="distance:Q"
).properties(title="TSNE",width=200)
UMAP = alt.Chart(UMAP_df,width=200).mark_boxplot().encode(
    x="clade_status:N",
    y="distance:Q"
).properties(title="UMAP")

chart = genetic|PCA|MDS|TSNE|UMAP

In [None]:
chart.save('../Docs/FullBoxplot.html')

In [None]:
#ratio of within vs between (within shorter than between) - label which metrics + sginificance level

In [None]:
genetic_df.mean()

In [None]:
median_genetic_within = genetic_df[genetic_df.clade_status == "within"].median()
median_genetic_between = genetic_df[genetic_df.clade_status == "between"].median()

In [None]:
median_PCA_within = PCA_df[PCA_df.clade_status == "within"].median()
median_PCA_between = PCA_df[PCA_df.clade_status == "between"].median()

In [None]:
median_MDS_within = MDS_df[MDS_df.clade_status == "within"].median()
median_MDS_between = MDS_df[MDS_df.clade_status == "between"].median()

In [None]:
median_TSNE_within = TSNE_df[TSNE_df.clade_status == "within"].median()
median_TSNE_between = TSNE_df[TSNE_df.clade_status == "between"].median()

In [None]:
median_UMAP_within = UMAP_df[UMAP_df.clade_status == "within"].median()
median_UMAP_between = UMAP_df[UMAP_df.clade_status == "between"].median()

In [None]:
def ratioFunction(num1, num2):
    ratio12 = int(num1/num2)
    return ratio12

In [None]:
genetic_ratio = ratioFunction(median_genetic_between,median_genetic_within)
print(genetic_ratio)

In [None]:
PCA_ratio = ratioFunction(median_PCA_between,median_PCA_within)
print(PCA_ratio)

In [None]:
MDS_ratio = ratioFunction(median_MDS_between,median_MDS_within)
print(MDS_ratio)

In [None]:
TSNE_ratio = ratioFunction(median_TSNE_between,median_TSNE_within)
print(TSNE_ratio)

In [None]:
UMAP_ratio = ratioFunction(median_UMAP_between,median_UMAP_within)
print(UMAP_ratio)