# Introduction to Phylopandas

Let me introduce you to PhyloPandas. A Pandas dataframe and interface for phylogenetics.

In [None]:
import pandas as pd
import phylopandas as ph

## Reading data

Phylopandas comes with various `read_` methods to load phylogenetic data into a Pandas DataFrame.

Check out the various formats by hitting `tab` after `read` in the cell below.

In [None]:
ph.read

Try reading some of the sequence files in the `data` folder.

In [None]:
with open('data/PF08793_seed.fasta', 'r') as f:
    print(f.read())

In [None]:
ph.read_fasta('data/PF08793_seed.fasta')

In [None]:
ph.read_phylip('data/PF08793_seed.phylip')

In [None]:
ph.read_clustal('data/PF08793_seed.clustal')

## Writing data

PhyloPandas attaches a `phylo` accessor to the standard Pandas DataFrame. Inside this accessor are various writing methods, following Pandas syntax, allowing you to write to various sequence formats.

To quickly see the writing functions, hit `tab` after `to_` in the cell below.

In [None]:
df = ph.read_fasta('data/PF08793_seed.fasta')
df

In [None]:
df.phylo.to_

Let's write the dataframe back out to fasta. If you don't give a filename, it will return a string.

In [None]:
s = df.phylo.to_fasta()
print(s)

## Converting between formats

Of course, this means you can easily convert between sequence formats. 

In [None]:
df = ph.read_phylip('data/PF08793_seed.phylip')

fasta_str = df.phylo.to_fasta()

print(fasta_str)

## Reading Tree Data

Phylopandas can also read in phylogenetic tree data.

In [None]:
with open('data/PF08793_seed.newick', 'r') as f:
    print( f.read())

In [None]:
ph.read_newick('data/PF08793_seed.newick')

## Why is PhyloPandas useful? 

We already have BioPython, DendroPy, ete3, etc. right?

In [None]:
df = ph.read_newick('data/PF08793_seed.newick')

df.loc[df.type == "leaf"]

# Here is where the real magic happens!

## Reading Sequence *and* Tree Data

Phylopandas has the ability to combine sequence and tree data in a single DataFrame.

In [None]:
# Read sequences.
df = ph.read_fasta('data/PF08793_seed.fasta')

# Read tree.
df = df.phylo.read_newick('data/PF08793_seed.newick', combine_on='id')
df

This enables us to build phylogenetics tools around a single, core dataframe. 

# Views for PhyloPandas

We've created a simple, interactive tree viewer powered by Vega. This leverages the "grammar of phylogenetics" that PhyloPandas defines.

In [None]:
from phylovega import VegaTree

# Read data
df = ph.read_fasta('data/PF08793_seed.fasta')
df = df.phylo.read_newick('data/PF08793_seed.newick', combine_on='id')

# Show using VegaTree
VegaTree(df).display()

From the same DataFrame, we can also show the sequences.

In [None]:
from IPython.display import display

def Fasta(data=''):
    bundle = {}
    bundle['application/vnd.fasta.fasta'] = data
    bundle['text/plain'] = data
    display(bundle, raw=True)

Fasta(df.phylo.to_fasta())

# Introduction to Phylogenetics

The `phylogenetics` package is our attempt at building tools around PhyloPandas.

In [None]:
from phylogenetics import PhylogeneticsProject

Phylogenetics unites many external tools only one single interface and stores their data in a single PhyloPandas DataFrame.

In [None]:
# Define a working directory
working_dir = "project"

# Initialize a working project.
p = PhylogeneticsProject(working_dir, overwrite=True)

Phylogenetics starts with an alignment.

In [None]:
p.read_data('data/PF08793_seed.fasta', schema='fasta')

We can view that data with the `data` attribute.

In [None]:
p.data

## Compute Tree.

Phylogenetics (extremely) simplifies the process of phylogenetics. 

To compute a tree, simply call `compute_tree`.

In [None]:
p.compute_tree()

We can see how that changed the dataframe.

In [None]:
p.data

## What is happening under the hood?

Each method uses PhyloPandas to prepare the data for an external program, like PhyML or PAML. A subprocess call is made to run the program. Then PhyloPandas is used to read in the results.

## Reconstruct ancestors.

To reconstruct ancestors, simply call `compute_reconstruction`.

In [None]:
p.compute_reconstruction()

Again, let's see how the DataFrame changed.

In [None]:
p.data

Let's just look at ancestors.

In [None]:
anc_df = p.data[p.data.type == 'node']
anc_df

Let's write that to file. 

In [None]:
fasta_str = anc_df.phylo.to_fasta(sequence_col='ml_sequence')

print(fasta_str)

Congratulations! You just reconstructed protein ancestors!