# Introduction to Phylopandas

Let me introduce you to PhyloPandas. A Pandas dataframe and interface for phylogenetics.

In [1]:
import pandas as pd

In [2]:
import phylopandas as ph

## Reading data

Phylopandas comes with various `read_` methods to load phylogenetic data into a Pandas DataFrame.

Check out the various formats by hitting `tab` after `read` in the cell below.

In [3]:
ph.read_

AttributeError: module 'phylopandas' has no attribute 'read_'

Try reading some of the sequence files in the `data` folder.

In [None]:
with open('PF08793_seed.fasta', 'r') as f:
    print(f.read())

In [None]:
ph.read_fasta('PF08793_seed.fasta')

In [None]:
ph.read_phylip('PF08793_seed.phylip')

In [None]:
ph.read_clustal('PF08793_seed.clustal')

## Writing data

PhyloPandas attaches a `phylo` accessor to the standard Pandas DataFrame. Inside this accessor are various writing methods, following Pandas syntax, allowing you to write to various sequence formats.

To quickly see the writing functions, hit `tab` after `to_` in the cell below.

In [None]:
df = ph.read_fasta('PF08793_seed.fasta')

In [None]:
df.phylo.to_

Let's write the dataframe back out to fasta. If you don't give a filename, it will return a string.

In [None]:
s = df.phylo.to_fasta()
print(s)

## Converting between formats

Of course, this means you can easily convert between sequence formats. 

In [4]:
df = ph.read_phylip('PF08793_seed.phylip')

fasta_str = df.phylo.to_fasta()

print(fasta_str)

>5PosaPz1Ba
KCIAFDK----ND-KINPFTGRPINENNDTYRMIYSMCHG
>nTtjWXLcTL
ACALYYD----DP-TVNPFTDEPLRRYSPIDDLLYRNCES
>Nk9EoqE14T
YCTNFHR----DE-SRNPLTGKKLVPTSPIRKAWHKMCSG
>KrryYldJzG
LCAEYKR----SP-RYNPWTDRTLAPGSPKHNLISGMCGG
>8sH15yS2LJ
VCNDLALCSQHTD-TYNPWTDRALLPDSPVHDMIDYVCNT
>38EkV6VtF1
VCERFAA----DP-TRNPVTGSPLSRNDPLYTDLMEICKG
>goe9RcxcQY
TCEAFCR----DP-TRNPVTGQKMRRNGIEYQMFAEECDC
>zBbStiY22V
KCDEWEKIRLNSS-PKNPFTKRNVKKDGPTYKKIDLICKH
>gUstHy3NWv
KCYEWDIAKKKSPLPKSPLTGRKLKQHGPTWKKITAECAT
>pJSzBTSdyJ
KCSKWHE----QP-LINPLTNRKIKKNGPTYKELERECGP
>hHqmLdOzYk
LCSKWKA----NP-LVNPATGRKIKKDGPVYEKIQKKCS-
>9PhikwhdAD
YCDEFER----NP-TRNPRTGRTIKRGGPVFRALERECSD
>YIM7zb5VSh
-CPEFAR----DP-TRNPRTGRTIKRGGPTYRALEAECAD
>hhFPHo9QRt
ECEQWLA----NK-GINPRTGKAIKIGGPTYKKLEMECKE
>1UAjmKxk2o
VCKKFLA----NK-TVSPYSGRPIKPGKKLYNDLEKHCSG
>AxcIhHg3sO
QCRAFEE----NP-DVNPNTGRRISPTGPIASSMRRRCMN
>yuLFxOOfPi
KCNQLRN----NRYTVNPVSNRAIAPRGDTANTLRRICEQ
>URSmxyxeaW
QCETFKR----NKQAVSPLTNCPIDKFGRTAARFRKECD-



## Reading Tree Data

Phylopandas can also read in phylogenetic tree data.

In [5]:
with open('PF08793_seed.newick', 'r') as f:
    print( f.read())

(Q8QUQ5_ISKNN/45-79:0.38376442,Q8QUQ6_ISKNN/37-75:0.93473288,(Q8QUQ5_ISKNN/123-157:1.14582942,(Q0E553_SFAVA/142-176:0.94308689,(Q0E553_SFAVA/184-218:0.98977147,(Q0E553_SFAVA/60-94:0.95706148,(((019R_FRG3G/5-39:0.06723315,(019R_FRG3G/139-172:0.05690376,(019R_FRG3G/249-283:0.95772959,019R_FRG3G/302-336:0.58361302)2.745285:0.61968795)1.680162:0.12814819)8.545520:0.30724093,((VF232_IIV6/64-98:0.77338949,((VF380_IIV6/7-45:0.56133629,VF380_IIV3/8-47:0.64307079)7.484104:0.37367018,(VF378_IIV6/4-38:0.31530205,O41158_PBCV1/63-96:0.46076842)1.909391:0.20522645)0.218717:0.09388521)2.531435:0.20551347,Q0E553_SFAVA/14-48:1.58834786)0.265099:0.00027193)6.209727:0.37908212,(Q8QUQ5_ISKNN/164-198:0.63907222,Q8QUQ5_ISKNN/7-42:0.96743219)2.806276:0.362965)0.677978:0.20054193)0.718698:0.20642561)2.503850:0.27168922)1.162623:0.15868612)6.040602:0.48939921);



In [6]:
ph.read_newick('PF08793_seed.newick')

Unnamed: 0,type,id,parent,length,label,distance,uid
0,root,0,,0.0,0,0.0,kCIjFBZKXZ
1,leaf,Q8QUQ5_ISKNN/45-79,0.0,0.383764,Q8QUQ5_ISKNN/45-79,0.383764,wKP5pcfIok
2,leaf,Q8QUQ6_ISKNN/37-75,0.0,0.934733,Q8QUQ6_ISKNN/37-75,0.934733,Wi6ARQAOcw
3,node,1,0.0,0.489399,1,0.489399,iKoRLPGtl6
4,leaf,Q8QUQ5_ISKNN/123-157,1.0,1.145829,Q8QUQ5_ISKNN/123-157,1.635229,RbLr5Hi2L9
5,node,2,1.0,0.158686,2,0.648085,pR3f9C8Ort
6,leaf,Q0E553_SFAVA/142-176,2.0,0.943087,Q0E553_SFAVA/142-176,1.591172,8wbvqaG3jg
7,node,3,2.0,0.271689,3,0.919775,sCUs3pJLK8
8,leaf,Q0E553_SFAVA/184-218,3.0,0.989771,Q0E553_SFAVA/184-218,1.909546,Lov4UJif6D
9,node,4,3.0,0.206426,4,1.1262,5yDZXG1tyd


## Why is PhyloPandas useful? 

We already have BioPython, DendroPy, ete3, etc. right?

In [7]:
df = ph.read_newick('PF08793_seed.newick')

df2 = df.loc[df.type == "leaf"]

In [8]:
df

Unnamed: 0,type,id,parent,length,label,distance,uid
0,root,0,,0.0,0,0.0,9x4F7nTLnY
1,leaf,Q8QUQ5_ISKNN/45-79,0.0,0.383764,Q8QUQ5_ISKNN/45-79,0.383764,bhUZpMzqaw
2,leaf,Q8QUQ6_ISKNN/37-75,0.0,0.934733,Q8QUQ6_ISKNN/37-75,0.934733,AGoLMJy4qb
3,node,1,0.0,0.489399,1,0.489399,PEr58Pk7IB
4,leaf,Q8QUQ5_ISKNN/123-157,1.0,1.145829,Q8QUQ5_ISKNN/123-157,1.635229,CQmpogxXrH
5,node,2,1.0,0.158686,2,0.648085,4fGJ1yqAd6
6,leaf,Q0E553_SFAVA/142-176,2.0,0.943087,Q0E553_SFAVA/142-176,1.591172,W89uwOl3sK
7,node,3,2.0,0.271689,3,0.919775,xCOwZZkfi5
8,leaf,Q0E553_SFAVA/184-218,3.0,0.989771,Q0E553_SFAVA/184-218,1.909546,gDFNACm9Vx
9,node,4,3.0,0.206426,4,1.1262,BngfjtGSGI


In [9]:
df2

Unnamed: 0,type,id,parent,length,label,distance,uid
1,leaf,Q8QUQ5_ISKNN/45-79,0,0.383764,Q8QUQ5_ISKNN/45-79,0.383764,bhUZpMzqaw
2,leaf,Q8QUQ6_ISKNN/37-75,0,0.934733,Q8QUQ6_ISKNN/37-75,0.934733,AGoLMJy4qb
4,leaf,Q8QUQ5_ISKNN/123-157,1,1.145829,Q8QUQ5_ISKNN/123-157,1.635229,CQmpogxXrH
6,leaf,Q0E553_SFAVA/142-176,2,0.943087,Q0E553_SFAVA/142-176,1.591172,W89uwOl3sK
8,leaf,Q0E553_SFAVA/184-218,3,0.989771,Q0E553_SFAVA/184-218,1.909546,gDFNACm9Vx
10,leaf,Q0E553_SFAVA/60-94,4,0.957061,Q0E553_SFAVA/60-94,2.083262,fRZuaBG9S3
14,leaf,019R_FRG3G/5-39,7,0.067233,019R_FRG3G/5-39,2.080298,mclkZI6LJJ
16,leaf,019R_FRG3G/139-172,8,0.056904,019R_FRG3G/139-172,2.198117,6qtDyUu3Xx
18,leaf,019R_FRG3G/249-283,9,0.95773,019R_FRG3G/249-283,3.718631,ZM0EOpcIQT
19,leaf,019R_FRG3G/302-336,9,0.583613,019R_FRG3G/302-336,3.344514,WQi85K0XJ9


# Here is where the real magic happens!

## Reading Sequence *and* Tree Data

Phylopandas has the ability to combine sequence and tree data in a single DataFrame.

In [10]:
# Read sequences.
df = ph.read_fasta('PF08793_seed.fasta')

# Read tree.
df = df.phylo.read_newick('PF08793_seed.newick', combine_on='id')
#df

This enables us to build phylogenetics tools around a single, core dataframe. 

## View an interactive Tree

**You must have PhyloVega installed!**

https://github.com/Zsailer/phylovega