In [1]:
!ls -l data/

total 7190328
-rw-------@ 1 fobermey  staff       50752 Aug 30 21:22 County_8_11_2020.csv
-rw-------@ 1 fobermey  staff  3681260418 Aug 30 21:29 GTR4G_posterior.trees
-r--------  0 fobermey  staff           0 Dec 31  1969 Icon?
-rw-------  1 fobermey  staff          78 Sep  1 17:45 README.txt
-rw-------@ 1 fobermey  staff        4649 Aug 31 09:39 demographics_by_zip_2013_ACS.tsv
-rw-------@ 1 fobermey  staff       16126 Aug 30 21:24 sample_county.tsv
-rw-------@ 1 fobermey  staff       47319 Aug 30 21:24 trimmed_alignment.fasta_ML_tree_from_iqtree.contree


- `GTR4G_posterior.trees` appears to be in NEXUS format
- `trimmed_alignment.fasta_ML_tree_from_iqtree.cont` appears to be in Newick format

In [2]:
import pandas as pd
from Bio import Phylo

In [3]:
incidence_df = pd.read_csv("data/County_8_11_2020.csv")
incidence_df.head()

Unnamed: 0,Date,County,Count,Deaths
0,4/17/2020,Barnstable,582.0,
1,4/17/2020,Berkshire,390.0,
2,4/17/2020,Bristol,1693.0,
3,4/17/2020,Dukes,14.0,
4,4/17/2020,Essex,4668.0,


In [4]:
sample_df = pd.read_csv("data/sample_county.tsv", sep="\t")
sample_df.head()

Unnamed: 0,sample_id,county
0,MA_MGH_00003,Middlesex
1,MA_MGH_00004,Middlesex
2,MA_MGH_00005,Norfolk
3,MA_MGH_00006,Suffolk
4,MA_MGH_00013,Middlesex


In [5]:
demo_df = pd.read_csv("data/demographics_by_zip_2013_ACS.tsv", sep="\t")
demo_df.head()

Unnamed: 0,region,value1,total_population,percent_white,percent_black,percent_asian,percent_hispanic,per_capita_income,median_rent,median_age
0,1704,1,,,,,,,,
1,1719,1,5048.0,81.0,0.0,17.0,1.0,61461.0,842.0,43.7
2,1730,4,13649.0,81.0,2.0,12.0,4.0,48886.0,1445.0,43.9
3,1748,1,15271.0,92.0,0.0,5.0,2.0,56899.0,1165.0,39.5
4,1752,1,38842.0,75.0,2.0,5.0,11.0,35335.0,977.0,39.5


In [6]:
ml_tree = Phylo.read("data/trimmed_alignment.fasta_ML_tree_from_iqtree.contree",
                     format="newick")
ml_tree.count_terminals()

772

The posterior tree file is too large for ``Bio.Phylo``, so we implement a custom parser.

In [7]:
from pyrophylo.io import read_nexus_trees
posterior_trees = read_nexus_trees("data/GTR4G_posterior.trees", format="newick",
                                   max_num_trees=10)

In [8]:
tree = next(posterior_trees)
tree.count_terminals()

772

We also implement a custom conversion to PyTorch. To convert all trees once, run
```sh
python preprocess_trees.py
```
We can then load the trees as a batched ``Phylogeny`` object.

In [9]:
!ls -lsh results/

total 4822712
4822712 -rw-r--r--  1 fobermey  staff   2.3G Sep  1 17:42 GTR4G_posterior.pt


In [10]:
import torch
phylo = torch.load("results/GTR4G_posterior.pt")

In [11]:
print(type(phylo))
print(len(phylo))
print(phylo.batch_shape)
print(phylo.num_nodes)
print(phylo.num_leaves)

<class 'pyrophylo.phylo.Phylogeny'>
100001
torch.Size([100001])
1543
772
