This script finds the co-diversifying clades within Moeller et al. supplemental fasta by creates a phylogeny and selecting the subtrees that match the topology of figure 1. 

In [2]:
import pandas as pd
import os
from Bio import SeqIO
from Bio import Seq

In [3]:
os.chdir('/Volumes/AHN/captive_ape_microbiome')

### Generate phylogeny

In [9]:
%%bash
mkdir -pv results/gyrb/processing/moeller_sup

#copy reference fasta
cp results/gyrb_bt_gtdbtk_ref/blastdb/gtdbtk_gyrb_Bt.faa results/gyrb/processing/moeller_sup/gtdbtk_gyrb_Bt.faa
cp results/gyrb_bt_gtdbtk_ref/blastdb/gtdbtk_gyrb_Bt.fasta results/gyrb/processing/moeller_sup/gtdbtk_gyrb_Bt.fasta

#trim primer sequences
cutadapt  -g CGGAGGTAARTTCGAYAAAGG  --overlap 21 -e .15 --discard-untrimmed  -o  results/gyrb/processing/moeller_sup/Bacteroidaceae.fna  data/gyrb/moeller_sup/Bacteroidaceae.fna

cd results/gyrb/processing/moeller_sup
#translate gyrb amplicon seqs
transeq -frame 2 -sequence Bacteroidaceae.fna  -outseq Bacteroidaceae.faa

#combine gyrb amplicon and ref seqs
cat gtdbtk_gyrb_Bt.faa Bacteroidaceae.faa > Bacteroidaceae_ref.faa
cat gtdbtk_gyrb_Bt.fasta Bacteroidaceae.fna > Bacteroidaceae_ref.fna

#align seqs
mafft --auto --quiet Bacteroidaceae_ref.faa > Bacteroidaceae_ref.faa.aln
tranalign -asequence  Bacteroidaceae_ref.fna -bsequence Bacteroidaceae_ref.faa.aln -outseq Bacteroidaceae_ref.fna.aln

#select range of sites because some have seqs have different 

This is cutadapt 2.5 with Python 3.7.7
Command line parameters: -g CGGAGGTAARTTCGAYAAAGG --overlap 21 -e .15 --discard-untrimmed -o results/gyrb/processing/moeller_sup/Bacteroidaceae.fna data/gyrb/moeller_sup/Bacteroidaceae.fna
Processing reads on 1 core in single-end mode ...
Finished in 0.04 s (53 us/read; 1.13 M reads/minute).

=== Summary ===

Total reads processed:                     724
Reads with adapters:                       724 (100.0%)
Reads written (passing filters):           724 (100.0%)

Total basepairs processed:       217,200 bp
Total written (filtered):        200,991 bp (92.5%)

=== Adapter 1 ===

Sequence: CGGAGGTAARTTCGAYAAAGG; Type: regular 5'; Length: 21; Trimmed: 724 times.

No. of allowed errors:
0-5 bp: 0; 6-12 bp: 1; 13-19 bp: 2; 20-21 bp: 3

Overview of removed sequences
length	count	expect	max.err	error counts
21	194	0.0	3	194
22	192	0.0	3	192
23	201	0.0	3	201
24	137	0.0	3	137


cp: results/gyrb_bt_gtdbtk_ref/blastdb/gtdbtk_gyrb_Bt.faa: No such file or directory
cp: results/gyrb_bt_gtdbtk_ref/blastdb/gtdbtk_gyrb_Bt.fasta: No such file or directory
Translate nucleic acid sequences
Generate an alignment of nucleic coding regions from aligned proteins
Error: Guide protein sequence Gorilla3374716_2 not found in nucleic sequence Gorilla3374716
Error: Guide protein sequence Human3058374_2 not found in nucleic sequence Human3058374
Error: Guide protein sequence Human9403092_2 not found in nucleic sequence Human9403092
Error: Guide protein sequence Chimp10413648_2 not found in nucleic sequence Chimp10413648
Error: Guide protein sequence Chimp10757854_2 not found in nucleic sequence Chimp10757854
Error: Guide protein sequence Chimp10796222_2 not found in nucleic sequence Chimp10796222
Error: Guide protein sequence Chimp11562025_2 not found in nucleic sequence Chimp11562025
Error: Guide protein sequence Chimp3146215_2 not found in nucleic sequence Chimp3146215
Error: Gu

In [8]:
!fasttree -nt -gtr < results/gyrb/processing/moeller_sup/Bacteroidaceae_ref.fna.aln.selection > results/gyrb/processing/moeller_sup/Bacteroidaceae_ref.fna.aln.selection.tre

FastTree Version 2.1.10 Double precision (No SSE3)
Alignment: standard input
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
Initial topology in 7.30 seconds0 of   1526   529 seqs (at seed   1100)   
Refining topology: 42 rounds ME-NNIs, 2 rounds ME-SPRs, 21 rounds ML-NNIs
Total branch-length 107.195 after 27.43 sec1 of 1527 splits   , 24 changes (max delta 0.003)   
ML-NNI round 1: LogLk = -963821.905 NNIs 422 max delta 74.59 Time 37.58ges (max delta 74.595)   
GTR Frequencies: 0.2797 0.2408 0.2599 0.2196ep 11 of 12   
GTR rates(ac ag at cg ct gt) 1.8762 2.6509 1.7148 1.7302 5.5709 1.0000
Switched to using 20 rate categories (CAT approximation)20 of 20   
Rate categories were divided by 1.149 so that average rate = 1.0
CAT-based log-likelihoods may not be comparable acros

In [None]:
#view tree in figtree, manually select clades with topology matching Figure 1 in Moeller et al. 
#save subtrees separately
#same sequence names separately, annotate with clade designations

In [10]:
codiv_asvs = pd.read_csv('results/moeller_sup/Bt_all_codiv_lineages_names.txt',sep=' ',index_col=None,header=None)
codiv_asvs.columns = ['codiv_clade','header']
codivDict = dict(zip(codiv_asvs['header'], codiv_asvs['codiv_clade']))

In [17]:

with open('results/moeller_sup/Bacteroidaceae.fna') as original: 
    with open('results/moeller_sup/Bacteroidaceae_codiv.fna', 'w') as covid:
        records = SeqIO.parse(original, 'fasta')
        for record in records:
            if record.id in codivDict.keys():
                print(record.id)
                HR_clade = codivDict[record.id]
                record.description = HR_clade
                SeqIO.write(record, covid, 'fasta')


Gorilla10693345
Gorilla10893151
Gorilla12208365
Gorilla1454159
Gorilla3280123
Gorilla3748318
Gorilla4351822
Gorilla4447500
Gorilla5764570
Gorilla6306341
Gorilla7248598
Gorilla7611566
Gorilla7655534
Gorilla8443136
Gorilla8497213
Gorilla8907336
Gorilla9193943
Gorilla9805772
Human10045430
Human10283488
Human10889793
Human10912985
Human10912992
Human12108738
Human1330489
Human1330495
Human1781917
Human237218
Human2572122
Human2935124
Human3034958
Human3367149
Human3418197
Human3961325
Human4630891
Human4639158
Human4639166
Human5110045
Human5110051
Human5465528
Human5838232
Human6390749
Human6405675
Human663970
Human663977
Human6649477
Human6700665
Human6700667
Human7250453
Human7563063
Human7563070
Human7859459
Human8146958
Human8295127
Human8829943
Human8842535
Human9177678
Human9518949
Human9825266
Chimp10218980
Chimp1067447
Chimp11049685
Chimp11943611
Chimp1322981
Chimp1324508
Chimp1493190
Chimp1528180
Chimp168543
Chimp1862511
Chimp2164238
Chimp2506335
Chimp2629519
Chimp2757936
Chimp29