This script finds the co-diversifying clades within Moeller et al. supplemental fasta by creates a phylogeny and selecting the subtrees that match the topology of figure 1. 

Inputs: fasta of all Bacteroidaceae seqs, ref_gyrb_gtdbtk Bacteroidales reference seqs (fna and faa)

Outputs: fasta of just co-div seqs with lineage and clade labels

In [1]:
import pandas as pd
import os
from Bio import SeqIO
from Bio import Seq
from ete3 import Tree

In [2]:
os.chdir('/Volumes/AHN/captive_ape_microbiome')
outdir = 'results/gyrb/processing/moeller_sup'

### Generate phylogeny

In [3]:
%%bash
outdir=results/gyrb/processing/moeller_sup
ref_faa=results/gyrb/processing/ref_gyrb_gtdbtk/gyrb_fastas/gtdbtk_gyrb_Bt.faa
ref_fna=results/gyrb/processing/ref_gyrb_gtdbtk/gyrb_fastas/gtdbtk_gyrb_Bt.fasta

mkdir -pv $outdir

#trim primer sequences
cutadapt  -g CGGAGGTAARTTCGAYAAAGG  --overlap 21 -e .15 --discard-untrimmed  -o  $outdir/Bacteroidaceae.fna  data/gyrb/moeller_sup/Bacteroidaceae.fna

#translate gyrb amplicon seqs
transeq -frame 2 -sequence $outdir/Bacteroidaceae.fna  -outseq $outdir/Bacteroidaceae.faa

#combine gyrb amplicon and ref seqs
cat $ref_faa $outdir/Bacteroidaceae.faa > $outdir/Bacteroidaceae_ref.faa
cat $ref_fna $outdir/Bacteroidaceae.fna > $outdir/Bacteroidaceae_ref.fna

#align seqs
mafft --auto --quiet $outdir/Bacteroidaceae_ref.faa > $outdir/Bacteroidaceae_ref.faa.aln
tranalign -asequence  $outdir/Bacteroidaceae_ref.fna -bsequence $outdir/Bacteroidaceae_ref.faa.aln -outseq $outdir/Bacteroidaceae_ref.fna.aln



This is cutadapt 2.5 with Python 3.7.7
Command line parameters: -g CGGAGGTAARTTCGAYAAAGG --overlap 21 -e .15 --discard-untrimmed -o results/gyrb/processing/moeller_sup/Bacteroidaceae.fna data/gyrb/moeller_sup/Bacteroidaceae.fna
Processing reads on 1 core in single-end mode ...
Finished in 0.46 s (638 us/read; 0.09 M reads/minute).

=== Summary ===

Total reads processed:                     724
Reads with adapters:                       724 (100.0%)
Reads written (passing filters):           724 (100.0%)

Total basepairs processed:       217,200 bp
Total written (filtered):        200,991 bp (92.5%)

=== Adapter 1 ===

Sequence: CGGAGGTAARTTCGAYAAAGG; Type: regular 5'; Length: 21; Trimmed: 724 times.

No. of allowed errors:
0-5 bp: 0; 6-12 bp: 1; 13-19 bp: 2; 20-21 bp: 3

Overview of removed sequences
length	count	expect	max.err	error counts
21	194	0.0	3	194
22	192	0.0	3	192
23	201	0.0	3	201
24	137	0.0	3	137


Translate nucleic acid sequences
Generate an alignment of nucleic coding regions from aligned proteins
Error: Guide protein sequence Gorilla3374716_2 not found in nucleic sequence Gorilla3374716
Error: Guide protein sequence Human3058374_2 not found in nucleic sequence Human3058374
Error: Guide protein sequence Human9403092_2 not found in nucleic sequence Human9403092
Error: Guide protein sequence Chimp10413648_2 not found in nucleic sequence Chimp10413648
Error: Guide protein sequence Chimp10757854_2 not found in nucleic sequence Chimp10757854
Error: Guide protein sequence Chimp10796222_2 not found in nucleic sequence Chimp10796222
Error: Guide protein sequence Chimp11562025_2 not found in nucleic sequence Chimp11562025
Error: Guide protein sequence Chimp3146215_2 not found in nucleic sequence Chimp3146215
Error: Guide protein sequence Chimp8175258_2 not found in nucleic sequence Chimp8175258
Error: Guide protein sequence Bonobo2919026_2 not found in nucleic sequence Bonobo2919026
Err

In [4]:
#trim start of gyrB gene
def trim_aln(original_fasta,trim_fasta,start_pos,end_pos):
    with open(original_fasta) as original_fasta:
        with open(trim_fasta, 'w') as trim_fasta:
            records = SeqIO.parse(original_fasta, 'fasta')
            for record in records:
                record.seq = record.seq[start_pos:end_pos] 
                SeqIO.write(record, trim_fasta, 'fasta')
trim_aln(f'{outdir}/Bacteroidaceae_ref.fna.aln',f'{outdir}/Bacteroidaceae_ref.fna.aln.trim',429,2952)

In [5]:
%%bash
fasttree -nt -gtr < results/gyrb/processing/moeller_sup/Bacteroidaceae_ref.fna.aln.trim > results/gyrb/processing/moeller_sup/Bacteroidaceae_ref.fna.aln.trim.tre


FastTree Version 2.1.10 Double precision (No SSE3)
Alignment: standard input
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
      0.17 seconds: Top hits for    121 of   1529 seqs (at seed    100)
      0.30 seconds: Top hits for    298 of   1529 seqs (at seed    200)
      0.45 seconds: Top hits for    460 of   1529 seqs (at seed    300)
      0.59 seconds: Top hits for    637 of   1529 seqs (at seed    400)
      0.70 seconds: Top hits for    798 of   1529 seqs (at seed    500)
      0.81 seconds: Top hits for    939 of   1529 seqs (at seed    600)
      0.92 seconds: Top hits for   1091 of   1529 seqs (at seed    700)
      1.06 seconds: Top hits for   1203 of   1529 seqs (at seed    900)
      1.46 seconds: Joined    100 of   1526
      1.81 seconds: Joined    200 of  

In [6]:
#view tree in figtree,  
#root based on outgroups 
#select clades with topology matching Figure 1 in Moeller et al. 
#save subtrees separately
#same sequence names separately, annotate with clade designations

In [7]:
in_tree = 'results/gyrb/processing/moeller_sup/Bacteroidaceae_ref.fna.aln.trim.tre'   
rooted_tree = 'results/gyrb/processing/moeller_sup/Bacteroidaceae_ref.fna.aln.trim.rooted.tre'  
Bt1_tree_file = 'results/gyrb/processing/moeller_sup/moeller_codiv_lin_Bt1.tree'  
Bt2_tree_file = 'results/gyrb/processing/moeller_sup/moeller_codiv_lin_Bt2.tree'  
Bt3_tree_file = 'results/gyrb/processing/moeller_sup/moeller_codiv_lin_Bt3.tree'  

tree = Tree(in_tree, format=0)
outgroup_taxa=[leaf.name for leaf in tree.get_leaves() if 
               'p__Gemmatimonadota' in leaf.name or 
               'o__Chlorobiales' in leaf.name or 
               'o__SJA-28' in leaf.name]
outgroup_MRCA = tree.get_common_ancestor(outgroup_taxa)
tree.set_outgroup(outgroup_MRCA)
tree.write(outfile=rooted_tree,format=2)

def get_lineage_ASVs(in_tree,listTaxa):    
    #makes sure we don't miss any ASVs that didn't match in the blast search \n"
    #but are descended from the MRCA of those that did\n"
    tree = Tree(in_tree, format=0)
    lineage_MRCA = tree.get_common_ancestor(listTaxa)
    linage_taxa = [x.name for x in lineage_MRCA.get_leaves()] 
    tree.prune(linage_taxa)
    return(tree)

def rename_taxa(lin_tree,listTaxa,cladeName):
    lineage_MRCA = lin_tree.get_common_ancestor(listTaxa)
    for taxa in lineage_MRCA.get_leaves():    
        taxa.name = taxa.name + ' ' + cladeName

Bt1_tree = get_lineage_ASVs(rooted_tree,['Gorilla5764570','Bonobo8808876'])
Bt1_taxa_oldnames = [leaf.name for leaf in Bt1_tree.get_leaves()]        
rename_taxa(Bt1_tree,['Gorilla5764570','Gorilla10693345'],'Bt1_clade1_gorilla')
rename_taxa(Bt1_tree,['Chimp6685161','Chimp5888147'],'Bt1_clade1_chimp')
rename_taxa(Bt1_tree,['Bonobo11821974','Bonobo8808876'],'Bt1_clade1_bonobo')
Bt1_taxa = [leaf.name for leaf in Bt1_tree.get_leaves()] 
Bt1_tree.write(outfile=Bt1_tree_file,format=2) 

Bt2_tree = get_lineage_ASVs(rooted_tree,['GCF_002933775.1p__Bacteroidotao__Bacteroidalesf__Bacteroidaceaes__Prevotella_sp002933775',
                                         'Bonobo5991239'])
Bt2_taxa_oldnames = [leaf.name for leaf in Bt2_tree.get_leaves()]        
rename_taxa(Bt2_tree,['Gorilla7248598','Gorilla1454159'],'Bt2_clade1_gorilla')
rename_taxa(Bt2_tree,['Chimp8269086','Chimp3094078'],'Bt2_clade1_chimp')
rename_taxa(Bt2_tree,['Bonobo11725448','Bonobo5991239'],'Bt2_clade1_bonobo')
rename_taxa(Bt2_tree,['Chimp1322981','Chimp3450568'],'Bt2_clade2_chimp')
rename_taxa(Bt2_tree,['Bonobo11213703','Bonobo6937590'],'Bt2_clade2_bonobo')
Bt2_taxa = [leaf.name for leaf in Bt2_tree.get_leaves()] 
Bt2_tree.write(outfile=Bt2_tree_file,format=2) 

Bt3_tree = get_lineage_ASVs(rooted_tree,['Human1330495','Bonobo4232320'])
Bt3_taxa_oldnames = [leaf.name for leaf in Bt3_tree.get_leaves()]        
rename_taxa(Bt3_tree,['Human1330495','Human12108738'],'Bt3_clade1_human')
rename_taxa(Bt3_tree,['Chimp2629519','Chimp873638'],'Bt3_clade1_chimp')
rename_taxa(Bt3_tree,['Bonobo6212174','Bonobo4232320'],'Bt3_clade1_bonobo')
Bt3_taxa = [leaf.name for leaf in Bt3_tree.get_leaves()] 
Bt3_tree.write(outfile=Bt3_tree_file,format=2) 



In [19]:
codiv_taxa_oldnames = Bt1_taxa_oldnames + Bt2_taxa_oldnames + Bt3_taxa_oldnames
codiv_taxa = Bt1_taxa + Bt2_taxa + Bt3_taxa
codivDict = dict(zip(codiv_taxa_oldnames, codiv_taxa))
codivDF = pd.DataFrame(list(zip(codiv_taxa,codiv_taxa_oldnames)),columns=['codiv_taxa','codiv_taxa_oldnames'])
codivDF['moeller_seq'] = codivDF['codiv_taxa'].apply(lambda x: 'moeller' if 'p__Bacteroidota' not in x else 'ref')
codivDF = codivDF[codivDF['moeller_seq'] == 'moeller']
codivDF['codiv_clade'] = codivDF['codiv_taxa'].apply(lambda x: x.split(' ')[1])
codivDF['codiv_lineage'] = codivDF['codiv_clade'].apply(lambda x: x.split('_')[0])
codivDF['hostSpecies'] = codivDF['codiv_clade'].apply(lambda x: 'wild_'+x.split('_')[-1])
codivDF = codivDF[['codiv_taxa','codiv_clade','codiv_lineage','hostSpecies']]
codivDF.to_csv(f'{outdir}/moeller_codiv_HRclades.txt',index=False,sep='\t')
codivDF.head()

Unnamed: 0,codiv_taxa,codiv_clade,codiv_lineage,hostSpecies
0,Gorilla5764570 Bt1_clade1_gorilla,Bt1_clade1_gorilla,Bt1,wild_gorilla
1,Gorilla3748318 Bt1_clade1_gorilla,Bt1_clade1_gorilla,Bt1,wild_gorilla
2,Gorilla12208365 Bt1_clade1_gorilla,Bt1_clade1_gorilla,Bt1,wild_gorilla
3,Gorilla4447500 Bt1_clade1_gorilla,Bt1_clade1_gorilla,Bt1,wild_gorilla
4,Gorilla8497213 Bt1_clade1_gorilla,Bt1_clade1_gorilla,Bt1,wild_gorilla


In [39]:
with open(f'{outdir}/Bacteroidaceae.fna') as original: 
    with open(f'{outdir}/moeller_codiv_Bacteroidaceae.fna', 'w') as covid:
        records = SeqIO.parse(original, 'fasta')
        for record in records:
            if record.id in codiv_taxa_oldnames:
                print(record.id)
                record.id = codivDict[record.id]
                record.description = ''
                SeqIO.write(record, covid, 'fasta')


Gorilla10693345
Gorilla10893151
Gorilla12208365
Gorilla1454159
Gorilla3280123
Gorilla3748318
Gorilla4351822
Gorilla4447500
Gorilla5764570
Gorilla6306341
Gorilla7248598
Gorilla7611566
Gorilla7655534
Gorilla8443136
Gorilla8497213
Gorilla8907336
Gorilla9193943
Gorilla9805772
Human10045430
Human10283488
Human10889793
Human10912985
Human10912992
Human12108738
Human1330489
Human1330495
Human1781917
Human237218
Human2572122
Human2935124
Human3034958
Human3367149
Human3418197
Human3961325
Human4630891
Human4639158
Human4639166
Human5110045
Human5110051
Human5465528
Human5838232
Human6390749
Human6405675
Human663970
Human663977
Human6649477
Human6700665
Human6700667
Human7250453
Human7563063
Human7563070
Human7859459
Human8146958
Human8295127
Human8829943
Human8842535
Human9177678
Human9518949
Human9825266
Chimp10218980
Chimp1067447
Chimp11049685
Chimp11943611
Chimp1322981
Chimp1324508
Chimp1493190
Chimp1528180
Chimp168543
Chimp1862511
Chimp2164238
Chimp2506335
Chimp2629519
Chimp2757936
Chimp29