This script finds the co-diversifying clades within Moeller et al. supplemental fasta by creates a phylogeny and selecting the subtrees that match the topology of figure 1. 

Inputs: fasta of all Bacteroidaceae seqs, ref_gyrb_gtdbtk Bacteroidales reference seqs (fna and faa)

Outputs: fasta of just co-div seqs with lineage and clade labels

In [74]:
import pandas as pd
import os
from Bio import SeqIO
from Bio import Seq
from ete3 import Tree

In [98]:
os.chdir('/Volumes/AHN/captive_ape_microbiome')
outdir = 'results/gyrb/processing/moeller_sup'
final_outdir = 'results/gyrb/inputs'

In [99]:
os.system(f'mkdir -pv {outdir}')
os.system(f'mkdir -pv {final_outdir}')

0

### subset ref sequences

In [82]:
genus_count = {}
with open('ref_seqs/gtdbtk_gyrb.fasta') as original_fasta:
    with open(outdir+'/gtdbtk_gyrb.fasta', 'w') as subset_fasta:
        records = SeqIO.parse(original_fasta, 'fasta')
        for record in records:
            if 'o__Bacteroidales' in record.description or 'c__Chlorobia' in record.description or 'c__Ignavibacteria' in record.description:
                record.id = record.description.replace(' ','').replace(';','')
                record.description = ''
                SeqIO.write(record, subset_fasta, 'fasta')

### Generate phylogeny of moeller sup and gtdbtk ref seqs

In [83]:
%%bash
outdir=results/gyrb/processing/moeller_sup

#trim primer sequences
cutadapt  -g CGGAGGTAARTTCGAYAAAGG  --overlap 21 -e .15 --discard-untrimmed  -o  $outdir/Bacteroidaceae.fna  data/gyrb/moeller_sup/Bacteroidaceae.fna

#translate gyrb amplicon seqs
transeq -frame 2 -sequence $outdir/Bacteroidaceae.fna  -outseq $outdir/Bacteroidaceae.faa

#translate ref seqs
transeq -frame 1 -sequence $outdir/gtdbtk_gyrb.fasta  -outseq $outdir/gtdbtk_gyrb.faa

#combine gyrb amplicon and ref seqs
cat $outdir/gtdbtk_gyrb.faa $outdir/Bacteroidaceae.faa > $outdir/Bacteroidaceae_ref.faa
cat $outdir/gtdbtk_gyrb.fasta  $outdir/Bacteroidaceae.fna > $outdir/Bacteroidaceae_ref.fna

#align seqs
mafft --auto --quiet $outdir/Bacteroidaceae_ref.faa > $outdir/Bacteroidaceae_ref.faa.aln
tranalign -asequence  $outdir/Bacteroidaceae_ref.fna -bsequence $outdir/Bacteroidaceae_ref.faa.aln -outseq $outdir/Bacteroidaceae_ref.fna.aln


This is cutadapt 2.5 with Python 3.7.7
Command line parameters: -g CGGAGGTAARTTCGAYAAAGG --overlap 21 -e .15 --discard-untrimmed -o results/gyrb/processing/moeller_sup/Bacteroidaceae.fna data/gyrb/moeller_sup/Bacteroidaceae.fna
Processing reads on 1 core in single-end mode ...
Finished in 0.02 s (21 us/read; 2.84 M reads/minute).

=== Summary ===

Total reads processed:                     724
Reads with adapters:                       724 (100.0%)
Reads written (passing filters):           724 (100.0%)

Total basepairs processed:       217,200 bp
Total written (filtered):        200,991 bp (92.5%)

=== Adapter 1 ===

Sequence: CGGAGGTAARTTCGAYAAAGG; Type: regular 5'; Length: 21; Trimmed: 724 times.

No. of allowed errors:
0-5 bp: 0; 6-12 bp: 1; 13-19 bp: 2; 20-21 bp: 3

Overview of removed sequences
length	count	expect	max.err	error counts
21	194	0.0	3	194
22	192	0.0	3	192
23	201	0.0	3	201
24	137	0.0	3	137


Translate nucleic acid sequences
Translate nucleic acid sequences
Generate an alignment of nucleic coding regions from aligned proteins
Error: Guide protein sequence Gorilla3374716_2 not found in nucleic sequence Gorilla3374716
Error: Guide protein sequence Human3058374_2 not found in nucleic sequence Human3058374
Error: Guide protein sequence Human9403092_2 not found in nucleic sequence Human9403092
Error: Guide protein sequence Chimp10413648_2 not found in nucleic sequence Chimp10413648
Error: Guide protein sequence Chimp10757854_2 not found in nucleic sequence Chimp10757854
Error: Guide protein sequence Chimp10796222_2 not found in nucleic sequence Chimp10796222
Error: Guide protein sequence Chimp11562025_2 not found in nucleic sequence Chimp11562025
Error: Guide protein sequence Chimp3146215_2 not found in nucleic sequence Chimp3146215
Error: Guide protein sequence Chimp8175258_2 not found in nucleic sequence Chimp8175258
Error: Guide protein sequence Bonobo2919026_2 not found in n

In [85]:
#trim start of gyrB gene
def trim_aln(original_fasta,trim_fasta,start_pos,end_pos):
    with open(original_fasta) as original_fasta:
        with open(trim_fasta, 'w') as trim_fasta:
            records = SeqIO.parse(original_fasta, 'fasta')
            for record in records:
                record.seq = record.seq[start_pos:end_pos] 
                SeqIO.write(record, trim_fasta, 'fasta')
trim_aln(f'{outdir}/Bacteroidaceae_ref.fna.aln',f'{outdir}/Bacteroidaceae_ref.fna.aln.trim',469,3039)

In [86]:
%%bash
fasttree -nt -gtr < results/gyrb/processing/moeller_sup/Bacteroidaceae_ref.fna.aln.trim > results/gyrb/processing/moeller_sup/Bacteroidaceae_ref.fna.aln.trim.tre


FastTree Version 2.1.10 Double precision (No SSE3)
Alignment: standard input
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
      0.15 seconds: Top hits for    149 of   1579 seqs (at seed    100)
      0.29 seconds: Top hits for    353 of   1579 seqs (at seed    200)
      0.39 seconds: Top hits for    450 of   1579 seqs (at seed    300)
      0.54 seconds: Top hits for    613 of   1579 seqs (at seed    400)
      0.65 seconds: Top hits for    714 of   1579 seqs (at seed    500)
      0.77 seconds: Top hits for    923 of   1579 seqs (at seed    600)
      0.88 seconds: Top hits for   1090 of   1579 seqs (at seed    700)
      0.99 seconds: Top hits for   1155 of   1579 seqs (at seed    900)
      1.09 seconds: Top hits for   1578 of   1579 seqs (at seed   1200)
      1.41

### extract codiv clades from tree
view tree in figtree,  
root based on outgroups 
select clades with topology matching Figure 1 in Moeller et al. 
save subtrees separately
same sequence names separately, annotate with clade designations

In [101]:
in_tree = f'{outdir}/Bacteroidaceae_ref.fna.aln.trim.tre'   
rooted_tree = f'{outdir}//Bacteroidaceae_ref.fna.aln.trim.rooted.tre'  
Bt1_tree_file = f'{outdir}/moeller_codiv_lin_Bt1.tree'  
Bt2_tree_file = f'{outdir}/moeller_codiv_lin_Bt2.tree'  
Bt3_tree_file = f'{outdir}/moeller_codiv_lin_Bt3.tree'  

tree = Tree(in_tree, format=0)
outgroup_taxa=[leaf.name for leaf in tree.get_leaves() if 
               'c__Ignavibacteria' in leaf.name or 
               'c__Chlorobia' in leaf.name]
outgroup_MRCA = tree.get_common_ancestor(outgroup_taxa)
tree.set_outgroup(outgroup_MRCA)
tree.write(outfile=rooted_tree,format=2)

def get_lineage_ASVs(in_tree,listTaxa):    
    #makes sure we don't miss any ASVs that didn't match in the blast search \n"
    #but are descended from the MRCA of those that did\n"
    tree = Tree(in_tree, format=0)
    lineage_MRCA = tree.get_common_ancestor(listTaxa)
    linage_taxa = [x.name for x in lineage_MRCA.get_leaves()] 
    tree.prune(linage_taxa)
    return(tree)

def rename_taxa(lin_tree,listTaxa,cladeName):
    lineage_MRCA = lin_tree.get_common_ancestor(listTaxa)
    for taxa in lineage_MRCA.get_leaves():    
        taxa.name = taxa.name + ' ' + cladeName

Bt1_tree = get_lineage_ASVs(rooted_tree,['Gorilla5764570','Bonobo7657683'])
Bt1_taxa_oldnames = [leaf.name for leaf in Bt1_tree.get_leaves()]        
rename_taxa(Bt1_tree,['Gorilla5764570','Gorilla10693345'],'Bt1_clade1_gorilla')
rename_taxa(Bt1_tree,['Chimp6685161','Chimp5888147'],'Bt1_clade1_chimp')
rename_taxa(Bt1_tree,['Bonobo11821974','Bonobo7657683'],'Bt1_clade1_bonobo')
Bt1_taxa = [leaf.name for leaf in Bt1_tree.get_leaves()] 
Bt1_tree.write(outfile=Bt1_tree_file,format=2) 

Bt2_tree = get_lineage_ASVs(rooted_tree,['Gorilla7248598',
                                         'Bonobo6937590'])
Bt2_taxa_oldnames = [leaf.name for leaf in Bt2_tree.get_leaves()]        
rename_taxa(Bt2_tree,['Gorilla7248598','Gorilla1454159'],'Bt2_clade1_gorilla')
rename_taxa(Bt2_tree,['Chimp7736944','Chimp3970093'],'Bt2_clade1_chimp')
rename_taxa(Bt2_tree,['Bonobo11725448','Bonobo4214590'],'Bt2_clade1_bonobo')
rename_taxa(Bt2_tree,['Chimp3781815','Chimp3450568'],'Bt2_clade2_chimp')
rename_taxa(Bt2_tree,['Bonobo11213703','Bonobo6937590'],'Bt2_clade2_bonobo')
Bt2_taxa = [leaf.name for leaf in Bt2_tree.get_leaves()] 
Bt2_tree.write(outfile=Bt2_tree_file,format=2) 

Bt3_tree = get_lineage_ASVs(rooted_tree,['Human5838232','Bonobo10623458'])
Bt3_taxa_oldnames = [leaf.name for leaf in Bt3_tree.get_leaves()]        
rename_taxa(Bt3_tree,['Human5838232','Human2935124'],'Bt3_clade1_human')
rename_taxa(Bt3_tree,['Chimp1528180','Chimp873638'],'Bt3_clade1_chimp')
rename_taxa(Bt3_tree,['Bonobo6212174','Bonobo4232320'],'Bt3_clade1_bonobo')
Bt3_taxa = [leaf.name for leaf in Bt3_tree.get_leaves()] 
Bt3_tree.write(outfile=Bt3_tree_file,format=2) 

### create table with all codiv sequences, clade, and lineage info

In [96]:
codiv_taxa_oldnames = Bt1_taxa_oldnames + Bt2_taxa_oldnames + Bt3_taxa_oldnames
codiv_taxa = Bt1_taxa + Bt2_taxa + Bt3_taxa
codivDict = dict(zip(codiv_taxa_oldnames, codiv_taxa))

codivDF = pd.DataFrame(list(zip(codiv_taxa,codiv_taxa_oldnames)),columns=['codiv_taxa','codiv_taxa_oldnames'])

codivDF['moeller_seq'] = codivDF['codiv_taxa'].apply(lambda x: 'moeller' if 'p__Bacteroidota' not in x else 'ref')
codivDF = codivDF[codivDF['moeller_seq'] == 'moeller']
codivDF['codiv_clade'] = codivDF['codiv_taxa'].apply(lambda x: x.split(' ')[1])
codivDF['codiv_lineage'] = codivDF['codiv_clade'].apply(lambda x: x.split('_')[0])
codivDF['hostSpecies'] = codivDF['codiv_clade'].apply(lambda x: x.split('_')[-1])
codivDF = codivDF[['codiv_taxa','codiv_clade','codiv_lineage','hostSpecies']]
codivDF.to_csv(f'{outdir}/moeller_codiv_HRclades.txt',index=False,sep='\t')

### subset supplement fasta to only have codiv sequences

In [97]:
with open(f'{outdir}/Bacteroidaceae.fna') as original: 
    with open(f'{outdir}/moeller_codiv_Bacteroidaceae.fna', 'w') as covid:
        records = SeqIO.parse(original, 'fasta')
        for record in records:
            if record.id in codiv_taxa_oldnames:
                record.id = codivDict[record.id]
                record.description = ''
                SeqIO.write(record, covid, 'fasta')


### copy over output files used by analysis scripts 

In [102]:
os.system(f'cp {outdir}/moeller_codiv_Bacteroidaceae.fna {final_outdir}/moeller_codiv_Bacteroidaceae.fna')
os.system(f'cp {outdir}/moeller_codiv_HRclades.txt {final_outdir}/moeller_codiv_HRclades.txt')
os.system(f'cp {outdir}/moeller_codiv_lin_Bt1.tree {final_outdir}/moeller_codiv_lin_Bt1.tree')
os.system(f'cp {outdir}/moeller_codiv_lin_Bt2.tree {final_outdir}/moeller_codiv_lin_Bt2.tree')
os.system(f'cp {outdir}/moeller_codiv_lin_Bt3.tree {final_outdir}/moeller_codiv_lin_Bt3.tree')

0