In [1]:
#Libraries
import re

## README: Removing french polynesian outgroup from posterior distribution of trees

To properly root our phylogenies of Zika in the Americas, we included a french polynesian genome as an outgroup. In the BEAST xml we've mandated that American ZIKV genomes be monophyletic, with this french polynesian isolate as an outgroup. However, since we don't want french polynesia as a deme in our phylogeographic analysis, we need to remove the french polynesian genome from the posterior sample of trees prior to using these trees. This is actually quite simple to do because the the french polynesian genome is always an outgroup. Therefore we just have to crop the french polynesian genome, as well as the branch leading to the American MRCA, from the tree. I'm doing this cropping by regex pattern finding and substiuting the pattern with an empty string. 

This regex: `\(172\:\[&rate=[0-9]+\.[0-9]+\][0-9]+\.[0-9]+\,` matches all taxa information for taxa #172 (which maps to the french polynesian genomes in my Nexus file). It is found right at the beginning of each Newick tree string.

This regex `\[&rate=[0-9]+\.[0-9]+\][0-9]+\.[0-9]+\)\;` matches to the branch information for the branch that leads from the outgroup MRCA to the American monophyly MRCA. It also needs to be removed.

#### Please note that this code is adapted from Baltic, a python package written by Gytis Dudas, for parsing BEAST trees files. Shout out to him.

#### Step 1: Parse non-newick trees portions of the `.trees` file

In the first part of the file I'm going to make a dictionary linking the numerical taxon name (which BEAST makes) with the string taxon name as it appears in the fasta. I'm also going to write these lines back into the output `.trees` file since this information will need to be in that file. Not doing any newick string parsing yet.

In [33]:
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/scripts/chain1AmZika-StrictClock-SkyGrid100-1Bstates.trees','rU') as infile: 
    
    taxaTranslation = False
    treeCounter = 0
   
    with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/beast/chain1AmZika-StrictClock-SkyGrid100_cropped.trees','w') as outfile:
        
        for line in infile: ## iterate through each line
            if 'state' not in line.lower(): #going to grab all the interesting stuff in the .trees file prior to the newick tree strings
                matchTaxonCount = re.search('Dimensions ntax\=([0-9]+)\;',line) ## Extract useful information from the bits preceding the actual trees.
                if matchTaxonCount is not None:
                    nTaxa = int(matchTaxonCount.group(1))
            
            if 'Translate' in line: #int to full taxon name mapping follows in file
                taxaTranslation = True
                tipNameMap = {}
            
            if taxaTranslation == True:
                line.replace('usa','united_states')
                line.replace('usvi','united_states_virgin_islands')
                matchMapping = re.search('([0-9]+) ([\'\"A-Za-z0-9\?\|\-\_\.\/]+)',line)
                if matchMapping is not None:
                    tipNameMap[matchMapping.group(1)] = matchMapping.group(2)
            
            if 'tree STATE_' in line:
                treeCounter += 1
            elif 'french_polynesia' in line:
                continue
            elif 'usa' in line:
                outfile.write(line.replace('usa','united_states'))
            elif 'usvi' in line:
                outfile.write(line.replace('usvi','united_states_virgin_islands'))
            else:
                outfile.write(line) #output trees file needs to have all the same initial headers etc as input file did
    #Checks
    assert len(tipNameMap) == nTaxa, 'not all tips read in by regex'
                

for key in tipNameMap.keys():
    if 'french_polynesia' in tipNameMap[key]:
        outgroupIntName = key

burnin = int(round(treeCounter*0.1)) #10% burnin, adapt as necessary

print 'Found {} logged trees in .trees file.'.format(treeCounter)
print 'First {} logged trees will be removed from output file as they are burn-in.'.format(burnin)
print 'The outgroup integer name is {}.'.format(outgroupIntName)

Found 1276 logged trees in .trees file.
First 128 logged trees will be removed from output file as they are burn-in.
The outgroup integer name is 172.


#### Step 2: Crop Newick strings down to remove outgroup, appending these to the output `.trees` file.

In [34]:
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/scripts/chain1AmZika-StrictClock-SkyGrid100-1Bstates.trees','rU') as infile:
    with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/beast/chain1AmZika-StrictClock-SkyGrid100_cropped.trees','a') as outfile: #appending so as not to overwrite
        newickStringCounter = 0
        for line in infile:
            if 'tree STATE_' in line: #sampled tree strings now
                newickStringCounter += 1
                #crudMatch = re.match('tree\sSTATE\_([0-9]+).+\[\&R\]\s',line) #.match is always at beginning of string
                front_cropped_newick = re.sub('\({}\:\[&rate=[0-9]+\.[0-9]+(E\-[0-9]*)?\][0-9]+\.[0-9]+(E\-[0-9]*)?\,'.format(outgroupIntName), '', line) # match outgroup taxon
                back_cropped_newick = re.sub('\:\[&rate=[0-9]+\.[0-9]+\][0-9]+\.[0-9]+\)\;', ';', front_cropped_newick) #match branch length to Americas mrca
                if newickStringCounter > burnin: #only write out trees that are logged after burnin period.
                    outfile.write(back_cropped_newick)