# 2/1/2019 Slack Update

## Raw tree vs. time-resolved tree

The two trees that augur makes (using TimeTree) look different than each other. It looked like there were fewer tips in the time-resolved tree, and I wanted to directly compare them to see if we were, indeed, losing tips. 

### Raw tree

In [22]:
from ete3 import Tree

tree_raw = Tree("../spatial_coupling/zika-tutorial/results/tree_raw.nwk")
print(tree_raw)


   /-PAN/CDC_259359_V1_V3/2015
  |
  |      /-COL/FLR_00024/2015
  |   /-|
  |--|   \-COL/FLR_00008/2015
  |  |
  |   \-VEN/UF_1/2016
  |
  |   /-Colombia/2016/ZC204Se
  |  |
  |  |      /-Brazil/2015/ZBRC301
  |  |     |
  |  |     |            /-Brazil/2015/ZBRA105
--|  |     |           |
  |  |     |           |               /-USA/2016/FL022
  |  |     |           |            /-|
  |  |     |           |         /-|   \-USA/2016/FLWB042
  |  |     |           |        |  |
  |  |     |         /-|      /-|   \-DOM/2016/BB_0059
  |  |     |        |  |     |  |
  |  |     |        |  |     |  |   /-USA/2016/FLUR022
  |  |     |        |  |   /-|   \-|
  |  |     |        |  |  |  |      \-Aedes_aegypti/USA/2016/FL05
  |  |     |        |  |  |  |
  |  |     |      /-|   \-|  |   /-DOM/2016/BB_0183
  |  |     |     |  |     |   \-|
  |  |     |     |  |     |      \-DOM/2016/MA_WGS16_011
   \-|   /-|     |  |     |
     |  |  |     |  |      \-DOM/2016/BB_0433
     |  |  |     |  

### Time-resolved tree

In [23]:
tree = Tree("../spatial_coupling/zika-tutorial/results/tree.nwk", format=1)
print(tree)


      /-Thailand/1610acTw
     |
   /-|   /-SG_018
  |  |  |
  |   \-|   /-SG_056
  |     |  |
  |      \-|--SG_027
  |        |
--|         \-SG_074
  |
  |      /-ZKC2/2016
  |   /-|
  |  |   \-SMGC_1
  |  |
  |  |   /-1_0087_PF
   \-|  |
     |  |   /-1_0181_PF
     |  |  |
     |  |  |--1_0199_PF
      \-|  |
        |  |      /-PRVABC59
        |  |   /-|
        |  |  |  |   /-Brazil/2016/ZBRC16
        |  |  |   \-|
        |  |  |     |   /-V8375
         \-|  |      \-|
           |  |        |   /-HND/2016/HU_ME59
           |  |         \-|
           |  |            \-Nica1_16
           |  |
           |  |   /-Brazil/2015/ZBRC301
           |  |  |
           |  |  |   /-Brazil/2015/ZBRC303
           |  |--|--|
            \-|  |   \-BRA/2016/FC_6706
              |  |
              |  |   /-Colombia/2016/ZC204Se
              |   \-|
              |     |   /-PAN/CDC_259359_V1_V3/2015
              |      \-|
              |        |   /-VEN/UF_1/2016
              |  

### Build lists of the nodes on both trees

In [24]:
from ete3 import TreeNode

# List for raw tree nodes
strains_raw = []
for node in tree_raw.traverse("postorder"):
        strains_raw.append(node.name)

# List for time-resolved tree nodes
strains_tree = []
for node in tree.traverse("postorder"):
    strains_tree.append(node.name)



### Comparing the time-resolved tree to the raw tree

The `overlap_TR` list contains all of the nodes that are in both trees

The `excluded_TR` list contains all of the nodes that are in the time-resolved tree but not the raw tree

In [25]:
# For TR-tree in Raw
overlap_TR = []
excluded_TR = []
ignore = 0
for TR in strains_tree:
    for raw in strains_raw:
        if raw == TR:
            overlap_TR.append(TR)
            ignore = 1
    if ignore == 0 and TR not in excluded_TR:
        excluded_TR.append(TR)
    ignore = 0
    
print("TR-tree > Raw overlap: ", overlap_TR, "\n\n", "TR-tree > Raw exclusion: ", excluded_TR)

TR-tree > Raw overlap:  ['Thailand/1610acTw', 'SG_018', 'SG_056', 'SG_027', 'SG_074', 'ZKC2/2016', 'SMGC_1', '1_0087_PF', '1_0181_PF', '1_0199_PF', 'PRVABC59', 'Brazil/2016/ZBRC16', 'V8375', 'HND/2016/HU_ME59', 'Nica1_16', 'Brazil/2015/ZBRC301', 'Brazil/2015/ZBRC303', 'BRA/2016/FC_6706', 'Colombia/2016/ZC204Se', 'PAN/CDC_259359_V1_V3/2015', 'VEN/UF_1/2016', 'COL/FLR_00024/2015', 'COL/FLR_00008/2015', 'EcEs062_16', 'Brazil/2015/ZBRA105', 'DOM/2016/BB_0433', 'DOM/2016/BB_0183', 'DOM/2016/MA_WGS16_011', 'USA/2016/FLUR022', 'Aedes_aegypti/USA/2016/FL05', 'DOM/2016/BB_0059', 'USA/2016/FL022', 'USA/2016/FLWB042'] 

 TR-tree > Raw exclusion:  ['NODE_0000016', 'NODE_0000015', 'NODE_0000014', 'NODE_0000013', 'NODE_0000035', 'NODE_0000021', 'NODE_0000020', 'NODE_0000019', 'NODE_0000005', 'NODE_0000002', 'NODE_0000001', 'NODE_0000000', 'NODE_0000003', 'NODE_0000006', 'NODE_0000029', 'NODE_0000031', 'NODE_0000030', 'NODE_0000028', 'NODE_0000026', 'NODE_0000025', 'NODE_0000024', 'NODE_0000023', 'NO

### Comparing the raw tree to the time-resolved tree

The `overlap_raw` list contains all of the nodes that are in both trees

The `excluded_raw` list contains all of the nodes that are in the raw tree but not the time-resolved tree

In [26]:
# For Raw in TR-tree
overlap_raw = []
excluded_raw = []
ignore = 0
for raw in strains_raw:
    for TR in strains_tree:
        if raw == TR:
            overlap_raw.append(raw)
            ignore = 1
    if ignore == 0 and raw not in excluded_raw:
        excluded_raw.append(raw)
    ignore = 0
    
print("Raw > TR-tree overlap: ", overlap_raw, "\n\n", "Raw > TR-tree exclusion: ", excluded_raw, "\n")
print("Is overlap_raw == overlap _TR?: ", overlap_raw.sort() == overlap_TR.sort())

Raw > TR-tree overlap:  ['PAN/CDC_259359_V1_V3/2015', 'COL/FLR_00024/2015', 'COL/FLR_00008/2015', 'VEN/UF_1/2016', 'Colombia/2016/ZC204Se', 'Brazil/2015/ZBRC301', 'Brazil/2015/ZBRA105', 'USA/2016/FL022', 'USA/2016/FLWB042', 'DOM/2016/BB_0059', 'USA/2016/FLUR022', 'Aedes_aegypti/USA/2016/FL05', 'DOM/2016/BB_0183', 'DOM/2016/MA_WGS16_011', 'DOM/2016/BB_0433', 'EcEs062_16', 'HND/2016/HU_ME59', 'V8375', 'Nica1_16', 'Brazil/2016/ZBRC16', 'PRVABC59', 'ZKC2/2016', 'SMGC_1', 'SG_027', 'SG_074', 'SG_056', 'SG_018', 'Thailand/1610acTw', '1_0087_PF', '1_0181_PF', '1_0199_PF', 'Brazil/2015/ZBRC303', 'BRA/2016/FC_6706'] 

 Raw > TR-tree exclusion:  ['', 'KX369547.1'] 

Is overlap_raw == overlap _TR?:  True


This looks like we expect it to. Both comparisons yield the same overlapping tips. From **raw** to **time-resolved** we lose tip "KX369547.1" which is fine, because this was removed by the `clock-filter-iqd 4` function which discards tips that deviate more than 4 interquartile ranges. 

We see several "NODE_####" nodes in the time-resolved tree which is a little bit confusing since all we did was run `augur refine` which feeds in our `metadata.tsv` information (with columns: strain, virus, accession, date, region, country, division, city, db, segment, authors, url, title, journal, paper_url) to get our time-resolved tree from our raw. Maybe these nodes are created as references for the branch_lengths.json file given to us by `--output-node-data zika-tutorial/results/branch_lengths.json`?

## Do the mutation .json files make sense?



### We have: `nt_muts.json` :

In [27]:
import json
with open("../spatial_coupling/zika-tutorial/results/nt_muts.json") as nt_muts:
    data = json.load(nt_muts)

# Nucleotide mutations in first node as an example    
print(json.dumps(next(iter(data['nodes'])), indent=4), json.dumps(data['nodes']['1_0087_PF']['muts'],indent=4))

# To print the whole file:
# print(json.dumps(data['nodes'], indent=4))

"1_0087_PF" [
    "C3433T",
    "A3639G",
    "G8478A"
]


### and `aa_muts.json` : 

In [28]:
with open("../spatial_coupling/zika-tutorial/results/aa_muts.json") as aa_muts:
    data = json.load(aa_muts)

# This is kind of a crappy example since there are no mutations in the first node
print(json.dumps(next(iter(data['nodes'])), indent=4), 
      json.dumps(next(iter(data['nodes'].values())), indent=4))

# How is it possible that so many strains don't have mutations?
# Full output of the file:
# print(json.dumps(data['nodes'], indent=4))

"1_0087_PF" {
    "aa_muts": {
        "2K": [],
        "CA": [],
        "ENV": [],
        "MP": [],
        "NS1": [],
        "NS2A": [],
        "NS2B": [],
        "NS3": [],
        "NS4A": [],
        "NS4B": [],
        "NS5": [],
        "PRO": []
    }
}


### What do the mutations listed in this `.json` file refer to? If the answer is "mutations adopted from the parent node" why don't we see more mutations throughout?
I've just printed the first node as an example here, but it seems odd that there are no amino acid mutations listed for this (and many other) nodes. Sarah and I initially wanted to compare the "aa_muts" dictionaries between all nodes to confirm that we were seeing more divergence as we moved down the tree, but I was having trouble iterating through nested dictionaries since dictionaries aren't ordered. As we were discussing the problem, we decided a quicker sanity check would be to simply ask whether the nucleotide mutations listed in node `1_0087_PF` could be found in its sequence. We, then, weren't sure if the notation `"C3433T"` meant basepair number 3433 in the **sequence** or in the **protein** (like "ENV" or "MP"). I now realize the answer is obvious (it's the **sequence** because the `nt_muts.json` file doesn't have the mutations nested within proteins like the `aa_muts.json` file does), but we decided it'd be worthwhile to switch to HA now and get the augur pipeline working on our virus of interest before we: 1) spend time trying to answer questions that may be irrelevant to influenza, and/or 2) overlook more prominent questions that may arise when we look at a virus whose tree structure is quite different from zika virus


## What next?

1. Decide what `augur` settings to keep from the zika tutorial
2. Create my influenza HA tree
3. Follow up on my recent "sanity check" questions 