## A notebook to compare strains between public & private build to converge on the appropriate dataset

Private names are 


Note: Private datsets have different names. Before using this notebook, use the script `./scripts/convert_private_json_names_to_match_public.py` to produce a private build with names in the same format as public. This allows easy tanglegrams and name matching.

## How to get the private JSON

1. I used the private git repo at 2f3c4de (commit msg: "updated to sequencing run 2020-07-28")
2. Convert to auspice v2 format (see below)
3. Converted the JSON into a format where the name matches the public & manuscript builds

```sh
python ./scripts/convert_private_json_names_to_match_public.py ignore/private_2f3c4de.json > ignore/private_2f3c4de_stripped-names.json
```

## How to get the public JSON

1. Convert the public build JSON via `auspice convert --v1 <meta json> <tree json> --output ./ignore/public_5891e6e.json`


## How to get the manuscript JSON

1. `auspice convert --v1 auspice/ebola-narrative-ms_meta.json auspice/ebola-narrative-ms_tree.json --output ignore/ms_4e33e36.json`

In [1]:
private_json_path = "../ignore/private_2f3c4de_stripped-names.json"
private_dropped_strains_path = "../ignore/private_2f3c4de_dropped-trains.txt"

public_json_path = "../ignore/public_5891e6e.json"
public_dropped_strains_path = "../ignore/public_5891e6e_dropped-strains.txt"

ms_json_path = "../auspice/ebola_nord-kivu_manuscript.json"
ms_dropped_strains_path = "../config/dropped_strains.txt"



In [2]:
import json
def get_names_from_auspice_v2_json(p):
    s = {}
    def collect(n):
        if not n['name'].startswith("NODE"):
            s[n['name']] = n['node_attrs']
        if 'children' in n.keys():
            for c in n['children']:
                   collect(c)
    with open(p) as fh:
        d = json.load(fh)
        collect(d['tree'])
    return s

def get_names_from_exclude_list(p):
    s = {}
    with open(p) as fh:
        for line in fh:
            if not line.startswith('#') and line.strip():
                parts = line.split()
                s[parts[0]] = " ".join(parts[1:])
    return s

strains = {
    'private_dataset': get_names_from_auspice_v2_json(private_json_path),
    'public_dataset': get_names_from_auspice_v2_json(public_json_path),
    'ms_dataset': get_names_from_auspice_v2_json(ms_json_path),

    'private_dropped': get_names_from_exclude_list(private_dropped_strains_path),
    'public_dropped': get_names_from_exclude_list(public_dropped_strains_path),
    'ms_dropped': get_names_from_exclude_list(ms_dropped_strains_path),
}

def compare_datasets(*names):
    strain_sets = [{k for k in names.keys()} for names in [strains[n] for n in names]]
    for i, n in enumerate(names):
        print(f"{n} has {len(strain_sets[i])} strains")
    print()
    
    ## intersection
    strains_in_common = set.intersection(*strain_sets)
    print(f"n(strains in common): {len(strains_in_common)}")
    print()

    ## get unique to each dataset
    for i in range(0, len(names)):
        print(f"Strains unique to {names[i]}:")
        for n in strain_sets[i] - set.union(*[s for j, s in enumerate(strain_sets) if j!=i]):
            print(f"\t{n}")

    print()
    ## get unique to pairs using the trick that in an n=3 dataset if we focus on one there are two left...
    for i in range(0, len(names)):
        print(f"Strains unique to {' & '.join([names[j] for j in range(0, len(names)) if i!=j])}, in orther words excluded from {names[i]}")
        for n in set.intersection(*[s for j, s in enumerate(strain_sets) if j!=i]) - strain_sets[i]:
            postscript = f"(in {names[i]} dropped list)" if n in strains[names[i].replace('dataset', 'dropped')] else ""
            print(f"\t{n} {postscript}")
    
compare_datasets("private_dataset", "public_dataset", "ms_dataset")


private_dataset has 787 strains
public_dataset has 742 strains
ms_dataset has 792 strains

n(strains in common): 738

Strains unique to private_dataset:
	MAN5036
	CSF2
	CSF1
	Blood2
Strains unique to public_dataset:
	BEN40781
Strains unique to ms_dataset:
	BEN1543
	MWG016
	BEN4627
	MAM1133
	BTB485
	BTB9406
	BTB41015
	BTB513

Strains unique to public_dataset & ms_dataset, in orther words excluded from private_dataset
	KOM2796 (in private_dataset dropped list)
	MAN035 (in private_dataset dropped list)
Strains unique to private_dataset & ms_dataset, in orther words excluded from public_dataset
	BEN28454 
	BEN24751 
	BEN30737 
	BEN20822 
	MAN5973 
	BEN32140 
	BEN24739 
	BEN23744 
	BEN27614 
	BEN23697 
	BEN22653 
	MAN6208 
	BEN25515 
	BTB26602 
	BEN25705 
	MAN5078 
	MAN5940 
	MAN5602 
	BEN28238 
	BEN28552 
	MAN6173 
	BEN28142 
	BEN22643 
	BTB27764 
	MAN5750 
	BEN26260 
	BEN24640 
	BEN23856 
	BEN23895 
	MAN6566 
	MAN6448 
	BEN25101 
	BEN23438 
	BTB27289 
	BEN22364 
	BEN25704 
	BEN28389 
	MAN