# Tree from most closely related sequences

This notebook was used for getting the most closely related sequences from a large tree of diverse SARS-CoV-2 from multiple VOC, clades and lineages including closely related sequences identified by UShER analysis. A maximum-likelihood tree was generated with IQ-TREE from a MAFFT multiple-sequence alignment of the Ontario WTD and human plus the most closely related sequences totalling 157 sequences. This 157 taxa tree was used to generate the tree in Figure 3 after some post-processing to collapse nodes with identical amino acid mutation patterns relative to the ON WTD sequences.

In [96]:
import pandas as pd

Read metadata about all sequences in 9163 taxa tree

In [97]:
df = pd.read_table('metadata.tsv')

In [98]:
df = df.set_index('sample')

In [99]:
df.lineage.value_counts()

B.1          4604
B.1.311      3817
B.1.2          83
AY.44          32
B.1.160        26
             ... 
B.1.1.221       1
B.1.452         1
B.1.1.362       1
B.1.1.39        1
B.1.239         1
Name: lineage, Length: 161, dtype: int64

Read tree with BioPython's Phylo module

In [100]:
from Bio import Phylo

In [101]:
tree = Phylo.read('results-B.1.311/iqtree/iqtree-MN908947.3-GTR.treefile', format='newick')

In [102]:
tree.count_terminals()

9163

Find one of the ON WTD taxa

In [103]:
node_4662 = tree.find_any(name='4662')

Retrieve subtree with less than 300 taxa of related sequences to sample 4662

In [104]:
node_under_300 = None
for n in tree.get_path(node_4662)[::-1]:
    if n.is_terminal():
        continue
    if n.count_terminals() <= 300:
        node_under_300 = n
    else:
        break
    print(n.branch_length, n.name, n.count_terminals())

2.0138e-06 None 6
0.0016228185 None 8
0.0004953373 None 10
0.0002341633 None 68
3.3443e-05 None 69
1e-06 None 122
1e-06 None 157


In [105]:
related_samples = [x.name for x in node_under_300.get_terminals()]

Obtained subtree of 157 taxa

In [106]:
len(related_samples)

157

Subset metadata table to get metadata for 157 subtree taxa

In [107]:
df_subtree = df.loc[list(set(related_samples) & set(df.index)),:]

In [109]:
df_subtree.lineage.value_counts()

B.1        144
B.1.311      8
B.1.516      2
B.1.582      2
B.1.2        1
Name: lineage, dtype: int64

Remove major outliers according to visual inspection in Dendroscope

In [120]:
outliers = '''
Nigeria/MRL-0S-0446/2021
'''.strip().split('\n')

In [121]:
outliers

['Nigeria/MRL-0S-0446/2021']

In [122]:
for outlier in outliers:
    tree.prune(target=outlier)

In [123]:
tree.count_terminals()

9162

Write tree without outlier(s) to file ensuring that BioPython doesn't trim branch length decimal numbers.

In [66]:
Phylo.write(tree, 'global-tree-without-outlier.newick', 'newick', format_branch_length="%f")

1

## Add WHO Clade Metadata

Info about which lineages belong to which WHO clades was retrieved from https://cov-lineages.org/lineage_list.html

In [186]:
who_clades = dict(
    Delta=r'B\.1\.617\.2|AY\..*',
    Omicron=r'BA\..*|B\.1\.1\.529',
    Eta=['B.1.525'],
    Iota=[
    'B.1.526',
    'B.1.526.1',
    'B.1.526.2',
    'B.1.526.3',],
    Beta=[
    'B.1.351',
    'B.1.351.1',
    'B.1.351.2',
    'B.1.351.3',
    'B.1.351.4',
    'B.1.351.5',],
    Gamma=[
    'P.1',
    'P.1.1',
    'P.1.2',
    'P.1.3',
    'P.1.4',
    'P.1.5',
    'P.1.6',
    'P.1.7',
    'P.1.7.1',
    'P.1.8',
    'P.1.9',
    'P.1.10',
    'P.1.10.1',
    'P.1.10.2',
    'P.1.11',
    'P.1.12',
    'P.1.12.1',
    'P.1.13',
    'P.1.14',
    'P.1.15',
    'P.1.16',
    'P.1.17',
    'P.1.17.1',],
    Epsilon=[
    'B.1.429',
    'B.1.429.1',
    'B.1.427',],
    Alpha=[
    'B.1.1.7',
    'Q.1',
    'Q.2',
    'Q.3',
    'Q.4',
    'Q.5',
    'Q.6',
    'Q.7',
    'Q.8',],
    Mu=[
    'B.1.621',
    'B.1.621.1',
    'B.1.621.2',
    'BB.1',
    'BB.2',],
    Lambda=[
        'C.37',
        'C.37.1',],
)

In [187]:
df['VOC'] = None
for who_clade, lineages in who_clades.items():
    print(who_clade)
    if isinstance(lineages, str):
        mask = df.lineage.str.match(lineages)
    else: # isinstance(lineages, list):
        mask = df.lineage.isin(lineages)
    print(df.loc[mask,'host'].value_counts())
    df.loc[mask, 'VOC'] = who_clade
    print('='*120)

Delta
Human                       85
Panthera leo                37
Panthera tigris             18
Canis lupus familiaris      18
Felis catus                 15
Gorilla gorilla             12
Environment                  8
Panthera tigris tigris       5
Panthera uncia               4
Aonyx cinereus               3
Neovison vison               2
Hippopotamus amphibius       1
Crocuta crocuta              1
Mustela furo                 1
Arctictis binturong          1
Panthera tigris sondaica     1
Prionailurus viverrinus      1
Name: host, dtype: int64
Omicron
Human                   19
Mesocricetus auratus    12
Environment              1
Name: host, dtype: int64
Eta
Human    10
Name: host, dtype: int64
Iota
Human                     10
Canis lupus familiaris     2
Felis catus                1
Name: host, dtype: int64
Beta
Human    20
Name: host, dtype: int64
Gamma
Human          20
Felis catus     1
Name: host, dtype: int64
Epsilon
Human          15
Felis catus     2
Name: host, dtype

In [126]:
df.VOC.value_counts()

Delta      213
Omicron     32
Gamma       21
Beta        20
Epsilon     17
Iota        13
Eta         10
Name: VOC, dtype: int64

# Prune away most B.1.311 taxa

B.1.311 seems to be a Pangolin misclassification of the ON WTD. They don't seem to be related to B.1.311 as closely as B.1.

Pruning away most of the B.1.311 except for comparison seems to be a good idea.

In [127]:
df.loc['USA/CA-CDPH-2000006330/2020',:]

type               betacoronavirus
accession          EPI_ISL_3105838
collection_date         2020-12-09
host                         Human
clade                           GH
lineage                    B.1.311
region               North America
country                        USA
division                California
city                 Sonoma County
VOC                           None
Name: USA/CA-CDPH-2000006330/2020, dtype: object

In [128]:
b_1_311_sample = 'USA/CA-CDPH-2000006330/2020'
node_b_1_311 = tree.find_any(name=b_1_311_sample)

In [129]:
clade_mostly_b1311 = None
cumul_branch_length = 0.0
for n in tree.get_path(node_b_1_311)[::-1]:
    cumul_branch_length += n.branch_length
    if n.is_terminal():
        continue
    clade_sample_names = [x.name for x in n.get_terminals()]
    print(len(clade_sample_names))
    print(cumul_branch_length)
    print(df.loc[clade_sample_names, 'lineage'].value_counts())
    print('='*80)
    if len(clade_sample_names) >= 3717:
        clade_mostly_b1311 = n
        break

4
2e-06
B.1.311    4
Name: lineage, dtype: int64
5
3e-06
B.1.311    5
Name: lineage, dtype: int64
8
4e-06
B.1.311    8
Name: lineage, dtype: int64
9
4.9999999999999996e-06
B.1.311    9
Name: lineage, dtype: int64
10
5.999999999999999e-06
B.1.311    10
Name: lineage, dtype: int64
11
6.999999999999999e-06
B.1.311    11
Name: lineage, dtype: int64
24
8e-06
B.1.311    24
Name: lineage, dtype: int64
25
9e-06
B.1.311    25
Name: lineage, dtype: int64
43
1e-05
B.1.311    43
Name: lineage, dtype: int64
92
1.1000000000000001e-05
B.1.311    92
Name: lineage, dtype: int64
135
1.2000000000000002e-05
B.1.311    135
Name: lineage, dtype: int64
153
1.3000000000000003e-05
B.1.311    153
Name: lineage, dtype: int64
155
1.4000000000000003e-05
B.1.311    155
Name: lineage, dtype: int64
166
1.5000000000000004e-05
B.1.311    166
Name: lineage, dtype: int64
167
4.8443000000000004e-05
B.1.311    167
Name: lineage, dtype: int64
182
8.188600000000001e-05
B.1.311    182
Name: lineage, dtype: int64
199
8.2886e-0

In [130]:
clade_mostly_b1311.count_terminals()

3717

In [131]:
def prune_other(tree, names):
    for node in tree.get_terminals():
        if node.name not in names:
            tree.prune(node)

In [132]:
b1311_samples_to_subsample = pd.Series(x.name for x in clade_mostly_b1311.get_terminals())

In [133]:
b1311_samples_to_subsample

0       USA/IL-IDPH-SAN-S-0001226/2020
1       USA/IL-IDPH-CAS-S-0001075/2020
2              USA/NC-UNCC-000121/2020
3                USA/VA-DCLS-1758/2020
4            USA/NC-CDC-LC0010876/2021
                     ...              
3712        USA/TX-HMH-MCoV-42782/2020
3713         USA/MI-MDHHS-SC21506/2020
3714         USA/MI-MDHHS-SC20539/2020
3715         USA/MI-MDHHS-SC20546/2020
3716       USA/CA-CDPH-2000011986/2020
Length: 3717, dtype: object

In [134]:
b1311_sub50 = b1311_samples_to_subsample.sample(n=50)

In [135]:
b1311_to_prune = list(set(b1311_samples_to_subsample) - set(b1311_sub50))

In [136]:
len(b1311_to_prune)

3667

In [137]:
sample_node = {n.name: n for n in tree.get_terminals()}

In [138]:
for name in b1311_to_prune:
    n = sample_node.get(name, None)
    if n is not None:
        tree.prune(n)

In [139]:
tree.count_terminals()

5495

## Prune sample with no metadata

For some reason some samples do not have a metadata entry. ASCII vs Unicode issue?

In [143]:
tree.prune(sample_node['Cote_d_Ivoire/BKE1402-bc14/2020'])

Clade(branch_length=0.0001337897)

In [144]:
tree.count_terminals()

5494

In [140]:
sample_node = {n.name: n for n in tree.get_terminals()}

In [146]:
samples_set = set(df.index)
df = df.loc[[x for x in sample_node.keys() if x in samples_set],:]

In [147]:
df.shape

(5493, 11)

In [151]:
b1_human_to_subsample_from = set(df.query('lineage == "B.1" and host == "Human"').index) - set(df_subtree.index)

In [153]:
len(b1_human_to_subsample_from)

4385

In [157]:
b1_human_to_keep = set(pd.Series(list(b1_human_to_subsample_from)).sample(n=200))

In [159]:
len(b1_human_to_keep)

200

In [160]:
b1_human_to_prune = list(set(b1_human_to_subsample_from) - set(b1_human_to_keep))

In [161]:
len(b1_human_to_prune)

4185

In [163]:
len(sample_node)

5495

In [164]:
tree.count_terminals()

5494

In [166]:
for b1_sample in b1_human_to_prune:
    b1_node = sample_node.get(b1_sample, None)
    if b1_node is None:
        print(f'{b1_sample} not found in dict!')
        b1_node = tree.find_any(name=b1_sample)
        if b1_node is None:
            print(f'{b1_sample} not found in tree')
            continue
    tree.prune(b1_node)

In [167]:
tree.count_terminals()

1309

In [168]:
sample_node = {n.name: n for n in tree.get_terminals()}

In [169]:
samples_set = set(df.index)
df = df.loc[[x for x in sample_node.keys() if x in samples_set],:]

In [170]:
df.shape

(1308, 11)

In [172]:
df.VOC.value_counts()

Delta      213
Omicron     32
Gamma       21
Beta        20
Epsilon     17
Iota        13
Eta         10
Name: VOC, dtype: int64

In [195]:
tree.count_terminals()

1309

Output global SARS-CoV-2 tree with non-essential taxa trimmed away with clade/lineage/host subsampling

In [173]:
Phylo.write(tree, 'tree-subsampled.newick', 'newick', format_branch_length="%f")

1

Iota clade doesn't appear to be monophyletic so designating certain Iota taxa that do not cluster with the rest as `Iota*`

In [188]:
iota_samples = set(df.loc[df.VOC == 'Iota',].index)

In [189]:
iota_samples

{'USA/CT-Yale-3208/2021',
 'USA/FL-CDC-ASC210070904/2021',
 'USA/FL-CDC-QDX24365418/2021',
 'USA/FL-CDC-STM-000038522/2021',
 'USA/MA-CDC-LC0037326/2021',
 'USA/MA-MASPHL-03041/2021',
 'USA/MN-MDH-8124/2021',
 'USA/NY-MSHSPSP-PV27528/2021',
 'USA/NY-PRL-2021_03_31_00G12/2021',
 'USA/NY-PRL-2021_0624_52C08/2021',
 'cat/USA/NJ-21-007630-001/2021',
 'dog/USA/CT-21-007025-001/2021',
 'dog/USA/FL-21-002342-001/2021'}

In [190]:
samples = '''
USA/AZ-TG935237/2020
USA/CT-Yale-3208/2021
USA/NY-PRL-2021_03_31_00G12/2021
USA/MA-CDCBI-CRSP_IFMMLNLSKYVF35FI/2021
USA/NY-PRL-2021_03_15_01H03/2021
USA/NY-PRL-2021_03_11_00B06/2021
USA/NY-PRL-2021_03_12_00I20/2021
USA/NY-PRL-2021_03_15_00F14/2021
USA/PA-CDC-STM-000040743/2021
USA/CO-CDC-FG-018482/2021
'''.strip().split('\n')
df.loc[list(set(samples) & set(df.index)), :]

Unnamed: 0_level_0,type,accession,collection_date,host,clade,lineage,region,country,division,city,VOC
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
USA/CT-Yale-3208/2021,betacoronavirus,EPI_ISL_1587409,2021-03-26,Human,GH,B.1.526,North America,USA,Connecticut,,Iota
USA/NY-PRL-2021_03_31_00G12/2021,betacoronavirus,EPI_ISL_1471625,2021-03-28,Human,GH,B.1.526,North America,USA,New York,New York City,Iota
USA/NY-PRL-2021_03_15_00F14/2021,betacoronavirus,EPI_ISL_1306557,2021-03-12,Human,GH,B.1,North America,USA,New York,New York City,


In [191]:
iota_2_samples = '''
USA/CT-Yale-3208/2021
USA/NY-PRL-2021_03_31_00G12/2021
'''.strip().split('\n')
df.loc[iota_2_samples, 'VOC'] = 'Iota*'

In [192]:
df.lineage.value_counts()

B.1        417
B.1.311    152
B.1.2       83
AY.44       32
B.1.1.7     26
          ... 
A            1
AY.119       1
B.1.264      1
Q.1          1
AY.85        1
Name: lineage, Length: 160, dtype: int64

In [193]:
df.loc[b1311_sub50, 'VOC'] = 'B.1.311'

In [194]:
df.to_csv('metadata-subsampled.tsv', sep='\t')

## Figure 3 - phylogenetic analysis of ON WTD+human and most closely related sequences

Performed phylogenetic analysis with MAFFT MSA and IQ-TREE maximum-likelihood phylogenetic tree inference on the 157 sequences of the ON WTD+human and most closely related sequences to generate Figure 3 zoom-in of global tree with UFBoot values.

In [199]:
with open('subtree-samples.txt') as f:
    subtree_samples = {l.strip() for l in f if l.strip()}        

In [201]:
len(subtree_samples)

157

In [216]:
df_subtree = df.loc[list(subtree_samples),:]

In [217]:
df_subtree.shape

(157, 11)

In [226]:
with open('subtree-dates-for-iqtree-time-tree.tsv', 'w') as fout:
    for sample, dt in zip(df_subtree.index, pd.to_datetime(df_subtree.collection_date)):
        fout.write(f'{sample}\t{dt.date()}\n')

In [202]:
from Bio.SeqIO.FastaIO import SimpleFastaParser

In [203]:
sample_seqs = {}
with open('results-B.1.311/gisaid/gisaid_sequences.filtered.fasta') as f:
    for h,s in SimpleFastaParser(f):
        if h in subtree_samples:
            sample_seqs[h] = s

In [204]:
len(sample_seqs)

157

In [205]:
with open('subtree.fasta', 'w') as fout:
    for sample, seq in sample_seqs.items():
        fout.write(f'>{sample}\n{seq}\n')

In [207]:
!wc  subtree.fasta

    314     314 4690198 subtree.fasta


In [208]:
!mafft --version

v7.471 (2020/Jul/3)


In [209]:
!mamba install -y mafft=7.490


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.13.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['mafft=7.490']

pkgs/main/linux-64       [=>                  ] (--:--) No change
pkgs/r/linux-64 

bioconda/noarch          [   <=>              ] (00m:00s) 1 MB / ?? (1.09 MB/s)
conda-forge/linux-64     [  <=>             ] (00m:00s) 1 MB / ?? (945.25 KB/s)
conda-forge/noarch       [ <=>              ] (00m:00s) 1 MB / ?? (897.87 KB/s)
bioconda/linux-64        [ <=>            ] (00m:00s) 974 KB / ?? (850.25 KB/s)
bioconda/noarch          [   <=>              ] (00m:00s) 1 MB / ?? (1.09 MB/s)
conda-forge/linux-64     [   <=>              ] (00m:00s) 2 MB / ?? (1.17 MB/s)
conda-forge/noarch       [ <=>              ] (00m:00s) 1 MB / ?? (897.87 KB/s)
bioconda/linux-64        [ <=>            ] (00m:00s) 974 KB / ?? (850.25 KB/s)
bioconda/noarch          [   <=>              ] (00m:00s) 1 MB / ?? (1.09 MB/s)
conda-forge/linux-64     [   <=>              ] (00m:00s) 2 MB / ?? (1.17 MB/s)
conda-forge/noarch       [  <=>             ] (00m:00s) 1 MB / ?? (897.87 KB/s)
bioconda/linux-64        [ <=>            ] (00m:00s) 974 KB / ?? (850.25 KB/s)
bioconda/noarch          [   <=>        

bioconda/noarch          [       <=>          ] (00m:01s) 2 MB / ?? (1.50 MB/s)
conda-forge/linux-64     [      <=>           ] (00m:00s) 3 MB / ?? (1.55 MB/s)
conda-forge/noarch       [      <=>           ] (00m:00s) 2 MB / ?? (1.31 MB/s)
bioconda/linux-64        [      <=>           ] (00m:00s) 2 MB / ?? (1.34 MB/s)
bioconda/noarch          [       <=>          ] (00m:01s) 3 MB / ?? (1.57 MB/s)
conda-forge/linux-64     [      <=>           ] (00m:00s) 3 MB / ?? (1.55 MB/s)
conda-forge/noarch       [      <=>           ] (00m:00s) 2 MB / ?? (1.31 MB/s)
bioconda/linux-64        [      <=>           ] (00m:00s) 2 MB / ?? (1.34 MB/s)
bioconda/noarch          [       <=>          ] (00m:01s) 3 MB / ?? (1.57 MB/s)
conda-forge/linux-64     [       <=>          ] (00m:01s) 3 MB / ?? (1.55 MB/s)
conda-forge/noarch       [      <=>           ] (00m:01s) 2 MB / ?? (1.31 MB/s)
bioconda/linux-64        [      <=>           ] (00m:01s) 2 MB / ?? (1.34 MB/s)
bioconda/noarch          [       <=>    

conda-forge/linux-64     [           <=>      ] (00m:01s) 4 MB / ?? (1.82 MB/s)
conda-forge/noarch       [          <=>       ] (00m:01s) 4 MB / ?? (1.58 MB/s)
conda-forge/linux-64     [           <=>      ] (00m:01s) 5 MB / ?? (1.88 MB/s)
conda-forge/noarch       [          <=>       ] (00m:01s) 4 MB / ?? (1.58 MB/s)
conda-forge/linux-64     [           <=>      ] (00m:01s) 5 MB / ?? (1.88 MB/s)
conda-forge/noarch       [           <=>      ] (00m:01s) 4 MB / ?? (1.58 MB/s)
conda-forge/linux-64     [           <=>      ] (00m:01s) 5 MB / ?? (1.88 MB/s)
conda-forge/noarch       [           <=>      ] (00m:01s) 4 MB / ?? (1.63 MB/s)
conda-forge/linux-64     [            <=>     ] (00m:01s) 5 MB / ?? (1.88 MB/s)
conda-forge/noarch       [           <=>      ] (00m:01s) 4 MB / ?? (1.63 MB/s)
conda-forge/linux-64     [            <=>     ] (00m:01s) 5 MB / ?? (1.91 MB/s)
conda-forge/noarch       [           <=>      ] (00m:01s) 4 MB / ?? (1.63 MB/s)
conda-forge/linux-64     [            <=

conda-forge/linux-64     [                   ] (00m:04s) 13 MB / ?? (2.23 MB/s)
conda-forge/linux-64     [                   ] (00m:04s) 13 MB / ?? (2.24 MB/s)
conda-forge/linux-64     [                   ] (00m:05s) 13 MB / ?? (2.24 MB/s)
conda-forge/linux-64     [                   ] (00m:05s) 13 MB / ?? (2.25 MB/s)
conda-forge/linux-64     [                   ] (00m:05s) 13 MB / ?? (2.25 MB/s)
conda-forge/linux-64     [                   ] (00m:05s) 14 MB / ?? (2.25 MB/s)
conda-forge/linux-64     [                   ] (00m:05s) 14 MB / ?? (2.25 MB/s)
conda-forge/linux-64     [                   ] (00m:05s) 14 MB / ?? (2.25 MB/s)
conda-forge/linux-64     [                   ] (00m:05s) 14 MB / ?? (2.25 MB/s)
conda-forge/linux-64     [                   ] (00m:05s) 15 MB / ?? (2.26 MB/s)
conda-forge/linux-64     [                   ] (00m:05s) 15 MB / ?? (2.26 MB/s)
conda-forge/linux-64     [                   ] (00m:05s) 15 MB / ?? (2.26 MB/s)
conda-forge/linux-64     [              

In [210]:
!mafft --version

v7.490 (2021/Oct/30)


In [211]:
!mafft --help

------------------------------------------------------------------------------
  MAFFT v7.490 (2021/Oct/30)
  https://mafft.cbrc.jp/alignment/software/
  MBE 30:772-780 (2013), NAR 30:3059-3066 (2002)
------------------------------------------------------------------------------
High speed:
  % mafft in > out
  % mafft --retree 1 in > out (fast)

High accuracy (for <~200 sequences x <~2,000 aa/nt):
  % mafft --maxiterate 1000 --localpair  in > out (% linsi in > out is also ok)
  % mafft --maxiterate 1000 --genafpair  in > out (% einsi in > out)
  % mafft --maxiterate 1000 --globalpair in > out (% ginsi in > out)

If unsure which option to use:
  % mafft --auto in > out

--op # :         Gap opening penalty, default: 1.53
--ep # :         Offset (works like gap extension penalty), default: 0.0
--maxiterate # : Maximum number of iterative refinement, default: 0
--clustalout :   Output: clustal format, default: fasta
--reorder :      Outorder: aligned, default: input 

In [212]:
!mafft --thread -1 --auto subtree.fasta > subtree.mafft.fasta

OS = linux
The number of physical cores =  8
nthread = 8
nthreadpair = 8
nthreadtb = 8
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00



Making a distance matrix ..

There are 35536 ambiguous characters.
  101 / 157 (thread    7)
done.

Constructing a UPGMA tree (efffree=0) ... 
  150 / 157
done.

Progressive alignment 1/2... 
STEP   126 / 156 (thread    1)
Reallocating..done. *alloclen = 60827
STEP   156 / 156 (thread    3)
done.

Making a distance matrix from msa.. 
  100 / 157 (thread    2)
done.

Constructing a UPGMA tree (efffree=1) ... 
  150 / 157
done.

Progressive alignment 2/2... 
STEP   148 / 156 (thread    1)
Reallocating..done. *alloclen = 60830
STEP   156 / 156 (thread    6)
done.

disttbfast (nuc) Version 7.490
alg=A, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
8 thread(s)


Strategy:
 FFT-NS-2 (Fast but rough)
 Progressive method (guide trees were built 2 times.)

If

In [229]:
!mafft --thread -1 --auto subtree-with-ref.fasta > subtree-with-ref.mafft.fasta

OS = linux
The number of physical cores =  8
nthread = 8
nthreadpair = 8
nthreadtb = 8
ppenalty_ex = 0
stacksize: 8192 kb
generating a scoring matrix for nucleotide (dist=200) ... done
Gap Penalty = -1.53, +0.00, +0.00



Making a distance matrix ..

There are 35536 ambiguous characters.
  101 / 158 (thread    2)
done.

Constructing a UPGMA tree (efffree=0) ... 
  150 / 158
done.

Progressive alignment 1/2... 
STEP   120 / 157 (thread    7)
Reallocating..done. *alloclen = 60827
STEP   157 / 157 (thread    5)
done.

Making a distance matrix from msa.. 
  100 / 158 (thread    6)
done.

Constructing a UPGMA tree (efffree=1) ... 
  150 / 158
done.

Progressive alignment 2/2... 
STEP   150 / 157 (thread    6)
Reallocating..done. *alloclen = 60830
STEP   157 / 157 (thread    0)
done.

disttbfast (nuc) Version 7.490
alg=A, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
8 thread(s)


Strategy:
 FFT-NS-2 (Fast but rough)
 Progressive method (guide trees were built 2 times.)

If

In [214]:
!iqtree --version

IQ-TREE multicore version 2.2.0-beta COVID-edition for Linux 64-bit built Dec 21 2021 built Dec 21 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams, Ly Trong Nhan.



# IQ-TREE phylogenetic analysis

In [222]:
!iqtree -s subtree.mafft.fasta --prefix subtree.mafft.iqtree -T 16 -m GTR -B 1000

IQ-TREE multicore version 2.2.0-beta COVID-edition for Linux 64-bit built Dec 21 2021 built Dec 21 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams, Ly Trong Nhan.

Host:    ncfadlaptop (AVX2, FMA3, 62 GB RAM)
Command: iqtree -s subtree.mafft.fasta --prefix subtree.mafft.iqtree -T 16 -m GTR -B 1000
Seed:    485643 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Thu Feb 17 11:51:36 2022
Kernel:  AVX+FMA - 16 threads (16 CPU cores detected)

Reading alignment file subtree.mafft.fasta ... Fasta format detected
Reading fasta file: done in 0.046123 secs using 91.71% CPU
Alignment most likely contains DNA/RNA sequences
Constructing alignment: done in 0.0319899 secs using 565.7% CPU
Alignment has 157 sequences with 29928 columns, 1352 distinct patterns
152 parsimony-informative, 150 singleton sites, 29626 constant sites
                                        Gap/Ambiguity  Composition  p-va

Identifying sites to remove: done in 0.0177862 secs using 1169% CPU
NOTE: 16 identical sequences (see below) will be ignored for subsequent analysis
NOTE: mink/USA/MI-CDC-3886941-001/2020 (identical to mink/USA/MI-CDC-3886613-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886720-001/2020 (identical to mink/USA/MI-CDC-3886613-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886641-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886693-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886707-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886719-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886727-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-

Optimizing NNI: done in 0.245482 secs using 1580% CPU
Optimizing NNI: done in 0.179862 secs using 1572% CPU
Optimizing NNI: done in 0.310166 secs using 1560% CPU
Optimizing NNI: done in 0.203498 secs using 1567% CPU
Optimizing NNI: done in 0.322135 secs using 1552% CPU
Iteration 70 / LogL: -43652.200 / Time: 0h:0m:27s (0h:0m:12s left)
Optimizing NNI: done in 0.185395 secs using 1585% CPU
Optimizing NNI: done in 0.114686 secs using 1562% CPU
Optimizing NNI: done in 0.205381 secs using 1581% CPU
Optimizing NNI: done in 0.270787 secs using 1550% CPU
Optimizing NNI: done in 0.217734 secs using 1593% CPU
Optimizing NNI: done in 0.142412 secs using 1551% CPU
Optimizing NNI: done in 0.246825 secs using 1588% CPU
Optimizing NNI: done in 0.214893 secs using 1574% CPU
Optimizing NNI: done in 0.154705 secs using 1558% CPU
Optimizing NNI: done in 0.0919931 secs using 1548% CPU
Iteration 80 / LogL: -43725.028 / Time: 0h:0m:29s (0h:0m:7s left)
Optimizing NNI: done in 0.171745 secs using 1551% CPU
Op

Optimizing NNI: done in 0.409281 secs using 1561% CPU
Optimizing NNI: done in 0.270389 secs using 1564% CPU
Optimizing NNI: done in 0.182206 secs using 1578% CPU
Optimizing NNI: done in 0.250414 secs using 1575% CPU
Iteration 200 / LogL: -43649.709 / Time: 0h:1m:0s (0h:0m:0s left)
Log-likelihood cutoff on original alignment: -43708.471
NOTE: Bootstrap correlation coefficient of split occurrence frequencies: 0.991
TREE SEARCH COMPLETED AFTER 200 ITERATIONS / Time: 0h:1m:1s

--------------------------------------------------------------------
|                    FINALIZING TREE SEARCH                        |
--------------------------------------------------------------------
Performs final model parameters optimization
Estimate model parameters (epsilon = 0.010)
1. Initial log-likelihood: -43649.350
2. Current log-likelihood: -43649.189
Optimal log-likelihood: -43649.182
Rate parameters:  A-C: 0.21583  A-G: 1.06857  A-T: 0.13440  C-G: 0.07338  C-T: 4.44336  G-T: 1.00000
Base frequenci

In [230]:
!iqtree -s subtree-with-ref.mafft.fasta -o MN908947.3 --prefix subtree-with-ref.mafft.iqtree -T 16 -m GTR -B 1000

IQ-TREE multicore version 2.2.0-beta COVID-edition for Linux 64-bit built Dec 21 2021 built Dec 21 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams, Ly Trong Nhan.

Host:    ncfadlaptop (AVX2, FMA3, 62 GB RAM)
Command: iqtree -s subtree-with-ref.mafft.fasta -o MN908947.3 --prefix subtree-with-ref.mafft.iqtree -T 16 -m GTR -B 1000
Seed:    84630 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Fri Feb 18 08:30:55 2022
Kernel:  AVX+FMA - 16 threads (16 CPU cores detected)

Reading alignment file subtree-with-ref.mafft.fasta ... Fasta format detected
Reading fasta file: done in 0.046182 secs using 94.05% CPU
Alignment most likely contains DNA/RNA sequences
Constructing alignment: done in 0.0308127 secs using 633.7% CPU
Alignment has 158 sequences with 29928 columns, 1357 distinct patterns
152 parsimony-informative, 158 singleton sites, 29618 constant sites
                                

Identifying sites to remove: done in 0.01758 secs using 1055% CPU
NOTE: 16 identical sequences (see below) will be ignored for subsequent analysis
NOTE: mink/USA/MI-CDC-3886941-001/2020 (identical to mink/USA/MI-CDC-3886613-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886720-001/2020 (identical to mink/USA/MI-CDC-3886613-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886641-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886693-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886707-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886719-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886727-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CD

Optimizing NNI: done in 0.232464 secs using 1549% CPU
Optimizing NNI: done in 0.291536 secs using 1540% CPU
Optimizing NNI: done in 1.00673 secs using 1491% CPU
Optimizing NNI: done in 0.393266 secs using 1527% CPU
Optimizing NNI: done in 0.36937 secs using 1522% CPU
Optimizing NNI: done in 0.406698 secs using 1544% CPU
Optimizing NNI: done in 0.267307 secs using 1552% CPU
Iteration 70 / LogL: -43736.643 / Time: 0h:0m:34s (0h:0m:16s left)
Optimizing NNI: done in 0.231187 secs using 1550% CPU
Optimizing NNI: done in 0.338862 secs using 1539% CPU
Optimizing NNI: done in 0.239519 secs using 1557% CPU
Optimizing NNI: done in 0.324815 secs using 1548% CPU
Optimizing NNI: done in 0.223746 secs using 1553% CPU
Optimizing NNI: done in 0.323373 secs using 1541% CPU
Optimizing NNI: done in 0.401909 secs using 1518% CPU
Optimizing NNI: done in 0.141644 secs using 1554% CPU
Optimizing NNI: done in 0.198499 secs using 1547% CPU
Optimizing NNI: done in 0.23431 secs using 1548% CPU
Iteration 80 / Log

Inferred "time tree" with IQ-TREE, but UFBoot or classical bootstrap values cannot be computed in this mode

In [228]:
!iqtree -s subtree.mafft.fasta --prefix subtree.mafft.iqtree-time-tree -T 16 -m GTR --date subtree-dates-for-iqtree-time-tree.tsv

IQ-TREE multicore version 2.2.0-beta COVID-edition for Linux 64-bit built Dec 21 2021 built Dec 21 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams, Ly Trong Nhan.

Host:    ncfadlaptop (AVX2, FMA3, 62 GB RAM)
Command: iqtree -s subtree.mafft.fasta --prefix subtree.mafft.iqtree-time-tree -T 16 -m GTR --date subtree-dates-for-iqtree-time-tree.tsv
Seed:    801360 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Thu Feb 17 11:55:31 2022
Kernel:  AVX+FMA - 16 threads (16 CPU cores detected)

Reading alignment file subtree.mafft.fasta ... Fasta format detected
Reading fasta file: done in 0.0480866 secs using 96.95% CPU
Alignment most likely contains DNA/RNA sequences
Constructing alignment: done in 0.0302765 secs using 707.8% CPU
Alignment has 157 sequences with 29928 columns, 1352 distinct patterns
152 parsimony-informative, 150 singleton sites, 29626 constant sites
                       

Identifying sites to remove: done in 0.0170207 secs using 1198% CPU
NOTE: 16 identical sequences (see below) will be ignored for subsequent analysis
NOTE: mink/USA/MI-CDC-3886941-001/2020 (identical to mink/USA/MI-CDC-3886613-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886720-001/2020 (identical to mink/USA/MI-CDC-3886613-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886641-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886693-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886707-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886719-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-CDC-3886727-001/2020 (identical to mink/USA/MI-CDC-3886724-001/2020) is ignored but added at the end
NOTE: mink/USA/MI-

Optimizing NNI: done in 0.248491 secs using 1516% CPU
Optimizing NNI: done in 0.164849 secs using 1540% CPU
Optimizing NNI: done in 0.14969 secs using 1518% CPU
Optimizing NNI: done in 0.131995 secs using 1516% CPU
Iteration 70 / LogL: -43649.534 / Time: 0h:0m:14s (0h:0m:9s left)
Optimizing NNI: done in 0.218556 secs using 1529% CPU
Optimizing NNI: done in 0.131357 secs using 1517% CPU
Optimizing NNI: done in 0.150211 secs using 1548% CPU
Optimizing NNI: done in 0.203826 secs using 1503% CPU
Optimizing NNI: done in 0.191314 secs using 1548% CPU
Optimizing NNI: done in 0.224507 secs using 1523% CPU
Optimizing NNI: done in 0.105082 secs using 1556% CPU
Optimizing NNI: done in 0.184628 secs using 1552% CPU
Optimizing NNI: done in 0.173856 secs using 1528% CPU
Optimizing NNI: done in 0.20618 secs using 1487% CPU
Iteration 80 / LogL: -43652.323 / Time: 0h:0m:16s (0h:0m:7s left)
Optimizing NNI: done in 0.146264 secs using 1536% CPU
Optimizing NNI: done in 0.160736 secs using 1548% CPU
Optimi