# Downloads PANGO Lineages
**[Work in progress]**

This notebook downloads the current PANGO lineages and build a tree structure of the lineages.

Data sources: [PANGO Lineage Designations](https://github.com/cov-lineages/pango-designation)

Reference:
Rambaut A, et al., A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology(2020) Nature Microbiology [doi:10.1038/s41564-020-0770-5](https://doi.org/10.1038/s41564-020-0770-5).

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import numpy as np
import pandas as pd
import io
import dateutil
import re
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


## Get PANGO lineages

In [4]:
pango_url = 'https://raw.githubusercontent.com/cov-lineages/pango-designation/master/lineage_notes.txt'

In [5]:
pango = pd.read_csv(pango_url, sep='\t', skiprows=1, dtype=str, names=['lineage', 'description'])

Remove spaces in lineage column

In [6]:
pango['lineage'] = pango['lineage'].str.strip()

Remove withdrawn lineages (start with a "*")

In [7]:
pango = pango[~pango['lineage'].str.startswith('*')]

In [8]:
pango.sample(10)

Unnamed: 0,lineage,description
295,B.1.1.343,Canadian lineage
439,B.1.12,"Luxembourg lineage (also England, Belgium etc.)"
353,B.1.1.402,"USA (Texas), split from B.1.1.177"
962,B.1.506,"USA (NY, split from B.1.3)"
56,B.1.1.37,UK lineage
418,B.1.1.483,Belgian lineage
250,B.1.1.294,Russian and UK lineage
1210,M.3,"Alias of B.1.1.294.3, English Lineage"
961,B.1.505,Israel and england (was B.1.3.4)
867,B.1.406,Germany/ UK lineage


Extract alias from description

In [9]:
pattern = re.compile('Alias of ([\S]*?),', re.IGNORECASE)

In [10]:
def get_alias(row):
    match = pattern.findall(str(row.description))
    if len(match) > 0:
        return match[0]
    else:
        return ''

In [11]:
pango['alias'] = pango.apply(get_alias, axis=1)

In [12]:
pango['predecessor'] = pango['alias'].str.rsplit('.', 1, expand=True)[0]

In [13]:
pango.tail(5)

Unnamed: 0,lineage,description,alias,predecessor
1275,AQ.1,"Alias of B.1.1.39.1, Finland lineage",B.1.1.39.1,B.1.1.39
1276,AQ.2,"Alias of B.1.1.39.2, Denmark lineage",B.1.1.39.2,B.1.1.39
1277,AS.1,"Alias of B.1.1.317.1, UK lineage",B.1.1.317.1,B.1.1.317
1278,AS.2,"Alias of B.1.1.317.2, UK lineage",B.1.1.317.2,B.1.1.317
1279,AT.1,"Alias of B.1.1.370.1, Russia and Finland linea...",B.1.1.370.1,B.1.1.370


### Split into sublineages

In [14]:
def split_lineage(row):
    lineage = row['lineage']
    lineages =  np.empty(4, dtype=object)

    for i in range(lineages.size):
        lineages[i] = lineage
        lineage = lineage.rpartition('.')[0]

    return lineages

In [15]:
pango[['l0', 'l1', 'l2', 'l3']] = pango.apply(split_lineage, axis=1, result_type='expand')
pango['levels'] = pango['lineage'].str.count('\.') + 1

In [16]:
pango.sample(5)

Unnamed: 0,lineage,description,alias,predecessor,l0,l1,l2,l3,levels
523,B.1.143,Indian lineage,,,B.1.143,B.1,B,,3
1029,B.1.572,"USA, split from B.1.325",,,B.1.572,B.1,B,,3
671,B.1.195,UAE,,,B.1.195,B.1,B,,3
727,B.1.254,Scotland,,,B.1.254,B.1,B,,3
229,B.1.1.267,Irish lineage,,,B.1.1.267,B.1.1,B.1,B,4


In [17]:
pango.to_csv(NEO4J_IMPORT / "00b-PANGOLineage.csv", index=False)