# Downloads Publication Information for PANGO Lineages from the CORD-19 Data Set
**[Work in progress]**

This notebook text-mines [PANGO lineage](https://cov-lineages.org/) mentions in the titles and abstracts of publications and preprints from the CORD-19 data set. Note, the text-mined results may contain false positive!

Data sources: [PANGO Lineage Designations](https://github.com/cov-lineages/pango-designation), 
[CORD-19](https://allenai.org/data/cord-19)

References:

Rambaut A, et al., A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology(2020) Nature Microbiology [doi:10.1038/s41564-020-0770-5](https://doi.org/10.1038/s41564-020-0770-5).

Lucy Lu Wang, et al., CORD-19: The COVID-19 Open Research Dataset (2020) [arXiv:2004.10706v4](https://arxiv.org/abs/2004.10706).

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import io
import dateutil
import re
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


## Get PANGO lineages

In [6]:
pango = pd.read_csv(NEO4J_IMPORT / "00b-PANGOLineage.csv", dtype=str)

In [7]:
pango.sample(5)

Unnamed: 0,lineage,description,alias,predecessor,l0,l1,l2,l3,levels
1208,M.1,"Alias of B.1.1.294.1, Israel",B.1.1.294.1,B.1.1.294,M.1,M,,,2
966,B.1.510,"Iceland, Sweden, North America (could be disso...",,,B.1.510,B.1,B,,3
622,B.1.177.53,UK lineage,,,B.1.177.53,B.1.177,B.1,B,4
1158,C.2,"Alias of B.1.1.1.2, South Africa and some Euro...",B.1.1.1.2,B.1.1.1,C.2,C,,,2
529,B.1.153,"English lineage with some Australian, New Zeal...",,,B.1.153,B.1,B,,3


In [8]:
lineages = set(pango['lineage'].unique())

## Get CORD-19 Metadata

In [9]:
CACHE = Path(NEO4J_IMPORT / 'cache/cord19/metadata.csv')

In [10]:
metadata = pd.read_csv(CACHE, dtype='str')

In [11]:
metadata.fillna('', inplace=True)
#convert datetime column to just date
metadata['year'] = metadata['publish_time'].apply(lambda d: d[:4] if len(d) > 4 else '')
metadata['date'] = metadata['publish_time'].apply(lambda d: dateutil.parser.parse(d) if len(d) > 0 else '')

In [12]:
print("Total number of papers", metadata.shape[0])

Total number of papers 536817


## Extract a list of PANGO lineages

Remove special characters to simply parsing for lineages in parenthesis, comma-separated lists, etc.

In [13]:
metadata['title'] = metadata['title'].replace('[()/,]', ' ', regex=True)
metadata['abstract'] = metadata['abstract'].replace('[()/,]', ' ', regex=True)

Match PANGO patterns and check agains list of known lineages.

In [14]:
pattern1 = re.compile(' [A-Z]{1,2}[.]\d+ ')
pattern2 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+ ')
pattern3 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+[.]+\d+ ')

In [15]:
def get_lineages(row):
    text = ' ' + row.title + ' ' + row.abstract + ' '
    lin = pattern1.findall(text) + pattern2.findall(text) + pattern3.findall(text)
    u_lin = set()
    
    for l in lin:
        l = l.strip()
        # check if lineage is valid (e.g., not a withdrawn lineage or false positive)
        if l in lineages:
            u_lin.add(l)
            
    return ";".join(u_lin)

In [16]:
metadata['lineages'] = metadata.apply(get_lineages, axis=1)

Keep only papers that map to PANGO lineages

In [17]:
hits = metadata[metadata['lineages'].str.len() > 0].copy()

### Assign CURIEs from [Identifiers.org](https://identifiers.org)

In [18]:
hits['doi'] = hits['doi'].apply(lambda x: 'doi:' + x if len(x) > 0 else '')
hits['pubmed_id'] = hits['pubmed_id'].apply(lambda x: 'pubmed:' + x if len(x) > 0 else '')
hits['pmcid'] = hits['pmcid'].apply(lambda x: 'pmc:' + x if len(x) > 0 else '')
hits['arxiv_id'] = hits['arxiv_id'].apply(lambda x: 'arxiv:' + x if len(x) > 0 else '')

In [19]:
#hits.sort_values(by=['publish_time'], ascending=False, inplace=True)

In [20]:
print("Number of matches", hits.shape[0])

Number of matches 592


In [21]:
def create_id(row):
    """Creates a unique id using the most commonly available id in priority order"""
    if row.doi != '':
        return row.doi
    elif row.pubmed_id != '':
        return row.pubmed_id
    elif row.pmcid != '':
        return row.pmcid
    elif row.arxiv_id != '':
        return row.arxiv_id
    elif row.url != '':
        return row.url
    else:
        # TODO deal with WHO papers here?
        return ''

In [22]:
hits['id'] = hits.apply(create_id, axis=1)

In [23]:
hits.sample(20)

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,year,date,lineages,id
286738,itikugkk,,WHO,An Observational Laboratory-Based Assessment o...,,,,unk,Information on severe acute respiratory syndro...,2021,"Sander, Anna-Lena; Yadouleton, Anges; Moreira-...",MSphere,,#1029965,,,,,231605197.0,,2021-04-23,A.4;B.1,
247931,ftvatk0a,,WHO,Immunoinformatic analysis of structural and ep...,,,,unk,A newly emerged strain of SARS-CoV-2 of B.1.1....,2021,"Hussain, Mushtaq; Shabbir, Sanya; Amanullah, A...",J. med. virol,,#1126508,,,,,232196602.0,,2021-04-23,B.1.1.7,
16254,g157h5sn,cc129c4cd6dc4523658db923b90f269e11d9d09a,PMC,Could it be that the B.1.1.7 lineage is more d...,doi:10.1017/ice.2021.59,pmc:PMC7948098,pubmed:33557960,no-cc,,2021-02-09,"Kow, Chia Siang; Hasan, Syed Shahzad",,,,,document_parses/pdf_json/cc129c4cd6dc4523658db...,document_parses/pmc_json/PMC7948098.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,,2021.0,2021-02-09,B.1.1.7,doi:10.1017/ice.2021.59
420424,vw3cuiof,69fa13b80bb407e8bca177cfb906bbf604ffcf63; c0ee...,Elsevier; Medline; PMC,Genomic characteristics and clinical effect of...,doi:10.1016/s1473-3099(21)00170-5,pmc:PMC8041359,pubmed:33857406,no-cc,BACKGROUND: Emergence of variants with specifi...,2021-04-12,"Frampton, Dan; Rampling, Tommy; Cross, Aidan; ...",Lancet Infect Dis,,,,document_parses/pdf_json/69fa13b80bb407e8bca17...,document_parses/pmc_json/PMC8041359.xml.json,https://www.sciencedirect.com/science/article/...,233215323.0,2021.0,2021-04-12,B.1.1.7,doi:10.1016/s1473-3099(21)00170-5
395962,wun5rwsk,3966d03d72440c939928b54f0d024f49181db116,Elsevier; Medline; PMC,Emergence and rapid transmission of SARS-CoV-2...,doi:10.1016/j.cell.2021.03.052,pmc:PMC8009040,pubmed:33861950,els-covid,The highly transmissible B.1.1.7 variant of SA...,2021-03-30,"Washington, Nicole L.; Gangavarapu, Karthik; Z...",Cell,,,,document_parses/pdf_json/3966d03d72440c939928b...,,https://www.ncbi.nlm.nih.gov/pubmed/33861950/;...,232412011.0,2021.0,2021-03-30,B.1.1.7,doi:10.1016/j.cell.2021.03.052
96776,k3qp2lxc,,Medline,Detection of B.1.351 SARS-CoV-2 Variant Strain...,doi:10.15585/mmwr.mm7008e2,,pubmed:33630820,unk,The first laboratory-confirmed cases of corona...,2021-02-26,"Mwenda, Mulenga; Saasa, Ngonda; Sinyange, Nyam...",MMWR. Morbidity and mortality weekly report,,,,,,https://doi.org/10.15585/mmwr.mm7008e2; https:...,232058475.0,2021.0,2021-02-26,B.1.351,doi:10.15585/mmwr.mm7008e2
477204,g9pwxfd2,5adfe79458623b8b71e1c1b3ccf370a81445a654,Elsevier; Medline; PMC,Characterizing SARS-CoV-2 genome diversity cir...,doi:10.1016/j.ijid.2021.02.073,pmc:PMC7895695,pubmed:33618008,els-covid,Objectives To evaluate the genomic diversity a...,2021-02-20,"Muñoz, Marina; Patiño, Luz H.; Ballesteros, Na...",Int J Infect Dis,,,,document_parses/pdf_json/5adfe79458623b8b71e1c...,,https://api.elsevier.com/content/article/pii/S...,231965833.0,2021.0,2021-02-20,B.1.351;B.1.1.7;P.1,doi:10.1016/j.ijid.2021.02.073
447927,3ztgrafg,,MedRxiv; WHO,SARS-CoV-2 B.1.1.7 lineage-related perceptions...,doi:10.1101/2021.01.19.21250111,,,medrxiv,Background: Healthcare workers' HCWs' travel...,2021-01-21,"Temsah, M.-H.; Barry, M.; Aljamaan, F.; Alhuza...",,,,,,,http://medrxiv.org/cgi/content/short/2021.01.1...,231653830.0,2021.0,2021-01-21,B.1.1.7,doi:10.1101/2021.01.19.21250111
169766,h5e8bi73,,WHO,Identification of B.1.346 Lineage of SARS-CoV-...,,,,unk,SARS-CoV-2 whole-genome sequencing of samples ...,2021,"Abe, Kodai; Shimura, Takako; Takenouchi, Toshi...",Keio j. med,,#1183786,,,,,231729271.0,,2021-04-23,B.1.1.284;B.1.1.214;B.1.346,
483453,ccqq5734,38be32b65618c66a2e3d41db5be9cda55a01b324,BioRxiv; WHO,A human antibody with blocking activity to RBD...,doi:10.1101/2021.02.07.429299,,,biorxiv,Severe acute respiratory syndrome coronavirus-...,2021-02-08,"Gu, Chunyin; Cao, Xiaodan; Wang, Zongda; Hu, X...",bioRxiv,,,,document_parses/pdf_json/38be32b65618c66a2e3d4...,,https://doi.org/10.1101/2021.02.07.429299,231885348.0,2021.0,2021-02-08,B.1.351,doi:10.1101/2021.02.07.429299


WHO documents seem to be copies of articles that are already present in the dataset and will be ignored for now.

In [24]:
hits[hits['abstract'].str.contains('INTRODUCTION: Venezuela and')]

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,year,date,lineages,id
278446,84bhzaqw,,WHO,SARS-CoV-2 spread across the Colombian-Venezue...,,,,unk,INTRODUCTION: Venezuela and Colombia both adop...,2020,"Paniz-Mondolfi, Alberto; Muñoz, Marina; Florez...",Infect Genet Evol,,#907154,,,,,220444484,,2020-04-23,B.1.13,
518529,0vtredu8,d2d3b8b04378bebdde024760e49c297a24dd2af7,Elsevier; Medline; PMC,SARS-CoV-2 spread across the Colombian-Venezue...,doi:10.1016/j.meegid.2020.104616,pmc:PMC7609240,pubmed:33157300,no-cc,INTRODUCTION: Venezuela and Colombia both adop...,2020-11-04,"Paniz-Mondolfi, Alberto; Muñoz, Marina; Florez...",Infect Genet Evol,,,,document_parses/pdf_json/d2d3b8b04378bebdde024...,document_parses/pmc_json/PMC7609240.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/33157300/;...,226238553,2020.0,2020-11-04,B.1.13,doi:10.1016/j.meegid.2020.104616


In [25]:
hits.query('id != ""', inplace=True)

In [26]:
print("Total number of matches", hits.shape[0])

Total number of matches 449


In [27]:
hits.to_csv(NEO4J_IMPORT / "01h-CORDLineages.csv", index=False)