# Downloads Publication Information for Pangolin Linages from the CORD-19 Data Set
**[Work in progress]**

This notebook identifies publications in preprints in the CORD-19 data set that mention PANGO lineages.

Data sources: [PANGO Lineage Designations](https://github.com/cov-lineages/pango-designation), 
[CORD-19](https://allenai.org/data/cord-19)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
import io
import dateutil
import re
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


## Get PANGO lineages

In [4]:
pango_url = 'https://raw.githubusercontent.com/cov-lineages/pango-designation/master/lineage_notes.txt'

In [5]:
pangolin = pd.read_csv(pango_url, sep='\t', skiprows=1, dtype=str, names=['lineage', 'description'])

In [6]:
pangolin['lineage'] = pangolin['lineage'].str.strip()

In [7]:
pangolin.sample(10)

Unnamed: 0,lineage,description
18,A.16,Japanese lineage
56,B.1.1.37,UK lineage
1292,*B.1.1.76,Withdrawn: USA lineage
333,B.1.1.381,Peruvian
1194,D.5,"Alias of B.1.1.25.5, Swedish/ Denmark lineage"
1314,*B.1.1.156,Withdrawn: South African lineage
1209,N.4,"Alias of B.1.1.33.4, Chilean"
1461,*B.1.283,Withdrawn: USA lineage
17,A.15,Sweden/ Denmark lineage
1436,*B.1.204,Withdrawn: Reassigned in the current tree. USA...


In [8]:
lineages = set(pangolin['lineage'].unique())

## Get CORD-19 Metadata

In [9]:
CACHE = Path(NEO4J_IMPORT / 'cache/cord19/metadata.csv')

In [10]:
metadata = pd.read_csv(CACHE, dtype='str')

In [11]:
metadata.fillna('', inplace=True)
#convert datetime column to just date
metadata['year'] = metadata['publish_time'].apply(lambda d: d[:4] if len(d) > 4 else '')
metadata['date'] = metadata['publish_time'].apply(lambda d: dateutil.parser.parse(d) if len(d) > 0 else '')

In [12]:
print("Total number of papers", metadata.shape[0])

Total number of papers 536817


## Extract a list of PANGO lineages

Remove special characters to simply parsing for lineages in parenthesis, comma-separated lists, etc.

In [13]:
metadata['title'] = metadata['title'].replace('[()/,]', ' ', regex=True)
metadata['abstract'] = metadata['abstract'].replace('[()/,]', ' ', regex=True)

Match PANGO patterns and check agains list of known lineages.

In [14]:
pattern1 = re.compile(' [A-Z]{1,2}[.]\d+ ')
pattern2 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+ ')
pattern3 = re.compile(' [A-Z]{1,2}[.]\d+[.]\d+[.]+\d+ ')

In [15]:
def get_lineages(row):
    text = ' ' + row.title + ' ' + row.abstract + ' '
    lin = pattern1.findall(text) + pattern2.findall(text) + pattern3.findall(text)
    u_lin = set()
    for l in lin:
        l = l.strip()
        if l in lineages:
            u_lin.add(l)
    return ";".join(u_lin)

In [16]:
metadata['lineages'] = metadata.apply(get_lineages, axis=1)

Keep only papers that map to PANGO lineages

In [17]:
hits = metadata[metadata['lineages'].str.len() > 0].copy()

### Assign CURIEs from [Identifiers.org](https://identifiers.org)

In [18]:
hits['doi'] = hits['doi'].apply(lambda x: 'doi:' + x if len(x) > 0 else '')
hits['pubmed_id'] = hits['pubmed_id'].apply(lambda x: 'pubmed:' + x if len(x) > 0 else '')
hits['pmcid'] = hits['pmcid'].apply(lambda x: 'pmc:' + x if len(x) > 0 else '')
hits['arxiv_id'] = hits['arxiv_id'].apply(lambda x: 'arxiv:' + x if len(x) > 0 else '')

In [19]:
#hits.sort_values(by=['publish_time'], ascending=False, inplace=True)

In [20]:
print("Number of matches", hits.shape[0])

Number of matches 592


In [21]:
def create_id(row):
    """Creates a unique id using the most commonly available id in priority order"""
    if row.doi != '':
        return row.doi
    elif row.pubmed_id != '':
        return row.pubmed_id
    elif row.pmcid != '':
        return row.pmcid
    elif row.arxiv_id != '':
        return row.arxiv_id
    elif row.url != '':
        return row.url
    else:
        # TODO deal with WHO papers here?
        return ''

In [22]:
hits['id'] = hits.apply(create_id, axis=1)

In [23]:
hits.sample(20)

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,year,date,lineages,id
383903,46pqwsqw,b729ed05d200a4c90dcf2a384f93f8ed84ca0ca8,BioRxiv; Medline; PMC; WHO,A new SARS-CoV-2 lineage that shares mutations...,doi:10.1101/2021.04.05.438352,pmc:PMC8043452,pubmed:33851162,biorxiv,We report a SARS-CoV-2 lineage that shares N50...,2021-04-06,"Thornlow, Bryan; Hinrichs, Angie S.; Jain, Mit...",bioRxiv,,,,document_parses/pdf_json/b729ed05d200a4c90dcf2...,document_parses/pmc_json/PMC8043452.xml.json,https://doi.org/10.1101/2021.04.05.438352; htt...,233175293,2021.0,2021-04-06,B.1.1.7,doi:10.1101/2021.04.05.438352
48848,y41z0l47,,Medline,Response to: COVID-19 re-infection. Vaccinated...,doi:10.1111/eci.13544,,pubmed:33725359,unk,Vaccination against SARS-CoV-2 has shown to of...,2021-03-16,"Schiavone, Marco; Gasperetti, Alessio; Mitacch...",European journal of clinical investigation,,,,,,https://doi.org/10.1111/eci.13544; https://www...,232261925,2021.0,2021-03-16,P.1;B.1.351;B.1.1.7,doi:10.1111/eci.13544
230447,ylhoydgr,,WHO,The variant gambit: COVID-19's next move,,,,unk,More than a year after its emergence COVID-19...,2021,"Plante, Jessica A; Mitchell, Brooke M; Plante,...",Cell host microbe,,#1116431,,,,,232081438,,2021-04-22,P.1;B.1.429;B.1.351;B.1.1.7,
496497,it1lbk8q,3c63f6f65d70209f54329313fd9136ae2cafb8d4,BioRxiv; Medline; WHO,Effect of natural mutations of SARS-CoV-2 on s...,doi:10.1101/2021.03.11.435037,,pubmed:33758838,biorxiv,New SARS-CoV-2 variants that have accumulated ...,2021-03-15,"Gobeil, Sophie M-C.; Janowska, Katarzyna; McDo...",bioRxiv,,,,document_parses/pdf_json/3c63f6f65d70209f54329...,,https://www.ncbi.nlm.nih.gov/pubmed/33758838/;...,232224525,2021.0,2021-03-15,B.1.1.28;B.1.351;B.1.1.7,doi:10.1101/2021.03.11.435037
423146,v115xwc3,eb1e5c04651fa8e660474feafe29cf6667316a74,BioRxiv; WHO,Emerging variants of concern in SARS-CoV-2 mem...,doi:10.1101/2021.03.11.434758,,,biorxiv,Mutations in the SARS-CoV-2 Membrane M gene ...,2021-03-11,"Shen, Lishuang; Bard, Jennifer Dien; Triche, T...",bioRxiv,,,,document_parses/pdf_json/eb1e5c04651fa8e660474...,,https://doi.org/10.1101/2021.03.11.434758,232224629,2021.0,2021-03-11,B.1.429;B.1;B.1.1.7,doi:10.1101/2021.03.11.434758
273960,fdxvruj2,,WHO,VOC 202012 01 Variant Is Effectively Neutraliz...,,,,unk,The coronavirus disease 2019 Covid-19 pandem...,2021,"Rondinone, Valeria; Pace, Lorenzo; Fasanella, ...",Viruses,,#1079723,,,,,232075133,,2021-04-22,B.1;B.1.1.7,
426052,yxo2cooa,fba9a188a93615d3f36205ad150e67d23d893bb9,Medline; PMC,Will the emergent SARS‐CoV2 B.1.1.7 lineage af...,doi:10.1002/jmv.26823,pmc:PMC8013853,pubmed:33506970,no-cc,As the coronavirus disease 2019 pandemic keep ...,2021-02-09,"Ramírez, Juan D.; Muñoz, Marina; Patiño, Luz H...",J Med Virol,,,,document_parses/pdf_json/fba9a188a93615d3f3620...,document_parses/pmc_json/PMC8013853.xml.json,https://doi.org/10.1002/jmv.26823; https://www...,231756849,2021.0,2021-02-09,B.1.1.7,doi:10.1002/jmv.26823
380251,g9mzoyw9,31ac1241c8ee1021ecdd8923fd9efb219ec85896,MedRxiv; WHO,Detection of the Novel SARS-CoV-2 European Lin...,doi:10.1101/2020.11.30.20241265,,,medrxiv,Background: Travel-related dissemination of SA...,2020-12-02,"Guthrie, J. L.; Teatero, S.; Zittermann, S.; C...",,,,,document_parses/pdf_json/31ac1241c8ee1021ecdd8...,,https://doi.org/10.1101/2020.11.30.20241265; h...,227247303,2020.0,2020-12-02,B.1.177,doi:10.1101/2020.11.30.20241265
453462,ole0fpmo,2de3e960ae944f08ff04bae8edd063e229c4ea6c,BioRxiv; WHO,Molecular Mechanism of the N501Y Mutation for ...,doi:10.1101/2021.01.04.425316,,,biorxiv,Coronavirus disease 2019 COVID-19 has been a...,2021-01-05,"Luan, Binquan; Wang, Haoran; Huynh, Tien",bioRxiv,,,,document_parses/pdf_json/2de3e960ae944f08ff04b...,,https://doi.org/10.1101/2021.01.04.425316,231615049,2021.0,2021-01-05,B.1.1.7,doi:10.1101/2021.01.04.425316
50608,q0ysdgsf,,Medline,The first SARS-CoV-2 genetic variants of conce...,doi:10.1016/j.advms.2021.03.005,,pubmed:33827042,unk,PURPOSE We analyzed the SARS-CoV-2 genome usin...,2021-03-30,"Charkiewicz, Radosław; Nikliński, Jacek; Biece...",Advances in medical sciences,,,,,,https://doi.org/10.1016/j.advms.2021.03.005; h...,233183354,2021.0,2021-03-30,B.1.351;B.1.1.7,doi:10.1016/j.advms.2021.03.005


WHO documents seem to be copies of articles that are already present in the dataset and will be ignored for now.

In [24]:
hits[hits['abstract'].str.contains('INTRODUCTION: Venezuela and')]

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id,year,date,lineages,id
278446,84bhzaqw,,WHO,SARS-CoV-2 spread across the Colombian-Venezue...,,,,unk,INTRODUCTION: Venezuela and Colombia both adop...,2020,"Paniz-Mondolfi, Alberto; Muñoz, Marina; Florez...",Infect Genet Evol,,#907154,,,,,220444484,,2020-04-22,B.1.13,
518529,0vtredu8,d2d3b8b04378bebdde024760e49c297a24dd2af7,Elsevier; Medline; PMC,SARS-CoV-2 spread across the Colombian-Venezue...,doi:10.1016/j.meegid.2020.104616,pmc:PMC7609240,pubmed:33157300,no-cc,INTRODUCTION: Venezuela and Colombia both adop...,2020-11-04,"Paniz-Mondolfi, Alberto; Muñoz, Marina; Florez...",Infect Genet Evol,,,,document_parses/pdf_json/d2d3b8b04378bebdde024...,document_parses/pmc_json/PMC7609240.xml.json,https://www.ncbi.nlm.nih.gov/pubmed/33157300/;...,226238553,2020.0,2020-11-04,B.1.13,doi:10.1016/j.meegid.2020.104616


In [25]:
hits.query('id != ""', inplace=True)

In [26]:
print("Total number of matches", hits.shape[0])

Total number of matches 449


In [27]:
hits.to_csv(NEO4J_IMPORT / "01h-CORDLineages.csv", index=False)