## 2. Analyze CORD-19 Datasets
### COVID-19 Open Research Dataset Challenge (CORD-19) Working Notebooks

This is a working notebook for the [COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) to help you jump start your analysis of the CORD-19 dataset.  

<img src="https://miro.medium.com/max/3648/1*596Ur1UdO-fzQsaiGPrNQg.png" width="700"/>

Attributions:
* The licenses for each dataset used for this workbook can be found in the *all _ sources _ metadata csv file* which is included in the [downloaded dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/download).  
* For the 2020-03-03 dataset: 
  * `comm_use_subset`: Commercial use subset (includes PMC content) -- 9000 papers, 186Mb
  * `noncomm_use_subset`: Non-commercial use subset (includes PMC content) -- 1973 papers, 36Mb
  * `biorxiov_medrxiv`: bioRxiv/medRxiv subset (pre-prints that are not peer reviewed) -- 803 papers, 13Mb
* When using Databricks or Databricks Community Edition, a copy of this dataset has been made available at `/databricks-datasets/COVID/CORD-19`
* This notebook is freely available to share, licensed under [CC BY 3.0](https://creativecommons.org/licenses/by/3.0/us/)

#### Configure Parquet Path Variables
Save the data in Parquet format at: `/tmp/dennylee/COVID/CORD-19/2020-03-13/`

In [3]:
# Configure Parquet Paths in Python
comm_use_subset_pq_path = "/tmp/dennylee/COVID/CORD-19/2020-03-13/comm_use_subset.parquet"
noncomm_use_subset_pq_path = "/tmp/dennylee/COVID/CORD-19/2020-03-13/noncomm_use_subset.parquet"
biorxiv_medrxiv_pq_path = "/tmp/dennylee/COVID/CORD-19/2020-03-13/biorxiv_medrxiv/biorxiv_medrxiv.parquet"
json_schema_path = "/databricks-datasets/COVID/CORD-19/2020-03-13/json_schema.txt"

# Configure Path as Shell Enviroment Variables
import os
os.environ['comm_use_subset_pq_path']=''.join(comm_use_subset_pq_path)
os.environ['noncomm_use_subset_pq_path']=''.join(noncomm_use_subset_pq_path)
os.environ['biorxiv_medrxiv_pq_path']=''.join(biorxiv_medrxiv_pq_path)
os.environ['json_schema_path']=''.join(json_schema_path)

#### Read Parquet Files
As these are correctly formed JSON files, you can use `spark.read.json` to read these files.  Note, you will need to specify the *multiline* option.

In [5]:
# Reread files
comm_use_subset = spark.read.format("parquet").load(comm_use_subset_pq_path)
noncomm_use_subset = spark.read.format("parquet").load(noncomm_use_subset_pq_path)
biorxiv_medrxiv = spark.read.format("parquet").load(biorxiv_medrxiv_pq_path)

In [6]:
# Count number of records
comm_use_subset_cnt = comm_use_subset.count()
noncomm_use_subset_cnt = noncomm_use_subset.count()
biorxiv_medrxiv_cnt = biorxiv_medrxiv.count()

# Print out
print ("comm_use_subset: %s, noncomm_use_subset: %s, biorxiv_medrxiv: %s" % (comm_use_subset_cnt, noncomm_use_subset_cnt, biorxiv_medrxiv_cnt))

In [7]:
%sh 
cat /dbfs$json_schema_path

In [8]:
comm_use_subset.createOrReplaceTempView("comm_use_subset")
comm_use_subset.printSchema()

#### Extract Authors
To determine the source geographic location of these papers, let's extract the author metadata to create the `paperAuthorLocation` temporary view.

In [10]:
%sql
select paper_id, metadata.title, metadata.authors, metadata from comm_use_subset limit 10

paper_id,title,authors,metadata
64b4ec9158c8f378000f3d15492f317f19baeafb,Lipid-Based Particles: versatile Delivery Systems for Mucosal vaccination against infection,"List(List(List(null, null, null), , Rajko, Reljic, List(), ), List(List(null, null, null), , St, George', List(), ), List(List(null, null, null), , Vijay, Panchanathan, List(), ), List(List(null, null, null), , Beatrice, Jahn-Schmid, List(), ), List(List(null, null, null), gilles.bioley@chuv.ch, Gilles, Bioley, List(), ), List(List(null, null, null), , Blaise, Corthésy, List(), ))","List(List(List(List(null, null, null), , Rajko, Reljic, List(), ), List(List(null, null, null), , St, George', List(), ), List(List(null, null, null), , Vijay, Panchanathan, List(), ), List(List(null, null, null), , Beatrice, Jahn-Schmid, List(), ), List(List(null, null, null), gilles.bioley@chuv.ch, Gilles, Bioley, List(), ), List(List(null, null, null), , Blaise, Corthésy, List(), )), Lipid-Based Particles: versatile Delivery Systems for Mucosal vaccination against infection)"
61b90922be286db0340b0543233488c3764c611d,biomolecules Chemical and Conformational Diversity of Modified Nucleosides Affects tRNA Structure and Function,"List(List(List(State University of New York, , List(null, USA, null, 12222, NY, Albany)), , Ville, Väre, List(Y P), ), List(List(State University of New York, , List(null, USA, null, 12222, NY, Albany)), eeruysal@albany.edue.r.e., Emily, Eruysal, List(R), ), List(List(State University of New York, , List(null, USA, null, 12222, NY, Albany)), anarendran@albany.edua.n., Amithi, Narendran, List(), ), List(List(State University of New York, , List(null, USA, null, 12222, NY, Albany)), ksarachan@albany.eduk.l.s., Kathryn, Sarachan, List(L), ), List(List(State University of New York, , List(null, USA, null, 12222, NY, Albany)), pagris@albany.edu, Paul, Agris, List(F), ))","List(List(List(List(State University of New York, , List(null, USA, null, 12222, NY, Albany)), , Ville, Väre, List(Y P), ), List(List(State University of New York, , List(null, USA, null, 12222, NY, Albany)), eeruysal@albany.edue.r.e., Emily, Eruysal, List(R), ), List(List(State University of New York, , List(null, USA, null, 12222, NY, Albany)), anarendran@albany.edua.n., Amithi, Narendran, List(), ), List(List(State University of New York, , List(null, USA, null, 12222, NY, Albany)), ksarachan@albany.eduk.l.s., Kathryn, Sarachan, List(L), ), List(List(State University of New York, , List(null, USA, null, 12222, NY, Albany)), pagris@albany.edu, Paul, Agris, List(F), )), biomolecules Chemical and Conformational Diversity of Modified Nucleosides Affects tRNA Structure and Function)"
061eb24a1462aba7b53e6c099fe5d4bd046dd56a,molecules G-Quadruplex-Based Fluorescent Turn-On Ligands and Aptamers: From Development to Applications,"List(List(List(City University of Hong Kong, , List(null, China, null, null, null, Kowloon Tong, Hong Kong SAR)), , Mubarak, Umar, List(I), ), List(List(City University of Hong Kong, , List(null, China, null, null, null, Kowloon Tong, Hong Kong SAR)), , Danyang, Ji, List(), ), List(List(City University of Hong Kong, , List(null, China, null, null, null, Kowloon Tong, Hong Kong SAR)), , Chun-Yin, Chan, List(), ), List(List(City University of Hong Kong, , List(null, China, null, null, null, Kowloon Tong, Hong Kong SAR)), , Chun, Kwok, List(Kit), ))","List(List(List(List(City University of Hong Kong, , List(null, China, null, null, null, Kowloon Tong, Hong Kong SAR)), , Mubarak, Umar, List(I), ), List(List(City University of Hong Kong, , List(null, China, null, null, null, Kowloon Tong, Hong Kong SAR)), , Danyang, Ji, List(), ), List(List(City University of Hong Kong, , List(null, China, null, null, null, Kowloon Tong, Hong Kong SAR)), , Chun-Yin, Chan, List(), ), List(List(City University of Hong Kong, , List(null, China, null, null, null, Kowloon Tong, Hong Kong SAR)), , Chun, Kwok, List(Kit), )), molecules G-Quadruplex-Based Fluorescent Turn-On Ligands and Aptamers: From Development to Applications)"
33cf03ea72e45ea6d6beeccc91c5145807544728,molecules Developing Novel G-Quadruplex Ligands: From Interaction with Nucleic Acids to Interfering with Nucleic Acid-Protein Interaction,"List(List(List(Sun Yat-sen University, , List(null, China, null, 510006, null, Guangzhou)), , Zhi-Yin, Sun, List(), ), List(List(Sun Yat-sen University, , List(null, China, null, 510006, null, Guangzhou)), , Xiao-Na, Wang, List(), ), List(List(Sun Yat-sen University, , List(null, China, null, 510006, null, Guangzhou)), , Sui-Qi, Cheng, List(), ), List(List(Sun Yat-sen University, , List(null, China, null, 510006, null, Guangzhou)), suxx@mail2.sysu.edu.cnx.-x.s.*correspondence:outianm@mail.sysu.edu.cn, Xiao-Xuan, Su, List(), ), List(List(Sun Yat-sen University, , List(null, China, null, 510006, null, Guangzhou)), , Tian-Miao, Ou, List(), ))","List(List(List(List(Sun Yat-sen University, , List(null, China, null, 510006, null, Guangzhou)), , Zhi-Yin, Sun, List(), ), List(List(Sun Yat-sen University, , List(null, China, null, 510006, null, Guangzhou)), , Xiao-Na, Wang, List(), ), List(List(Sun Yat-sen University, , List(null, China, null, 510006, null, Guangzhou)), , Sui-Qi, Cheng, List(), ), List(List(Sun Yat-sen University, , List(null, China, null, 510006, null, Guangzhou)), suxx@mail2.sysu.edu.cnx.-x.s.*correspondence:outianm@mail.sysu.edu.cn, Xiao-Xuan, Su, List(), ), List(List(Sun Yat-sen University, , List(null, China, null, 510006, null, Guangzhou)), , Tian-Miao, Ou, List(), )), molecules Developing Novel G-Quadruplex Ligands: From Interaction with Nucleic Acids to Interfering with Nucleic Acid-Protein Interaction)"
78ef6e101522626b8c7c7621ec50146ab57cf488,a section of the journal Frontiers in Immunology inducible Bronchus-Associated Lymphoid Tissue: Taming inflammation in the Lung,"List(List(List(null, null, null), , Andreas, Habenicht, List(), ), List(List(null, null, null), , Sanjiv, Luther, List(A), ), List(List(null, null, null), randallt@uab.edu, Troy, Randall, List(D), ), List(List(null, null, null), , Ji, Hwang, List(Young), ), List(List(null, null, null), , Aaron, Silva-Sanchez, List(), ))","List(List(List(List(null, null, null), , Andreas, Habenicht, List(), ), List(List(null, null, null), , Sanjiv, Luther, List(A), ), List(List(null, null, null), randallt@uab.edu, Troy, Randall, List(D), ), List(List(null, null, null), , Ji, Hwang, List(Young), ), List(List(null, null, null), , Aaron, Silva-Sanchez, List(), )), a section of the journal Frontiers in Immunology inducible Bronchus-Associated Lymphoid Tissue: Taming inflammation in the Lung)"
a6f36e3233319626ec737895c1d52dedd7aac0bf,Recombination in Eukaryotic Single Stranded DNA Viruses,"List(List(List(University of Cape Town, Computational Biology Group, List(null, South Africa, null, 4579, null, Cape Town)), darrenpatrickmartin@gmail.com, Darren, Martin, List(P), ), List(List(Université de la Méditerranée, UMR CNRS 6578 Anthropologie Bioculturelle, Equipe ""Emergence et co-évolution virale"", Etablissement Français du Sang Alpes-Méditerranée, List(27 Bd. Jean Moulin, France, null, 13005, null, Marseille)), philippe.biagini@univmed.fr, Philippe, Biagini, List(), ), List(List(CIRAD, UMR 53 PVBMT CIRAD-Université de la Réunion, Pôle de Protection des Plantes, Ligne Paradis, List(Saint Pierre, La Réunion, France, null, 97410, null, null)), pierre.lefeuvre@gmail.com, Pierre, Lefeuvre, List(), ), List(List(University of Cape Town, Computational Biology Group, List(null, South Africa, null, 4579, null, Cape Town)), , Michael, Golden, List(), ), List(List(, CIRAD, UMR BGPI, TA A-54/K, Campus International de Montferrier-Baillarguet, List(null, France, null, 34398, null, Montpellier)), philippe.roumagnac@cirad.fr, Philippe, Roumagnac, List(), ), List(List(University of Cape Town, , List(null, South Africa, null, 7701, null, Rondebosch, Cape Town)), arvind.varsani@canterbury.ac.nz, Arvind, Varsani, List(), ))","List(List(List(List(University of Cape Town, Computational Biology Group, List(null, South Africa, null, 4579, null, Cape Town)), darrenpatrickmartin@gmail.com, Darren, Martin, List(P), ), List(List(Université de la Méditerranée, UMR CNRS 6578 Anthropologie Bioculturelle, Equipe ""Emergence et co-évolution virale"", Etablissement Français du Sang Alpes-Méditerranée, List(27 Bd. Jean Moulin, France, null, 13005, null, Marseille)), philippe.biagini@univmed.fr, Philippe, Biagini, List(), ), List(List(CIRAD, UMR 53 PVBMT CIRAD-Université de la Réunion, Pôle de Protection des Plantes, Ligne Paradis, List(Saint Pierre, La Réunion, France, null, 97410, null, null)), pierre.lefeuvre@gmail.com, Pierre, Lefeuvre, List(), ), List(List(University of Cape Town, Computational Biology Group, List(null, South Africa, null, 4579, null, Cape Town)), , Michael, Golden, List(), ), List(List(, CIRAD, UMR BGPI, TA A-54/K, Campus International de Montferrier-Baillarguet, List(null, France, null, 34398, null, Montpellier)), philippe.roumagnac@cirad.fr, Philippe, Roumagnac, List(), ), List(List(University of Cape Town, , List(null, South Africa, null, 7701, null, Rondebosch, Cape Town)), arvind.varsani@canterbury.ac.nz, Arvind, Varsani, List(), )), Recombination in Eukaryotic Single Stranded DNA Viruses)"
54458ee775b72a18830ed116a31336638b909fa7,viruses The Innate Antiviral Response in Animals: An Evolutionary Perspective from Flagellates to Humans,"List(List(List(Université de Strasbourg, , List(null, France, null, U1110, 67000, null, Strasbourg)), , Karim, Majzoub, List(), ), List(List(Université de Strasbourg, , List(null, France, null, U1110, 67000, null, Strasbourg)), , Florian, Wrensch, List(), ), List(List(Université de Strasbourg, , List(null, France, null, U1110, 67000, null, Strasbourg)), thomas.baumert@unistra.frt.f.b., Thomas, Baumert, List(F), ))","List(List(List(List(Université de Strasbourg, , List(null, France, null, U1110, 67000, null, Strasbourg)), , Karim, Majzoub, List(), ), List(List(Université de Strasbourg, , List(null, France, null, U1110, 67000, null, Strasbourg)), , Florian, Wrensch, List(), ), List(List(Université de Strasbourg, , List(null, France, null, U1110, 67000, null, Strasbourg)), thomas.baumert@unistra.frt.f.b., Thomas, Baumert, List(F), )), viruses The Innate Antiviral Response in Animals: An Evolutionary Perspective from Flagellates to Humans)"
bc763a4c1cb0175d61d1332af8d137d0482df73b,viruses When Dendritic Cells Go Viral: The Role of Siglec-1 in Host Defense and Dissemination of Enveloped Viruses,"List(List(List(IrsiCaixa AIDS Research Institute, , List(Ctra. de Canyet s/n, Spain, null, 08916, null, Badalona)), , Daniel, Perez-Zsolt, List(), ), List(List(IrsiCaixa AIDS Research Institute, , List(Ctra. de Canyet s/n, Spain, null, 08916, null, Badalona)), , Javier, Martinez-Picado, List(), ), List(List(IrsiCaixa AIDS Research Institute, , List(Ctra. de Canyet s/n, Spain, null, 08916, null, Badalona)), , Nuria, Izquierdo-Useros, List(), ))","List(List(List(List(IrsiCaixa AIDS Research Institute, , List(Ctra. de Canyet s/n, Spain, null, 08916, null, Badalona)), , Daniel, Perez-Zsolt, List(), ), List(List(IrsiCaixa AIDS Research Institute, , List(Ctra. de Canyet s/n, Spain, null, 08916, null, Badalona)), , Javier, Martinez-Picado, List(), ), List(List(IrsiCaixa AIDS Research Institute, , List(Ctra. de Canyet s/n, Spain, null, 08916, null, Badalona)), , Nuria, Izquierdo-Useros, List(), )), viruses When Dendritic Cells Go Viral: The Role of Siglec-1 in Host Defense and Dissemination of Enveloped Viruses)"
474dc8b54f9110f60b129220549c532355fe10b2,"The dipeptidyl peptidase family, prolyl oligopeptidase, and prolyl carboxypeptidase in the immune system and inflammatory disease, including atherosclerosis","List(List(List(null, null, null), , Heidi, Noels, List(), ), List(List(null, null, null), , Jürgen, Bernhagen, List(), ), List(List(null, null, null), , Rafael, Franco, List(), ), List(List(null, null, null), , Catherine, Abbott, List(Anne), ), List(List(null, null, null), , Mark, Gorrell, List(), ), List(List(null, null, null), ingrid.demeester@uantwerpen.be, Ingrid, De Meester, List(), ), List(List(null, null, null), , Yannick, Waumans, List(), ), List(List(null, null, null), , Lesley, Baerts, List(), ), List(List(null, null, null), , Kaat, Kehoe, List(), ), List(List(null, null, null), , Anne-Marie, Lambeir, List(), ))","List(List(List(List(null, null, null), , Heidi, Noels, List(), ), List(List(null, null, null), , Jürgen, Bernhagen, List(), ), List(List(null, null, null), , Rafael, Franco, List(), ), List(List(null, null, null), , Catherine, Abbott, List(Anne), ), List(List(null, null, null), , Mark, Gorrell, List(), ), List(List(null, null, null), ingrid.demeester@uantwerpen.be, Ingrid, De Meester, List(), ), List(List(null, null, null), , Yannick, Waumans, List(), ), List(List(null, null, null), , Lesley, Baerts, List(), ), List(List(null, null, null), , Kaat, Kehoe, List(), ), List(List(null, null, null), , Anne-Marie, Lambeir, List(), )), The dipeptidyl peptidase family, prolyl oligopeptidase, and prolyl carboxypeptidase in the immune system and inflammatory disease, including atherosclerosis)"
55dc3ccae37d88301441558752efbc4700c116e3,PEDV and PDCoV Pathogenesis: The Interplay Between Host Innate Immune Responses and Porcine Enteric Coronaviruses,"List(List(List(null, null, null), , Dongbo, Sun, List(), ), List(List(null, null, null), , Daniel, Marc, List(), ), List(List(null, null, null), surapong.koo@biotec.or.th, Surapong, Koonpaew, List(), ), List(List(null, null, null), , Samaporn, Teeravechyan, List(), ), List(List(null, null, null), , Phanramphoei, Namprachan Frantz, List(), ), List(List(null, null, null), , Thanathom, Chailangkarn, List(), ), List(List(null, null, null), , Anan, Jongkaewwattana, List(), ))","List(List(List(List(null, null, null), , Dongbo, Sun, List(), ), List(List(null, null, null), , Daniel, Marc, List(), ), List(List(null, null, null), surapong.koo@biotec.or.th, Surapong, Koonpaew, List(), ), List(List(null, null, null), , Samaporn, Teeravechyan, List(), ), List(List(null, null, null), , Phanramphoei, Namprachan Frantz, List(), ), List(List(null, null, null), , Thanathom, Chailangkarn, List(), ), List(List(null, null, null), , Anan, Jongkaewwattana, List(), )), PEDV and PDCoV Pathogenesis: The Interplay Between Host Innate Immune Responses and Porcine Enteric Coronaviruses)"


In [11]:
paperAuthorLocation = spark.sql("""
select paper_id, 
       title,  
       authors.affiliation.location.addrLine as addrLine, 
       authors.affiliation.location.country as country, 
       authors.affiliation.location.postBox as postBox,
       authors.affiliation.location.postCode as postCode,
       authors.affiliation.location.region as region,
       authors.affiliation.location.settlement as settlement
  from (
    select a.paper_id, a.metadata.title as title, b.authors
      from comm_use_subset a
        left join (
            select paper_id, explode(metadata.authors) as authors from comm_use_subset 
            ) b
           on b.paper_id = a.paper_id  
  ) x
""")
paperAuthorLocation.createOrReplaceTempView("paperAuthorLocation")

#### Author Country Data Issues
There are some issues with the `authors.affiliation.location.country` information such as a value of `USA,USA,USA,USA`

In [13]:
%sql
select *
  from (
    select paper_id, metadata.title as title, explode(metadata.authors) as authors from comm_use_subset 
  ) a
where authors.affiliation.location.country like '%USA, USA, USA, USA%'

paper_id,title,authors
2a6a9de82dc0494f32530e1ee8ee7509367a04fd,Building International Genomics Collaboration for Global Health Security,"List(List(Blood Systems Research Institute, Los Alamos National Laboratory, List(null, USA, USA, USA, USA, null, null, NM, Metabiota, Los Alamos)), , Nathan, Wolfe, List(), )"
2a6a9de82dc0494f32530e1ee8ee7509367a04fd,Building International Genomics Collaboration for Global Health Security,"List(List(Blood Systems Research Institute, Los Alamos National Laboratory, List(null, USA, USA, USA, USA, null, null, NM, Metabiota, Los Alamos)), , Paras, Jain, List(), )"
2a6a9de82dc0494f32530e1ee8ee7509367a04fd,Building International Genomics Collaboration for Global Health Security,"List(List(Blood Systems Research Institute, Los Alamos National Laboratory, List(null, USA, USA, USA, USA, null, null, NM, Metabiota, Los Alamos)), , Eric, Delwart, List(), )"
2a6a9de82dc0494f32530e1ee8ee7509367a04fd,Building International Genomics Collaboration for Global Health Security,"List(List(Blood Systems Research Institute, Los Alamos National Laboratory, List(null, USA, USA, USA, USA, null, null, NM, Metabiota, Los Alamos)), hhcui@lanl.gov, Helen, Cui, List(H), )"
2a6a9de82dc0494f32530e1ee8ee7509367a04fd,Building International Genomics Collaboration for Global Health Security,"List(List(Blood Systems Research Institute, Los Alamos National Laboratory, List(null, USA, USA, USA, USA, null, null, NM, Metabiota, Los Alamos)), , Tracy, Erkkila, List(), )"
2a6a9de82dc0494f32530e1ee8ee7509367a04fd,Building International Genomics Collaboration for Global Health Security,"List(List(Blood Systems Research Institute, Los Alamos National Laboratory, List(null, USA, USA, USA, USA, null, null, NM, Metabiota, Los Alamos)), , Patrick, Chain, List(S G), )"
2a6a9de82dc0494f32530e1ee8ee7509367a04fd,Building International Genomics Collaboration for Global Health Security,"List(List(Blood Systems Research Institute, Los Alamos National Laboratory, List(null, USA, USA, USA, USA, null, null, NM, Metabiota, Los Alamos)), , Momchilo, Vuyisich, List(), )"


### Clean Up the Data
Let's work on cleaning up the author country data

#### Review paperAuthorLocation
A quick review of the `paperAuthorLocation` temporary view.

In [16]:
%sql
select * from paperAuthorLocation limit 200

paper_id,title,addrLine,country,postBox,postCode,region,settlement
0a1533470817bc5ef0d0d0af56386a96b505dc0d,BMC Molecular Biology Evaluation of potential reference genes in real-time RT-PCR studies of Atlantic salmon,Nordnesboder 2,Norway,,N-5005,,Bergen
0a1533470817bc5ef0d0d0af56386a96b505dc0d,BMC Molecular Biology Evaluation of potential reference genes in real-time RT-PCR studies of Atlantic salmon,Nordnesboder 2,Norway,,N-5005,,Bergen
0a1533470817bc5ef0d0d0af56386a96b505dc0d,BMC Molecular Biology Evaluation of potential reference genes in real-time RT-PCR studies of Atlantic salmon,Nordnesboder 2,Norway,,N-5005,,Bergen
0a1533470817bc5ef0d0d0af56386a96b505dc0d,BMC Molecular Biology Evaluation of potential reference genes in real-time RT-PCR studies of Atlantic salmon,Thormøhlensgate 55,Norway,,N-5020,,Bergen
0a1533470817bc5ef0d0d0af56386a96b505dc0d,BMC Molecular Biology Evaluation of potential reference genes in real-time RT-PCR studies of Atlantic salmon,Thormøhlensgate 55,Norway,,N-5020,,Bergen
0ddcfc9bedfb0a87a7221dd2448bd41d3ba9cc51,How Can Viral Dynamics Models Inform Endpoint Measures in Clinical Trials of Therapies for Acute Viral Infections?,,United Kingdom,,,,London
0ddcfc9bedfb0a87a7221dd2448bd41d3ba9cc51,How Can Viral Dynamics Models Inform Endpoint Measures in Clinical Trials of Therapies for Acute Viral Infections?,,,,,,
0ddcfc9bedfb0a87a7221dd2448bd41d3ba9cc51,How Can Viral Dynamics Models Inform Endpoint Measures in Clinical Trials of Therapies for Acute Viral Infections?,,United Kingdom,,,,London
0ddcfc9bedfb0a87a7221dd2448bd41d3ba9cc51,How Can Viral Dynamics Models Inform Endpoint Measures in Clinical Trials of Therapies for Acute Viral Infections?,,United Kingdom,,,,London
0ddcfc9bedfb0a87a7221dd2448bd41d3ba9cc51,How Can Viral Dynamics Models Inform Endpoint Measures in Clinical Trials of Therapies for Acute Viral Infections?,,United Kingdom,,,,London


In [17]:
%sql
select count(1), count(distinct paper_id) as papers from paperAuthorLocation

count(1),papers
67709,9000


#### Extract country data
Extract country data (`paperCountries`) from the `paperAuthorLocation` temporary view.

In [19]:
paperCountries = spark.sql("""select distinct country from paperAuthorLocation""")
paperCountries.createOrReplaceTempView("paperCountries")

#### Use pycountry
Use `pycountry` to extract the alpha_3 code for each country

In [21]:
# import
import pycountry

# Look up alpha_3 country code (using pycountry)
def get_alpha_3(country):
    try_alpha_3 = -1
    try:
        try_alpha_3 = pycountry.countries.search_fuzzy(country)[0].alpha_3
    except:
        print("Unknown Country")
    return try_alpha_3

# Register UDF
spark.udf.register("get_alpha_3", get_alpha_3)

In [22]:
# from pyspark.sql.functions import pandas_udf, PandasUDFType

# # Use pandas_udf to define a Pandas UDF
# @pandas_udf('double', PandasUDFType.SCALAR)
# # Input/output are both a pandas.Series of doubles

# def pandas_plus_one(v):
#     return v + 1

# df.withColumn('v2', pandas_plus_one(df.v))

In [23]:
%sql
select country, get_alpha_3(country) as alpha_3 from paperCountries

country,alpha_3
Utah,USA
"Spain, UNITED STATES",-1
United Kingdom A R,-1
"Ghana, Kenya",-1
Russia,RUS
USa,USA
Paraguay,PRY
"The Netherlands, The Netherlands",-1
"France., France",-1
israel,ISR


#### Steps to clean up country data

In [25]:
# Step 1: Extract alpha_3 for easily identifiable countries
paperCountries_s01 = spark.sql("""select country, get_alpha_3(country) as alpha_3 from paperCountries""")
paperCountries_s01.cache()
paperCountries_s01.createOrReplaceTempView("paperCountries_s01")

In [26]:
# Step 2: Extract alpha_3 for splittable identifiable countries (e.g. "USA, USA, USA", "Sweden, Norway", etc)
paperCountries_s02 = spark.sql("""
select country, splitCountry as country_cleansed, get_alpha_3(ltrim(rtrim(splitCountry))) as alpha_3
  from (
select country, explode(split(regexp_replace(country, "[^a-zA-Z, ]+", ""), ',')) as splitCountry
  from paperCountries_s01
 where alpha_3 = '-1'
 ) x
""")
paperCountries_s02.cache()
paperCountries_s02.createOrReplaceTempView("paperCountries_s02")

In [27]:
# Step 3: Extract yet to be identified countries (per steps 1 and 2) 
paperCountries_s03 = spark.sql("""select country, ltrim(rtrim(country_cleansed)) as country_cleansed, get_alpha_3(country_cleansed) from paperCountries_s02 where alpha_3 = -1""")
paperCountries_s03.cache()
paperCountries_s03.createOrReplaceTempView("paperCountries_s03")

In [28]:
# Step 4: Identify country by settlement
paperCountries_s04 = spark.sql("""
select distinct m.country_cleansed, f.settlement, get_alpha_3(f.settlement) as alpha_3
  from paperAuthorLocation f
    inner join paperCountries_s03 m
      on m.country = f.country
""")
paperCountries_s04.cache()
paperCountries_s04.createOrReplaceTempView("paperCountries_s04")

In [29]:
 # Step 5: Build new mapping
map_country_cleansed = spark.sql("""select distinct country_cleansed, alpha_3 from paperCountries_s04 where alpha_3 <> '-1'""")
map_country_cleansed.cache()
map_country_cleansed.createOrReplaceTempView("map_country_cleansed")

In [30]:
# Step 6: Update paperCountries_s03 using the mapping from step 5
paperCountries_s06 = spark.sql("""
select f.country, f.country_cleansed, m.alpha_3
  from paperCountries_s03 f
    left join map_country_cleansed m
      on m.country_cleansed = f.country_cleansed
 where m.alpha_3 is not null      
""")
paperCountries_s06.cache()
paperCountries_s06.createOrReplaceTempView("paperCountries_s06")

#### Build up map_country 
Build up map_country based on the previous pipeline processing.

In [32]:
map_country = spark.sql("""
select country, alpha_3 from paperCountries_s01 where alpha_3 <> '-1'
union all
select country, alpha_3 from paperCountries_s02 where alpha_3 <> '-1'
union all
select country, alpha_3 from paperCountries_s06
""")
map_country.cache()
map_country.createOrReplaceTempView("map_country")

#### Build paperCountryMapped
Put this all together to map the paper and alpha_3 geo location

In [34]:
paperCountryMapped = spark.sql("""
select p.paper_id, p.title, p.addrLine, p.country, p.postBox, p.postCode, p.region, p.settlement, m.alpha_3
 from paperAuthorLocation p
   left outer join map_country m
     on m.country = p.country
""")
paperCountryMapped.cache()
paperCountryMapped.createOrReplaceTempView("paperCountryMapped")

In [35]:
%sql
select * from paperCountryMapped limit 100

paper_id,title,addrLine,country,postBox,postCode,region,settlement,alpha_3
d259c80b55cb69486ef75e66d13fe60688a1e028,Middle East Respiratory Coronavirus Accessory Protein 4a Inhibits PKR-Mediated Antiviral Stress Responses,Campus Universidad Autonoma de Madrid,"Spain, UNITED STATES",,,,Madrid,ESP
d259c80b55cb69486ef75e66d13fe60688a1e028,Middle East Respiratory Coronavirus Accessory Protein 4a Inhibits PKR-Mediated Antiviral Stress Responses,Campus Universidad Autonoma de Madrid,"Spain, UNITED STATES",,,,Madrid,USA
d259c80b55cb69486ef75e66d13fe60688a1e028,Middle East Respiratory Coronavirus Accessory Protein 4a Inhibits PKR-Mediated Antiviral Stress Responses,Campus Universidad Autonoma de Madrid,"Spain, UNITED STATES",,,,Madrid,ESP
d259c80b55cb69486ef75e66d13fe60688a1e028,Middle East Respiratory Coronavirus Accessory Protein 4a Inhibits PKR-Mediated Antiviral Stress Responses,Campus Universidad Autonoma de Madrid,"Spain, UNITED STATES",,,,Madrid,USA
d259c80b55cb69486ef75e66d13fe60688a1e028,Middle East Respiratory Coronavirus Accessory Protein 4a Inhibits PKR-Mediated Antiviral Stress Responses,Campus Universidad Autonoma de Madrid,"Spain, UNITED STATES",,,,Madrid,ESP
d259c80b55cb69486ef75e66d13fe60688a1e028,Middle East Respiratory Coronavirus Accessory Protein 4a Inhibits PKR-Mediated Antiviral Stress Responses,Campus Universidad Autonoma de Madrid,"Spain, UNITED STATES",,,,Madrid,USA
d0a7af58aa5e272f1c7aef4e6908dd3059d9173e,Proteasome inhibition in cancer is associated with enhanced tumor targeting by the adeno-associated virus/phage,,United Kingdom A R,,,,London,GBR
d0a7af58aa5e272f1c7aef4e6908dd3059d9173e,Proteasome inhibition in cancer is associated with enhanced tumor targeting by the adeno-associated virus/phage,,United Kingdom A R,,,,London,GBR
c7d60067e11331d3c5e1f9b1d79e70caacb13f25,,,Utah,,,,Salt Lake City,USA
c7d60067e11331d3c5e1f9b1d79e70caacb13f25,,,Utah,,,,Salt Lake City,USA


#### paperCountryMapped Descriptive Statistics

In [37]:
(ep_no, edp_no) = spark.sql("select count(1), count(distinct paper_id) from paperCountryMapped where country is null and settlement is null").collect()[0]
(ep_geo, edp_geo) = spark.sql("select count(1), count(distinct paper_id) from paperCountryMapped where country is not null or settlement is not null").collect()[0]
(ep_a3, edp_a3) = spark.sql("select count(1), count(distinct paper_id) from paperCountryMapped where alpha_3 is not null").collect()[0]
print("Distinct Papers with No Geographic Information: %s" % edp_no)
print("Distinct Papers with Some Geographic Information: %s" % edp_geo)
print("Distinct Papers with Identified Alpha_3 codes: %s" % edp_a3)

### Visualize Paper Country Mapping
Map out the author country for each paper; note multiple authors per paper so there will be some double counting.

In [39]:
%sql
select alpha_3, count(distinct paper_id) 
  from paperCountryMapped 
 where alpha_3 is not null
 group by alpha_3

alpha_3,count(DISTINCT paper_id)
HTI,2
PSE,1
LVA,2
POL,37
JAM,2
BRA,124
MOZ,2
CUB,3
JOR,7
FRA,284
