Skip to content
COVID-19 Open Research Dataset (work in progress)
R
Branch: master
Clone or download

Latest commit

dgrtwo Adjustments to data and documentation
Documentation changes:

* Documented the cord19_paper_citations dataset
* Expanded the README
* Changed seealso to source

Data changes:

* Filtered out some of the citations that weren't to actual papers (e.g. "All rights reserved")
* Changed the section titles to title case

Other changes:

* Removed packages used only in data-raw from Suggests
Latest commit 4f68b3d Mar 19, 2020

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R Adjustments to data and documentation Mar 19, 2020
data-raw Adjustments to data and documentation Mar 19, 2020
data Adjustments to data and documentation Mar 19, 2020
man Adjustments to data and documentation Mar 19, 2020
.Rbuildignore Adjustments to data and documentation Mar 19, 2020
.gitignore Adjustments to data and documentation Mar 19, 2020
DESCRIPTION Adjustments to data and documentation Mar 19, 2020
LICENSE Initial commit of cord19 package; work in progress. Mar 19, 2020
NAMESPACE Initial commit of cord19 package; work in progress. Mar 19, 2020
README.Rmd Adjustments to data and documentation Mar 19, 2020
README.md Adjustments to data and documentation Mar 19, 2020
cord19.Rproj Initial commit of cord19 package; work in progress. Mar 19, 2020

README.md

cord19

The cord19 package shares the COVID-19 Open Research Dataset (CORD-19) in a tidy form that is easily analyzed within R.

Installation

Install the package from GitHub as follows:

remotes::install_github("dgrtwo/cord19")

Papers

The package turns the CORD-19 dataset into a set of tidy tables.

For example, the paper metadata is stored in cord19_papers.

library(dplyr)
library(cord19)

cord19_papers
#> # A tibble: 12,503 x 14
#>    paper_id source title doi   pmcid pubmed_id license abstract publish_time
#>    <chr>    <chr>  <chr> <chr> <lgl>     <dbl> <chr>   <chr>           <dbl>
#>  1 210a892… CZI    Incu… 10.3… NA           NA cc-by   The geo…         2020
#>  2 e3b40cc… CZI    Char… 10.3… NA     32093211 cc-by   In Dece…         2020
#>  3 0df0d52… CZI    An u… 10.1… NA           NA cc-by-… The bas…         2020
#>  4 f242425… CZI    Real… 10.1… NA           NA cc-by-… The ini…         2020
#>  5 e1b336d… CZI    COVI… 10.1… NA           NA cc-by-… Cruise …         2020
#>  6 e923910… CZI    Dist… 10.1… NA           NA cc-by   Coronav…         2020
#>  7 469ed0f… CZI    Firs… 10.1… NA           NA cc-by   Similar…         2020
#>  8 4e550e0… CZI    Effe… 10.2… NA           NA cc-by   We simu…         2020
#>  9 4bbb0c5… CZI    Geno… 10.1… NA     32108862 cc-by-… SUMMARY…         2020
#> 10 c821803… CZI    Case… 10.3… NA           NA cc-by-… Since m…         2020
#> # … with 12,493 more rows, and 5 more variables: authors <chr>, journal <chr>,
#> #   microsoft_academic_paper_id <dbl>, who_number_covidence <chr>,
#> #   has_full_text <lgl>

# Learn how many papers came from each journal
cord19_papers %>%
    count(journal, sort = TRUE)
#> # A tibble: 1,300 x 2
#>    journal              n
#>    <chr>            <int>
#>  1 PLoS One          1560
#>  2 Emerg Infect Dis   726
#>  3 Viruses            545
#>  4 <NA>               503
#>  5 Sci Rep            485
#>  6 PLoS Pathog        357
#>  7 Virol J            357
#>  8 BMC Infect Dis     246
#>  9 Front Immunol      210
#> 10 Front Microbiol    202
#> # … with 1,290 more rows

Full text

Most usefully, cord19_paragraphs has the full text of the papers, with one observation for each paragraph.

cord19_paragraphs
#> # A tibble: 364,755 x 4
#>    paper_id               paragraph section text                               
#>    <chr>                      <int> <chr>   <chr>                              
#>  1 0015023cc06b5362d332b…         1 <NA>    VP3, and VP0 (which is further pro…
#>  2 0015023cc06b5362d332b…         2 70      The FMDV 5′ UTR is the largest kno…
#>  3 0015023cc06b5362d332b…         3 120     To introduce mutations into the PK…
#>  4 0015023cc06b5362d332b…         4 120     132 133 author/funder. All rights …
#>  5 0015023cc06b5362d332b…         5 120     The copyright holder for this prep…
#>  6 0015023cc06b5362d332b…         6 135     Mutations were then introduced int…
#>  7 0015023cc06b5362d332b…         7 136     To assess the effects of truncatio…
#>  8 0015023cc06b5362d332b…         8 144     Transcription reactions to produce…
#>  9 0015023cc06b5362d332b…         9 144     The copyright holder for this prep…
#> 10 0015023cc06b5362d332b…        10 144     The copyright holder for this prep…
#> # … with 364,745 more rows

# What are common sections
cord19_paragraphs %>%
    count(section, sort = TRUE)
#> # A tibble: 79,531 x 2
#>    section                   n
#>    <chr>                 <int>
#>  1 Discussion            41868
#>  2 Introduction          24128
#>  3 <NA>                  12503
#>  4 Results               11317
#>  5 Background             6709
#>  6 Conclusions            5328
#>  7 Methods                4167
#>  8 Materials And Methods  3677
#>  9 Conclusion             2872
#> 10 Statistical Analysis   2689
#> # … with 79,521 more rows

This allows for some analysis with a package like tidytext.

library(tidytext)
set.seed(2020)

# Sample 100 random papers
paper_words <- cord19_paragraphs %>%
    filter(paper_id %in% sample(unique(paper_id), 100)) %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words, by = "word")

paper_words %>%
    count(word, sort = TRUE)
#> # A tibble: 21,612 x 2
#>    word          n
#>    <chr>     <int>
#>  1 1          1556
#>  2 2          1366
#>  3 cells      1300
#>  4 virus      1184
#>  5 infection  1033
#>  6 3           920
#>  7 cell        854
#>  8 study       848
#>  9 viral       830
#> 10 data        773
#> # … with 21,602 more rows

Citations

This also includes the articles cited by each paper.

cord19_paper_citations
#> # A tibble: 605,650 x 9
#>    paper_id       ref_id title            venue  volume issn  pages  year doi  
#>    <chr>          <chr>  <chr>            <chr>  <chr>  <chr> <chr> <int> <chr>
#>  1 0015023cc06b5… b0     Genetic economy… PLOS … 13     ""    ""     2017 <NA> 
#>  2 0015023cc06b5… b2     A universal pro… BMC G… 604    ""    ""     2014 <NA> 
#>  3 0015023cc06b5… b3     Library prepara… Nat P… 9      ""    1760…  2014 <NA> 
#>  4 0015023cc06b5… b4     IDBA-UD: a de n… ""     ""     ""    ""     2012 <NA> 
#>  5 0015023cc06b5… b6     Basic local ali… J Mol… 215    ""    403-…  1990 <NA> 
#>  6 0015023cc06b5… b7     Genetically eng… J 614… 67     ""    5139…  1993 <NA> 
#>  7 0015023cc06b5… b9     Both cis and tr… J Vir… 90     ""    6864…  2016 <NA> 
#>  8 0015023cc06b5… b10    Mutational anal… J Vir… 620    ""    2027…  1996 <NA> 
#>  9 0015023cc06b5… b12    Figure 3. The p… ""     ""     ""    ""       NA <NA> 
#> 10 0015023cc06b5… b13    A replicon 650 … ""     ""     ""    ""       NA <NA> 
#> # … with 605,640 more rows

What are the most commonly cited articles?

cord19_paper_citations %>%
    count(title, sort = TRUE)
#> # A tibble: 417,863 x 2
#>    title                                                                      n
#>    <chr>                                                                  <int>
#>  1 Isolation of a novel coronavirus from a man with pneumonia in Saudi A…   397
#>  2 Submit your next manuscript to BioMed Central and take full advantage…   295
#>  3 Identification of a novel coronavirus in patients with severe acute r…   236
#>  4 A novel coronavirus associated with severe acute respiratory syndrome    226
#>  5 Global trends in emerging infectious diseases                            193
#>  6 Bats are natural reservoirs of SARS-like coronaviruses                   177
#>  7 Coronavirus as a possible cause of severe acute respiratory syndrome     164
#>  8 Characterization of a novel coronavirus associated with severe acute …   149
#>  9 Severe acute respiratory syndrome coronavirus-like virus in Chinese h…   140
#> 10 Identification of a new human coronavirus                                137
#> # … with 417,853 more rows

We could use the widyr package to find which papers are often cited by the same paper.

library(widyr)

filtered_citations <- cord19_paper_citations %>%
    add_count(title) %>%
    filter(n >= 25)

# What papers are often cited by the same paper?
filtered_citations %>%
    pairwise_cor(title, paper_id, sort = TRUE)
#> # A tibble: 244,530 x 3
#>    item1                            item2                           correlation
#>    <chr>                            <chr>                                 <dbl>
#>  1 Small molecule inhibitors revea… Ebola virus entry requires the…       0.776
#>  2 Ebola virus entry requires the … Small molecule inhibitors reve…       0.776
#>  3 VISA is an adapter protein requ… IPS-1, an adaptor triggering R…       0.765
#>  4 IPS-1, an adaptor triggering RI… VISA is an adapter protein req…       0.765
#>  5 Identification of a novel polyo… Identification of a third huma…       0.735
#>  6 Identification of a third human… Identification of a novel poly…       0.735
#>  7 The IFITM proteins mediate cell… Distinct patterns of IFITM-med…       0.727
#>  8 Distinct patterns of IFITM-medi… The IFITM proteins mediate cel…       0.727
#>  9 Cardif is an adaptor protein in… VISA is an adapter protein req…       0.698
#> 10 VISA is an adapter protein requ… Cardif is an adaptor protein i…       0.698
#> # … with 244,520 more rows
You can’t perform that action at this time.