Skip to content

Commit

Permalink
Adjustments to data and documentation
Browse files Browse the repository at this point in the history
Documentation changes:

* Documented the cord19_paper_citations dataset
* Expanded the README
* Changed seealso to source

Data changes:

* Filtered out some of the citations that weren't to actual papers (e.g. "All rights reserved")
* Changed the section titles to title case

Other changes:

* Removed packages used only in data-raw from Suggests
  • Loading branch information
dgrtwo committed Mar 19, 2020
1 parent 7512b67 commit 4f68b3d
Show file tree
Hide file tree
Showing 17 changed files with 255 additions and 90 deletions.
1 change: 1 addition & 0 deletions .Rbuildignore
Expand Up @@ -2,3 +2,4 @@
^\.Rproj\.user$
^data-raw$
^README\.Rmd$
^README-cache
1 change: 1 addition & 0 deletions .gitignore
Expand Up @@ -2,3 +2,4 @@
.Rhistory
.RData
.Ruserdata
README-cache/
19 changes: 9 additions & 10 deletions DESCRIPTION
Expand Up @@ -4,20 +4,19 @@ Title: COVID-19 Open Research Dataset
Version: 0.0.0.9000
Authors@R: c(person("David", "Robinson", email = "admiral.david@gmail.com", role = c("aut", "cre")))
Maintainer: David Robinson <admiral.david@gmail.com>
Description: Shares the data from the COVID-19 Open Research Dataset Challenge
hosted by Kaggle, in a format easily analyzed within R. See here for more:
Description: Data from the COVID-19 Open Research Dataset Challenge
hosted by Kaggle, in a format easily analyzed within R. It includes datasets
with metadata about each paper, of the full text, and of the citations.
See here for more:
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
License: file LICENSE
Encoding: UTF-8
LazyData: true
Suggests:
Depends:
R (>= 2.10)
Suggests:
dplyr,
purrr,
tidyr,
readr,
stringr,
janitor,
jsonlite,
tidytext,
usethis
widyr,
tidytext
RoxygenNote: 6.1.1
24 changes: 24 additions & 0 deletions R/paper_citations.R
@@ -0,0 +1,24 @@
#' Link papers to the full details of citations
#'
#' One observation for each combination of a paper and citation. Includes
#' only the ones in \code{\link{cord19_papers}} (thus, deduplicated and
#' filtered). Can be joined with \code{\link{cord19_paragraph_citations}} with
#' \code{paper_id} and \code{ref_id}, or with \code{cord19_papers} using
#' \code{paper_id}.
#'
#' @format A tibble with variables:
#' \describe{
#' \item{paper_id}{Unique identifier that can link to metadata and citations.
#' SHA of the paper PDF.}
#' \item{ref_id}{Reference ID, can be used to join to
#' \code{\link{cord19_paragraph_citations}}}
#' \item{venue}{Journal}
#' \item{volume}{Volume number}
#' \item{issn}{Issue number}
#' \item{pages}{Pages}
#' \item{year}{Year}
#' \item{doi}{Digital Object Identifier}
#' }
#'
#' @source \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
"cord19_paper_citations"
11 changes: 5 additions & 6 deletions R/papers.R
Expand Up @@ -3,8 +3,8 @@
#' Metadata such as titles, authors, journal, and publication IDs for each
#' paper in the CORD-19 dataset. This comes from the
#' \code{all_sources_metadata_DATE.csv} file in the decompressed dataset.
#' Note that duplicate papers (based on paper_id, doi, or title) have been
#' deduplicated, and papers without a paper_id or title have been removed.
#' Note that the papers have been deduplicated based on paper_id, doi, or
#' title, and papers without a paper_id or title have been removed.
#'
#' @format A tibble with one observation for each paper, and the following columns:
#' \describe{
Expand Down Expand Up @@ -33,15 +33,14 @@
#' cord19_papers %>%
#' count(journal, sort = TRUE)
#'
#' # What are the most common words in titles?
#' # What are the most common words in titles (or abstracts)?
#' library(tidytext)
#'
#' cord19_papers %>%
#' unnest_tokens(word, title) %>%
#' count(word, sort = TRUE) %>%
#' anti_join(stop_words, by = "word")
#'
#' # Could also look at abstracts
#'
#' @seealso \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
#' @source \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge},
#' specifically the \code{all_sources_metadata_DATE.csv} file.
"cord19_papers"
2 changes: 1 addition & 1 deletion R/paragraph_citations.R
Expand Up @@ -18,5 +18,5 @@
#' \code{\link{cord19_paper_citations}}}.
#' }
#'
#' @seealso \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
#' @source \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
"cord19_paragraph_citations"
5 changes: 3 additions & 2 deletions R/paragraphs.R
Expand Up @@ -9,7 +9,8 @@
#' \item{paper_id}{Unique identifier that can link to metadata and citations.
#' SHA of the paper PDF.}
#' \item{paragraph}{Index of the paragraph within the paper (1, 2, 3)}
#' \item{section}{Section (e.g. Introduction, Results, Discussion)}
#' \item{section}{Section (e.g. Introduction, Results, Discussion). The
#' casing is standardized to title case.}
#' \item{text}{Full text}
#' }
#'
Expand All @@ -22,5 +23,5 @@
#' cord19_paragraphs %>%
#' count(section = str_to_lower(section), sort = TRUE)
#'
#' @seealso \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
#' @source \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
"cord19_paragraphs"
58 changes: 44 additions & 14 deletions README.Rmd
Expand Up @@ -5,7 +5,9 @@ knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
out.width = "100%",
cache = TRUE,
cache.path = "README-cache/"
)
```

Expand All @@ -14,23 +16,24 @@ knitr::opts_chunk$set(
<!-- badges: start -->
<!-- badges: end -->

(WORK IN PROGRESS)

The cord19 package shares the [COVID-19 Open Research Dataset (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge#all_sources_metadata_2020-03-13.csv) in a tidy form that is easily analyzed within R.

## Installation

Install the package from GitHub here:
Install the package from GitHub as follows:

``` r
remotes::install_github("dgrtwo/cord19")
```

## Example
## Papers

The package turns the CORD-19 dataset into a set of tidy tables.

The package turns the CORD-19 dataset into a set of tidy tables. For example, the paper metadata is stored in `cord19_papers`:
For example, the paper metadata is stored in `cord19_papers`.

```{r example}
```{r cord19_papers}
library(dplyr)
library(cord19)
cord19_papers
Expand All @@ -40,32 +43,59 @@ cord19_papers %>%
count(journal, sort = TRUE)
```

Most usefully, it has the full text of the papers in `cord19_paragraphs`.
### Full text

Most usefully, `cord19_paragraphs` has the full text of the papers, with one observation for each paragraph.

```{r}
cord19_paragraphs
# What are common sections
cord19_paragraphs %>%
count(section, sort = TRUE)
```

This allows for some mining with a package like tidytext.
This allows for some analysis with a package like tidytext.

```{r}
library(tidytext)
set.seed(2020)
# Sample 1000 random paragraphs
cord19_paragraphs %>%
sample_n(1000) %>%
# Sample 100 random papers
paper_words <- cord19_paragraphs %>%
filter(paper_id %in% sample(unique(paper_id), 100)) %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
anti_join(stop_words, by = "word")
paper_words %>%
count(word, sort = TRUE)
```

### Citations

This also includes the articles cited by each paper.

```{r}
# What are the most commonly cited articles?
cord19_paper_citations
```

What are the most commonly cited articles?

```{r}
cord19_paper_citations %>%
count(title, sort = TRUE)
```

We could use the [widyr](https://github.com/dgrtwo/widyr) package to find which papers are often cited *by* the same paper.

```{r}
library(widyr)
filtered_citations <- cord19_paper_citations %>%
add_count(title) %>%
filter(n >= 25)
# What papers are often cited by the same paper?
filtered_citations %>%
pairwise_cor(title, paper_id, sort = TRUE)
```

0 comments on commit 4f68b3d

Please sign in to comment.