Adjustments to data and documentation

Documentation changes: * Documented the cord19_paper_citations dataset * Expanded the README * Changed seealso to source Data changes: * Filtered out some of the citations that weren't to actual papers (e.g. "All rights reserved") * Changed the section titles to title case Other changes: * Removed packages used only in data-raw from Suggests
dgrtwo · Mar 19, 2020 · 4f68b3d · 4f68b3d
1 parent 7512b67
commit 4f68b3d
Show file tree

Hide file tree

Showing 17 changed files with 255 additions and 90 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -2,3 +2,4 @@
 ^\.Rproj\.user$
 ^data-raw$
 ^README\.Rmd$
+^README-cache
diff --git a/.gitignore b/.gitignore
@@ -2,3 +2,4 @@
 .Rhistory
 .RData
 .Ruserdata
+README-cache/
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -4,20 +4,19 @@ Title: COVID-19 Open Research Dataset
 Version: 0.0.0.9000
 Authors@R: c(person("David", "Robinson", email = "admiral.david@gmail.com", role = c("aut", "cre")))
 Maintainer: David Robinson <admiral.david@gmail.com>
-Description: Shares the data from the COVID-19 Open Research Dataset Challenge
-    hosted by Kaggle, in a format easily analyzed within R. See here for more:
+Description: Data from the COVID-19 Open Research Dataset Challenge
+    hosted by Kaggle, in a format easily analyzed within R. It includes datasets
+    with metadata about each paper, of the full text, and of the citations.
+    See here for more:
     https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
 License: file LICENSE
 Encoding: UTF-8
 LazyData: true
-Suggests: 
+Depends:
+    R (>= 2.10)
+Suggests:
     dplyr,
-    purrr,
-    tidyr,
-    readr,
     stringr,
-    janitor,
-    jsonlite,
-    tidytext,
-    usethis
+    widyr,
+    tidytext
 RoxygenNote: 6.1.1
diff --git a/R/paper_citations.R b/R/paper_citations.R
@@ -0,0 +1,24 @@
+#' Link papers to the full details of citations
+#'
+#' One observation for each combination of a paper and citation. Includes
+#' only the ones in \code{\link{cord19_papers}} (thus, deduplicated and
+#' filtered). Can be joined with \code{\link{cord19_paragraph_citations}} with
+#' \code{paper_id} and \code{ref_id}, or with \code{cord19_papers} using
+#' \code{paper_id}.
+#'
+#' @format A tibble with variables:
+#' \describe{
+#'   \item{paper_id}{Unique identifier that can link to metadata and citations.
+#'   SHA of the paper PDF.}
+#'   \item{ref_id}{Reference ID, can be used to join to
+#'   \code{\link{cord19_paragraph_citations}}}
+#'   \item{venue}{Journal}
+#'   \item{volume}{Volume number}
+#'   \item{issn}{Issue number}
+#'   \item{pages}{Pages}
+#'   \item{year}{Year}
+#'   \item{doi}{Digital Object Identifier}
+#' }
+#'
+#' @source \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
+"cord19_paper_citations"
diff --git a/R/papers.R b/R/papers.R
@@ -3,8 +3,8 @@
 #' Metadata such as titles, authors, journal, and publication IDs for each
 #' paper in the CORD-19 dataset. This comes from the
 #' \code{all_sources_metadata_DATE.csv} file in the decompressed dataset.
-#' Note that duplicate papers (based on paper_id, doi, or title) have been
-#' deduplicated, and papers without a paper_id or title have been removed.
+#' Note that the papers have been deduplicated based on paper_id, doi, or
+#' title, and papers without a paper_id or title have been removed.
 #'
 #' @format A tibble with one observation for each paper, and the following columns:
 #' \describe{
@@ -33,15 +33,14 @@
 #' cord19_papers %>%
 #'   count(journal, sort = TRUE)
 #'
-#' # What are the most common words in titles?
+#' # What are the most common words in titles (or abstracts)?
 #' library(tidytext)
 #'
 #' cord19_papers %>%
 #'   unnest_tokens(word, title) %>%
 #'   count(word, sort = TRUE) %>%
 #'   anti_join(stop_words, by = "word")
 #'
-#' # Could also look at abstracts
-#'
-#' @seealso \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
+#' @source \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge},
+#' specifically the \code{all_sources_metadata_DATE.csv} file.
 "cord19_papers"
diff --git a/R/paragraph_citations.R b/R/paragraph_citations.R
@@ -18,5 +18,5 @@
 #'   \code{\link{cord19_paper_citations}}}.
 #' }
 #'
-#' @seealso \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
+#' @source \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
 "cord19_paragraph_citations"
diff --git a/R/paragraphs.R b/R/paragraphs.R
@@ -9,7 +9,8 @@
 #'   \item{paper_id}{Unique identifier that can link to metadata and citations.
 #'   SHA of the paper PDF.}
 #'   \item{paragraph}{Index of the paragraph within the paper (1, 2, 3)}
-#'   \item{section}{Section (e.g. Introduction, Results, Discussion)}
+#'   \item{section}{Section (e.g. Introduction, Results, Discussion). The
+#'   casing is standardized to title case.}
 #'   \item{text}{Full text}
 #' }
 #'
@@ -22,5 +23,5 @@
 #' cord19_paragraphs %>%
 #'   count(section = str_to_lower(section), sort = TRUE)
 #'
-#' @seealso \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
+#' @source \url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}
 "cord19_paragraphs"
diff --git a/README.Rmd b/README.Rmd
@@ -5,7 +5,9 @@ knitr::opts_chunk$set(
   collapse = TRUE,
   comment = "#>",
   fig.path = "man/figures/README-",
-  out.width = "100%"
+  out.width = "100%",
+  cache = TRUE,
+  cache.path = "README-cache/"
 )
 ```
 
@@ -14,23 +16,24 @@ knitr::opts_chunk$set(
 <!-- badges: start -->
 <!-- badges: end -->
 
-(WORK IN PROGRESS)
-
 The cord19 package shares the [COVID-19 Open Research Dataset (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge#all_sources_metadata_2020-03-13.csv) in a tidy form that is easily analyzed within R.
 
 ## Installation
 
-Install the package from GitHub here:
+Install the package from GitHub as follows:
 
 ``` r
 remotes::install_github("dgrtwo/cord19")
 ```
 
-## Example
+## Papers
+
+The package turns the CORD-19 dataset into a set of tidy tables.
 
-The package turns the CORD-19 dataset into a set of tidy tables. For example, the paper metadata is stored in `cord19_papers`:
+For example, the paper metadata is stored in `cord19_papers`.
 
-```{r example}
+```{r cord19_papers}
+library(dplyr)
 library(cord19)
 
 cord19_papers
@@ -40,32 +43,59 @@ cord19_papers %>%
     count(journal, sort = TRUE)
 ```
 
-Most usefully, it has the full text of the papers in `cord19_paragraphs`.
+### Full text
+
+Most usefully, `cord19_paragraphs` has the full text of the papers, with one observation for each paragraph.
 
 ```{r}
 cord19_paragraphs
+
+# What are common sections
+cord19_paragraphs %>%
+    count(section, sort = TRUE)
 ```
 
-This allows for some mining with a package like tidytext.
+This allows for some analysis with a package like tidytext.
 
 ```{r}
 library(tidytext)
 set.seed(2020)
 
-# Sample 1000 random paragraphs
-cord19_paragraphs %>%
-    sample_n(1000) %>%
+# Sample 100 random papers
+paper_words <- cord19_paragraphs %>%
+    filter(paper_id %in% sample(unique(paper_id), 100)) %>%
     unnest_tokens(word, text) %>%
-    count(word, sort = TRUE) %>%
     anti_join(stop_words, by = "word")
+
+paper_words %>%
+    count(word, sort = TRUE)
 ```
 
 ### Citations
 
 This also includes the articles cited by each paper.
 
 ```{r}
-# What are the most commonly cited articles?
+cord19_paper_citations
+```
+
+What are the most commonly cited articles?
+
+```{r}
 cord19_paper_citations %>%
     count(title, sort = TRUE)
 ```
+
+We could use the [widyr](https://github.com/dgrtwo/widyr) package to find which papers are often cited *by* the same paper.
+
+```{r}
+library(widyr)
+
+filtered_citations <- cord19_paper_citations %>%
+    add_count(title) %>%
+    filter(n >= 25)
+
+# What papers are often cited by the same paper?
+filtered_citations %>%
+    pairwise_cor(title, paper_id, sort = TRUE)
+```