Merge pull request #13 from ecohealthalliance/release/0.2.0

Release/0.2.0
ecohealthalliance · Apr 17, 2024 · ccb4962 · ccb4962
2 parents 82bd908 + 9768e80
commit ccb4962
Show file tree

Hide file tree

Showing 16 changed files with 616 additions and 20 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -3,3 +3,6 @@
 ^README\.Rmd$
 ^LICENSE\.md$
 ^\.github$
+^_pkgdown\.yml$
+^docs$
+^pkgdown$
diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml
@@ -0,0 +1,48 @@
+# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
+# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
+on:
+  push:
+    branches: [main, master]
+  pull_request:
+    branches: [main, master]
+  release:
+    types: [published]
+  workflow_dispatch:
+
+name: pkgdown
+
+jobs:
+  pkgdown:
+    runs-on: ubuntu-latest
+    # Only restrict concurrency for non-PR jobs
+    concurrency:
+      group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }}
+    env:
+      GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
+    permissions:
+      contents: write
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: r-lib/actions/setup-pandoc@v2
+
+      - uses: r-lib/actions/setup-r@v2
+        with:
+          use-public-rspm: true
+
+      - uses: r-lib/actions/setup-r-dependencies@v2
+        with:
+          extra-packages: any::pkgdown, local::.
+          needs: website
+
+      - name: Build site
+        run: pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE)
+        shell: Rscript {0}
+
+      - name: Deploy to GitHub pages 🚀
+        if: github.event_name != 'pull_request'
+        uses: JamesIves/github-pages-deploy-action@v4.5.0
+        with:
+          clean: false
+          branch: gh-pages
+          folder: docs
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,4 @@
 .RData
 .Ruserdata
 inst/doc
+docs
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: ohcleandat
 Type: Package
 Title: One Health Data Cleaning and Quality Checking Package
-Version: 0.1.1
+Version: 0.2.0
 Authors@R: c(
     person("Collin", "Schwantes", email = "schwantes@ecohealthalliance.org", role = c("cre", "aut"), comment = c(ORCID = "0000-0003-4014-4896")),
     person("Johana", "Teigen", email = "teigen@ecohealthalliance.org", role = "aut", comment = c(ORCID = "0000-0002-6209-2321")),
@@ -42,3 +42,4 @@ Remotes:
     ecohealthalliance/containerTemplateUtils,
     fcampelo/rdrop2,
     ropensci/ruODK
+URL: https://ecohealthalliance.github.io/ohcleandat/
diff --git a/README.Rmd b/README.Rmd
@@ -29,3 +29,11 @@ You can install the development version of ohcleandat from [GitHub](https://gith
 # install.packages("devtools")
 devtools::install_github("ecohealthalliance/ohcleandat")
 ```
+
+## Getting Started   
+
+For help guides, check out the package vignettes.   
+
+
+## Getting Help  
+If you encounter a clear bug, please file a minimal reproducible example on [github](https://github.com/ecohealthalliance/ohcleandat/issues). 
diff --git a/README.md b/README.md
@@ -20,3 +20,12 @@ You can install the development version of ohcleandat from
 # install.packages("devtools")
 devtools::install_github("ecohealthalliance/ohcleandat")
 ```
+
+## Getting Started
+
+For help guides, check out the package vignettes.
+
+## Getting Help
+
+If you encounter a clear bug, please file a minimal reproducible example
+on [github](https://github.com/ecohealthalliance/ohcleandat/issues).
diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -0,0 +1,4 @@
+url: https://ecohealthalliance.github.io/ohcleandat/
+template:
+  bootstrap: 5
+
diff --git a/vignettes/idcheck.Rmd b/vignettes/idcheck.Rmd
@@ -0,0 +1,54 @@
+---
+title: "ID Correction and Autobot"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{ID Correction and Autobot}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+```{r setup}
+library(ohcleandat)
+```
+
+The data cleaning and validation pipeline provides a way to specify rules that can then be applied to data in order to produce a validation log for manual corrections. However, in some cases particularly with the use of ID columns there are certain automatic corrections that can be made due to formatting errors.
+
+For instance missing prefixes, incorrect case, or non-standard formatting of columns where there should be a predictable and fixed format. In this case, we wish to provide an automated cleaning step that makes these corrections to the data, but also produces a validation log for our records. This is done in two steps.
+
+The first is applying the automatic corrections through the use of an `id_check()` function (or family of checking functions). These operate on the semi-clean data set to produce a new proposed column with the automated corrections. These functions are designed and implemented by users based on their requirements. 
+
+Once these corrections are made, both the original ID column and the new corrected ID column are provided to an `autobot()` function in the pipeline. The `autobot()` function compares these records and keeps only those where the original in the new column are different - indicating that some form of automatic correction has been made.
+
+A validation log is generated in the exact same format as other validation logs, however a key change here is that this validation log does not require the manual overview of a reviewer. The proposed changes are automatically accepted by the autobot. The reason for producing the log is to persist changes and have a record of how IDs have changed due to automatic corrections. 
+
+## Example  
+
+Below is an example of a (fake) farm_id identifier. We can see the ID checker functions have corrected an 'O' to '0' in record 2. Case correction has taken place, and records that do not conform the the required pattern post corrections, are set to NA for manual review. 
+
+```
+# A tibble: 6 × 2
+  farm_id    farm_id_new
+  <chr>      <chr>      
+1 123ABC0007 NA         
+2 1O3ABC010  103ABC010  
+3 143abc010  143ABC010  
+4 13DEFH005  NA         
+5 243DLF803  243DLF803  
+6 243DPF911  243DPF911 
+```
+
+```
+> ohcleandat::autobot(data = test, old_col = "farm_id", new_col = "farm_id_new", key = "farm_id")
+# A tibble: 2 × 8
+  entry     field   issue                               old_value is_valid new_val  user_initials comments
+  <chr>     <chr>   <chr>                               <chr>     <chr>    <chr>    <chr>         <chr>   
+1 1O3ABC010 farm_id Automated field format check failed 1O3ABC010 FALSE    103ABC0… autobot       ""      
+2 143abc010 farm_id Automated field format check failed 143abc010 FALSE    143ABC0… autobot       ""    
+```
diff --git a/vignettes/img/erd.png b/vignettes/img/erd.png
diff --git a/vignettes/img/html.png b/vignettes/img/html.png
diff --git a/vignettes/img/pipeline.png b/vignettes/img/pipeline.png
diff --git a/vignettes/img/targets.png b/vignettes/img/targets.png
diff --git a/vignettes/integration.Rmd b/vignettes/integration.Rmd
@@ -0,0 +1,62 @@
+---
+title: "Integrating Different Datasets"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Integrating Different Datasets}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+```{r setup}
+library(ohcleandat)
+```
+
+## Integrating Data Sets
+
+Once an individual data set has passed through the cleaning and validation process, it may need to be combined (or joined) with other data sets. This process is handled by the {targets} pipeline during the *integration* steps. There are two phases, the first is performing checks the second is the join operation.
+
+The checks perform tests to ensure that the records in both data sets are compatible and match the expectations of the relationship between the two data sets. Secondly, the integration of the data sets is performed by using an SQL join operation. This is typically either a right-join, a left-join, an inner-join or a full-join.
+
+The type of join operation selected depends on the relationship between the data sets. The critical information here are the primary-key (unique identifier) of the base table and the foreign-key of the table to be joined which is the attribute that should match the primary-key. In addition, the cardinality of the relationship is important to understand the expected result when joining the data. Below is an example of an entity relationship diagram that shows the relationship between two example data sets. Crow's feet notation is used to illustrate that there is an optional 1:Many relationship with the left table and the right table. There is also a mandatory 1:1 relationship with the right table and the left table.
+
+![](img/erd.png)
+
+Also below is the relevant target that performs the joining operation and integrates these data sets together.
+
+```{r eval=FALSE}
+  tar_target(integrated_mosq_field,
+             left_join(
+               x = fs_mosquito_field_semiclean,
+               y = longitudinal_identification_semiclean,
+               by =  c("Batch_ID" = "batch_id")
+             )
+  )
+```
+
+It is critical that data validation steps are correctly performed to ensure the integration of multiple data sets is successful. In the case where there are missing, malformed or duplicate primary key, the expectations around the relationship type will not hold up.
+
+## Types of Data
+
+Throughout the data cleaning pipeline, we take in raw data and convert it to some form of clean data. There are several intermediate steps in this process. A standard terminology has been adopted to describe the steps in this process.
+
+-   **raw data**: is data that is read in directly from the source systems.
+-   **combined data**: If the raw data is situated in multiple files or data frames, a compatible and united data set of these data are termed 'combined.'  
+-   **semi-clean**: Data are semi-clean once they have been corrected using the values provided in the validation log.
+-   **integrated**: Data are integrated when they are joined to other data sets.
+-   **clean**: Data are termed clean when they are integrated, and records that are still pending validation in the logs are removed, thereby leaving only a clean subset of validated data.
+
+## Tips for data management
+
+Over the course of a long data collection exercise, standards and formats can diverge. This makes the data cleaning steps difficult and will slow down the ability to integrate data as above. Some general strategies can help to mitigate these risks:
+
+-   Design and enforce a Primary Key or Unique Identifier for each data set that will be meaningful and immutable.
+-   Think about storing data in a 'tidy' format where possible. See here: <https://www.jstatsoft.org/article/view/v059i10>\
+-   Store raw data in a machine readable format (i.e. CSV)
+-   Set some metadata standards at the start of the project around columns and data types. It is understandable that these might change over time, but having these standards will help plan how to best accommodate changes without breaking existing work.