Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert mdl units #57

Merged
merged 17 commits into from
Feb 19, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Type: Package
Package: rems
Title: Get Data From British Columbia's Environmental Monitoring
System
Version: 0.6.1
Version: 0.6.1.9000
Authors@R:
c(person(given = "Andy",
family = "Teucher",
Expand All @@ -12,7 +12,7 @@ Authors@R:
role = "cph"))
Description: Download and parse data from B.C.'s Environmental
Monitoring System.
License: file LICENSE
License: Apache License (>= 2)
Depends:
R (>= 3.4)
Imports:
Expand All @@ -31,7 +31,8 @@ Imports:
tibble (>= 1.2),
utils,
xml2 (>= 1.0.0),
zip (>= 2.0.4)
zip (>= 2.0.4),
units
Suggests:
bit64,
devtools,
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ export(lt_lake_sites)
export(read_historic_data)
export(remove_data_cache)
export(save_ems_data)
export(standardize_mdl_units)
import(httr)
import(rappdirs)
import(readr)
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# rems (development version)

* Added new helper function `standardize_mdl_units()` to detect when `MDL_UNIT` and `UNIT` are different, and convert `METHOD_DETECTION_LIMIT` to the same unit as `RESULT` (and update the `MDL_UNIT` column accordingly). (https://github.com/bcgov/wqbc/issues/158, #57)

# rems 0.6.1

## Bug fixes
Expand Down
31 changes: 31 additions & 0 deletions R/onLoad.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,37 @@

._remsCache_ <- NULL

register_ems_units <- function() {

try_test <- try(
units::install_conversion_constant("S", "mho", 1),
silent = TRUE
)

if (!inherits(try_test, "try-error")) {
units::install_symbolic_unit("MPN")
units::install_conversion_constant("MPN", "CFU", 1)
units::install_symbolic_unit("NTU")
units::install_conversion_constant("NTU", "JTU", 1)
units::install_conversion_constant("US_liquid_gallon", "USG", 1)
units::install_conversion_constant("UK_liquid_gallon", "IG", 1)
# These petroleum measures are actually only ever used
# as volumes (1m3 = 0.001 E3m3 = 1e-6), but can't use E3m3
# to install the unit so have to do it as m, units takes
# care of the rest when it is m3 <=> E3m3 etc.
units::install_conversion_constant("t", "adt", 1)
units::install_conversion_constant("m", "E3m", .1)
units::install_conversion_constant("m", "E6m", .01)
units::install_conversion_constant("L", "E6L", 1e-6)
units::install_conversion_constant("UK_liquid_gallon", "E6IG", 1e-6)
#> set_units(set_units(1, "m3"), "E3m3")
# 0.001 [E3m3]
#> set_units(set_units(1, "m3"), "E6m3")
# 1e-06 [E6m3]
}
}

.onLoad <- function(libname, pkgname) {
write_cache()
register_ems_units()
}
105 changes: 105 additions & 0 deletions R/units.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Copyright 2021 Province of British Columbia
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and limitations under the License.

#' Standardize MDL
#'
#' There are many cases in EMS where the UNIT of the RESULT is
#' displayed differently than the METHOD_DETECTION_LIMIT.
#' This function uses the `units` package to convert the
#' `METHOD_DETECTION_LIMIT` values to the same unit as `RESULT`.
#'
#' @param data an ems data frame containing at least the
#' columns `"UNIT", "METHOD_DETECTION_LIMIT", "MDL_UNIT"`
#'
#' @return data frame with MDLs standardized to UNITs (where possible)
#' @importFrom rlang .data
#' @export
standardize_mdl_units <- function(data) {
if (!all(c("UNIT", "METHOD_DETECTION_LIMIT", "MDL_UNIT") %in% names(data))) {
stop("'data' must contain columns 'UNIT', 'METHOD_DETECTION_LIMIT', 'MDL_UNIT'")
}

if (!any(data[["UNIT"]] != data[["MDL_UNIT"]])) return(data)

data <- dplyr::group_by(data, .data$MDL_UNIT, .data$UNIT)

unique_units <- unique(data[, c("UNIT", "MDL_UNIT"), drop = FALSE])

are_convertible <- mapply(
# TODO: when new units is released, this function will be exported so can drop the :::
units:::ud_are_convertible,
unique_units$MDL_UNIT,
unique_units$UNIT,
USE.NAMES = FALSE
)

not_convertible <- unique_units[!are_convertible, , drop = FALSE]

if (nrow(not_convertible)) {
warning("There were ", nrow(not_convertible),
" unit combinations that were not convertible:\n * ",
paste(not_convertible$MDL_UNIT, "to",
not_convertible$UNIT, collapse = "\n * "),
call. = FALSE)
}

data <- dplyr::mutate(
data,
converted_val = convert_unit_values(.data$METHOD_DETECTION_LIMIT,
.data$MDL_UNIT[1],
.data$UNIT[1])
)

fixed <- !is.na(data[["converted_val"]])

message("Successfully converted units in ", sum(fixed), " rows.")

# update MDL and MDL_UNIT for those that were converted
# and remove the temporary converted_val column
data[["METHOD_DETECTION_LIMIT"]][fixed] <- data[["converted_val"]][fixed]
data[["MDL_UNIT"]][fixed] <- data[["UNIT"]][fixed]
data[["converted_val"]] <- NULL

data
}

convert_unit_values <- function(x, from, to) {
stopifnot(length(from) == 1)
stopifnot(length(to) == 1)

clean_to <- clean_unit(to)
clean_from <- clean_unit(from)

# only return a non-NA value for those that are converted
if (
any(is.na(c(clean_from, clean_to))) ||
clean_from == clean_to ||
# TODO: when new units is released, this function will be exported so can drop the :::
!units:::ud_are_convertible(clean_from, clean_to)
) {
return(NA_real_)
}

ret <- units::set_units(
units::set_units(x, clean_from, mode = "standard"),
clean_to, mode = "standard"
)

as.numeric(ret)
}

clean_unit <- function(x) {
x[x == "N/A"] <- NA_character_
# Remove trailing A, W, wet etc. as well as percent type (V/V, W/W, Mortality)
# Assuming they are not important in the unit conversion??
gsub("\\s*(W|wet|A|\\(?W/W\\)?|\\(?V/V\\)?|\\(?Mortality\\)?)\\s*$", "", x)
}
20 changes: 18 additions & 2 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -132,10 +132,26 @@ You can combine the previously imported historic and two_year data sets using `b

```{r}
all_data <- bind_ems_data(filtered_2yr, filtered_historic)
nrow(all_data)
head(all_data)
```

For more advanced filtering, selecting, and summarizing, I recommend using the `dplyr` package.
## Units

There are many cases in EMS data where the unit of the `RESULT` (in the `UNIT` column) is different from that of `METHOD_DETECTION_LIMIT` (`MDL_UNIT` column). The `standardize_mdl_units()` function converts the `METHOD_DETECTION_LIMIT` values to the same unit as `RESULT`, and updates the `MDL_UNIT` column accordingly:

```{r}
# look at data with mismatched units:
filter(all_data, UNIT != MDL_UNIT) %>%
select(RESULT, UNIT, METHOD_DETECTION_LIMIT, MDL_UNIT) %>%
head()

all_data <- standardize_mdl_units(all_data)

# Check again
filter(all_data, UNIT != MDL_UNIT) %>%
select(RESULT, UNIT, METHOD_DETECTION_LIMIT, MDL_UNIT) %>%
head()
```

Then you can plot your data with ggplot2:

Expand Down
79 changes: 63 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,14 @@

<!-- README.md is generated from README.Rmd. Please edit that file -->

# rems 0.6.0
# rems 0.6.1.9000

<!-- badges: start -->

[![Codecov test
coverage](https://codecov.io/gh/bcgov/rems/branch/master/graph/badge.svg)](https://codecov.io/gh/bcgov/rems?branch=master)
[![Codecov test coverage](https://codecov.io/gh/bcgov/rems/branch/master/graph/badge.svg)](https://codecov.io/gh/bcgov/rems?branch=master)
[![License](https://img.shields.io/badge/License-Apache2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![CRAN
status](https://www.r-pkg.org/badges/version/rems)](https://cran.r-project.org/package=rems)
[![R build
status](https://github.com/bcgov/rems/workflows/R-CMD-check/badge.svg)](https://github.com/bcgov/rems/actions)
[![CRAN status](https://www.r-pkg.org/badges/version/rems)](https://cran.r-project.org/package=rems)
[![R build status](https://github.com/bcgov/rems/workflows/R-CMD-check/badge.svg)](https://github.com/bcgov/rems/actions)
[![img](https://img.shields.io/badge/Lifecycle-Maturing-007EC6)](https://github.com/bcgov/repomountie/blob/master/doc/lifecycle-badges.md)
<!-- badges: end -->

Expand Down Expand Up @@ -53,12 +50,9 @@ data):
``` r
library(rems)
two_year <- get_ems_data(which = "2yr", ask = FALSE)
#> Downloading latest '2yr' EMS data from BC Data Catalogue (url: https://pub.data.gov.bc.ca/datasets/949f2233-9612-4b06-92a9-903e817da659/ems_sample_results_current_expanded.csv)
#> Reading data from file...
#> Caching data on disk...
#> Loading data...
#> Fetching data from cache...
nrow(two_year)
#> [1] 874045
#> [1] 951551
head(two_year)
#> # A tibble: 6 x 23
#> EMS_ID MONITORING_LOCA… LATITUDE LONGITUDE LOCATION_TYPE COLLECTION_START
Expand Down Expand Up @@ -167,16 +161,69 @@ glimpse(filtered_historic2)
#> $ EMS_ID <chr> "0121580", "0121580", "0121580", "0121580", "0121580…
#> $ PARAMETER <chr> "Turbidity", "Aluminum Total", "Turbidity", "Turbidi…
#> $ COLLECTION_START <dttm> 2010-04-12 08:50:00, 2010-08-23 09:12:00, 2016-08-0…
#> $ RESULT <dbl> 0.900000, 0.013800, 0.320000, 0.620000, 1.600000, 0.…
#> $ RESULT <dbl> 0.900000, 0.013800, 0.320000, 0.300000, 0.700000, 0.…
```

You can combine the previously imported historic and two\_year data sets
using `bind_ems_data`:

``` r
all_data <- bind_ems_data(filtered_2yr, filtered_historic)
nrow(all_data)
#> [1] 2481
head(all_data)
#> # A tibble: 6 x 23
#> EMS_ID MONITORING_LOCA… LATITUDE LONGITUDE LOCATION_TYPE COLLECTION_START
#> <chr> <chr> <dbl> <dbl> <chr> <dttm>
#> 1 01215… ENGLISHMAN RIVE… 49.3 -124. RIVER,STREAM… 2010-04-12 08:50:00
#> 2 01215… ENGLISHMAN RIVE… 49.3 -124. RIVER,STREAM… 2010-08-23 09:12:00
#> 3 01215… ENGLISHMAN RIVE… 49.3 -124. RIVER,STREAM… 2010-07-20 09:47:00
#> 4 01215… ENGLISHMAN RIVE… 49.3 -124. RIVER,STREAM… 2009-09-29 09:25:00
#> 5 01215… ENGLISHMAN RIVE… 49.3 -124. RIVER,STREAM… 2003-02-06 13:30:00
#> 6 01215… ENGLISHMAN RIVE… 49.3 -124. RIVER,STREAM… 2007-11-20 09:15:00
#> # … with 17 more variables: LOCATION_PURPOSE <chr>, PERMIT <chr>,
#> # SAMPLE_CLASS <chr>, SAMPLE_STATE <chr>, SAMPLE_DESCRIPTOR <chr>,
#> # PARAMETER_CODE <chr>, PARAMETER <chr>, ANALYTICAL_METHOD_CODE <chr>,
#> # ANALYTICAL_METHOD <chr>, RESULT_LETTER <chr>, RESULT <dbl>, UNIT <chr>,
#> # METHOD_DETECTION_LIMIT <dbl>, MDL_UNIT <chr>, QA_INDEX_CODE <chr>,
#> # UPPER_DEPTH <dbl>, LOWER_DEPTH <dbl>
```

## Units

There are many cases in EMS data where the unit of the `RESULT` (in the
`UNIT` column) is different from that of `METHOD_DETECTION_LIMIT`
(`MDL_UNIT` column). The `standardize_mdl_units()` function converts the
`METHOD_DETECTION_LIMIT` values to the same unit as `RESULT`, and
updates the `MDL_UNIT` column accordingly:

``` r
# look at data with mismatched units:
filter(all_data, UNIT != MDL_UNIT) %>%
select(RESULT, UNIT, METHOD_DETECTION_LIMIT, MDL_UNIT) %>%
head()
#> # A tibble: 6 x 4
#> RESULT UNIT METHOD_DETECTION_LIMIT MDL_UNIT
#> <dbl> <chr> <dbl> <chr>
#> 1 0.0138 mg/L 0.5 ug/L
#> 2 0.000002 mg/L 0.001 ug/L
#> 3 0.00074 mg/L 0.02 ug/L
#> 4 0.0947 mg/L 0.2 ug/L
#> 5 0.00072 mg/L 0.02 ug/L
#> 6 0.00068 mg/L 0.02 ug/L

all_data <- standardize_mdl_units(all_data)
#> Successfully converted units in 1629 rows.

# Check again
filter(all_data, UNIT != MDL_UNIT) %>%
select(RESULT, UNIT, METHOD_DETECTION_LIMIT, MDL_UNIT) %>%
head()
#> # A tibble: 3 x 4
#> # Groups: MDL_UNIT, UNIT [1]
#> RESULT UNIT METHOD_DETECTION_LIMIT MDL_UNIT
#> <dbl> <chr> <dbl> <chr>
#> 1 0.00065 mg/L NA ug/L
#> 2 0.000005 mg/L NA ug/L
#> 3 0.0122 mg/L NA ug/L
```

For more advanced filtering, selecting, and summarizing, I recommend
Expand All @@ -192,7 +239,7 @@ ggplot(all_data, aes(x = COLLECTION_START, y = RESULT)) +
facet_grid(PARAMETER ~ EMS_ID, scales = "free_y")
```

![](fig/README-unnamed-chunk-11-1.png)<!-- -->
![](fig/README-unnamed-chunk-12-1.png)<!-- -->

When you are finished querying the historic database, you should close
the database connection using `disconnect_historic_db()`:
Expand Down
Binary file modified fig/README-unnamed-chunk-12-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
21 changes: 21 additions & 0 deletions man/standardize_mdl_units.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading