more details on built in a_* and i_* functions

bluefoxr · Nov 27, 2023 · 79a8c2f · 79a8c2f
1 parent 4ffda78
commit 79a8c2f
Show file tree

Hide file tree

Showing 13 changed files with 47 additions and 8 deletions.
diff --git a/R/aggregate.R b/R/aggregate.R
@@ -66,7 +66,7 @@ Aggregate.purse <- function(x, dset, f_ag = NULL, w = NULL, f_ag_para = NULL, da
 #'
 #' Note that COINr has a number of aggregation functions built in,
 #' all of which are of the form `a_*()`, e.g. [a_amean()], [a_gmean()] and friends. To see a list browse COINr functions alphabetically or
-#' type `a_` in the R Studio console and press the tab key (after loading COINr).
+#' type `a_` in the R Studio console and press the tab key (after loading COINr), or see the [online documentation](https://bluefoxr.github.io/COINr/articles/aggregate.html#coinr-aggregation-functions).
 #'
 #' Optionally, a data availability threshold can be assigned below which the aggregated value will return
 #' `NA` (see `dat_thresh` argument). If `by_df = TRUE`, this will however be ignored because aggregation is not
@@ -313,7 +313,7 @@ Aggregate.coin <- function(x, dset, f_ag = NULL, w = NULL, f_ag_para = NULL, dat
 #'
 #' Note that COINr has a number of aggregation functions built in,
 #' all of which are of the form `a_*()`, e.g. [a_amean()], [a_gmean()] and friends. To see a list browse COINr functions alphabetically or
-#' type `a_` in the R Studio console and press the tab key (after loading COINr).
+#' type `a_` in the R Studio console and press the tab key (after loading COINr), or see the [online documentation](https://bluefoxr.github.io/COINr/articles/aggregate.html#coinr-aggregation-functions).
 #'
 #' Optionally, a data availability threshold can be assigned below which the aggregated value will return
 #' `NA` (see `dat_thresh` argument). If `by_df = TRUE`, this will however be ignored because aggregation is not

diff --git a/R/impute.R b/R/impute.R
@@ -135,6 +135,9 @@ Impute.purse <- function(x, dset, f_i = NULL, f_i_para = NULL, impute_by = "colu
 #' it except for `NA` values, which can be replaced. The function `f_i` is not required to replace *all* `NA`
 #' values.
 #'
+#' COINr has several built-in imputation functions of the form `i_*()` for vectors which can be called by [Impute()]. See the
+#' [online documentation](https://bluefoxr.github.io/COINr/articles/imputation.html#data-frames) for more details.
+#'
 #' When imputing row-wise, prior normalisation of the data is recommended. This is because imputation
 #' will use e.g. the mean of the unit values over all indicators (columns). If the indicators are on
 #' very different scales, the result will likely make no sense. If the indicators are normalised first,
@@ -332,6 +335,9 @@ Impute.coin <- function(x, dset, f_i = NULL, f_i_para = NULL, impute_by = "colum
 #' it except for `NA` values, which can be replaced. The function `f_i` is not required to replace *all* `NA`
 #' values.
 #'
+#' COINr has several built-in imputation functions of the form `i_*()` for vectors which can be called by [Impute()]. See the
+#' [online documentation](https://bluefoxr.github.io/COINr/articles/imputation.html#data-frames) for more details.
+#'
 #' When imputing row-wise, prior normalisation of the data is recommended. This is because imputation
 #' will use e.g. the mean of the unit values over all indicators (columns). If the indicators are on
 #' very different scales, the result will likely make no sense. If the indicators are normalised first,
@@ -566,6 +572,9 @@ Impute.data.frame <- function(x, f_i = NULL, f_i_para = NULL, impute_by = "colum
 #' values found in `x`. By default, `f_i = "i_mean()"`, which simply imputes `NA`s with the mean of the
 #' non-`NA` values in `x`.
 #'
+#' COINr has several built-in imputation functions of the form `i_*()` for vectors which can be called by [Impute()]. See the
+#' [online documentation](https://bluefoxr.github.io/COINr/articles/imputation.html#data-frames) for more details.
+#'
 #' You could also use one of the imputation functions directly (such as [i_mean()]). However, this
 #' function offers a few extra advantages, such as checking the input and output formats, and making
 #' sure the resulting imputed vector agrees with the input. It will also skip imputation entirely if

diff --git a/R/normalise.R b/R/normalise.R
@@ -109,7 +109,8 @@ Normalise.purse <- function(x, dset, global_specs = NULL, indiv_specs = NULL,
 #' normalised with individual specifications using the `indiv_specs` argument. If indicators should have their
 #' directions reversed, this can be specified using the `directions` argument. Non-numeric columns are ignored
 #' automatically by this function. By default, this function normalises each indicator using the "min-max" method, scaling indicators to lie between
-#' 0 and 100. This calls the [n_minmax()] function. Note, all COINr normalisation functions are of the form `n_*()`.
+#' 0 and 100. This calls the [n_minmax()] function. COINr has a number of built-in normalisation functions of the form `n_*()`. See [online documentation](https://bluefoxr.github.io/COINr/articles/normalise.html#built-in-normalisation-functions)
+#' for details.
 #'
 #' ## Global specification
 #'
@@ -251,7 +252,8 @@ Normalise.coin <- function(x, dset, global_specs = NULL, indiv_specs = NULL,
 #' normalised with individual specifications using the `indiv_specs` argument. If variables should have their
 #' directions reversed, this can be specified using the `directions` argument. Non-numeric columns are ignored
 #' automatically by this function. By default, this function normalises each indicator using the "min-max" method, scaling indicators to lie between
-#' 0 and 100. This calls the [n_minmax()] function. Note, all COINr normalisation functions are of the form `n_*()`.
+#' 0 and 100. This calls the [n_minmax()] function. COINr has a number of built-in normalisation functions of the form `n_*()`. See [online documentation](https://bluefoxr.github.io/COINr/articles/normalise.html#built-in-normalisation-functions)
+#' for details.
 #'
 #' ## Global specification
 #'
@@ -401,6 +403,9 @@ Normalise.data.frame <- function(x, global_specs = NULL, indiv_specs = NULL,
 #' first argument is `x`, a numeric vector, and it returns a numeric vector of the same length. See [n_minmax()]
 #' for an example.
 #'
+#' COINr has a number of built-in normalisation functions of the form `n_*()`. See [online documentation](https://bluefoxr.github.io/COINr/articles/normalise.html#built-in-normalisation-functions)
+#' for details.
+#'
 #' `f_n_para` is *required* to be a named list. So e.g. if we define a function `f1(x, arg1, arg2)` then we should
 #' specify `f_n = "f1"`, and `f_n_para = list(arg1 = val1, arg2 = val2)`, where `val1` and `val2` are the
 #' values assigned to the arguments `arg1` and `arg2` respectively.

diff --git a/man/Aggregate.coin.Rd b/man/Aggregate.coin.Rd
diff --git a/man/Aggregate.data.frame.Rd b/man/Aggregate.data.frame.Rd
diff --git a/man/Impute.coin.Rd b/man/Impute.coin.Rd
diff --git a/man/Impute.data.frame.Rd b/man/Impute.data.frame.Rd
diff --git a/man/Impute.numeric.Rd b/man/Impute.numeric.Rd
diff --git a/man/Normalise.coin.Rd b/man/Normalise.coin.Rd
diff --git a/man/Normalise.data.frame.Rd b/man/Normalise.data.frame.Rd
diff --git a/man/Normalise.numeric.Rd b/man/Normalise.numeric.Rd
diff --git a/vignettes/aggregate.Rmd b/vignettes/aggregate.Rmd
@@ -136,6 +136,8 @@ Let's now explore some of the options of the `Aggregate()` function. Like other
 * `a_hmean()`: the weighted harmonic mean
 * `a_copeland()`: the Copeland method (note: requires `by_df = TRUE`)
 
+For details of these methods, see [Approaches] above and the function documentation of each of the functions listed.
+
 By default, the arithmetic mean is called but we can easily change this to the geometric mean, for example. However here we run into a problem: the geometric mean will fail if any values to aggregate are less than or equal to zero. So to use the geometric mean we have to re-do the normalisation step to avoid this. Luckily this is straightforward in COINr:
 
 ```{r}

diff --git a/vignettes/imputation.Rmd b/vignettes/imputation.Rmd
@@ -94,6 +94,15 @@ i_mean(x)
 
 The key concept here is that the simple function `i_mean()` is applied by `Impute()` to each column. This idea of passing simpler functions is used in several key COINr functions, and allows great flexibility because more sophisticated imputation methods can be used from other packages, for example.
 
+In COINr there are currently four basic imputation functions which impute a numeric vector:
+
+* `i_mean()`: substitutes missing values with the mean of the remaining values
+* `i_median()`: substitutes missing values with the median of the remaining values (this will be more robust to outliers)
+* `i_mean_grp()`: substitutes missing values with the mean of a subset of the remaining values, defined by a separate grouping vector
+* `i_median_grp()`: substitutes missing values with the mean of a subset of the remaining values, defined by a separate grouping vector
+
+These are very simple imputation methods, but more sophisticated options can be used by calling functions from other packages. The group imputation functions above are useful in an indicator context: for example in country-level indicator analysis we can substitute missing values by the mean/median within the same GDP/capita group, which is often a better approach than a flat mean across all countries. Obviously this is context-dependent however.
+
 For now let's explore the options native to COINr. We can also apply the `i_median()` function in the same way to substitute with the indicator median. Adding a little complexity, we can also impute by mean or median, but within unit (row) groups. Let's assume that the first five rows in our data frame belong to a group "a", and the remaining five to a different group "b". In practice, these could be e.g. GDP, population or wealth groups for countries - we might hypothesise that it is better to replace `NA` values with the median inside a group, rather than the overall median, because countries within groups are more similar.
 
 To do this on a data frame we can use the `i_median_grp()` function, which requires an additional argument `f`: a grouping variable. This is passed through `Impute()` using the `f_i_para` argument which takes any additional parameters top `f_i` apart from the data to be imputed.