diff --git a/R/aggregate.R b/R/aggregate.R index 75ac7941..d803e1d1 100644 --- a/R/aggregate.R +++ b/R/aggregate.R @@ -66,7 +66,7 @@ Aggregate.purse <- function(x, dset, f_ag = NULL, w = NULL, f_ag_para = NULL, da #' #' Note that COINr has a number of aggregation functions built in, #' all of which are of the form `a_*()`, e.g. [a_amean()], [a_gmean()] and friends. To see a list browse COINr functions alphabetically or -#' type `a_` in the R Studio console and press the tab key (after loading COINr). +#' type `a_` in the R Studio console and press the tab key (after loading COINr), or see the [online documentation](https://bluefoxr.github.io/COINr/articles/aggregate.html#coinr-aggregation-functions). #' #' Optionally, a data availability threshold can be assigned below which the aggregated value will return #' `NA` (see `dat_thresh` argument). If `by_df = TRUE`, this will however be ignored because aggregation is not @@ -313,7 +313,7 @@ Aggregate.coin <- function(x, dset, f_ag = NULL, w = NULL, f_ag_para = NULL, dat #' #' Note that COINr has a number of aggregation functions built in, #' all of which are of the form `a_*()`, e.g. [a_amean()], [a_gmean()] and friends. To see a list browse COINr functions alphabetically or -#' type `a_` in the R Studio console and press the tab key (after loading COINr). +#' type `a_` in the R Studio console and press the tab key (after loading COINr), or see the [online documentation](https://bluefoxr.github.io/COINr/articles/aggregate.html#coinr-aggregation-functions). #' #' Optionally, a data availability threshold can be assigned below which the aggregated value will return #' `NA` (see `dat_thresh` argument). If `by_df = TRUE`, this will however be ignored because aggregation is not diff --git a/R/impute.R b/R/impute.R index b05c8a15..90e8d60c 100644 --- a/R/impute.R +++ b/R/impute.R @@ -135,6 +135,9 @@ Impute.purse <- function(x, dset, f_i = NULL, f_i_para = NULL, impute_by = "colu #' it except for `NA` values, which can be replaced. The function `f_i` is not required to replace *all* `NA` #' values. #' +#' COINr has several built-in imputation functions of the form `i_*()` for vectors which can be called by [Impute()]. See the +#' [online documentation](https://bluefoxr.github.io/COINr/articles/imputation.html#data-frames) for more details. +#' #' When imputing row-wise, prior normalisation of the data is recommended. This is because imputation #' will use e.g. the mean of the unit values over all indicators (columns). If the indicators are on #' very different scales, the result will likely make no sense. If the indicators are normalised first, @@ -332,6 +335,9 @@ Impute.coin <- function(x, dset, f_i = NULL, f_i_para = NULL, impute_by = "colum #' it except for `NA` values, which can be replaced. The function `f_i` is not required to replace *all* `NA` #' values. #' +#' COINr has several built-in imputation functions of the form `i_*()` for vectors which can be called by [Impute()]. See the +#' [online documentation](https://bluefoxr.github.io/COINr/articles/imputation.html#data-frames) for more details. +#' #' When imputing row-wise, prior normalisation of the data is recommended. This is because imputation #' will use e.g. the mean of the unit values over all indicators (columns). If the indicators are on #' very different scales, the result will likely make no sense. If the indicators are normalised first, @@ -566,6 +572,9 @@ Impute.data.frame <- function(x, f_i = NULL, f_i_para = NULL, impute_by = "colum #' values found in `x`. By default, `f_i = "i_mean()"`, which simply imputes `NA`s with the mean of the #' non-`NA` values in `x`. #' +#' COINr has several built-in imputation functions of the form `i_*()` for vectors which can be called by [Impute()]. See the +#' [online documentation](https://bluefoxr.github.io/COINr/articles/imputation.html#data-frames) for more details. +#' #' You could also use one of the imputation functions directly (such as [i_mean()]). However, this #' function offers a few extra advantages, such as checking the input and output formats, and making #' sure the resulting imputed vector agrees with the input. It will also skip imputation entirely if diff --git a/R/normalise.R b/R/normalise.R index 9dcffb17..79a75325 100644 --- a/R/normalise.R +++ b/R/normalise.R @@ -109,7 +109,8 @@ Normalise.purse <- function(x, dset, global_specs = NULL, indiv_specs = NULL, #' normalised with individual specifications using the `indiv_specs` argument. If indicators should have their #' directions reversed, this can be specified using the `directions` argument. Non-numeric columns are ignored #' automatically by this function. By default, this function normalises each indicator using the "min-max" method, scaling indicators to lie between -#' 0 and 100. This calls the [n_minmax()] function. Note, all COINr normalisation functions are of the form `n_*()`. +#' 0 and 100. This calls the [n_minmax()] function. COINr has a number of built-in normalisation functions of the form `n_*()`. See [online documentation](https://bluefoxr.github.io/COINr/articles/normalise.html#built-in-normalisation-functions) +#' for details. #' #' ## Global specification #' @@ -251,7 +252,8 @@ Normalise.coin <- function(x, dset, global_specs = NULL, indiv_specs = NULL, #' normalised with individual specifications using the `indiv_specs` argument. If variables should have their #' directions reversed, this can be specified using the `directions` argument. Non-numeric columns are ignored #' automatically by this function. By default, this function normalises each indicator using the "min-max" method, scaling indicators to lie between -#' 0 and 100. This calls the [n_minmax()] function. Note, all COINr normalisation functions are of the form `n_*()`. +#' 0 and 100. This calls the [n_minmax()] function. COINr has a number of built-in normalisation functions of the form `n_*()`. See [online documentation](https://bluefoxr.github.io/COINr/articles/normalise.html#built-in-normalisation-functions) +#' for details. #' #' ## Global specification #' @@ -401,6 +403,9 @@ Normalise.data.frame <- function(x, global_specs = NULL, indiv_specs = NULL, #' first argument is `x`, a numeric vector, and it returns a numeric vector of the same length. See [n_minmax()] #' for an example. #' +#' COINr has a number of built-in normalisation functions of the form `n_*()`. See [online documentation](https://bluefoxr.github.io/COINr/articles/normalise.html#built-in-normalisation-functions) +#' for details. +#' #' `f_n_para` is *required* to be a named list. So e.g. if we define a function `f1(x, arg1, arg2)` then we should #' specify `f_n = "f1"`, and `f_n_para = list(arg1 = val1, arg2 = val2)`, where `val1` and `val2` are the #' values assigned to the arguments `arg1` and `arg2` respectively. diff --git a/man/Aggregate.coin.Rd b/man/Aggregate.coin.Rd index 7614fa1b..3aa0cf07 100644 --- a/man/Aggregate.coin.Rd +++ b/man/Aggregate.coin.Rd @@ -71,7 +71,7 @@ a numeric vector (the result of aggregating the whole data frame in one go). Note that COINr has a number of aggregation functions built in, all of which are of the form \verb{a_*()}, e.g. \code{\link[=a_amean]{a_amean()}}, \code{\link[=a_gmean]{a_gmean()}} and friends. To see a list browse COINr functions alphabetically or -type \code{a_} in the R Studio console and press the tab key (after loading COINr). +type \code{a_} in the R Studio console and press the tab key (after loading COINr), or see the \href{https://bluefoxr.github.io/COINr/articles/aggregate.html#coinr-aggregation-functions}{online documentation}. Optionally, a data availability threshold can be assigned below which the aggregated value will return \code{NA} (see \code{dat_thresh} argument). If \code{by_df = TRUE}, this will however be ignored because aggregation is not diff --git a/man/Aggregate.data.frame.Rd b/man/Aggregate.data.frame.Rd index 935ce473..d358a032 100644 --- a/man/Aggregate.data.frame.Rd +++ b/man/Aggregate.data.frame.Rd @@ -51,7 +51,7 @@ a numeric vector (the result of aggregating the whole data frame in one go). Note that COINr has a number of aggregation functions built in, all of which are of the form \verb{a_*()}, e.g. \code{\link[=a_amean]{a_amean()}}, \code{\link[=a_gmean]{a_gmean()}} and friends. To see a list browse COINr functions alphabetically or -type \code{a_} in the R Studio console and press the tab key (after loading COINr). +type \code{a_} in the R Studio console and press the tab key (after loading COINr), or see the \href{https://bluefoxr.github.io/COINr/articles/aggregate.html#coinr-aggregation-functions}{online documentation}. Optionally, a data availability threshold can be assigned below which the aggregated value will return \code{NA} (see \code{dat_thresh} argument). If \code{by_df = TRUE}, this will however be ignored because aggregation is not diff --git a/man/Impute.coin.Rd b/man/Impute.coin.Rd index 8617b56c..c219bc69 100644 --- a/man/Impute.coin.Rd +++ b/man/Impute.coin.Rd @@ -75,6 +75,9 @@ frame. Moreover, this function should return a vector or data frame identical to it except for \code{NA} values, which can be replaced. The function \code{f_i} is not required to replace \emph{all} \code{NA} values. +COINr has several built-in imputation functions of the form \verb{i_*()} for vectors which can be called by \code{\link[=Impute]{Impute()}}. See the +\href{https://bluefoxr.github.io/COINr/articles/imputation.html#data-frames}{online documentation} for more details. + When imputing row-wise, prior normalisation of the data is recommended. This is because imputation will use e.g. the mean of the unit values over all indicators (columns). If the indicators are on very different scales, the result will likely make no sense. If the indicators are normalised first, diff --git a/man/Impute.data.frame.Rd b/man/Impute.data.frame.Rd index 436449d6..af2c5ffb 100644 --- a/man/Impute.data.frame.Rd +++ b/man/Impute.data.frame.Rd @@ -57,6 +57,9 @@ frame. Moreover, this function should return a vector or data frame identical to it except for \code{NA} values, which can be replaced. The function \code{f_i} is not required to replace \emph{all} \code{NA} values. +COINr has several built-in imputation functions of the form \verb{i_*()} for vectors which can be called by \code{\link[=Impute]{Impute()}}. See the +\href{https://bluefoxr.github.io/COINr/articles/imputation.html#data-frames}{online documentation} for more details. + When imputing row-wise, prior normalisation of the data is recommended. This is because imputation will use e.g. the mean of the unit values over all indicators (columns). If the indicators are on very different scales, the result will likely make no sense. If the indicators are normalised first, diff --git a/man/Impute.numeric.Rd b/man/Impute.numeric.Rd index 6270e881..e97d1ef9 100644 --- a/man/Impute.numeric.Rd +++ b/man/Impute.numeric.Rd @@ -28,6 +28,9 @@ This calls the function \code{f_i()}, with optionally further arguments \code{f_ values found in \code{x}. By default, \code{f_i = "i_mean()"}, which simply imputes \code{NA}s with the mean of the non-\code{NA} values in \code{x}. +COINr has several built-in imputation functions of the form \verb{i_*()} for vectors which can be called by \code{\link[=Impute]{Impute()}}. See the +\href{https://bluefoxr.github.io/COINr/articles/imputation.html#data-frames}{online documentation} for more details. + You could also use one of the imputation functions directly (such as \code{\link[=i_mean]{i_mean()}}). However, this function offers a few extra advantages, such as checking the input and output formats, and making sure the resulting imputed vector agrees with the input. It will also skip imputation entirely if diff --git a/man/Normalise.coin.Rd b/man/Normalise.coin.Rd index 7fb5bed4..9c7db3a1 100644 --- a/man/Normalise.coin.Rd +++ b/man/Normalise.coin.Rd @@ -52,7 +52,8 @@ Creates a normalised data set using specifications specified in \code{global_spe normalised with individual specifications using the \code{indiv_specs} argument. If indicators should have their directions reversed, this can be specified using the \code{directions} argument. Non-numeric columns are ignored automatically by this function. By default, this function normalises each indicator using the "min-max" method, scaling indicators to lie between -0 and 100. This calls the \code{\link[=n_minmax]{n_minmax()}} function. Note, all COINr normalisation functions are of the form \verb{n_*()}. +0 and 100. This calls the \code{\link[=n_minmax]{n_minmax()}} function. COINr has a number of built-in normalisation functions of the form \verb{n_*()}. See \href{https://bluefoxr.github.io/COINr/articles/normalise.html#built-in-normalisation-functions}{online documentation} +for details. } \details{ \subsection{Global specification}{ diff --git a/man/Normalise.data.frame.Rd b/man/Normalise.data.frame.Rd index 0d7f8b92..c9503b51 100644 --- a/man/Normalise.data.frame.Rd +++ b/man/Normalise.data.frame.Rd @@ -31,7 +31,8 @@ Normalises a data frame using specifications specified in \code{global_specs}. C normalised with individual specifications using the \code{indiv_specs} argument. If variables should have their directions reversed, this can be specified using the \code{directions} argument. Non-numeric columns are ignored automatically by this function. By default, this function normalises each indicator using the "min-max" method, scaling indicators to lie between -0 and 100. This calls the \code{\link[=n_minmax]{n_minmax()}} function. Note, all COINr normalisation functions are of the form \verb{n_*()}. +0 and 100. This calls the \code{\link[=n_minmax]{n_minmax()}} function. COINr has a number of built-in normalisation functions of the form \verb{n_*()}. See \href{https://bluefoxr.github.io/COINr/articles/normalise.html#built-in-normalisation-functions}{online documentation} +for details. } \details{ \subsection{Global specification}{ diff --git a/man/Normalise.numeric.Rd b/man/Normalise.numeric.Rd index 74a2606b..d32b5fba 100644 --- a/man/Normalise.numeric.Rd +++ b/man/Normalise.numeric.Rd @@ -34,6 +34,9 @@ further arguments to \code{f_n}. This means that any function can be passed to \ first argument is \code{x}, a numeric vector, and it returns a numeric vector of the same length. See \code{\link[=n_minmax]{n_minmax()}} for an example. +COINr has a number of built-in normalisation functions of the form \verb{n_*()}. See \href{https://bluefoxr.github.io/COINr/articles/normalise.html#built-in-normalisation-functions}{online documentation} +for details. + \code{f_n_para} is \emph{required} to be a named list. So e.g. if we define a function \code{f1(x, arg1, arg2)} then we should specify \code{f_n = "f1"}, and \code{f_n_para = list(arg1 = val1, arg2 = val2)}, where \code{val1} and \code{val2} are the values assigned to the arguments \code{arg1} and \code{arg2} respectively. diff --git a/vignettes/aggregate.Rmd b/vignettes/aggregate.Rmd index b911c574..7e5a1cb5 100644 --- a/vignettes/aggregate.Rmd +++ b/vignettes/aggregate.Rmd @@ -136,6 +136,8 @@ Let's now explore some of the options of the `Aggregate()` function. Like other * `a_hmean()`: the weighted harmonic mean * `a_copeland()`: the Copeland method (note: requires `by_df = TRUE`) +For details of these methods, see [Approaches] above and the function documentation of each of the functions listed. + By default, the arithmetic mean is called but we can easily change this to the geometric mean, for example. However here we run into a problem: the geometric mean will fail if any values to aggregate are less than or equal to zero. So to use the geometric mean we have to re-do the normalisation step to avoid this. Luckily this is straightforward in COINr: ```{r} diff --git a/vignettes/imputation.Rmd b/vignettes/imputation.Rmd index fa053f95..5194347d 100644 --- a/vignettes/imputation.Rmd +++ b/vignettes/imputation.Rmd @@ -94,6 +94,15 @@ i_mean(x) The key concept here is that the simple function `i_mean()` is applied by `Impute()` to each column. This idea of passing simpler functions is used in several key COINr functions, and allows great flexibility because more sophisticated imputation methods can be used from other packages, for example. +In COINr there are currently four basic imputation functions which impute a numeric vector: + +* `i_mean()`: substitutes missing values with the mean of the remaining values +* `i_median()`: substitutes missing values with the median of the remaining values (this will be more robust to outliers) +* `i_mean_grp()`: substitutes missing values with the mean of a subset of the remaining values, defined by a separate grouping vector +* `i_median_grp()`: substitutes missing values with the mean of a subset of the remaining values, defined by a separate grouping vector + +These are very simple imputation methods, but more sophisticated options can be used by calling functions from other packages. The group imputation functions above are useful in an indicator context: for example in country-level indicator analysis we can substitute missing values by the mean/median within the same GDP/capita group, which is often a better approach than a flat mean across all countries. Obviously this is context-dependent however. + For now let's explore the options native to COINr. We can also apply the `i_median()` function in the same way to substitute with the indicator median. Adding a little complexity, we can also impute by mean or median, but within unit (row) groups. Let's assume that the first five rows in our data frame belong to a group "a", and the remaining five to a different group "b". In practice, these could be e.g. GDP, population or wealth groups for countries - we might hypothesise that it is better to replace `NA` values with the median inside a group, rather than the overall median, because countries within groups are more similar. To do this on a data frame we can use the `i_median_grp()` function, which requires an additional argument `f`: a grouping variable. This is passed through `Impute()` using the `f_i_para` argument which takes any additional parameters top `f_i` apart from the data to be imputed.