`statsExpressions`: Tidy dataframes and expressions with statistical details

Package	Status	Usage	GitHub	Miscellaneous

Introduction

The statsExpressions package has two key aims:

to provide a consistent syntax to do statistical analysis with tidy data (in pipe-friendly manner),
to provide statistical expressions (pre-formatted in-text statistical results) for plotting functions.

Statistical packages exhibit substantial diversity in terms of their syntax and expected input type. This can make it difficult to switch from one statistical approach to another. For example, some functions expect vectors as inputs, while others expect dataframes. Depending on whether it is a repeated measures design or not, different functions might expect data to be in wide or long format. Some functions can internally omit missing values, while other functions error in their presence. Furthermore, if someone wishes to utilize the objects returned by these packages downstream in their workflow, this is not straightforward either because even functions from the same package can return a list, a matrix, an array, a dataframe, etc., depending on the function.

This is where statsExpressions comes in: It can be thought of as a unified portal through which most of the functionality in these underlying packages can be accessed, with a simpler interface and no requirement to change data format.

This package forms the statistical processing backend for ggstatsplot package.

Installation

Type	Source	Command
Release	CRAN	`install.packages("statsExpressions")`
Development	GitHub	`remotes::install_github("IndrajeetPatil/statsExpressions")`

Citation

The package can be cited as:

citation("statsExpressions")

  Patil, I., (2021). statsExpressions: R Package for Tidy Dataframes
  and Expressions with Statistical Details. Journal of Open Source
  Software, 6(61), 3236, https://doi.org/10.21105/joss.03236

A BibTeX entry for LaTeX users is

  @Article{,
    doi = {10.21105/joss.03236},
    url = {https://doi.org/10.21105/joss.03236},
    year = {2021},
    publisher = {{The Open Journal}},
    volume = {6},
    number = {61},
    pages = {3236},
    author = {Indrajeet Patil},
    title = {{statsExpressions: {R} Package for Tidy Dataframes and Expressions with Statistical Details}},
    journal = {{Journal of Open Source Software}},
  }

General Workflow

Summary of types of statistical analyses

Here is a tabular summary of available tests:

Test	Function	Lifecycle
one-sample t-test	`one_sample_test`
two-sample t-test	`two_sample_test`
one-way ANOVA	`oneway_anova`
correlation analysis	`corr_test`
contingency table analysis	`contingency_table`
meta-analysis	`meta_analysis`

The table below summarizes all the different types of analyses currently supported in this package-

Description	Parametric	Non-parametric	Robust	Bayesian
Between group/condition comparisons	✅	✅	✅	✅
Within group/condition comparisons	✅	✅	✅	✅
Distribution of a numeric variable	✅	✅	✅	✅
Correlation between two variables	✅	✅	✅	✅
Association between categorical variables	✅	✅	❌	✅
Equal proportions for categorical variable levels	✅	✅	❌	✅
Random-effects meta-analysis	✅	❌	✅	✅

Summary of Bayesian analysis

Analysis	Hypothesis testing	Estimation
(one/two-sample) t-test	✅	✅
one-way ANOVA	✅	✅
correlation	✅	✅
(one/two-way) contingency table	✅	✅
random-effects meta-analysis	✅	✅

Tidy dataframes from statistical analysis

To illustrate the simplicity of this syntax, let’s say we want to run a one-way ANOVA. If we first run a non-parametric ANOVA and then decide to run a robust ANOVA instead, the syntax remains the same and the statistical approach can be modified by changing a single argument:

library(statsExpressions)

mtcars %>% oneway_anova(cyl, wt, type = "nonparametric") 
#> # A tibble: 1 x 14
#>   parameter1 parameter2 statistic df.error   p.value
#>   <chr>      <chr>          <dbl>    <int>     <dbl>
#> 1 wt         cyl             22.8        2 0.0000112
#>   method                       estimate conf.level conf.low conf.high
#>   <chr>                           <dbl>      <dbl>    <dbl>     <dbl>
#> 1 Kruskal-Wallis rank sum test    0.736       0.95    0.619     0.841
#>   effectsize      conf.method          conf.iterations expression
#>   <chr>           <chr>                          <int> <list>    
#> 1 Epsilon2 (rank) percentile bootstrap             100 <language>

mtcars %>% oneway_anova(cyl, wt, type = "robust")
#> # A tibble: 1 x 11
#>   statistic    df df.error p.value estimate conf.level conf.low conf.high
#>       <dbl> <dbl>    <dbl>   <dbl>    <dbl>      <dbl>    <dbl>     <dbl>
#> 1      12.7     2     12.2 0.00102     1.04       0.95    0.822      1.57
#>   effectsize                        
#>   <chr>                             
#> 1 Explanatory measure of effect size
#>   method                                            expression
#>   <chr>                                             <list>    
#> 1 A heteroscedastic one-way ANOVA for trimmed means <language>

All possible output dataframes from functions are tabulated here: https://indrajeetpatil.github.io/statsExpressions/articles/web_only/dataframe_outputs.html

Needless to say this will also work with the kable function to generate a table:

# setup
library(statsExpressions)
set.seed(123)

# one-sample robust t-test
# we will leave `expression` column out; it's not needed for using only the dataframe
mtcars %>%
  one_sample_test(wt, test.value = 3, type = "robust") %>%
  dplyr::select(-expression) %>%
  knitr::kable()

statistic	p.value	method	estimate	conf.low	conf.high	conf.level	effectsize
1.179181	0.22	Bootstrap-t method for one-sample test	3.197	2.872163	3.521837	0.95	Trimmed mean

These functions are also compatible with other popular data manipulation packages.

For example, let’s say we want to run a one-sample t-test for all levels of a certain grouping variable. We can use dplyr to do so:

# for reproducibility
set.seed(123)
library(dplyr)

# grouped operation
# running one-sample test for all levels of grouping variable `cyl`
mtcars %>%
  group_by(cyl) %>%
  group_modify(~ one_sample_test(.x, wt, test.value = 3), .keep = TRUE) %>%
  ungroup()
#> # A tibble: 3 x 15
#>     cyl    mu statistic df.error  p.value method            alternative estimate
#>   <dbl> <dbl>     <dbl>    <dbl>    <dbl> <chr>             <chr>          <dbl>
#> 1     4     3    -4.16        10 0.00195  One Sample t-test two.sided     -1.16 
#> 2     6     3     0.870        6 0.418    One Sample t-test two.sided      0.286
#> 3     8     3     4.92        13 0.000278 One Sample t-test two.sided      1.24 
#>   conf.level conf.low conf.high effectsize conf.method conf.distribution
#>        <dbl>    <dbl>     <dbl> <chr>      <chr>       <chr>            
#> 1       0.95   -1.97     -0.422 Hedges' g  ncp         t                
#> 2       0.95   -0.419     1.01  Hedges' g  ncp         t                
#> 3       0.95    0.565     1.98  Hedges' g  ncp         t                
#>   expression
#>   <list>    
#> 1 <language>
#> 2 <language>
#> 3 <language>

Using expressions in custom plots

Note that expression here means a pre-formatted in-text statistical result. In addition to other details contained in the dataframe, there is also a column titled expression, which contains expression with statistical details and can be displayed in a plot.

For all statistical test expressions, the default template attempt to follow the gold standard for statistical reporting.

For example, here are results from Welch’s t-test:

Example: Expressions for one-way ANOVAs

Between-subjects design

Let’s say we want to check differences in weight of the vehicle based on number of cylinders in the engine and wish to carry out robust trimmed-means ANOVA:

# setup
set.seed(123)
library(ggplot2)
library(statsExpressions)
library(ggridges)

# create a ridgeplot
ggplot(iris, aes(x = Sepal.Length, y = Species)) +
  geom_density_ridges(
    jittered_points = TRUE, quantile_lines = TRUE,
    scale = 0.9, vline_size = 1, vline_color = "red",
    position = position_raincloud(adjust_vlines = TRUE)
  ) + # use the expression in the dataframe to display results in the subtitle
  labs(
    title = "A heteroscedastic one-way ANOVA for trimmed means",
    subtitle = oneway_anova(iris, Species, Sepal.Length, type = "robust")$expression[[1]]
  )

Within-subjects design

Let’s now see an example of a repeated measures one-way ANOVA.

# setup
set.seed(123)
library(ggplot2)
library(WRS2)
library(ggbeeswarm)
library(statsExpressions)

ggplot2::ggplot(WineTasting, aes(Wine, Taste, color = Wine)) +
  geom_quasirandom() +
  labs(
    title = "Friedman's rank sum test",
    subtitle = oneway_anova(
      WineTasting,
      Wine,
      Taste,
      paired = TRUE,
      subject.id = Taster,
      type = "np"
    )$expression[[1]]
  )

Example: Expressions for two-sample tests

Between-subjects design

# setup
set.seed(123)
library(ggplot2)
library(gghalves)
library(ggbeeswarm)
library(hrbrthemes)
library(statsExpressions)

# create a plot
ggplot(ToothGrowth, aes(supp, len)) +
  geom_half_boxplot() +
  geom_beeswarm(beeswarmArgs = list(side = 1)) +
  theme_ipsum_rc() +
  # adding a subtitle with
  labs(
    title = "Two-Sample Welch's t-test",
    subtitle = two_sample_test(ToothGrowth, supp, len)$expression[[1]]
  )

Within-subjects design

We can also have a look at a repeated measures design and the related expressions.

# setup
set.seed(123)
library(ggplot2)
library(statsExpressions)
library(tidyr)
library(PairedData)
data(PrisonStress)

# plot
paired.plotProfiles(PrisonStress, "PSSbefore", "PSSafter", subjects = "Subject") +
  # `statsExpressions` needs data in the tidy format
  labs(
    title = "Two-sample Wilcoxon paired test",
    subtitle = two_sample_test(
      data = pivot_longer(PrisonStress, starts_with("PSS"), "PSS", values_to = "stress"),
      x = PSS,
      y = stress,
      paired = TRUE,
      subject.id = Subject,
      type = "np"
    )$expression[[1]]
  )

Example: Expressions for one-sample tests

# setup
set.seed(123)
library(ggplot2)
library(statsExpressions)

# creating a histogram plot
ggplot(mtcars, aes(wt)) +
  geom_histogram(alpha = 0.5) +
  geom_vline(xintercept = mean(mtcars$wt), color = "red") +
  # adding a caption with a non-parametric one-sample test
  labs(
    title = "One-Sample Wilcoxon Signed Rank Test",
    subtitle = one_sample_test(mtcars, wt, test.value = 3, type = "nonparametric")$expression[[1]]
  )

Example: Expressions for correlation analyses

Let’s look at another example where we want to run correlation analysis:

# setup
set.seed(123)
library(ggplot2)
library(statsExpressions)

# create a scatter plot
ggplot(mtcars, aes(mpg, wt)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x) +
  labs(
    title = "Spearman's rank correlation coefficient",
    subtitle = corr_test(mtcars, mpg, wt, type = "nonparametric")$expression[[1]]
  )

Example: Expressions for contingency table analysis

For categorical/nominal data - one-sample:

# setup
set.seed(123)
library(ggplot2)
library(statsExpressions)

# basic pie chart
ggplot(as.data.frame(table(mpg$class)), aes(x = "", y = Freq, fill = factor(Var1))) +
  geom_bar(width = 1, stat = "identity") +
  theme(axis.line = element_blank()) +
  # cleaning up the chart and adding results from one-sample proportion test
  coord_polar(theta = "y", start = 0) +
  labs(
    fill = "Class",
    x = NULL,
    y = NULL,
    title = "Pie Chart of class (type of car)",
    subtitle = contingency_table(as.data.frame(table(mpg$class)), Var1, counts = Freq)$expression[[1]],
    caption = "One-sample goodness of fit proportion test"
  )

You can also use these function to get the expression in return without having to display them in plots:

# setup
set.seed(123)
library(ggplot2)
library(statsExpressions)

# Pearson's chi-squared test of independence
contingency_table(mtcars, am, cyl)$expression[[1]]
#> paste(chi["Pearson"]^2, "(", "2", ") = ", "8.74", ", ", italic("p"), 
#>     " = ", "0.013", ", ", widehat(italic("V"))["Cramer"], " = ", 
#>     "0.46", ", CI"["95%"], " [", "0.00", ", ", "0.78", "], ", 
#>     italic("n")["obs"], " = ", "32")

Example: Expressions for meta-analysis

# setup
set.seed(123)
library(metaviz)
library(ggplot2)
library(metaplus)

# meta-analysis forest plot with results random-effects meta-analysis
viz_forest(
  x = mozart[, c("d", "se")],
  study_labels = mozart[, "study_name"],
  xlab = "Cohen's d",
  variant = "thick",
  type = "cumulative"
) + # use `statsExpressions` to create expression containing results
  labs(
    title = "Meta-analysis of Pietschnig, Voracek, and Formann (2010) on the Mozart effect",
    subtitle = meta_analysis(dplyr::rename(mozart, estimate = d, std.error = se))$expression[[1]]
  ) +
  theme(text = element_text(size = 12))

Customizing details to your liking

Sometimes you may not wish include so many details in the subtitle. In that case, you can extract the expression and copy-paste only the part you wish to include. For example, here only statistic and p-values are included:

# setup
set.seed(123)
library(ggplot2)
library(statsExpressions)

# extracting detailed expression
(res_expr <- oneway_anova(iris, Species, Sepal.Length, var.equal = TRUE)$expression[[1]])
#> paste(italic("F")["Fisher"], "(", "2", ",", "147", ") = ", "119.26", 
#>     ", ", italic("p"), " = ", "1.67e-31", ", ", widehat(omega["p"]^2), 
#>     " = ", "0.61", ", CI"["95%"], " [", "0.52", ", ", "0.68", 
#>     "], ", italic("n")["obs"], " = ", "150")

# adapting the details to your liking
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot() +
  labs(subtitle = ggplot2::expr(paste(
    NULL, italic("F"), "(", "2",
    ",", "147", ") = ", "119.26", ", ",
    italic("p"), " = ", "1.67e-31"
  )))

Summary of tests and effect sizes

Here a go-to summary about statistical test carried out and the returned effect size for each function is provided. This should be useful if one needs to find out more information about how an argument is resolved in the underlying package or if one wishes to browse the source code. So, for example, if you want to know more about how one-way (between-subjects) ANOVA, you can run ?stats::oneway.test in your R console.

`two_sample_test` + `oneway_anova`

No. of groups: 2 => two_sample_test
No. of groups: > 2 => oneway_anova

between-subjects

Hypothesis testing

Type	No. of groups	Test	Function used
Parametric	> 2	Fisher’s or Welch’s one-way ANOVA	`stats::oneway.test`
Non-parametric	> 2	Kruskal–Wallis one-way ANOVA	`stats::kruskal.test`
Robust	> 2	Heteroscedastic one-way ANOVA for trimmed means	`WRS2::t1way`
Bayes Factor	> 2	Fisher’s ANOVA	`BayesFactor::anovaBF`
Parametric	2	Student’s or Welch’s t-test	`stats::t.test`
Non-parametric	2	Mann–Whitney U test	`stats::wilcox.test`
Robust	2	Yuen’s test for trimmed means	`WRS2::yuen`
Bayesian	2	Student’s t-test	`BayesFactor::ttestBF`

Effect size estimation

Type	No. of groups	Effect size	CI?	Function used
Parametric	> 2	$\eta_{p}^2$ , $\omega_{p}^2$	✅	`effectsize::omega_squared`, `effectsize::eta_squared`
Non-parametric	> 2	$\epsilon_{ordinal}^2$	✅	`effectsize::rank_epsilon_squared`
Robust	> 2	$\xi$ (Explanatory measure of effect size)	✅	`WRS2::t1way`
Bayes Factor	> 2	$R_{Bayesian}^2$	✅	`performance::r2_bayes`
Parametric	2	Cohen’s d, Hedge’s g	✅	`effectsize::cohens_d`, `effectsize::hedges_g`
Non-parametric	2	r (rank-biserial correlation)	✅	`effectsize::rank_biserial`
Robust	2	$\delta_{R}^{AKP}$ (Algina-Keselman-Penfield robust standardized difference)	✅	`WRS2::akp.effect`
Bayesian	2	$\delta_{posterior}$	✅	`bayestestR::describe_posterior`

within-subjects

Hypothesis testing

Type	No. of groups	Test	Function used
Parametric	> 2	One-way repeated measures ANOVA	`afex::aov_ez`
Non-parametric	> 2	Friedman rank sum test	`stats::friedman.test`
Robust	> 2	Heteroscedastic one-way repeated measures ANOVA for trimmed means	`WRS2::rmanova`
Bayes Factor	> 2	One-way repeated measures ANOVA	`BayesFactor::anovaBF`
Parametric	2	Student’s t-test	`stats::t.test`
Non-parametric	2	Wilcoxon signed-rank test	`stats::wilcox.test`
Robust	2	Yuen’s test on trimmed means for dependent samples	`WRS2::yuend`
Bayesian	2	Student’s t-test	`BayesFactor::ttestBF`

Effect size estimation

Type	No. of groups	Effect size	CI?	Function used
Parametric	> 2	$\eta_{p}^2$ , $\omega_{p}^2$	✅	`effectsize::omega_squared`, `effectsize::eta_squared`
Non-parametric	> 2	$W_{Kendall}$ (Kendall’s coefficient of concordance)	✅	`effectsize::kendalls_w`
Robust	> 2	$\delta_{R-avg}^{AKP}$ (Algina-Keselman-Penfield robust standardized difference average)	✅	`WRS2::wmcpAKP`
Bayes Factor	> 2	$R_{Bayesian}^2$	✅	`performance::r2_bayes`
Parametric	2	Cohen’s d, Hedge’s g	✅	`effectsize::cohens_d`, `effectsize::hedges_g`
Non-parametric	2	r (rank-biserial correlation)	✅	`effectsize::rank_biserial`
Robust	2	$\delta_{R}^{AKP}$ (Algina-Keselman-Penfield robust standardized difference)	✅	`WRS2::wmcpAKP`
Bayesian	2	$\delta_{posterior}$	✅	`bayestestR::describe_posterior`

`one_sample_test`

Hypothesis testing

Type	Test	Function used
Parametric	One-sample Student’s t-test	`stats::t.test`
Non-parametric	One-sample Wilcoxon test	`stats::wilcox.test`
Robust	Bootstrap-t method for one-sample test	`trimcibt` (custom)
Bayesian	One-sample Student’s t-test	`BayesFactor::ttestBF`

Effect size estimation

Type	Effect size	CI?	Function used
Parametric	Cohen’s d, Hedge’s g	✅	`effectsize::cohens_d`, `effectsize::hedges_g`
Non-parametric	r (rank-biserial correlation)	✅	`effectsize::rank_biserial`
Robust	trimmed mean	✅	`trimcibt` (custom)
Bayes Factor	$\delta_{posterior}$	✅	`bayestestR::describe_posterior`

`corr_test`

Hypothesis testing and Effect size estimation

Type	Test	CI?	Function used
Parametric	Pearson’s correlation coefficient	✅	`correlation::correlation`
Non-parametric	Spearman’s rank correlation coefficient	✅	`correlation::correlation`
Robust	Winsorized Pearson correlation coefficient	✅	`correlation::correlation`
Bayesian	Pearson’s correlation coefficient	✅	`correlation::correlation`

`contingency_table`

two-way table

Hypothesis testing

Type	Design	Test	Function used
Parametric/Non-parametric	Unpaired	Pearson’s $\chi^2$ test	`stats::chisq.test`
Bayesian	Unpaired	Bayesian Pearson’s $\chi^2$ test	`BayesFactor::contingencyTableBF`
Parametric/Non-parametric	Paired	McNemar’s $\chi^2$ test	`stats::mcnemar.test`
Bayesian	Paired	❌	❌

Effect size estimation

Type	Design	Effect size	CI?	Function used
Parametric/Non-parametric	Unpaired	Cramer’s	✅	`effectsize::cramers_v`
Bayesian	Unpaired	Cramer’s	✅	`effectsize::cramers_v`
Parametric/Non-parametric	Paired	Cohen’s	✅	`effectsize::cohens_g`
Bayesian	Paired	❌	❌	❌

one-way table

Hypothesis testing

Type	Test	Function used
Parametric/Non-parametric	Goodness of fit $\chi^2$ test	`stats::chisq.test`
Bayesian	Bayesian Goodness of fit $\chi^2$ test	(custom)

Effect size estimation

Type	Effect size	CI?	Function used
Parametric/Non-parametric	Cramer’s	✅	`bayestestR::describe_posterior`
Bayesian	❌	❌	❌

`meta_analysis`

Hypothesis testing and Effect size estimation

Type	Test	Effect size	CI?	Function used
Parametric	Meta-analysis via random-effects models	$\beta$	✅	`metafor::metafor`
Robust	Meta-analysis via robust random-effects models	$\beta$	✅	`metaplus::metaplus`
Bayes	Meta-analysis via Bayesian random-effects models	$\beta$	✅	`metaBMA::meta_random`

Usage in `ggstatsplot`

Note that these functions were initially written to display results from statistical tests on ready-made ggplot2 plots implemented in ggstatsplot.

For detailed documentation, see the package website: https://indrajeetpatil.github.io/ggstatsplot/

Here is an example from ggstatsplot of what the plots look like when the expressions are displayed in the subtitle-

Acknowledgments

The hexsticker and the schematic illustration of general workflow were generously designed by Sarah Otterstetter (Max Planck Institute for Human Development, Berlin).

Contributing

I’m happy to receive bug reports, suggestions, questions, and (most of all) contributions to fix problems and add features. I personally prefer using the GitHub issues system over trying to reach out to me in other ways (personal e-mail, Twitter, etc.). Pull Requests for contributions are encouraged.

Here are some simple ways in which you can contribute (in the increasing order of commitment):

Read and correct any inconsistencies in the documentation
Raise issues about bugs or wanted features
Review code
Add new functionality (in the form of new plotting functions or helpers for preparing subtitles)

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Name		Name	Last commit message	Last commit date
Latest commit History 332 Commits
.github		.github
R		R
WIP		WIP
data		data
hextools		hextools
inst		inst
man		man
paper		paper
pkgdown		pkgdown
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.coveralls.yml		.coveralls.yml
.gitignore		.gitignore
.lintr		.lintr
.pre-commit-config.yaml		.pre-commit-config.yaml
.travis.yml		.travis.yml
API		API
CRAN-RELEASE		CRAN-RELEASE
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
appveyor.yml		appveyor.yml
codecov.yml		codecov.yml
codemeta.json		codemeta.json
cran-comments.md		cran-comments.md
statsExpressions.Rproj		statsExpressions.Rproj

License

Licenses found

Wandrys-dev/statsExpressions

Folders and files

Latest commit

History

Repository files navigation

statsExpressions: Tidy dataframes and expressions with statistical details

Introduction

Installation

Citation

General Workflow

Summary of types of statistical analyses

Tidy dataframes from statistical analysis

Using expressions in custom plots

Example: Expressions for one-way ANOVAs

Between-subjects design

Within-subjects design

Example: Expressions for two-sample tests

Between-subjects design

Within-subjects design

Example: Expressions for one-sample tests

Example: Expressions for correlation analyses

Example: Expressions for contingency table analysis

Example: Expressions for meta-analysis

Customizing details to your liking

Summary of tests and effect sizes

two_sample_test + oneway_anova

between-subjects

within-subjects

one_sample_test

corr_test

contingency_table

two-way table

one-way table

meta_analysis

Usage in ggstatsplot

Acknowledgments

Contributing

About

Resources

License

Licenses found

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

`statsExpressions`: Tidy dataframes and expressions with statistical details

`two_sample_test` + `oneway_anova`

`one_sample_test`

`corr_test`

`contingency_table`

`meta_analysis`

Usage in `ggstatsplot`

Packages