Full implementation of the MAD workflow #177

rempsyc · 2022-06-12T23:36:33Z

rempsyc
Jun 12, 2022
Maintainer Sponsor

MAD workflow & package overlap

Like many others, my colleagues and I have over time developed a couple of functions for data processing—before we knew about the easyverse and datawizard.

I have realized that some of these functions, which I have integrated in the rempsyc package, overlap with the easyverse. Of course, ideally one would rely entirely on the easyverse to avoid redundancy and scattering (also, less maintenance).

For example, one common workflow taught in our R stats university class is to standardize data, identify outliers, and finally winsorize, all using the MAD (median absolute deviation), based on this publication:

Leys, C., Ley, C., Klein, O., Bernard, P., & Licata, L. (2013). Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49(4), 764–766. https://doi.org/10.1016/j.jesp.2013.03.013

Here I would like to compare this workflow between the two packages, see what could be deprecated/removed from rempsyc, and what would require further implementation/refinement in datawizard.

Standardize based on MAD

(a <- datawizard::standardize(iris[1:10, "Petal.Length"], robust = TRUE))
#>  [1]  0.000000  0.000000 -1.348982  1.348982  0.000000  4.046945  0.000000
#>  [8]  1.348982  0.000000  1.348982
#> attr(,"center")
#> [1] 1.4
#> attr(,"scale")
#> [1] 0.07413
#> attr(,"robust")
#> [1] TRUE
(b <- rempsyc::scale_mad(iris[1:10, "Petal.Length"]))
#>  [1]  0.000000  0.000000 -1.348982  1.348982  0.000000  4.046945  0.000000
#>  [8]  1.348982  0.000000  1.348982
a == b
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

^{Created on 2022-06-12 by the reprex package (v2.0.1)}

Conclusion: They provide the same results (except for the extra attributes in datawizard). That's a perfect scenario. It suggests I could deprecate/remove rempsyc::scale_mad and point users/colleagues to datawizard::standardize instead. That's a good start!

Find outliers based on MAD

(Here we have to reach for performance instead of datawizard)

performance::check_outliers(mtcars$mpg, method = "zscore_robust", threshold = 2)
#> Warning: 4 outliers detected (cases 18, 19, 20, 28).

rempsyc::find_mad(mtcars, "mpg", criteria = 2)
#> 4 outlier(s) based on 2 median absolute deviations for variable(s): 
#>  mpg,  
#> 
#> Outliers per variable:
#> $mpg
#>   Row  mpg
#> 1  18 32.4
#> 2  19 30.4
#> 3  20 33.9
#> 4  28 30.4

^{Created on 2022-06-12 by the reprex package (v2.0.1)}

Same result. Fantastic! So far, so good. However, their functionality differ a bit when feeding several variables.

mtcars2 <- cbind(car = row.names(mtcars), mtcars)[1:10, 1:5]

performance::check_outliers(mtcars2, method = "zscore_robust", threshold = 3)
#> Warning: 2 outliers detected (cases 5, 7).

rempsyc::find_mad(data = mtcars2, names(mtcars2)[2:5], criteria = 3, ID = "car")
#> 2 outlier(s) based on 3 median absolute deviations for variable(s): 
#>  mpg,  cyl,  disp,  hp,  
#> 
#> The following participants were considered outliers for more than one variable: 
#> 
#>   Row               car n
#> 1   5 Hornet Sportabout 2
#> 2   7        Duster 360 2
#> 
#> Outliers per variable:
#> 
#> $disp
#>   Row               car disp
#> 1   5 Hornet Sportabout  360
#> 2   7        Duster 360  360
#> 
#> $hp
#>   Row               car  hp
#> 1   5 Hornet Sportabout 175
#> 2   7        Duster 360 245

^{Created on 2022-06-12 by the reprex package (v2.0.1)}

Whereas check_outliers provides an indiscriminate list of outliers for all variables (so you don't know each person was an outlier for which variable), find_mad specifies each outlier per variable, but also counts how many times each row/observation is identified as an outlier. This can be useful, for example, to identify if one participant provided particularly bad data which manifests in being an outlier for several variables.

Perhaps a possible area of improvement? (At the same time, it might not be possible to modify performance::check_outliers like this if it's being used in specific ways under the hood for other functions.) No big deal though.

Winsorize based on MAD

(a <- datawizard::winsorize(iris$Sepal.Length, threshold = 0.2) |> head())
#> [1] 5.1 5.0 5.0 5.0 5.0 5.4

(b <- rempsyc::winsorize_mad(iris$Sepal.Length, criteria = 3) |> head())
#> [1] 5.1 4.9 4.7 4.6 5.0 5.4

a == b
#> [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE

^{Created on 2022-06-12 by the reprex package (v2.0.1)}

This time, the results differ, and for good reason. rempsyc::winsorize_mad uses the MAD, whereas datawizard::winsorize doesn't (my understanding from the documentation is that the threshold refers to the percentile). @mattansb did suggest allowing more thresholds for winsorizing in #49 (e.g., fixed values, relative Z score, relative robust Z score). I would suggest also adding the possibility to use the MAD.

Once that is done, I will be able to change my scripts and workflow to stick only to the easyverse for processing data, and encourage others to do the same.

DominiqueMakowski · 2022-06-13T01:41:49Z

DominiqueMakowski
Jun 13, 2022
Maintainer

Thanks @rempsyc for your message (and on the team's behalf for your sponsorship 💌), and yes we'll gladly look on areas of possible overlap & improvement

For check_outliers, in general that's a function that could benefit from some love and improvements. In general, the path for change look like this:

the return object: the base class of the object that the functions returns. This is the thing that is the least easy to change, for instance if a function returns say a vector of row names, changing the output of the function to a dataframe is risky because that could lead to a breaking change somewhere else in easystats or user's scripts. So this requires weighting the pros and cons.
the attributes. As you know we use quite a lot the attributes. That system is one great feature of R as it allows us to add additional arbitrary information without changing the return object. For check_outlier, my guess is that most of improvements would find their way here
the printing. The printing of check_outlier is currently super basic, so we could definitely improve that to be more detailed. Something like what find_mad does is a great start, so if you want to take that in this direction, feel free to open a PR. The printing can use the info stored in the attributes.
The methods: sometimes we don't want to change the return object, but we would like to get some easy access to either an attribute or some transformation of it, so we can always add methods (either from base or from easystast; summary(), predict(), as.data.frame(), report(), ...) that can alter the return object

All that to say, given that 1) is done, and 3) is wip (I'm sure @mattansb wouldn't mind you to give #49 a go too tho), do not hesitate to open a PR in performance for 2) to improve check_outliers(). perhaps also adding a vignette on outliers detection would be something to look for

4 replies

rempsyc Jun 13, 2022
Maintainer Author Sponsor

Fantastic! I am not familiar (yet) with methods and attributes (and pull requests ashamedly) but this is the perfect opportunity to develop those skills! Thank you!

DominiqueMakowski Jun 13, 2022
Maintainer

don't worry we'll assist you along the way ☺️

rempsyc Aug 25, 2022
Maintainer Author Sponsor

Standardize based on MAD
Find outliers based on MAD
Winsorize based on MAD

@DominiqueMakowski OFFICIALLY DONE BOOM WOOT WOOT

DominiqueMakowski Aug 25, 2022
Maintainer

🥇 🥳

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full implementation of the MAD workflow #177

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Full implementation of the MAD workflow #177

rempsyc Jun 12, 2022 Maintainer Sponsor

MAD workflow & package overlap

Standardize based on MAD

Find outliers based on MAD

Winsorize based on MAD

Replies: 1 comment · 4 replies

DominiqueMakowski Jun 13, 2022 Maintainer

rempsyc Jun 13, 2022 Maintainer Author Sponsor

DominiqueMakowski Jun 13, 2022 Maintainer

rempsyc Aug 25, 2022 Maintainer Author Sponsor

DominiqueMakowski Aug 25, 2022 Maintainer

rempsyc
Jun 12, 2022
Maintainer Sponsor

Replies: 1 comment 4 replies

DominiqueMakowski
Jun 13, 2022
Maintainer

rempsyc Jun 13, 2022
Maintainer Author Sponsor

DominiqueMakowski Jun 13, 2022
Maintainer

rempsyc Aug 25, 2022
Maintainer Author Sponsor

DominiqueMakowski Aug 25, 2022
Maintainer