Replies: 1 comment 4 replies
-
Thanks @rempsyc for your message (and on the team's behalf for your sponsorship 💌), and yes we'll gladly look on areas of possible overlap & improvement For check_outliers, in general that's a function that could benefit from some love and improvements. In general, the path for change look like this:
All that to say, given that 1) is done, and 3) is wip (I'm sure @mattansb wouldn't mind you to give #49 a go too tho), do not hesitate to open a PR in performance for 2) to improve check_outliers(). perhaps also adding a vignette on outliers detection would be something to look for |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
MAD workflow & package overlap
Like many others, my colleagues and I have over time developed a couple of functions for data processing—before we knew about the
easyverse
anddatawizard
.I have realized that some of these functions, which I have integrated in the
rempsyc
package, overlap with theeasyverse
. Of course, ideally one would rely entirely on theeasyverse
to avoid redundancy and scattering (also, less maintenance).For example, one common workflow taught in our R stats university class is to standardize data, identify outliers, and finally winsorize, all using the MAD (median absolute deviation), based on this publication:
Here I would like to compare this workflow between the two packages, see what could be deprecated/removed from
rempsyc
, and what would require further implementation/refinement indatawizard.
Standardize based on MAD
Created on 2022-06-12 by the reprex package (v2.0.1)
Conclusion: They provide the same results (except for the extra attributes in
datawizard
). That's a perfect scenario. It suggests I could deprecate/removerempsyc::scale_mad
and point users/colleagues todatawizard::standardize
instead. That's a good start!Find outliers based on MAD
(Here we have to reach for
performance
instead ofdatawizard
)Created on 2022-06-12 by the reprex package (v2.0.1)
Same result. Fantastic! So far, so good. However, their functionality differ a bit when feeding several variables.
Created on 2022-06-12 by the reprex package (v2.0.1)
Whereas
check_outliers
provides an indiscriminate list of outliers for all variables (so you don't know each person was an outlier for which variable),find_mad
specifies each outlier per variable, but also counts how many times each row/observation is identified as an outlier. This can be useful, for example, to identify if one participant provided particularly bad data which manifests in being an outlier for several variables.Perhaps a possible area of improvement? (At the same time, it might not be possible to modify
performance::check_outliers
like this if it's being used in specific ways under the hood for other functions.) No big deal though.Winsorize based on MAD
Created on 2022-06-12 by the reprex package (v2.0.1)
This time, the results differ, and for good reason.
rempsyc::winsorize_mad
uses the MAD, whereasdatawizard::winsorize
doesn't (my understanding from the documentation is that the threshold refers to the percentile). @mattansb did suggest allowing more thresholds for winsorizing in #49 (e.g., fixed values, relative Z score, relative robust Z score). I would suggest also adding the possibility to use the MAD.Once that is done, I will be able to change my scripts and workflow to stick only to the
easyverse
for processing data, and encourage others to do the same.Beta Was this translation helpful? Give feedback.
All reactions