Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support list columns #9

Open
richierocks opened this issue Mar 29, 2017 · 3 comments
Open

Support list columns #9

richierocks opened this issue Mar 29, 2017 · 3 comments

Comments

@richierocks
Copy link

List columns can cause errors in clean().

library(tibble)
d <- data_frame(x = as.list(rep(1:2, 5)))
clean(d, replace = TRUE)
## Error in UseMethod("check") : 
##   no applicable method for 'check' applied to an object of class "list"
## Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
##   invalid 'row.names' length
## Data cleaning is finished. Please wait while your output file is being rendered.
@annennenne
Copy link
Collaborator

I am having a hard time coming up with ideas for relevant checks and summaries to perform on (all) lists. The very core idea of dataMaid is to perform a standard suite of checks for each variable class. Do you have any suggestions for relevant checks for lists in mind yourself? Or did you perhaps have a specific example in mind, when you opened this issue?

@richierocks
Copy link
Author

If you have a list column inside a data frame, you typically want each element to have the same form. For example, if you call strsplit(), then the output is a list of character vectors, and you might want to store this as a field in a data frame.

So some useful checks on list columns are "Does each element have the same class/typeof/length/dim?".

@annennenne
Copy link
Collaborator

I do see the point in your concrete example, but I'm concerned that other people would use lists differently in datasets. Personally, I would usually choose to store something in a list (rather than a vector) exactly because the entries were of different data types or varying lengths, and even though that does not instantly generalize to the role of lists in data.frames, I imagine there are others that think like me. So if clean() tests e.g. that all elements in a list variable have the same class, length and dimensions, I can easily imagine that all list variables would almost always be marked as problematic, as one rarely wants all of those features simultaneously... The list is such a flexible class that I'm afraid standardized problem flagging is simply not a suitable strategy when looking for mistakes.

I will consider implementing a list extension for dataMaid, possibly with no default check/summarize/visualize functions, so that it is up for each (advanced) user to implement the tools he/she needs. And in either case, I will implement a check for variable class soon so that the error will be replaced by an informative message in the outputted report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants