Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle data.tables with column names containing spaces #149

Closed
wants to merge 4 commits into from

Conversation

stefanfritsch
Copy link

Hi,

I've made some changes to combine_cols_data.table() so it handles column names with spaces correctly.

It should also work with all other special characters as they're now converted to symbols before being evaluated.

  1. It adds a dependency on rlang (I've not inserted it in DESCRIPTION) but the package is already required by a number of suggests (e.g. ggplot2). All alternatives that I'm aware of are also considerably uglier.
  2. It modifies the input data.table. The existing function already did that (in dt[, eval(qq)]) but made assignments like this was unintended. I don't know if anything in tsbox requires an unmodified input dt; then we'd have to make a deep copy of the input.

Ciao,
Stefan

@christophsax
Copy link
Collaborator

Thanks.

rlang is not really an option here, I want to keep the dependency list short. I am sure there are ways to do this in base.

@stefanfritsch
Copy link
Author

I remembered that data.table nowadays supports mget. That makes it a lot less painful.

@codecov
Copy link

codecov bot commented May 24, 2019

Codecov Report

Merging #149 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #149      +/-   ##
==========================================
- Coverage   87.27%   87.25%   -0.03%     
==========================================
  Files          43       43              
  Lines        1635     1632       -3     
==========================================
- Hits         1427     1424       -3     
  Misses        208      208
Impacted Files Coverage Δ
R/data.table_helpers.R 84.61% <100%> (-1.6%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b51d96...6e17587. Read the comment docs.

@stefanfritsch
Copy link
Author

Hi,

is there anything else you want me to change?

Best regards,
Stefan

@christophsax
Copy link
Collaborator

Hi Stefan, sorry for the delay. I am doing similar problematic eval(parse(text = ) things at other places, and wanted to check these first. do.call("paste", mget(cols)) may be a solution here, but perhaps there is a straightforward way to construct an expression evaluate in a data.table.

Will work on an update soon.

@stefanfritsch
Copy link
Author

You can just turn a list into a call. But stuff like that sometimes is problematic with data.table (as mget() used to be) because it does some trickery when parsing j.

x <- 1:13

c("paste",
  "letters",
  "LETTERS",
  "x") %>%
  map(as.name) %>%
  as.call %>%
  eval

For the current use case (i.e. calling a command on a list of arguments) I'd use do.call().

@christophsax
Copy link
Collaborator

Hi Stefan,

As I suspected, the problem exists everywhere in tsbox, see #151. But then it turns out that simply backticking the names resolves the problem, and works for any non-ASCII characters too.

Here is my solution: #155

Thanks for contributing.

@stefanfritsch
Copy link
Author

Your solution does not work if the column name itself contains backticks.

I'd advise against constructing calls as text and then using parse because it's so hard to know if and when it will break (cough - although security is obviously not an issue here). As far as I can see from a quick glance all your adjustments could be done with a do.call() and sometimes even end up shorter than they are now.

do_on_cols <- function(.function, .cols, ...) {
    env <- parent.frame()
    do.call(.function, c(as.list(.cols), ...), envir = env)
}

DT <- data.table(iris)
cols <- names(DT)[4:5]

DT[, do_on_cols("paste", mget(cols), sep = "_")]
DT[, do_on_cols("paste", .SD, sep = "_"), .SDcols = cols]

@christophsax
Copy link
Collaborator

I certainly agree that eval(parse(text = )) is not a very elegant way, but it gave a lot of flexibility.

do.call and mget does too much, I think: mget returns the data, which maybe ok in the case of the paste operation (but maybe not, because you may loose some of data.table optimizations).

In other cases, as for the by expression, I really just want to manipulate the expression.

So here is an attempt for a more elegant solution, by directly manipulating the expression>

by_expr2 <- function(x) {
  as.call(c(quote(list), lapply(x, as.name)))
}

by_expr2(c("hello", "fs%%df"))
#> list(hello, `fs%%df`)
by_expr2(c("hello", "fs%```%df"))
#> list(hello, `fs%\`\`\`%df`)
by_expr2(character(0))
#> list()


library(data.table)
.by <- by_expr2("Species")
as.data.table(iris)[, .SD[1], by = .by]
#>       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1:     setosa          5.1         3.5          1.4         0.2
#> 2: versicolor          7.0         3.2          4.7         1.4
#> 3:  virginica          6.3         3.3          6.0         2.5

Created on 2019-07-23 by the reprex package (v0.3.0)

Handles all the quoting stuff correctly, and does not produce any overhead.

Still objections?

@christophsax
Copy link
Collaborator

combine_cols_data.table now becomes:

combine_cols_data.table <- function(dt, cols, sep = '_') {
  paste_sep <- function(...) paste(..., sep = sep)
  qq <- as.call(c(quote(paste_sep), lapply(cols, as.name)))
  z <- dt[, id := eval(qq)]
  z[, (setdiff(cols, "id")) := NULL] # but this is the right way to do it
  setcolorder(z, c("id", setdiff(names(z), "id")))
  z[]
}

@christophsax
Copy link
Collaborator

Fix is on CRAN now, feel free to re-open if there are still issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants