Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Columns as Group ID #3

Closed
recleev opened this issue Feb 7, 2019 · 6 comments
Closed

Multiple Columns as Group ID #3

recleev opened this issue Feb 7, 2019 · 6 comments

Comments

@recleev
Copy link

recleev commented Feb 7, 2019

I wanted to do something like this

library(cdata)

control_table <- qchar_frame(
  Part   , Measure , Value        |
  "Sepal", "Length", Sepal.Length |
  "Sepal", "Width" , Sepal.Width  |
  "Petal", "Length", Petal.Length |
  "Petal", "Width" , Petal.Width
)

rowrecs_to_blocks(iris, control_table)

But I get this error

Error in rowrecs_to_blocks.default(iris, control_table) : 
  cdata::rowrecs_to_blocks all control table group ids must be distinct

checkControlTable() assumes that the first column is always and the only id column, so when the first column does not have distinct values it throws an error.

I think cdata should be able to support a combination of multiple columns as ids. In my example above, the combination of Part and Measure constitutes a group id. Maybe an extra argument to specify the id cols in the control table can make this work?

Something like this?

rowrecs_to_blocks(iris, control_table, keyColumns = c("Part", "Measure"))

keyColumns (like the one in blocks_to_rowrecs()) can also take a vector of col index to specify the columns to take as group ids. Default should be 1 to keep current behavior.

Here is a work around with {data.table}

control_table_2 <- qchar_frame(
  Part.Measure  , Value        |
  "Sepal.Length", Sepal.Length |
  "Sepal.Width" , Sepal.Width  |
  "Petal.Length", Petal.Length |
  "Petal.Width" , Petal.Width
)

iris_long <- rowrecs_to_blocks(iris, control_table_2)

library(data.table)

iris_long <- as.data.table(iris_long)

iris_long[, c("Part", "Measure") := tstrsplit(Part.Measure, split = "\\.")]

# > iris_long
# Part.Measure      Value  Part  Measure
# 1: Sepal.Length     5.1 Sepal  Length
# 2: Sepal.Width      3.5 Sepal  Width
# 3: Petal.Length     1.4 Petal  Length
# 4: Petal.Width      0.2 Petal  Width
# 5: Sepal.Length     4.9 Sepal  Length
# ---                                 
# 596: Petal.Width    2.3 Petal  Width
# 597: Sepal.Length   5.9 Sepal  Length
# 598: Sepal.Width    3.0 Sepal  Width
# 599: Petal.Length   5.1 Petal  Length
# 600: Petal.Width    1.8 Petal  Width

Splitting columns is easy, but I was just wondering if it can be avoided.

@JohnMount
Copy link
Member

JohnMount commented Feb 7, 2019

Good (and interesting point). Thank you for taking the time to think about it and share it here.

As you commented the current control table definition treats the first column in a special way: it is the record portion key. You are correct more could be done if some set of columns in the control table were allowed to be the record portion keys. blocks_to_rowrecs() does allow some additional key columns from the incoming table, I need to think a bit on if there is a way to extend rowrecs_to_blocks() as you just describe (and what the consequences would be).

This is an asymmetry: column names are single strings (though I think Pandas can be more general than this). Whereas, you are correct composite keys are very natural for rows.

Your current work around is in fact what I would have suggested. But I perhaps we can automate this.

@recleev
Copy link
Author

recleev commented Feb 7, 2019

Thanks. I use {cdata} whenever I can in my work. controlTable makes data transformation almost automatic and intuitive.

I have not thought of any consequences yet, but I hope there are nothing major and that the benefits will outweigh them.

@JohnMount
Copy link
Member

JohnMount commented Feb 8, 2019

I'll probably add the feature soon. The nice thing is if cdata itself does it there is not string manipulation- as all the transforms are in the control table. I had thought about things like this, but decided they might be too advanced/baroque for the initial versions. But now at least one user has thought of it- and I think it regularizes the teaching by making things more uniform.

I have started to lay-down infrastructure to handle the feature and entered a (failing) test to track progress on the feature.

@JohnMount
Copy link
Member

Got it! (In the development version, for local ops, will probably get to the database versions later.)

library(cdata)
library(rqdatatable)
#> Loading required package: rquery

d <- iris
d$id <- seq_len(nrow(d))

control_table <- qchar_frame(
  Part,  Measure, Value       |
  Sepal, Length, Sepal.Length |
  Sepal, Width, Sepal.Width   |
  Petal, Length, Petal.Length |
  Petal, Width, Petal.Width   )

d %.>%
  rowrecs_to_blocks(
    .,
    control_table,
    controlTableKeys = c("Part", "Measure"),
    columnsToCopy = c("id", "Species")) %.>%
  orderby(., c("id", "Part", "Measure")) %.>%
  head(.)
#>    id Species  Part Measure Value
#> 1:  1  setosa Petal  Length   1.4
#> 2:  1  setosa Petal   Width   0.2
#> 3:  1  setosa Sepal  Length   5.1
#> 4:  1  setosa Sepal   Width   3.5
#> 5:  2  setosa Petal  Length   1.4
#> 6:  2  setosa Petal   Width   0.2

@recleev
Copy link
Author

recleev commented Feb 8, 2019

Wow. What a quick update. Will try it out as soon as I can, although I may wait for the CRAN update. Thanks!

@JohnMount
Copy link
Member

I found it an interesting possibility. Your example as a vignette here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants