Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Dataset] in expressions don't work with >1 partition levels #25668

Closed
asfimport opened this issue Jul 31, 2020 · 4 comments
Closed

[C++][Dataset] in expressions don't work with >1 partition levels #25668

asfimport opened this issue Jul 31, 2020 · 4 comments

Comments

@asfimport
Copy link

When filtering nested partitions using %in%, no rows are returned, both for Hive and non-Hive partitioning. == and other comparison operators do work, and the problem also goes away when only one partition level is declared in the schema.

This is not caused by the dplyr wrappers, the lower-level functions have the same problem.

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

## Write files
pqdir <- file.path(tempdir(), paste(sample(letters, 6), collapse = ""))

for (foo in 0:1) {
  for (faa in 0:1) {
    fdir <- file.path(pqdir, letters[foo + 1], letters[faa + 1])
    dir.create(fdir, recursive = TRUE)
    rng <- (foo * 5 + faa + 1):(foo * 5 + faa + 5)
    write_parquet(data.frame(col = letters[rng]),
                         file.path(fdir, "file.parquet"))
  }
}

## What doesn't work: using %in% with both partitions defined
ds <- open_dataset(pqdir,
                   partitioning = schema(foo = string(), faa = string()))

collect(filter(ds, foo %in% "a"))
#> # A tibble: 0 x 3
#> # ... with 3 variables: col <chr>, foo <chr>, faa <chr>

## == does work
collect(filter(ds, foo == "a"))
#> # A tibble: 10 x 3
#>    col   foo   faa  
#>    <chr> <chr> <chr>
#>  1 a     a     a    
#>  2 b     a     a    
#>  3 c     a     a    
#>  4 d     a     a    
#>  5 e     a     a    
#>  6 b     a     b    
#>  7 c     a     b    
#>  8 d     a     b    
#>  9 e     a     b    
#> 10 f     a     b

## Declaring only one partition does work
ds <- open_dataset(pqdir, partitioning = schema(foo = string()))
collect(filter(ds, foo %in% "a"))
#> # A tibble: 10 x 2
#>    col   foo  
#>    <chr> <chr>
#>  1 a     a    
#>  2 b     a    
#>  3 c     a    
#>  4 d     a    
#>  5 e     a    
#>  6 b     a    
#>  7 c     a    
#>  8 d     a    
#>  9 e     a    
#> 10 f     a

## The lower-level API has the same problem
ds <- open_dataset(pqdir,
                   partitioning = schema(foo = string(), faa = string()))

flt <- Expression$in_(Expression$field_ref("foo"), Array$create("a"))

sc <- Scanner$create(ds, filter = flt)
sc$ToTable()
#> Table
#> 0 rows x 3 columns
#> $col <string>
#> $foo <string>
#> $faa <string>

Environment: This is using the latest Github version using windows, but I also reproduce using the CRAN version and using Linux.

{code}
sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19041)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252
#> [2] LC_CTYPE=English_United Kingdom.1252
#> [3] LC_MONETARY=English_United Kingdom.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United Kingdom.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.0.0 arrow_1.0.0.9000
{code}
Reporter: Maarten Demeyer
Assignee: Ben Kietzman / @bkietz

PRs and other links:

Note: This issue was originally created as ARROW-9606. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Maarten Demeyer:
I should add that this doesn't happen when the partition column is also part of the table. Perhaps the file structure in this example is just incorrectly created then - but the behaviour is still inconsistent without any warning.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Thanks a lot for this detailed report. We'll take a look.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
I see this as well with a Python reproducer:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# works OK with a single partition field
table = pa.table({'part': ['a', 'b']*10, 'col': np.arange(20)}) 
pq.write_to_dataset(table, "test_partitioned_filter_in", partition_cols=["part"])   
dataset = ds.dataset("test_partitioned_filter_in/", partitioning="hive")                                                                                                                                  

In [41]: dataset.to_table(filter=ds.field("part").isin(["a"])).to_pandas()                                                                                                                                         
Out[41]: 
   col part
0    0    a
1    2    a
2    4    a
3    6    a
4    8    a
5   10    a
6   12    a
7   14    a
8   16    a
9   18    a

# fails with multiple partition columns
table = pa.table({'part1': ['a', 'b']*10, 'part2': [1, 1, 2, 2]*5, 'col': np.arange(20)})  
pq.write_to_dataset(table, "test_partitioned_filter_in2", partition_cols=["part1", "part2"])  
dataset2 = ds.dataset("test_partitioned_filter_in2/", partitioning="hive")                                                                                                                                

In [45]: dataset2.to_table(filter=ds.field("part1").isin(["a"])).to_pandas()                                                                                                                                       
Out[45]: 
Empty DataFrame
Columns: [col, part1, part2]
Index: []

In [46]: dataset2.to_table(filter=ds.field("part1") == "a").to_pandas()                                                                                                                                            
Out[46]: 
   col part1  part2
0    0     a      1
1    4     a      1
2    8     a      1
3   12     a      1
4   16     a      1
5    2     a      2
6    6     a      2
7   10     a      2
8   14     a      2
9   18     a      2

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Issue resolved by pull request 7911
#7911

@asfimport asfimport added this to the 1.0.1 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants