[C++][Dataset] in expressions don't work with >1 partition levels #25668

asfimport · 2020-07-31T08:46:23Z

When filtering nested partitions using %in%, no rows are returned, both for Hive and non-Hive partitioning. == and other comparison operators do work, and the problem also goes away when only one partition level is declared in the schema.

This is not caused by the dplyr wrappers, the lower-level functions have the same problem.

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

## Write files
pqdir <- file.path(tempdir(), paste(sample(letters, 6), collapse = ""))

for (foo in 0:1) {
  for (faa in 0:1) {
    fdir <- file.path(pqdir, letters[foo + 1], letters[faa + 1])
    dir.create(fdir, recursive = TRUE)
    rng <- (foo * 5 + faa + 1):(foo * 5 + faa + 5)
    write_parquet(data.frame(col = letters[rng]),
                         file.path(fdir, "file.parquet"))
  }
}

## What doesn't work: using %in% with both partitions defined
ds <- open_dataset(pqdir,
                   partitioning = schema(foo = string(), faa = string()))

collect(filter(ds, foo %in% "a"))
#> # A tibble: 0 x 3
#> # ... with 3 variables: col <chr>, foo <chr>, faa <chr>

## == does work
collect(filter(ds, foo == "a"))
#> # A tibble: 10 x 3
#>    col   foo   faa  
#>    <chr> <chr> <chr>
#>  1 a     a     a    
#>  2 b     a     a    
#>  3 c     a     a    
#>  4 d     a     a    
#>  5 e     a     a    
#>  6 b     a     b    
#>  7 c     a     b    
#>  8 d     a     b    
#>  9 e     a     b    
#> 10 f     a     b

## Declaring only one partition does work
ds <- open_dataset(pqdir, partitioning = schema(foo = string()))
collect(filter(ds, foo %in% "a"))
#> # A tibble: 10 x 2
#>    col   foo  
#>    <chr> <chr>
#>  1 a     a    
#>  2 b     a    
#>  3 c     a    
#>  4 d     a    
#>  5 e     a    
#>  6 b     a    
#>  7 c     a    
#>  8 d     a    
#>  9 e     a    
#> 10 f     a

## The lower-level API has the same problem
ds <- open_dataset(pqdir,
                   partitioning = schema(foo = string(), faa = string()))

flt <- Expression$in_(Expression$field_ref("foo"), Array$create("a"))

sc <- Scanner$create(ds, filter = flt)
sc$ToTable()
#> Table
#> 0 rows x 3 columns
#> $col <string>
#> $foo <string>
#> $faa <string>

Environment: This is using the latest Github version using windows, but I also reproduce using the CRAN version and using Linux.

{code}
sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19041)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252
#> [2] LC_CTYPE=English_United Kingdom.1252
#> [3] LC_MONETARY=English_United Kingdom.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United Kingdom.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.0.0 arrow_1.0.0.9000
{code}
Reporter: Maarten Demeyer
Assignee: Ben Kietzman / @bkietz

PRs and other links:

GitHub Pull Request #7911

_{Note: This issue was originally created as ARROW-9606. Please see the migration documentation for further details.}

asfimport · 2020-07-31T10:33:55Z

Maarten Demeyer:
I should add that this doesn't happen when the partition column is also part of the table. Perhaps the file structure in this example is just incorrectly created then - but the behaviour is still inconsistent without any warning.

asfimport · 2020-07-31T16:03:16Z

Neal Richardson / @nealrichardson:
Thanks a lot for this detailed report. We'll take a look.

asfimport · 2020-08-04T08:36:44Z

Joris Van den Bossche / @jorisvandenbossche:
I see this as well with a Python reproducer:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# works OK with a single partition field
table = pa.table({'part': ['a', 'b']*10, 'col': np.arange(20)}) 
pq.write_to_dataset(table, "test_partitioned_filter_in", partition_cols=["part"])   
dataset = ds.dataset("test_partitioned_filter_in/", partitioning="hive")                                                                                                                                  

In [41]: dataset.to_table(filter=ds.field("part").isin(["a"])).to_pandas()                                                                                                                                         
Out[41]: 
   col part
0    0    a
1    2    a
2    4    a
3    6    a
4    8    a
5   10    a
6   12    a
7   14    a
8   16    a
9   18    a

# fails with multiple partition columns
table = pa.table({'part1': ['a', 'b']*10, 'part2': [1, 1, 2, 2]*5, 'col': np.arange(20)})  
pq.write_to_dataset(table, "test_partitioned_filter_in2", partition_cols=["part1", "part2"])  
dataset2 = ds.dataset("test_partitioned_filter_in2/", partitioning="hive")                                                                                                                                

In [45]: dataset2.to_table(filter=ds.field("part1").isin(["a"])).to_pandas()                                                                                                                                       
Out[45]: 
Empty DataFrame
Columns: [col, part1, part2]
Index: []

In [46]: dataset2.to_table(filter=ds.field("part1") == "a").to_pandas()                                                                                                                                            
Out[46]: 
   col part1  part2
0    0     a      1
1    4     a      1
2    8     a      1
3   12     a      1
4   16     a      1
5    2     a      2
6    6     a      2
7   10     a      2
8   14     a      2
9   18     a      2

asfimport · 2020-08-10T17:41:50Z

Neal Richardson / @nealrichardson:
Issue resolved by pull request 7911
#7911

asfimport closed this as completed Aug 10, 2020

asfimport assigned bkietz Jan 10, 2023

asfimport added this to the 1.0.1 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Dataset] in expressions don't work with >1 partition levels #25668

[C++][Dataset] in expressions don't work with >1 partition levels #25668

asfimport commented Jul 31, 2020

asfimport commented Jul 31, 2020

asfimport commented Jul 31, 2020

asfimport commented Aug 4, 2020

asfimport commented Aug 10, 2020

[C++][Dataset] in expressions don't work with >1 partition levels #25668

[C++][Dataset] in expressions don't work with >1 partition levels #25668

Comments

asfimport commented Jul 31, 2020

PRs and other links:

asfimport commented Jul 31, 2020

asfimport commented Jul 31, 2020

asfimport commented Aug 4, 2020

asfimport commented Aug 10, 2020