Skip to content

[C++][Dataset] Handling of duplicate columns in Dataset factory and scanning #24407

@asfimport

Description

@asfimport

While testing duplicate column names, I ran into multiple issues:

  • Factory fails if there are duplicate columns, even for a single file
  • In addition, we should also fix and/or test that factory works for duplicate columns if the schema's are equal
  • Once a Dataset with duplicated columns is created, scanning without any column projection fails

My python reproducer:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.fs

# create single parquet file with duplicated column names
table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, 9])], names=['a', 'b', 'a'])
pq.write_table(table, "data_duplicate_columns.parquet")

Factory fails:

dataset = ds.dataset("data_duplicate_columns.parquet", format="parquet")
...
~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, filesystem, partitioning, format)
    346 
    347     factories = [_ensure_factory(f, **kwargs) for f in paths_or_factories]
--> 348     return UnionDatasetFactory(factories).finish()
    349 
    350 

ArrowInvalid: Can't unify schema with duplicate field names.

And when creating a Dataset manually:

schema = pa.schema([('a', 'int64'), ('b', 'int64'), ('a', 'int64')])
dataset = ds.FileSystemDataset(
    schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
    [str(basedir / "data_duplicate_columns.parquet")], [ds.ScalarExpression(True)])

then scanning fails:

>>> dataset.to_table()
...
ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64
b: int64
a: int64

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-8210. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions