Skip to content

Parquet column arrow type does not determine DataFusion column data_type #576

@jwimberl

Description

@jwimberl

Describe the bug
I have produced two Parquet files with the same schema and compression scheme (according to 1pyarrow1), but which contrary to my (perhaps mistaken) expectation yield columns of different types when added to a DataFusion session.

To Reproduce
Unfortunately the history of the interactive session in which I produced the two Parquet files has been lost. However, the files themselves -- a.parquet and b.parquet are attached in the tarball df_repro.tgz. According to pyarrow, they have the same number of rows and same schemas:

>>> import pyarrow.parquet as pq
>>> path_a = "a.parquet"
>>> path_b = "b.parquet"
>>> a = pq.ParquetFile(path_a)
>>> b = pq.ParquetFile(path_b)
>>> a.schema == b.schema
True
>>> a.schema_arrow == b.schema_arrow
True

There is in particular a column Gene with arrow schema Gene: dictionary<values=string, indices=int32, ordered=0> in each case.

However, when read into a DataFusion session, the Gene column from each parquet file becomes table columns with different types:

>>> import datafusion as df
>>> ctx = df.SessionContext()
>>> ctx.sql(f"CREATE EXTERNAL TABLE a STORED AS PARQUET LOCATION '{path_a}'")
Show in New Window
DataFrame()
++
++
>>> ctx.sql(f"CREATE EXTERNAL TABLE b STORED AS PARQUET LOCATION '{path_b}'")
Show in New Window
DataFrame()
++
++
>>> ctx.sql("SELECT table_name, data_type FROM information_schema.columns WHERE column_name='Gene'")
DataFrame()
+------------+-------------------------------+
| table_name | data_type                     |
+------------+-------------------------------+
| a          | Dictionary(Int32, Utf8)       |
| b          | Dictionary(UInt32, LargeUtf8) |
+------------+-------------------------------+

I regret that I can't reproduce the exact pyarrow operations that produced a.parquet and b.parquet from their common source; however, my expectation was that there would be a one-to-one (or many-to-one) mapping of arrow data types in the Parquet files to column types in DataFusion.

Interestingly, some modifications of b.parquet return it to a state where the Gene column has type Dictionary(Int32, Utf8) via DataFusion:

  • take-ing rows (hence I was unable to produce small input Parquet files for the reproduction):
br = b.read()
pq.write_table(br.take([i for i in range(br.num_rows)], ...)
  • dropping all columns but Gene:
pq.write_table(br.drop_columns([col for col in br.column_names where col != "Gene"]), ...)

Expected behavior
Quite possibly this is a misunderstanding on my part, but my expectation was that there would be a one-to-one (or possibly many-to-one) mapping between between the Parquet's files arrow data types and the resulting types of the columns in the DataFusion session. If the reproducible behavior above is expected, it is not clear to me from documentation (https://arrow.apache.org/datafusion/user-guide/sql/data_types.html) which properties of the two Parquet files are responsible for the different behaviors.

Additional context

  • Operating system: Rocky 8
  • Python: 3.10.4
  • datafusion module: 34.0.0
  • pyarrow module: 15.0.0

df_repro.tgz

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions