Describe the bug
I have produced two Parquet files with the same schema and compression scheme (according to 1pyarrow1), but which contrary to my (perhaps mistaken) expectation yield columns of different types when added to a DataFusion session.
To Reproduce
Unfortunately the history of the interactive session in which I produced the two Parquet files has been lost. However, the files themselves -- a.parquet and b.parquet are attached in the tarball df_repro.tgz. According to pyarrow, they have the same number of rows and same schemas:
>>> import pyarrow.parquet as pq
>>> path_a = "a.parquet"
>>> path_b = "b.parquet"
>>> a = pq.ParquetFile(path_a)
>>> b = pq.ParquetFile(path_b)
>>> a.schema == b.schema
True
>>> a.schema_arrow == b.schema_arrow
True
There is in particular a column Gene with arrow schema Gene: dictionary<values=string, indices=int32, ordered=0> in each case.
However, when read into a DataFusion session, the Gene column from each parquet file becomes table columns with different types:
>>> import datafusion as df
>>> ctx = df.SessionContext()
>>> ctx.sql(f"CREATE EXTERNAL TABLE a STORED AS PARQUET LOCATION '{path_a}'")
Show in New Window
DataFrame()
++
++
>>> ctx.sql(f"CREATE EXTERNAL TABLE b STORED AS PARQUET LOCATION '{path_b}'")
Show in New Window
DataFrame()
++
++
>>> ctx.sql("SELECT table_name, data_type FROM information_schema.columns WHERE column_name='Gene'")
DataFrame()
+------------+-------------------------------+
| table_name | data_type |
+------------+-------------------------------+
| a | Dictionary(Int32, Utf8) |
| b | Dictionary(UInt32, LargeUtf8) |
+------------+-------------------------------+
I regret that I can't reproduce the exact pyarrow operations that produced a.parquet and b.parquet from their common source; however, my expectation was that there would be a one-to-one (or many-to-one) mapping of arrow data types in the Parquet files to column types in DataFusion.
Interestingly, some modifications of b.parquet return it to a state where the Gene column has type Dictionary(Int32, Utf8) via DataFusion:
take-ing rows (hence I was unable to produce small input Parquet files for the reproduction):
br = b.read()
pq.write_table(br.take([i for i in range(br.num_rows)], ...)
- dropping all columns but
Gene:
pq.write_table(br.drop_columns([col for col in br.column_names where col != "Gene"]), ...)
Expected behavior
Quite possibly this is a misunderstanding on my part, but my expectation was that there would be a one-to-one (or possibly many-to-one) mapping between between the Parquet's files arrow data types and the resulting types of the columns in the DataFusion session. If the reproducible behavior above is expected, it is not clear to me from documentation (https://arrow.apache.org/datafusion/user-guide/sql/data_types.html) which properties of the two Parquet files are responsible for the different behaviors.
Additional context
- Operating system: Rocky 8
- Python: 3.10.4
- datafusion module: 34.0.0
- pyarrow module: 15.0.0
df_repro.tgz
Describe the bug
I have produced two Parquet files with the same schema and compression scheme (according to 1pyarrow1), but which contrary to my (perhaps mistaken) expectation yield columns of different types when added to a DataFusion session.
To Reproduce
Unfortunately the history of the interactive session in which I produced the two Parquet files has been lost. However, the files themselves --
a.parquetandb.parquetare attached in the tarballdf_repro.tgz. According topyarrow, they have the same number of rows and same schemas:There is in particular a column
Genewith arrow schemaGene: dictionary<values=string, indices=int32, ordered=0>in each case.However, when read into a DataFusion session, the
Genecolumn from each parquet file becomes table columns with different types:I regret that I can't reproduce the exact
pyarrowoperations that produceda.parquetandb.parquetfrom their common source; however, my expectation was that there would be a one-to-one (or many-to-one) mapping of arrow data types in the Parquet files to column types in DataFusion.Interestingly, some modifications of
b.parquetreturn it to a state where theGenecolumn has typeDictionary(Int32, Utf8)via DataFusion:take-ing rows (hence I was unable to produce small input Parquet files for the reproduction):Gene:Expected behavior
Quite possibly this is a misunderstanding on my part, but my expectation was that there would be a one-to-one (or possibly many-to-one) mapping between between the Parquet's files arrow data types and the resulting types of the columns in the DataFusion session. If the reproducible behavior above is expected, it is not clear to me from documentation (https://arrow.apache.org/datafusion/user-guide/sql/data_types.html) which properties of the two Parquet files are responsible for the different behaviors.
Additional context
df_repro.tgz