Parquet column arrow type does not determine DataFusion column data_type

**Describe the bug**
I have produced two Parquet files with the same schema and compression scheme (according to 1pyarrow1), but which contrary to my (perhaps mistaken) expectation yield columns of different types when added to a DataFusion session.

**To Reproduce**
Unfortunately the history of the interactive session in which I produced the two Parquet files has been lost. However, the files themselves -- `a.parquet` and `b.parquet` are attached in the tarball `df_repro.tgz`. According to `pyarrow`, they have the same number of rows and same schemas:
```python
>>> import pyarrow.parquet as pq
>>> path_a = "a.parquet"
>>> path_b = "b.parquet"
>>> a = pq.ParquetFile(path_a)
>>> b = pq.ParquetFile(path_b)
>>> a.schema == b.schema
True
>>> a.schema_arrow == b.schema_arrow
True
```
There is in particular a column `Gene` with arrow schema `Gene: dictionary<values=string, indices=int32, ordered=0>` in each case.

However, when read into a DataFusion session, the `Gene` column from each parquet file becomes table columns with different types:

```python
>>> import datafusion as df
>>> ctx = df.SessionContext()
>>> ctx.sql(f"CREATE EXTERNAL TABLE a STORED AS PARQUET LOCATION '{path_a}'")
Show in New Window
DataFrame()
++
++
>>> ctx.sql(f"CREATE EXTERNAL TABLE b STORED AS PARQUET LOCATION '{path_b}'")
Show in New Window
DataFrame()
++
++
>>> ctx.sql("SELECT table_name, data_type FROM information_schema.columns WHERE column_name='Gene'")
DataFrame()
+------------+-------------------------------+
| table_name | data_type                     |
+------------+-------------------------------+
| a          | Dictionary(Int32, Utf8)       |
| b          | Dictionary(UInt32, LargeUtf8) |
+------------+-------------------------------+
```

I regret that I can't reproduce the exact `pyarrow` operations that produced `a.parquet` and `b.parquet` from their common source; however, my expectation was that there would be a one-to-one (or many-to-one) mapping of arrow data types in the Parquet files to column types in DataFusion.

Interestingly, some  modifications of `b.parquet` return it to a state where the `Gene` column has type `Dictionary(Int32, Utf8) ` via DataFusion:

- `take`-ing rows (hence I was unable to produce small input Parquet files for the reproduction):
```python
br = b.read()
pq.write_table(br.take([i for i in range(br.num_rows)], ...)
```
- dropping all columns but `Gene`:
```
pq.write_table(br.drop_columns([col for col in br.column_names where col != "Gene"]), ...)
```

**Expected behavior**
Quite possibly this is a misunderstanding on my part, but my expectation was that there would be a one-to-one (or possibly many-to-one) mapping between between the Parquet's files arrow data types and the resulting types of the columns  in the DataFusion session. If the reproducible behavior above is expected, it is not clear to me from documentation (https://arrow.apache.org/datafusion/user-guide/sql/data_types.html) which properties of the two Parquet files are responsible for the different behaviors.

**Additional context**

- Operating system: Rocky 8
- Python: 3.10.4
- datafusion module: 34.0.0
- pyarrow module: 15.0.0

[df_repro.tgz](https://github.com/apache/arrow-datafusion-python/files/14103940/df_repro.tgz)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet column arrow type does not determine DataFusion column data_type #576

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parquet column arrow type does not determine DataFusion column data_type #576

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions