Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Unable to read a ParquetDataset when schema validation is on. #23967

Closed
asfimport opened this issue Jan 30, 2020 · 5 comments
Closed

Comments

@asfimport
Copy link

I was trying to read a subset of my parquet files using the ParquetDataset object with a predefined schema, when it tries to validate the schema a to_arrow_schema is called and the schema does not support this. I don't what is happening, this is a sample. 

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

schema = pa.schema([
    pa.field("field1", pa.string()),
    pa.field("field2", pa.string()),
    pa.field("field3", pa.string()),
])

 ...

pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)

AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'

If we check the type of the schema as defined above we get:

type(schema)
pyarrow.lib.Schema

But the required type according with the docs is pyarrow.parquet.Schema, I don't know how to produce a object with this since we are forbbiden to use the Schema constructor directly.

If we check the implementation on github we get directly this line here:

dataset_schema = self.schema.to_arrow_schema()

Is this a problem in the schema builder or the parquet dataset object?

Environment: _libgcc_mutex 0.1 main
arrow-cpp 0.15.1 py37h982ac2c_6 conda-forge
attrs 19.3.0 py_0 conda-forge
backcall 0.1.0 py_0 conda-forge
bleach 3.1.0 py_0 conda-forge
boost-cpp 1.70.0 h8e57a91_2 conda-forge
brotli 1.0.7 he1b5a44_1000 conda-forge
bzip2 1.0.8 h516909a_2 conda-forge
c-ares 1.15.0 h516909a_1001 conda-forge
ca-certificates 2019.11.28 hecc5488_0 conda-forge
certifi 2019.11.28 py37_0 conda-forge
decorator 4.4.1 py_0 conda-forge
defusedxml 0.6.0 py_0 conda-forge
double-conversion 3.1.5 he1b5a44_2 conda-forge
entrypoints 0.3 py37_1000 conda-forge
gflags 2.2.2 he1b5a44_1002 conda-forge
glog 0.4.0 he1b5a44_1 conda-forge
grpc-cpp 1.25.0 h213be95_2 conda-forge
icu 64.2 he1b5a44_1 conda-forge
importlib_metadata 1.4.0 py37_0 conda-forge
inflect 4.0.0 py37_1 conda-forge
ipykernel 5.1.4 py37h5ca1d4c_0 conda-forge
ipython 7.11.1 py37h5ca1d4c_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jaraco.itertools 5.0.0 py_0 conda-forge
jedi 0.16.0 py37_0 conda-forge
jinja2 2.10.3 py_0 conda-forge
jsonschema 3.2.0 py37_0 conda-forge
jupyter_client 5.3.4 py37_1 conda-forge
jupyter_core 4.6.1 py37_0 conda-forge
ld_impl_linux-64 2.33.1 h53a641e_7
libblas 3.8.0 14_openblas conda-forge
libcblas 3.8.0 14_openblas conda-forge
libedit 3.1.20181209 hc058e9b_0
libevent 2.1.10 h72c5cf5_0 conda-forge
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_4 conda-forge
liblapack 3.8.0 14_openblas conda-forge
libopenblas 0.3.7 h5ec1e0e_6 conda-forge
libprotobuf 3.11.0 h8b12597_0 conda-forge
libsodium 1.0.17 h516909a_0 conda-forge
libstdcxx-ng 9.1.0 hdf63c60_0
lz4-c 1.8.3 he1b5a44_1001 conda-forge
markupsafe 1.1.1 py37h516909a_0 conda-forge
mistune 0.8.4 py37h516909a_1000 conda-forge
more-itertools 8.1.0 py_0 conda-forge
nbconvert 5.6.1 py37_0 conda-forge
nbformat 5.0.4 py_0 conda-forge
ncurses 6.1 he6710b0_1
notebook 6.0.3 py37_0 conda-forge
numpy 1.17.5 py37h95a1406_0 conda-forge
openssl 1.1.1d h516909a_0 conda-forge
pandas 0.25.3 py37hb3f55d8_0 conda-forge
pandoc 2.9.1.1 0 conda-forge
pandocfilters 1.4.2 py_1 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.6.0 py_0 conda-forge
pexpect 4.8.0 py37_0 conda-forge
pickleshare 0.7.5 py37_1000 conda-forge
pip 20.0.2 py37_0
prometheus_client 0.7.1 py_0 conda-forge
prompt_toolkit 3.0.2 py_0 conda-forge
ptyprocess 0.6.0 py_1001 conda-forge
pyarrow 0.15.1 py37h8b68381_1 conda-forge
pygments 2.5.2 py_0 conda-forge
pyrsistent 0.15.7 py37h516909a_0 conda-forge
python 3.7.6 h0371630_2
python-dateutil 2.8.1 py_0 conda-forge
pytz 2019.3 py_0 conda-forge
pyzmq 18.1.1 py37h1768529_0 conda-forge
re2 2020.01.01 he1b5a44_0 conda-forge
readline 7.0 h7b6447c_5
send2trash 1.5.0 py_0 conda-forge
setuptools 45.1.0 py37_0
six 1.14.0 py37_0 conda-forge
snappy 1.1.7 he1b5a44_1003 conda-forge
sqlite 3.30.1 h7b6447c_0
terminado 0.8.3 py37_0 conda-forge
testpath 0.4.4 py_0 conda-forge
thrift-cpp 0.12.0 hf3afdfd_1004 conda-forge
tk 8.6.8 hbc83047_0
tornado 6.0.3 py37h516909a_0 conda-forge
traitlets 4.3.3 py37_0 conda-forge
uriparser 0.9.3 he1b5a44_1 conda-forge
wcwidth 0.1.8 py_0 conda-forge
webencodings 0.5.1 py_1 conda-forge
wheel 0.33.6 py37_0
xz 5.2.4 h14c3975_4
zeromq 4.3.2 he1b5a44_2 conda-forge
zipp 2.1.0 py_0 conda-forge
zlib 1.2.11 h7b6447c_3
zstd 1.4.4 h3b9ef0a_1 conda-forge
Reporter: Otávio Vasques

Note: This issue was originally created as ARROW-7727. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
@kszucs @wesm Is this a 0.16.0 blocker?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
No. This parameter is required to be a Parquet schema, not an Arrow schema.

https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L977

@asfimport
Copy link
Author

Wes McKinney / @wesm:
This isn't a bug. The intention of the schema parameter is to pass a Parquet schema object obtained from the metadata of a particular file

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
It is not a regression and the dostring indicates that a ParquetSchema must be passed, so I wouldn't consider it as a blocker.

@asfimport
Copy link
Author

Otávio Vasques:
Is possible to read a parquet file converting an arrow schema to a parquet schema?

Or my only option is to read the file as is?

@asfimport asfimport added this to the 0.16.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant