-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
Apache Iceberg version
1.1.0 (latest release)
Query engine
Other
Please describe the bug 🐞
I start the spark/iceberg docker containers (as explained here) and I use the Getting Started notebook to create and populate table nyc.taxis.
Then, I use PyIceberg to access the data. I can get the list of tables and get some basic info about the table, but when I query it I get the following error:
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/xyz/data/data-research/apache-iceberg/src/micro.py", line 16, in <module>
results = [task.file.file_path for task in scan.plan_files()]
File "/Users/xyz/data/data-research/apache-iceberg/src/micro.py", line 16, in <listcomp>
results = [task.file.file_path for task in scan.plan_files()]
File "/Users/xyz/data/data-research/venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py", line 320, in plan_files
for manifest_file in snapshot.manifests(io)
File "/Users/xyz/data/data-research/venv/lib/python3.10/site-packages/pyiceberg/table/snapshots.py", line 116, in manifests
return list(read_manifest_list(file))
File "/Users/xyz/data/data-research/venv/lib/python3.10/site-packages/pyiceberg/manifest.py", line 153, in read_manifest_list
with AvroFile(input_file) as reader:
File "/Users/xyz/data/data-research/venv/lib/python3.10/site-packages/pyiceberg/avro/file.py", line 133, in __enter__
self.input_stream = BufferedReader(self.input_file.open())
File "/Users/xyz/data/data-research/venv/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py", line 153, in open
input_file = self._filesystem.open_input_file(self._path)
File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When reading information for key 'wh/nyc/taxis/metadata/snap-6907359110454980554-1-72c446bf-de84-4800-aa24-73b2dd64a259.avro' in bucket 'warehouse': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body.
I would be very grateful if someone could take a look and give me some hints :)
Info about my system:
Docker info
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
minio/mc latest 8e2b3ca6225f 9 hours ago 139MB
minio/minio latest 107801c34719 9 hours ago 246MB
tabulario/spark-iceberg latest 731f180d545e 7 days ago 3.8GB
tabulario/iceberg-rest 0.2.0 d33a2980abc4 13 days ago 442MB
Pip freeze
pip freeze
appnope==0.1.3
asttokens==2.2.1
backcall==0.2.0
certifi==2022.12.7
cfgv==3.3.1
charset-normalizer==2.1.1
click==8.1.3
commonmark==0.9.1
decorator==5.1.1
distlib==0.3.6
executing==1.2.0
filelock==3.9.0
fsspec==2022.10.0
identify==2.5.16
idna==3.4
ipython==8.9.0
jedi==0.18.2
matplotlib-inline==0.1.6
mmhash3==3.0.1
nodeenv==1.7.0
numpy==1.24.1
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
platformdirs==2.6.2
pre-commit==3.0.2
prompt-toolkit==3.0.36
ptyprocess==0.7.0
pure-eval==0.2.2
py4j==0.10.9.5
pyarrow==11.0.0
pydantic==1.10.2
Pygments==2.14.0
pyiceberg==0.2.1
pyspark==3.3.1
PyYAML==6.0
requests==2.28.1
rich==12.6.0
six==1.16.0
stack-data==0.6.2
traitlets==5.8.1
typing_extensions==4.4.0
urllib3==1.26.14
virtualenv==20.17.1
wcwidth==0.2.6
zstandard==0.19.0
Code I'm running:
import os
from pyiceberg.catalog import load_catalog
os.environ["AWS_ACCESS_KEY_ID"] = "admin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "password"
os.environ["AWS_REGION"] = "us-east-1"
catalog = load_catalog("demo_catalog", uri="http://localhost:8181")
table = catalog.load_table("nyc.taxis")
print(table.identifier)
print(table.location())
print(table.schema())
scan = table.scan(selected_fields=("trip_distance", ))
results = [task.file.file_path for task in scan.plan_files()]
Full output:
python -m src.micro
('demo_catalog', 'nyc', 'taxis')
s3a://warehouse/wh/nyc/taxis
table {
1: VendorID: optional long
2: tpep_pickup_datetime: optional timestamptz
3: tpep_dropoff_datetime: optional timestamptz
4: passenger_count: optional double
5: trip_distance: optional double
6: RatecodeID: optional double
7: store_and_fwd_flag: optional string
8: PULocationID: optional long
9: DOLocationID: optional long
10: payment_type: optional long
11: fare_amount: optional double
12: extra: optional double
13: mta_tax: optional double
14: tip_amount: optional double
15: tolls_amount: optional double
16: improvement_surcharge: optional double
17: total_amount: optional double
18: congestion_surcharge: optional double
19: airport_fee: optional double
}
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/xyz/data/data-research/apache-iceberg/src/micro.py", line 16, in <module>
results = [task.file.file_path for task in scan.plan_files()]
File "/Users/xyz/data/data-research/apache-iceberg/src/micro.py", line 16, in <listcomp>
results = [task.file.file_path for task in scan.plan_files()]
File "/Users/xyz/data/data-research/venv/lib/python3.10/site-packages/pyiceberg/table/__init__.py", line 320, in plan_files
for manifest_file in snapshot.manifests(io)
File "/Users/xyz/data/data-research/venv/lib/python3.10/site-packages/pyiceberg/table/snapshots.py", line 116, in manifests
return list(read_manifest_list(file))
File "/Users/xyz/data/data-research/venv/lib/python3.10/site-packages/pyiceberg/manifest.py", line 153, in read_manifest_list
with AvroFile(input_file) as reader:
File "/Users/xyz/data/data-research/venv/lib/python3.10/site-packages/pyiceberg/avro/file.py", line 133, in __enter__
self.input_stream = BufferedReader(self.input_file.open())
File "/Users/xyz/data/data-research/venv/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py", line 153, in open
input_file = self._filesystem.open_input_file(self._path)
File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When reading information for key 'wh/nyc/taxis/metadata/snap-1143379398124310344-1-b8503783-03fc-4eed-9290-110e65ddf9a1.avro' in bucket 'warehouse': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body.```
Metadata
Metadata
Assignees
Labels
No labels