Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Integration tests #6398

Merged
merged 29 commits into from
Mar 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
d4e1916
Integration tests
Fokko Dec 5, 2022
9c68ca2
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Dec 9, 2022
05b8aed
First version
Fokko Dec 9, 2022
79b8e36
Add caching
Fokko Dec 11, 2022
58af0c3
Add caching
Fokko Dec 11, 2022
084ea4d
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Dec 20, 2022
9b7fc33
Restore pyproject
Fokko Dec 20, 2022
b81b45f
WIP
Fokko Jan 31, 2023
cb2741b
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Feb 10, 2023
3ff7427
NaN seems to be broken
Fokko Feb 12, 2023
cffa6cd
WIP
Fokko Feb 13, 2023
e3e70ae
Coming along
Fokko Feb 13, 2023
9f13128
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Feb 13, 2023
ff08efc
Cleanup
Fokko Feb 13, 2023
3b564d0
Install duckdb
Fokko Feb 13, 2023
8cb8b9c
Cleanup
Fokko Feb 13, 2023
099d720
Revert changes to poetry
Fokko Feb 14, 2023
2b9836a
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Feb 14, 2023
0f19e2f
Make it even nicer
Fokko Feb 14, 2023
c2635cf
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Feb 17, 2023
0bc6861
Revert unneeded change
Fokko Feb 23, 2023
843e5f0
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Feb 23, 2023
8972f22
Update Spark version
Fokko Feb 25, 2023
3516159
Make test passing
Fokko Feb 28, 2023
8205f34
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Mar 3, 2023
6d857ba
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Mar 8, 2023
89bf8f7
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Mar 12, 2023
b1ec6a5
Merge branch 'master' of github.com:apache/iceberg into fd-integratio…
Fokko Mar 15, 2023
bf1d59a
comments
Fokko Mar 15, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions .github/workflows/python-integration.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

name: "Python CI"
on:
push:
branches:
- 'master'
- '0.**'
tags:
- 'apache-iceberg-**'
pull_request:
paths:
- '.github/workflows/python-ci.yml'
- 'python/**'

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}

jobs:
integration-test:
runs-on: ubuntu-20.04

steps:
- uses: actions/checkout@v3
with:
fetch-depth: 2
- shell: pwsh
id: check_file_changed
run: |
$diff = git diff --name-only HEAD^ HEAD
$SourceDiff = $diff | Where-Object { $_ -match '^python/dev/Dockerfile$' }
$HasDiff = $SourceDiff.Length -gt 0
Write-Host "::set-output name=docs_changed::$HasDiff"
- name: Restore image
id: cache-docker
uses: actions/cache@v3
with:
path: ci/cache/docker/python
key: cache-mintegration
- name: Update Image Cache if cache miss
if: steps.cache-docker.outputs.cache-hit != 'true' || steps.check_file_changed.outputs.docs_changed == 'True'
run: |
docker build -t python-integration python/dev/ && \
mkdir -p ci/cache/docker/python && \
docker image save python-integration --output ./ci/cache/docker/python/python-integration.tar
- name: Use Image Cache if cache hit
if: steps.cache-docker.outputs.cache-hit == 'true'
run: docker image load --input ./ci/cache/docker/python/python-integration.tar
- name: Run Apache-Spark setup
working-directory: ./python
run: |
docker-compose -f dev/docker-compose-integration.yml up -d
sleep 10
- name: Install poetry
run: pip install poetry
- uses: actions/setup-python@v4
with:
python-version: '3.9'
cache: poetry
cache-dependency-path: ./python/poetry.lock
- name: Install
working-directory: ./python
run: make install
- name: Tests
working-directory: ./python
run: make test-integration
- name: Show debug logs
if: ${{ failure() }}
run: docker-compose -f python/dev/docker-compose.yml logs
21 changes: 11 additions & 10 deletions python/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

install:
pip install poetry
poetry install -E pyarrow -E hive -E s3fs -E glue -E adlfs
poetry install -E pyarrow -E hive -E s3fs -E glue -E adlfs -E duckdb

check-license:
./dev/check-license
Expand All @@ -26,21 +26,22 @@ lint:
poetry run pre-commit run --all-files

test:
poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3 and not adlfs" ${PYTEST_ARGS}
poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m unmarked ${PYTEST_ARGS}
poetry run coverage report -m --fail-under=90
poetry run coverage html
poetry run coverage xml

test-s3:
sh ./dev/run-minio.sh
poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not adlfs" ${PYTEST_ARGS}
poetry run coverage report -m --fail-under=90
poetry run coverage html
poetry run coverage xml
poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m s3 ${PYTEST_ARGS}

test-integration:
docker-compose -f dev/docker-compose-integration.yml kill
docker-compose -f dev/docker-compose-integration.yml build
docker-compose -f dev/docker-compose-integration.yml up -d
sleep 20
poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m integration ${PYTEST_ARGS}

test-adlfs:
sh ./dev/run-azurite.sh
poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3" ${PYTEST_ARGS}
poetry run coverage report -m --fail-under=90
poetry run coverage html
poetry run coverage xml
poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m adlfs ${PYTEST_ARGS}
67 changes: 67 additions & 0 deletions python/dev/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM python:3.9-bullseye

RUN apt-get -qq update && \
apt-get -qq install -y --no-install-recommends \
sudo \
curl \
vim \
unzip \
openjdk-11-jdk \
build-essential \
software-properties-common \
ssh && \
apt-get -qq clean && \
rm -rf /var/lib/apt/lists/*

# Optional env variables
ENV SPARK_HOME=${SPARK_HOME:-"/opt/spark"}
ENV HADOOP_HOME=${HADOOP_HOME:-"/opt/hadoop"}
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH

RUN mkdir -p ${HADOOP_HOME} && mkdir -p ${SPARK_HOME} && mkdir -p /home/iceberg/spark-events
WORKDIR ${SPARK_HOME}

ENV SPARK_VERSION=3.3.2

RUN curl -s https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz -o spark-${SPARK_VERSION}-bin-hadoop3.tgz \
&& tar xzf spark-${SPARK_VERSION}-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \
&& rm -rf spark-${SPARK_VERSION}-bin-hadoop3.tgz

# Download iceberg spark runtime
RUN curl -s https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.1.0/iceberg-spark-runtime-3.3_2.12-1.1.0.jar -Lo iceberg-spark-runtime-3.3_2.12-1.1.0.jar \
&& mv iceberg-spark-runtime-3.3_2.12-1.1.0.jar /opt/spark/jars

# Download Java AWS SDK
RUN curl -s https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.17.165/bundle-2.17.165.jar -Lo bundle-2.17.165.jar \
&& mv bundle-2.17.165.jar /opt/spark/jars

# Download URL connection client required for S3FileIO
RUN curl -s https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.17.165/url-connection-client-2.17.165.jar -Lo url-connection-client-2.17.165.jar \
&& mv url-connection-client-2.17.165.jar /opt/spark/jars

COPY spark-defaults.conf /opt/spark/conf
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"

RUN chmod u+x /opt/spark/sbin/* && \
chmod u+x /opt/spark/bin/*

COPY entrypoint.sh .
COPY provision.py .

ENTRYPOINT ["./entrypoint.sh"]
CMD ["notebook"]
76 changes: 76 additions & 0 deletions python/dev/docker-compose-integration.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
version: "3"

services:
spark-iceberg:
image: python-integration
container_name: pyiceberg-spark
build: .
depends_on:
- rest
- minio
volumes:
- ./warehouse:/home/iceberg/warehouse
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
ports:
- 8888:8888
- 8080:8080
links:
- rest:rest
- minio:minio
rest:
image: tabulario/iceberg-rest:0.2.0
container_name: pyiceberg-rest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this store the underlying catalog metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An in-memory SQLite

ports:
- 8181:8181
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3a://warehouse/wh/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- CATALOG_S3_ENDPOINT=http://minio:9000
minio:
image: minio/minio
container_name: pyiceberg-minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
ports:
- 9001:9001
- 9000:9000
command: [ "server", "/data", "--console-address", ":9001" ]
mc:
depends_on:
- minio
image: minio/mc
container_name: pyiceberg-mc
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: >
/bin/sh -c "
until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"
25 changes: 25 additions & 0 deletions python/dev/entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

start-master.sh -p 7077
start-worker.sh spark://spark-iceberg:7077
start-history-server.sh

python3 ./provision.py
98 changes: 98 additions & 0 deletions python/dev/provision.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
import time

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

print("Create database")

spark.sql(
"""
CREATE DATABASE IF NOT EXISTS default;
"""
)

spark.sql(
"""
use default;
"""
)

spark.sql(
"""
DROP TABLE IF EXISTS test_null_nan;
"""
)

spark.sql(
"""
CREATE TABLE test_null_nan
USING iceberg
AS SELECT
1 AS idx,
float('NaN') AS col_numeric
UNION ALL SELECT
2 AS idx,
null AS col_numeric
UNION ALL SELECT
3 AS idx,
1 AS col_numeric
"""
)

spark.sql(
"""
CREATE TABLE test_null_nan_rewritten
USING iceberg
AS SELECT * FROM test_null_nan
"""
)

spark.sql(
"""
DROP TABLE IF EXISTS test_deletes;
"""
)

spark.sql(
"""
CREATE TABLE test_deletes
USING iceberg
TBLPROPERTIES (
'write.delete.mode'='merge-on-read',
'write.update.mode'='merge-on-read',
'write.merge.mode'='merge-on-read'
)
AS SELECT
1 AS idx,
True AS deleted
UNION ALL SELECT
2 AS idx,
False AS deleted;
"""
)

spark.sql(
"""
DELETE FROM test_deletes WHERE deleted = True;
"""
)

while True:
time.sleep(1)
Loading