Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Integration tests #6398

Merged
merged 29 commits into from
Mar 15, 2023
Merged

Python: Integration tests #6398

merged 29 commits into from
Mar 15, 2023

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Dec 9, 2022

This is the first version of a framework to read Iceberg tables, produced by Spark, using PyIceberg. This makes it easier to run end-to-end tests and also validate the behavior of PyArrow and DuckDB.

python-version: '3.9'
cache: poetry
cache-dependency-path: |
./python/poetry.lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be more than just the lock file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the dependencies, you need to regenerate the lock file. So that should be enough


spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.demo org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.catalog-impl org.apache.iceberg.rest.RESTCatalog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type=rest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less is more, thanks!

- minio:minio
rest:
image: tabulario/iceberg-rest:0.2.0
container_name: pyiceberg-rest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this store the underlying catalog metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An in-memory SQLite

@@ -428,11 +428,11 @@ def visit_not_in(self, term: BoundTerm[pc.Expression], literals: Set[Any]) -> pc

def visit_is_nan(self, term: BoundTerm[Any]) -> pc.Expression:
ref = pc.field(term.ref().field.name)
return ref.is_null(nan_is_null=True) & ref.is_valid()
return pc.is_nan(ref)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be in this PR right? Seems like an update with a new version of pyarrow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually to make the CI pass. I've created a PR to allow ref.is_nan() as well, but this is not released yet.

@@ -331,7 +331,7 @@ def __init__(
self,
table: Table,
row_filter: Union[str, BooleanExpression] = ALWAYS_TRUE,
selected_fields: Tuple[str] = ("*",),
selected_fields: Tuple[str, ...] = ("*",),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems like a separate PR change, but good cleanup.

arrow_table = table_test_null_nan.scan(row_filter=IsNaN("col_numeric"), selected_fields=("idx", "col_numeric")).to_arrow()
assert len(arrow_table) == 1
assert arrow_table[0][0].as_py() == 1
assert math.isnan(arrow_table[1][0].as_py())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to read these tests if you called as_py() to produce rows and validated the rows. It looks like there's just one row, but the row/column indexes are backward because this is columnar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rewrite those tests a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it into assert math.isnan(arrow_table["col_numeric"][0].as_py())

def test_duckdb_nan(table_test_null_nan_rewritten: Table) -> None:
con = table_test_null_nan_rewritten.scan().to_duckdb("table_test_null_nan")
result = con.query("SELECT idx FROM table_test_null_nan WHERE isnan(col_numeric)").fetchone()
assert result == (1,)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't return NaN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it does :)

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes look like a good start.

@Fokko Fokko merged commit 0807857 into apache:master Mar 15, 2023
@Fokko Fokko deleted the fd-integration-tests branch March 15, 2023 19:04
@Fokko
Copy link
Contributor Author

Fokko commented Mar 15, 2023

Thanks for the review @rdblue we can more tests later on

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
* Integration tests

* First version

* Add caching

* Add caching

* Restore pyproject

* WIP

* NaN seems to be broken

* WIP

* Coming along

* Cleanup

* Install duckdb

* Cleanup

* Revert changes to poetry

* Make it even nicer

* Revert unneeded change

* Update Spark version

* Make test passing

* comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants