Python: Add positional deletes #6775

Fokko · 2023-02-08T20:00:12Z

Closes #6568

Closes apache#6568

…-deletes

python/pyiceberg/table/__init__.py

python/tests/io/test_pyarrow.py

…-deletes

python/pyiceberg/io/pyarrow.py

rdblue · 2023-05-19T23:03:22Z

python/pyiceberg/table/__init__.py

+        ]
+
+    def _match_deletes_to_datafile(self, data_file: DataFile, positional_delete_files: List[DataFile]) -> Set[DataFile]:
+        return set(


I think this is going to over-match delete files to data files. An older delete file could match a new data file because the range of file_path could be large and not helpful.

I think all you need to do is to check this preliminary set by comparing the sequence number of the data file and the matching delete files.

Ahh, I see. Yes. I agree. I also recall this from the Java implementation, but I was confused with the sequence numbers on the manifest-list level.

I've added a sorted-list for collecting the deletes by sequence number, then we can efficiently bisect them by only selecting the deletes that came after the data file.

python/tests/io/test_pyarrow.py

Fokko · 2023-05-21T21:17:32Z

python/pyiceberg/io/pyarrow.py

@@ -28,12 +28,16 @@
 import os
 from abc import ABC, abstractmethod
 from functools import lru_cache, singledispatch
+from heapq import merge


Thanks! Are there any specific changes that you would like to see in a separate PR? The heapq is used for merging the different positional deletes.

Fokko · 2023-05-21T21:17:40Z

python/pyiceberg/io/pyarrow.py

+        for pos in range(fn_rows()):
+            if deleted_pos == pos:
+                try:
+                    deleted_pos = next(sorted_deleted).as_py()  # type: ignore


Oof, nice catch!

>>> sorted_deleted = iter([1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 5, 6]) >>> deleted_pos = next(sorted_deleted) >>> for pos in range(10): ... if deleted_pos == pos: ... while deleted_pos == pos: ... try: ... deleted_pos = next(sorted_deleted) ... except StopIteration: ... deleted_pos = -1 ... else: ... print(f"yield {pos}") ... ... yield 0 yield 3 yield 7 yield 8 yield 9

Fokko · 2023-05-21T21:18:13Z

python/pyiceberg/io/pyarrow.py

+def _create_positional_deletes_indices(positional_deletes: List[pa.ChunkedArray], fn_rows: Callable[[], int]) -> pa.Array:
+    sorted_deleted = merge(*positional_deletes)
+
+    def generator() -> Generator[int, None, None]:
+        deleted_pos = next(sorted_deleted).as_py()  # type: ignore
+        for pos in range(fn_rows()):
+            if deleted_pos == pos:
+                try:
+                    deleted_pos = next(sorted_deleted).as_py()  # type: ignore
+                except StopIteration:
+                    deleted_pos = -1
+            else:
+                yield pos
+
+    # Filter on the positions
+    return pa.array(generator(), type=pa.int64())


@jorisvandenbossche Let me know if you're interested in providing this from the PyArrow side :) Would be very welcome.

Fokko · 2023-05-21T21:19:19Z

python/pyiceberg/io/pyarrow.py

@@ -786,15 +865,39 @@ def project_table(
    rows_counter = multiprocessing.Value("i", 0)

    with ThreadPool() as pool:
+        # Fetch the deletes
+        deletes_per_file: Dict[str, List[ChunkedArray]] = {}


Good one, I like that a lot!

Fokko · 2023-05-21T21:19:28Z

python/pyiceberg/table/__init__.py

@@ -259,7 +262,7 @@ def projection(self) -> Schema:
        return snapshot_schema.select(*self.selected_fields, case_sensitive=self.case_sensitive)

    @abstractmethod
-    def plan_files(self) -> Iterator[ScanTask]:
+    def plan_files(self) -> Iterable[ScanTask]:


It won't break the code of anyone using this, but it might alarm the type checker.

Fokko · 2023-05-21T21:19:47Z

python/pyiceberg/table/__init__.py

@@ -401,9 +423,38 @@ def plan_files(self) -> Iterator[FileScanTask]:
                            metrics_evaluator,
                        )
                        for manifest in manifests
+                        if (manifest.content is None or manifest.content == ManifestContent.DATA)


I don't like this, I'm going to separate this out in a function.

Fokko · 2023-05-21T21:19:51Z

python/pyiceberg/table/__init__.py

+                    data_datafiles.append(datafile)
+                elif datafile.content == DataFileContent.POSITION_DELETES:
+                    deletes_positional.append(datafile)
+                elif datafile.content == DataFileContent.EQUALITY_DELETES:


I left it out for now since it is already quite a hefty PR

python/tests/io/test_pyarrow.py

Fokko · 2023-05-22T00:27:24Z

python/pyiceberg/table/__init__.py

+        ]
+
+    def _match_deletes_to_datafile(self, data_file: DataFile, positional_delete_files: List[DataFile]) -> Set[DataFile]:
+        return set(


Ahh, I see. Yes. I agree. I also recall this from the Java implementation, but I was confused with the sequence numbers on the manifest-list level.

I've added a sorted-list for collecting the deletes by sequence number, then we can efficiently bisect them by only selecting the deletes that came after the data file.

Fokko · 2023-05-22T00:28:37Z

python/pyiceberg/io/pyarrow.py

            columns=[col.name for col in file_project_schema.columns],
        )

+        if positional_deletes:
+            # In the case of a mask, it is a bit awkward because we first


If I understand correctly, the problem is that we are relying on the arrow result to correspond 1-to-1 with the records in the file so that we can use position in the DataFrame as the row position in the file.

Yes, this is correct.

But if we need to read deletes, we don't want to read the entire file, which could mean reading whole row groups that are unnecessary.

Based on the row group statistics, yes.

I don't know if Arrow supports this, but it would need to.

I don't think Arrow supports this today. I think we can even implement this on the PyIceberg side, but I don't think we should. Because:

It is internal to PyArrow

This would pull a lot of hot code into Python, where the GIL will slow us down.

The points that you address above are correct. At the Arrow side we're looking into implementing this: apache/arrow#35301

The last comment was about adding an internal index column that can be used for this purpose. This way we can combine the filters, and push this down to PyArrow (and also simplify things at the PyIceberg end, I feel like all the if-else branches make the code error prone).

…eberg into fd-positional-deletes

…-deletes

I was doing some work on the Python side: apache#6775 But ran into an issue when creating some integration tests for testing the positional deletes. I ended up with double slashes: s3://warehouse/default/test_positional_mor_deletes/data//00000-32-70be11f7-3c4b-40e0-b35a-334e97ef6554-00001-deletes.parquet It looks like the Struct is not-null, but the partition not partitioned, therefore it creates a partitioned path, but with the empty struct we'll end up with a double slash `//` that Minio doesn't like. Outputfactory.java ```java public EncryptedOutputFile newOutputFile(PartitionSpec spec, StructLike partition) { // partition is a StructCopy String newDataLocation = locations.newDataLocation(spec, partition, generateFilename()); OutputFile rawOutputFile = io.newOutputFile(newDataLocation); return encryptionManager.encrypt(rawOutputFile); } ``` ClusteredWriter.java ```java // copy the partition key as the key object may be reused this.currentPartition = StructCopy.copy(partition); // partition is a StructProjection this.currentWriter = newWriter(currentSpec, currentPartition); ``` I still have to dig into why there is a StructProjection. Resolves apache#7678

Fokko · 2023-05-28T20:18:17Z

This PR grows very big. Let's split some stuff into smaller PRs to make reviewing easier. Let's start with support for initial-default, and then we can add inheritance, which deserves its own PR.

I was doing some work on the Python side: apache#6775 But ran into an issue when creating some integration tests for testing the positional deletes. I ended up with double slashes: s3://warehouse/default/test_positional_mor_deletes/data//00000-32-70be11f7-3c4b-40e0-b35a-334e97ef6554-00001-deletes.parquet It looks like the Struct is not-null, but the partition not partitioned, therefore it creates a partitioned path, but with the empty struct we'll end up with a double slash `//` that Minio doesn't like. Outputfactory.java ```java public EncryptedOutputFile newOutputFile(PartitionSpec spec, StructLike partition) { // partition is a StructCopy String newDataLocation = locations.newDataLocation(spec, partition, generateFilename()); OutputFile rawOutputFile = io.newOutputFile(newDataLocation); return encryptionManager.encrypt(rawOutputFile); } ``` ClusteredWriter.java ```java // copy the partition key as the key object may be reused this.currentPartition = StructCopy.copy(partition); // partition is a StructProjection this.currentWriter = newWriter(currentSpec, currentPartition); ``` Resolves apache#7678

…-deletes

rdblue

Overall, looks good to me. I'd say let's merge it and improve as we get new features into Arrow.

Fokko · 2023-06-20T12:33:56Z

@rdblue Sounds like a plan! I'll follow up on this once we get __row_index etc.

…-deletes

github-actions bot added the python label Feb 8, 2023

Fokko force-pushed the fd-positional-deletes branch 5 times, most recently from a3fb95a to fc49b8b Compare February 9, 2023 15:34

Python: Add positional deletes

8b293cb

Closes apache#6568

Fokko force-pushed the fd-positional-deletes branch from fc49b8b to 8b293cb Compare February 9, 2023 15:36

Fokko added this to the Python 0.4.0 release milestone Feb 10, 2023

Merge branch 'master' of github.com:apache/iceberg into fd-positional…

18a3204

…-deletes

rdblue reviewed Feb 17, 2023

View reviewed changes

python/pyiceberg/table/__init__.py Show resolved Hide resolved

rdblue reviewed Feb 17, 2023

View reviewed changes

python/tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

Fokko added 2 commits February 18, 2023 00:25

Revert unrelated changes

b013b5a

Merge branch 'master' of github.com:apache/iceberg into fd-positional…

3898879

…-deletes

rdblue reviewed Feb 19, 2023

View reviewed changes

python/pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

rdblue reviewed Feb 19, 2023

View reviewed changes

python/pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

rdblue reviewed Feb 19, 2023

View reviewed changes

python/pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

rdblue reviewed Feb 19, 2023

View reviewed changes

python/pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

Fokko added 3 commits February 20, 2023 16:34

Update based on comments

e35ecb3

Fix false annotation

a0e5f0b

WIP

aeea029

github-actions bot added the core label Feb 22, 2023

Fokko marked this pull request as draft February 24, 2023 16:31

Fokko added 2 commits February 24, 2023 17:33

WIP

a792764

Cleanup

3fd53c9

Fokko marked this pull request as ready for review February 27, 2023 12:30

Fokko added 4 commits February 27, 2023 13:36

Cleanup

fcead79

Cleanup

97d84bf

Cleanup

0a2537f

Cleanup

9281caa

rdblue reviewed May 19, 2023

View reviewed changes

python/tests/io/test_pyarrow.py Show resolved Hide resolved

Thanks Ryan!

2a65342

Fokko commented May 22, 2023

View reviewed changes

Fokko added 7 commits May 21, 2023 19:47

Merge branch 'fd-positional-deletes' of github.com:Fokko/incubator-ic…

0fc0021

…eberg into fd-positional-deletes

Merge branch 'master' of github.com:apache/iceberg into fd-positional…

f87bb0f

…-deletes

WIP

16c251e

Merge branch 'master' of github.com:apache/iceberg into fd-positional…

db5c336

…-deletes

Add moar tests

ea09514

WIP

5f1538a

Merge branch 'master' of github.com:apache/iceberg into fd-positional…

bc770ec

…-deletes

Fokko mentioned this pull request May 22, 2023

Spark: Positional deletes creates partitioned path on unpartitioned tables #7685

Merged

Fokko added 2 commits May 24, 2023 07:44

WIP

ce8dc22

Cleanup

8b4890d

Fokko marked this pull request as draft May 28, 2023 20:18

Fokko added 3 commits June 6, 2023 00:34

Merge branch 'master' of github.com:apache/iceberg into fd-positional…

d088480

…-deletes

WIP

2e47f24

Merge branch 'master' of github.com:apache/iceberg into fd-positional…

2ecb81b

…-deletes

Fokko marked this pull request as ready for review June 6, 2023 21:42

Fokko added 2 commits June 14, 2023 14:40

Merge branch 'master' of github.com:apache/iceberg into fd-positional…

606e18c

…-deletes

Optimizations

60c83b6

Fokko force-pushed the fd-positional-deletes branch from 4cc9ccc to 60c83b6 Compare June 14, 2023 13:31

rdblue approved these changes Jun 20, 2023

View reviewed changes

Merge branch 'master' of github.com:apache/iceberg into fd-positional…

f215353

…-deletes

Fokko merged commit 9ffb762 into apache:master Jun 20, 2023
7 checks passed

Fokko deleted the fd-positional-deletes branch August 7, 2023 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Add positional deletes #6775

Python: Add positional deletes #6775

Fokko commented Feb 8, 2023

rdblue May 19, 2023

Fokko May 22, 2023

Fokko May 21, 2023

Fokko May 21, 2023

Fokko May 21, 2023

Fokko May 21, 2023

Fokko May 21, 2023

Fokko May 21, 2023

Fokko May 21, 2023

Fokko May 22, 2023

Fokko May 22, 2023

Fokko commented May 28, 2023

rdblue left a comment

Fokko commented Jun 20, 2023

Python: Add positional deletes #6775

Python: Add positional deletes #6775

Conversation

Fokko commented Feb 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko commented May 28, 2023

rdblue left a comment

Choose a reason for hiding this comment

Fokko commented Jun 20, 2023