[Python] Flaky test test_write_dataset_max_open_files #30918

asfimport · 2022-01-25T00:04:13Z

Found during 7.0.0 verification


pyarrow/tests/test_dataset.py::test_write_dataset_max_open_files FAILED                                            [ 30%]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>tempdir = PosixPath('/tmp/pytest-of-root/pytest-1/test_write_dataset_max_open_fi0')    def test_write_dataset_max_open_files(tempdir):
        directory = tempdir / 'ds'
        file_format = "parquet"
        partition_column_id = 1
        column_names = ['c1', 'c2']
        record_batch_1 = pa.record_batch(data=[[1, 2, 3, 4, 0, 10],
                                               ['a', 'b', 'c', 'd', 'e', 'a']],
                                         names=column_names)
        record_batch_2 = pa.record_batch(data=[[5, 6, 7, 8, 0, 1],
                                               ['a', 'b', 'c', 'd', 'e', 'c']],
                                         names=column_names)
        record_batch_3 = pa.record_batch(data=[[9, 10, 11, 12, 0, 1],
                                               ['a', 'b', 'c', 'd', 'e', 'd']],
                                         names=column_names)
        record_batch_4 = pa.record_batch(data=[[13, 14, 15, 16, 0, 1],
                                               ['a', 'b', 'c', 'd', 'e', 'b']],
                                         names=column_names)
    
        table = pa.Table.from_batches([record_batch_1, record_batch_2,
                                       record_batch_3, record_batch_4])
    
        partitioning = ds.partitioning(
            pa.schema([(column_names[partition_column_id], pa.string())]),
            flavor="hive")
    
        data_source_1 = directory / "default"
    
        ds.write_dataset(data=table, base_dir=data_source_1,
                         partitioning=partitioning, format=file_format)
    
        # Here we consider the number of unique partitions created when
        # partitioning column contains duplicate records.
        #   Returns: (number_of_files_generated, number_of_partitions)
        def _get_compare_pair(data_source, record_batch, file_format, col_id):
            num_of_files_generated = _get_num_of_files_generated(
                base_directory=data_source, file_format=file_format)
            number_of_partitions = len(pa.compute.unique(record_batch[col_id]))
            return num_of_files_generated, number_of_partitions
    
        # CASE 1: when max_open_files=default & max_open_files >= num_of_partitions
        #         In case of a writing to disk via partitioning based on a
        #         particular column (considering row labels in that column),
        #         the number of unique rows must be equal
        #         to the number of files generated
    
        num_of_files_generated, number_of_partitions \
            = _get_compare_pair(data_source_1, record_batch_1, file_format,
                                partition_column_id)
        assert num_of_files_generated == number_of_partitions
    
        # CASE 2: when max_open_files > 0 & max_open_files < num_of_partitions
        #         the number of files generated must be greater than the number of
        #         partitions
    
        data_source_2 = directory / "max_1"
    
        max_open_files = 3
    
        ds.write_dataset(data=table, base_dir=data_source_2,
                         partitioning=partitioning, format=file_format,
                         max_open_files=max_open_files)
    
        num_of_files_generated, number_of_partitions \
            = _get_compare_pair(data_source_2, record_batch_1, file_format,
                                partition_column_id)
>       assert num_of_files_generated > number_of_partitions
E       assert 5 > 5pyarrow/tests/test_dataset.py:3807: AssertionError

Reporter: David Li / @lidavidm
Assignee: Vibhatha Lakmal Abeykoon / @vibhatha

PRs and other links:

_{Note: This issue was originally created as ARROW-15438. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-01-25T09:30:48Z

Krisztian Szucs / @kszucs:
cc @jorisvandenbossche

asfimport · 2022-01-25T09:37:41Z

Joris Van den Bossche / @jorisvandenbossche:
This was added in ARROW-15019 (cc @vibhatha)

asfimport · 2022-01-25T09:41:54Z

Vibhatha Lakmal Abeykoon / @vibhatha:
@jorisvandenbossche will take a look.

asfimport · 2022-01-25T11:28:56Z

Vibhatha Lakmal Abeykoon / @vibhatha:
I created a PR

@westonpace could this be happening due to the small data size?
I increased it and lowered the max_open_files and tested it.
Seems to be working, but can we fix this better. Am I missing something here?

asfimport · 2022-01-25T13:43:23Z

Antoine Pitrou / @pitrou:
I also see this regularly on my local computer.

asfimport · 2022-01-25T13:49:37Z

Vibhatha Lakmal Abeykoon / @vibhatha:
@pitrou could it be due to the less data size?

asfimport · 2022-01-25T13:50:11Z

Vibhatha Lakmal Abeykoon / @vibhatha:
Since the intial data size was too small? What I did was increase it a little bit and reduce the max_open_files.

asfimport · 2022-01-25T14:31:15Z

Antoine Pitrou / @pitrou:
I have no idea. It may be timing-dependent?

asfimport · 2022-01-25T14:31:49Z

Antoine Pitrou / @pitrou:
Is parallel writing enabled by default? If so, disabling it would probably make the test more robust.

asfimport · 2022-01-25T14:48:16Z

Vibhatha Lakmal Abeykoon / @vibhatha:
Yes, most probably. I changed use_threads=False and reduced the max_open_files = 1 it reduces the low number of file generation to an extent.

asfimport · 2022-01-25T20:46:46Z

Weston Pace / @westonpace:
Threading makes sense. The WriteNode's InputReceived will be called 4 times (one per batch). Each call will generate 5 calls to the dataset writer (one per partition in that batch).

We'd get more than 5 files if we get something like:

B1P1, B1P2, B1P3, B1P4, B1P5, B2P1, ..., B4P5

However, we'd only get 5 files if we get something like:

B1P1, B2P1, B3P1, B4P1, B1P2, ... B4P5

Since we scan the source in parallel both orders are possible.

Is parallel writing enabled by default? If so, disabling it would probably make the test more robust.

There is no easy way in the dataset writer to disable parallel writing (the CPU path is completely serial but it submits I/O tasks for each batch so you would need to shrink the I/O thread pool to size 1).

I changed use_threads=False and reduced the max_open_files = 1 it reduces the low number of file generation to an extent.

This will disable parallel scanning which should be enough to prevent the flakiness (unless I am misunderstanding how the error is generated). I'll try and setup a reproduction.

asfimport · 2022-01-25T22:57:50Z

Weston Pace / @westonpace:
I was able to reproduce fairly regularly and confirmed the issue was the order in which the batches were delivered to the dataset writer. Unfortunately, we were also not completely respecting the use_threads option in write_dataset. If use_threads=False then we would scan in serial but still send batches into the exec plan in parallel. I added a new PR that sets use_threads=False (based on Vibhatha's PR) and also updates the Write method to be serial.

I don't know if this bug should block the RC but if we cut another RC it would probably be nice to include.

asfimport · 2022-01-25T23:27:59Z

Antoine Pitrou / @pitrou:
This is for the most part a flaky test, so while the fix is nice to have it shouldn't block the release IMHO.

asfimport · 2022-01-27T10:01:59Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 12263
#12263

asfimport closed this as completed Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Flaky test test_write_dataset_max_open_files #30918

[Python] Flaky test test_write_dataset_max_open_files #30918

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 27, 2022

[Python] Flaky test test_write_dataset_max_open_files #30918

[Python] Flaky test test_write_dataset_max_open_files #30918

Comments

asfimport commented Jan 25, 2022

PRs and other links:

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 25, 2022

asfimport commented Jan 27, 2022