[Python] Write_dataset() run time does not scale linearly with dataset size #39768

lmocsi · 2024-01-23T19:09:01Z

Describe the bug, including details regarding any error messages, version, and platform.

I'd like to create a sample hive-partitioned dataset in parquet format.
Parameter "b" controls the amount of data.
b = 1010000 -> 10000 customer ids / 7440000 records -> Runs in 2 seconds
b = 1050000 -> 50000 customer ids / 37200000 records -> Runs in 8 minutes
b = 1100000 -> 100000 customer ids / 74400000 records -> Runs in 63 minutes
Shouldn't it scale linearly?

# Windows 11
#!pip install --upgrade polars==0.20.5
#!pip install --upgrade pyarrow==15.0.0
 
import polars as pl
import pyarrow.dataset as ds
 
from dateutil import rrule
from datetime import datetime

def ido():
    return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')

print(ido(),'started')

a = 1000000
b = 1010000 # Runs in 2 seconds
#b = 1050000 # Runs in 8 minutes
#b = 1100000 # Runs in 63 minutes
df1 = pl.DataFrame({'PARTY_ID': [i for i in range(a, b)]})
df2 = pl.DataFrame({'CALENDAR_DATE': [datetime.strftime(i,'%Y-%m-%d %H:%M:%S') for i in list(rrule.rrule(rrule.DAILY, count=186, dtstart=datetime(2023, 7, 21)))]})
df3 = pl.DataFrame({'CREDIT_FL': ['Y','N','Y', 'Y'],
                    'AMOUNT': [123, 789, 22, 44]})
 
df4 = (df1.join(df2,
                how='cross'
                )
          .join(df3,
                how='cross'
                )
       )
print(ido(),'data created')
print(ido(),df4.shape)

dft = df4.to_arrow()
print(ido(),'data converted to arrow')

ds.write_dataset(
        dft,
        'my_table',
        format="parquet",
        partitioning=["CALENDAR_DATE"],
        partitioning_flavor="hive",
        existing_data_behavior="delete_matching",
    )

print(ido(),'finished')

Component(s)

Python

The text was updated successfully, but these errors were encountered:

lmocsi added the Type: bug label Jan 23, 2024

github-actions bot added the Component: Python label Jan 23, 2024

lmocsi changed the title ~~Write_dataset() does not scale linearly with dataset size~~ [Python] Write_dataset() does not scale linearly with dataset size Jan 23, 2024

lmocsi mentioned this issue Jan 29, 2024

Performance degradation in scan_parquet() vs. scan_pyarrow_dataset() pola-rs/polars#13908

Closed

2 tasks

lmocsi changed the title ~~[Python] Write_dataset() does not scale linearly with dataset size~~ [Python] Write_dataset() run time does not scale linearly with dataset size Jan 30, 2024

lmocsi mentioned this issue Apr 7, 2024

Enable dataset writer to write hive partitioned parquet datasets pola-rs/polars#11500

Closed

lmocsi mentioned this issue Jun 18, 2024

hive partitioning predicate isn't applied before reading pola-rs/polars#17045

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Write_dataset() run time does not scale linearly with dataset size #39768

[Python] Write_dataset() run time does not scale linearly with dataset size #39768

lmocsi commented Jan 23, 2024 •

edited

Loading

[Python] Write_dataset() run time does not scale linearly with dataset size #39768

[Python] Write_dataset() run time does not scale linearly with dataset size #39768

Comments

lmocsi commented Jan 23, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

lmocsi commented Jan 23, 2024 •

edited

Loading