Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Write_dataset() run time does not scale linearly with dataset size #39768

Open
lmocsi opened this issue Jan 23, 2024 · 0 comments
Open

Comments

@lmocsi
Copy link

lmocsi commented Jan 23, 2024

Describe the bug, including details regarding any error messages, version, and platform.

I'd like to create a sample hive-partitioned dataset in parquet format.
Parameter "b" controls the amount of data.
b = 1010000 -> 10000 customer ids / 7440000 records -> Runs in 2 seconds
b = 1050000 -> 50000 customer ids / 37200000 records -> Runs in 8 minutes
b = 1100000 -> 100000 customer ids / 74400000 records -> Runs in 63 minutes
Shouldn't it scale linearly?

# Windows 11
#!pip install --upgrade polars==0.20.5
#!pip install --upgrade pyarrow==15.0.0
 
import polars as pl
import pyarrow.dataset as ds
 
from dateutil import rrule
from datetime import datetime

def ido():
    return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')

print(ido(),'started')

a = 1000000
b = 1010000 # Runs in 2 seconds
#b = 1050000 # Runs in 8 minutes
#b = 1100000 # Runs in 63 minutes
df1 = pl.DataFrame({'PARTY_ID': [i for i in range(a, b)]})
df2 = pl.DataFrame({'CALENDAR_DATE': [datetime.strftime(i,'%Y-%m-%d %H:%M:%S') for i in list(rrule.rrule(rrule.DAILY, count=186, dtstart=datetime(2023, 7, 21)))]})
df3 = pl.DataFrame({'CREDIT_FL': ['Y','N','Y', 'Y'],
                    'AMOUNT': [123, 789, 22, 44]})
 
df4 = (df1.join(df2,
                how='cross'
                )
          .join(df3,
                how='cross'
                )
       )
print(ido(),'data created')
print(ido(),df4.shape)

dft = df4.to_arrow()
print(ido(),'data converted to arrow')

ds.write_dataset(
        dft,
        'my_table',
        format="parquet",
        partitioning=["CALENDAR_DATE"],
        partitioning_flavor="hive",
        existing_data_behavior="delete_matching",
    )

print(ido(),'finished')

Component(s)

Python

@lmocsi lmocsi changed the title Write_dataset() does not scale linearly with dataset size [Python] Write_dataset() does not scale linearly with dataset size Jan 23, 2024
@lmocsi lmocsi changed the title [Python] Write_dataset() does not scale linearly with dataset size [Python] Write_dataset() run time does not scale linearly with dataset size Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant