You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug, including details regarding any error messages, version, and platform.
I'd like to create a sample hive-partitioned dataset in parquet format.
Parameter "b" controls the amount of data.
b = 1010000 -> 10000 customer ids / 7440000 records -> Runs in 2 seconds
b = 1050000 -> 50000 customer ids / 37200000 records -> Runs in 8 minutes
b = 1100000 -> 100000 customer ids / 74400000 records -> Runs in 63 minutes
Shouldn't it scale linearly?
# Windows 11
#!pip install --upgrade polars==0.20.5
#!pip install --upgrade pyarrow==15.0.0
import polars as pl
import pyarrow.dataset as ds
from dateutil import rrule
from datetime import datetime
def ido():
return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')
print(ido(),'started')
a = 1000000
b = 1010000 # Runs in 2 seconds
#b = 1050000 # Runs in 8 minutes
#b = 1100000 # Runs in 63 minutes
df1 = pl.DataFrame({'PARTY_ID': [i for i in range(a, b)]})
df2 = pl.DataFrame({'CALENDAR_DATE': [datetime.strftime(i,'%Y-%m-%d %H:%M:%S') for i in list(rrule.rrule(rrule.DAILY, count=186, dtstart=datetime(2023, 7, 21)))]})
df3 = pl.DataFrame({'CREDIT_FL': ['Y','N','Y', 'Y'],
'AMOUNT': [123, 789, 22, 44]})
df4 = (df1.join(df2,
how='cross'
)
.join(df3,
how='cross'
)
)
print(ido(),'data created')
print(ido(),df4.shape)
dft = df4.to_arrow()
print(ido(),'data converted to arrow')
ds.write_dataset(
dft,
'my_table',
format="parquet",
partitioning=["CALENDAR_DATE"],
partitioning_flavor="hive",
existing_data_behavior="delete_matching",
)
print(ido(),'finished')
Component(s)
Python
The text was updated successfully, but these errors were encountered:
lmocsi
changed the title
Write_dataset() does not scale linearly with dataset size
[Python] Write_dataset() does not scale linearly with dataset size
Jan 23, 2024
lmocsi
changed the title
[Python] Write_dataset() does not scale linearly with dataset size
[Python] Write_dataset() run time does not scale linearly with dataset size
Jan 30, 2024
Describe the bug, including details regarding any error messages, version, and platform.
I'd like to create a sample hive-partitioned dataset in parquet format.
Parameter "b" controls the amount of data.
b = 1010000 -> 10000 customer ids / 7440000 records -> Runs in 2 seconds
b = 1050000 -> 50000 customer ids / 37200000 records -> Runs in 8 minutes
b = 1100000 -> 100000 customer ids / 74400000 records -> Runs in 63 minutes
Shouldn't it scale linearly?
Component(s)
Python
The text was updated successfully, but these errors were encountered: