How to improve table loading with pandas ? #673
-
Hello 👋 I have a question regarding the way tables are populated from a pandas dataframe. Currently, on a DEV environment, we have one csv file of 440MB that is stored into one of our tables and our BaseStore . After the explode, it has 53 848 761 entries (10 columns for the BaseStore and 11 for the other table) but it takes around 6 minutes to be loaded. With cProfile, I found that about 60% of our loading time is coming from 3 methods called with the load_pandas : Is there a way to optimize this or it is expected ? Also, would it be better to use load_spark in my situation ? Many thanks, Eric |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Hi Eric, Could I have more precision on your pipeline? So far, this is what I understood:
Would it be possible to get more precision about the dataframe you obtained after your ETL, its size and/or the main type of your column? Like this we could roughly estimate the loading time to atoti and tell you which loading method is the most convenient. Thanks, |
Beta Was this translation helpful? Give feedback.
-
Hi Mailys, Thanks for your fast reply ! Correct but from the same csv, basically I have in the end 2 df of 54,000,000 entries (I load in my BaseStore and in one of my tables). However, for the BaseStore I drop one column : hence I have a df with 11 columns and the other with 10. The overall ETL workflow looks like this : Dataframe loaded into the BaseStore :
Size : 538,487,610 Dataframe loaded into the other table:
Size : 592,336,371 Note : For those 2 df, all the columns are loaded into the corresponding table Please let me know if you need other details, Many thanks Mailys ! Eric |
Beta Was this translation helpful? Give feedback.
-
Hi Eric, Thanks for your detail answer, I'm coming to you after some investigation. First of all, after trying loading dataframes similar as yours with fewer records (10.000.000 records with 10 columns), it gave me approximatively the same latency time for However, the loading method df["partition"] = [np.random.choice(range(partition_count), size=df.shape[0])] # add a partition column to your dataframe if necessary
df.to_parquet("df.parquet", index=False, partition_cols=["partition"])
with session.start_transaction():
table.load_parquet("./df.parquet") Finally, Let me know it that decreases your tables loading time. |
Beta Was this translation helpful? Give feedback.
Hi Eric,
Thanks for your detail answer, I'm coming to you after some investigation.
First of all, after trying loading dataframes similar as yours with fewer records (10.000.000 records with 10 columns), it gave me approximatively the same latency time for
load_pandas()
. So, this is an expected behaviour.However, the loading method
load_parquet()
, see documentation, is more efficient and way faster thanload_pandas()
if your dataset is partitioned. On my testing dataframes, it was 3 times faster.You can load your data as follow: