How to improve table loading with pandas ? #673

EricDallAgnol · 2022-08-23T12:57:55Z

EricDallAgnol
Aug 23, 2022

Hello 👋

I have a question regarding the way tables are populated from a pandas dataframe.
In our implementation, for whatever the reason, we are denormalizing vectors, from the csv, during the ETL (by using pd.explode()).
The consequence of that is for some big files, we have, in the end, more entries to be loaded into the tables and I'm wondering if there is a way to speed up this phase.

Currently, on a DEV environment, we have one csv file of 440MB that is stored into one of our tables and our BaseStore . After the explode, it has 53 848 761 entries (10 columns for the BaseStore and 11 for the other table) but it takes around 6 minutes to be loaded.

With cProfile, I found that about 60% of our loading time is coming from 3 methods called with the load_pandas :

Is there a way to optimize this or it is expected ?

Also, would it be better to use load_spark in my situation ?

Many thanks,

Eric

Answered by mailys-ds

Aug 24, 2022

Hi Eric,

Thanks for your detail answer, I'm coming to you after some investigation.

First of all, after trying loading dataframes similar as yours with fewer records (10.000.000 records with 10 columns), it gave me approximatively the same latency time for load_pandas(). So, this is an expected behaviour.

However, the loading method load_parquet(), see documentation, is more efficient and way faster than load_pandas() if your dataset is partitioned. On my testing dataframes, it was 3 times faster.
You can load your data as follow:

df["partition"] = [np.random.choice(range(partition_count), size=df.shape[0])] # add a partition column to your dataframe if necessary
df.to_parquet("df.parquet",

View full answer

mailys-ds · 2022-08-23T15:50:18Z

mailys-ds
Aug 23, 2022

Hi Eric,

Could I have more precision on your pipeline? So far, this is what I understood:

In your ETL, you read a csv file of 440MB with pandas. Then, you denormalize vectors of your dataframe with pd.explode(), increasing drastically the number of entries.
After your ETL, you load this dataframe in atoti with load_pandas(), which has 21 columns and around 54.000.000 entries.

Would it be possible to get more precision about the dataframe you obtained after your ETL, its size and/or the main type of your column? Like this we could roughly estimate the loading time to atoti and tell you which loading method is the most convenient.

Thanks,

0 replies

EricDallAgnol · 2022-08-24T08:01:45Z

EricDallAgnol
Aug 24, 2022
Author

Hi Mailys,

Thanks for your fast reply !

Correct but from the same csv, basically I have in the end 2 df of 54,000,000 entries (I load in my BaseStore and in one of my tables). However, for the BaseStore I drop one column : hence I have a df with 11 columns and the other with 10.

The overall ETL workflow looks like this :

Dataframe loaded into the BaseStore :

Column	Type
RiskDate	STRING
TradeID	STRING
TradeLegID	STRING
TradeUnderlyingID	STRING
DataSourceCode	STRING
ConfigurationID	STRING
TenorLabels	STRING
CmpCode	STRING
RiskFactor	STRING
RiskCurrency	STRING

Size : 538,487,610
Dimension : 2
Shape : (53848761, 10)

Dataframe loaded into the other table:

Column	Type
RiskDate	STRING
TradeID	STRING
TradeLegID	STRING
TradeUnderlyingID	STRING
DataSourceCode	STRING
ConfigurationID	STRING
TenorLabels	STRING
SensitivityValue	DOUBLE
CmpCode	STRING
RiskFactor	STRING
RiskCurrency	STRING

Size : 592,336,371
Dimension : 2
Shape : (53848761, 11)

Note : For those 2 df, all the columns are loaded into the corresponding table

Please let me know if you need other details,

Many thanks Mailys !

Eric

0 replies

mailys-ds · 2022-08-24T16:24:55Z

mailys-ds
Aug 24, 2022

Hi Eric,

Thanks for your detail answer, I'm coming to you after some investigation.

First of all, after trying loading dataframes similar as yours with fewer records (10.000.000 records with 10 columns), it gave me approximatively the same latency time for load_pandas(). So, this is an expected behaviour.

However, the loading method load_parquet(), see documentation, is more efficient and way faster than load_pandas() if your dataset is partitioned. On my testing dataframes, it was 3 times faster.
You can load your data as follow:

df["partition"] = [np.random.choice(range(partition_count), size=df.shape[0])] # add a partition column to your dataframe if necessary
df.to_parquet("df.parquet", index=False, partition_cols=["partition"])
with session.start_transaction():
     table.load_parquet("./df.parquet")

Finally, load_spark() uses load_parquet() so it should save you time to directly use load_parquet().

Let me know it that decreases your tables loading time.
Maïlys

1 reply

EricDallAgnol Aug 26, 2022
Author

Hi Maïlys,

Thanks for your answer, indeed it reduces the latency. However, in the overall, we are losing this time saved with the conversion of the dataframe into the parquet format. So I will have to rework my ETL and see if it would be possible instead of having a csv to have directly a parquet file but anyway thanks a lot for your answer ! 😄

Eric

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve table loading with pandas ? #673

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to improve table loading with pandas ? #673

EricDallAgnol Aug 23, 2022

Replies: 3 comments · 1 reply

mailys-ds Aug 23, 2022

EricDallAgnol Aug 24, 2022 Author

mailys-ds Aug 24, 2022

EricDallAgnol Aug 26, 2022 Author

EricDallAgnol
Aug 23, 2022

Replies: 3 comments 1 reply

mailys-ds
Aug 23, 2022

EricDallAgnol
Aug 24, 2022
Author

mailys-ds
Aug 24, 2022

EricDallAgnol Aug 26, 2022
Author