Skip to content

How to improve table loading with pandas ? #673

Answered by mailys-ds
EricDallAgnol asked this question in Q&A
Discussion options

You must be logged in to vote

Hi Eric,

Thanks for your detail answer, I'm coming to you after some investigation.

First of all, after trying loading dataframes similar as yours with fewer records (10.000.000 records with 10 columns), it gave me approximatively the same latency time for load_pandas(). So, this is an expected behaviour.

However, the loading method load_parquet(), see documentation, is more efficient and way faster than load_pandas() if your dataset is partitioned. On my testing dataframes, it was 3 times faster.
You can load your data as follow:

df["partition"] = [np.random.choice(range(partition_count), size=df.shape[0])] # add a partition column to your dataframe if necessary
df.to_parquet("df.parquet", 

Replies: 3 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@EricDallAgnol
Comment options

Answer selected by EricDallAgnol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants