Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres: execute_batch still significantly slower than copy command #47

Open
gvenzl opened this issue Jun 14, 2020 · 3 comments
Open
Assignees
Labels
enhancement New feature or request Postgres Postgres related issues

Comments

@gvenzl
Copy link
Collaborator

gvenzl commented Jun 14, 2020

As discovered as part of #44, the execute_batch facility in psycopg2 used for Postgres still leaves a lot of room for improvement. The native Postgres COPY command is about 55x faster to load the same amount of data locally as compared to batch inserts using execute_batch.

Using execute_batch – 1m51s:

$ wc -l 201701-citibike-tripdata.csv
  726677 201701-citibike-tripdata.csv

$ ls -lh 201701-citibike-tripdata.csv
   117M Jun 13 12:35 201701-citibike-tripdata.csv

$ time csv2db load -f 201701-citibike-tripdata.csv -o postgres -u test -d test -t test

Loading file 201701-citibike-tripdata.csv
File loaded.

real	1m51.615s
user	0m17.277s
sys	0m1.015s

Using COPY – 2.7s:

test=> \timing
Timing is on.
test=> \copy test from '/tests/201701-citibike-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY 726676
Time: 2755.240 ms (00:02.755)
@gvenzl gvenzl added the Postgres Postgres related issues label Jun 14, 2020
@gvenzl gvenzl self-assigned this Jun 14, 2020
@gvenzl gvenzl added the enhancement New feature or request label Jun 14, 2020
@gvenzl
Copy link
Collaborator Author

gvenzl commented Jun 14, 2020

Issue psycopg/psycopg2#491 (comment) acknowledges that:

Adding functions into the extras module to implement the ideas discussed here and in the ML thread without changing the executemany() implementation.

@altaurog has reported the following speedups over transatlantic connection:

  • cur.executemany(): 761.322767019 sec
  • execute_batch(): 14.6529989243 sec
  • execute_values(): 12.4037430286 sec
  • pgcopy: 2.85529208183 sec

@gvenzl
Copy link
Collaborator Author

gvenzl commented Jun 14, 2020

It seems like there is no easy wrapper function in psycopg to leverage the same performance mechanism that COPY provides for simple batch INSERT statements.

However, the driver does support cursor.copy_from() and cursor.copy_expert() functions that allows the client to tap into the COPY command instead. This concept could be used to load the data via memory-based StringIO or BytesIO classes to speed up the data load to numbers as shown above.

A question to be answered is how to deal with extra encoding/decoding of the values already read from the file.

@gvenzl
Copy link
Collaborator Author

gvenzl commented Jun 14, 2020

After some more research, it looks like that psycopg2 will never offer an executemany() DBAPI method that will get anywhere close to the performance of the COPY command (see this long email thread). There might be a chance that this will happen with psycopg3 but that's still in development and far from being clear whether it would even implement the DBAPI, see https://www.varrazzo.com/blog/2020/03/06/thinking-psycopg3/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Postgres Postgres related issues
Projects
None yet
Development

No branches or pull requests

1 participant