You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discovered as part of #44, the execute_batch facility in psycopg2 used for Postgres still leaves a lot of room for improvement. The native Postgres COPY command is about 55x faster to load the same amount of data locally as compared to batch inserts using execute_batch.
Using execute_batch – 1m51s:
$ wc -l 201701-citibike-tripdata.csv
726677 201701-citibike-tripdata.csv
$ ls -lh 201701-citibike-tripdata.csv
117M Jun 13 12:35 201701-citibike-tripdata.csv
$ time csv2db load -f 201701-citibike-tripdata.csv -o postgres -u test -d test -t test
Loading file 201701-citibike-tripdata.csv
File loaded.
real 1m51.615s
user 0m17.277s
sys 0m1.015s
Using COPY – 2.7s:
test=> \timing
Timing is on.
test=> \copy test from '/tests/201701-citibike-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY 726676
Time: 2755.240 ms (00:02.755)
The text was updated successfully, but these errors were encountered:
It seems like there is no easy wrapper function in psycopg to leverage the same performance mechanism that COPY provides for simple batch INSERT statements.
However, the driver does support cursor.copy_from() and cursor.copy_expert() functions that allows the client to tap into the COPY command instead. This concept could be used to load the data via memory-based StringIO or BytesIO classes to speed up the data load to numbers as shown above.
A question to be answered is how to deal with extra encoding/decoding of the values already read from the file.
After some more research, it looks like that psycopg2 will never offer an executemany() DBAPI method that will get anywhere close to the performance of the COPY command (see this long email thread). There might be a chance that this will happen with psycopg3 but that's still in development and far from being clear whether it would even implement the DBAPI, see https://www.varrazzo.com/blog/2020/03/06/thinking-psycopg3/
As discovered as part of #44, the
execute_batch
facility inpsycopg2
used for Postgres still leaves a lot of room for improvement. The native PostgresCOPY
command is about 55x faster to load the same amount of data locally as compared to batch inserts usingexecute_batch
.Using
execute_batch
– 1m51s:Using
COPY
– 2.7s:The text was updated successfully, but these errors were encountered: