Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres fast load #44

Open
sunilred opened this issue Jun 4, 2020 · 8 comments
Open

Postgres fast load #44

sunilred opened this issue Jun 4, 2020 · 8 comments
Assignees
Labels
question Further information is requested

Comments

@sunilred
Copy link

sunilred commented Jun 4, 2020

Hi Gerald,

First off this is a great tool. Thanks very much.

My use case is to upload CSV files into AWS Postgres DB for data analytics (Windows environment). Your tool works great however it is a bit slow when dealing with my data volumes. I get files with a million rows that need to be uploaded to Postgre on a daily basis.

For testing, I loaded 20000 rows and it took 2-3 mins, while using pgsql COPY load it took 2 seconds.

I was wondering if there is a way to tap into using pgsql for a faster upload?

I am assuming your process, in order to be generic, uses INSERT statements behind the scene?

Also I was wondering if we can add a few more options to "csv2db" that I think others users may find useful - and hopefully you see value in it.

Additional useful options for "csv2db load"

  1. Truncate table and load (optional)
  2. Append to target table (default)

Additional useful option for "csv2db generate"
3. Create table in the target database (the default fornow i.e. to print the CREATE TABLE script)

Let me know what you think.

Thanks
Sunil

@gvenzl gvenzl self-assigned this Jun 6, 2020
@gvenzl gvenzl added the question Further information is requested label Jun 6, 2020
@gvenzl
Copy link
Collaborator

gvenzl commented Jun 6, 2020

Hi Sunil,

First of all, thank you for using csv2db!

For testing, I loaded 20000 rows and it took 2-3 mins, while using pgsql COPY load it took 2 seconds.

Could you please elaborate on how you loaded the data with csv2db? For example, what batch size you were using, etc.? I don't expect it to be as fast as the pgsql COPY command, you are correct that csv2db uses generic INSERT INTO (batch) statements to load the data.

Truncate table and load (optional)

That does sound indeed like a good enhancement, thank you very much! I have created ER #45 for that.

Append to target table (default)

Can you elaborate on this one, please? Given that INSERT statements always append to a table, I assume you mean something like the Oracle INSERT /*+ APPEND */ functionality?

Create table in the target database (the default for now i.e. to print the CREATE TABLE script)

There are several reasons why csv2db doesn't create tables automatically but just provides this as a print-out. It's meant to be a little helper for when the CSV file structure is unknown and users want to get the data into the database (i.e. staging table) quickly and further cleanse the data from there. The questions of which data types are the correct ones to use for each column, whether or not columns should have CHECK or referential integrity constraints, be NOT NULL or the PRIMARY KEY, and what storage parameters should be used for the table itself (the latter differs widely across the different databases supported) is left for the user to decide. The assumption is that the CREATE TABLE statement should only happen very rarely (for quickly getting such a staging table) but not on a regular basis for loading data into a well-defined target table.

@sunilred
Copy link
Author

sunilred commented Jun 7, 2020

Hi Sunil,

First of all, thank you for using csv2db!

For testing, I loaded 20000 rows and it took 2-3 mins, while using pgsql COPY load it took 2 seconds.

Could you please elaborate on how you loaded the data with csv2db? For example, what batch size you were using, etc.? I don't expect it to be as fast as the pgsql COPY command, you are correct that csv2db uses generic INSERT INTO (batch) statements to load the data.

I have tried not batch size or other parameters. Below the csv2db command I used. I have redacted the username, password, host and db name for obvious reasons.

csv2db load -f 'C:\ReddReports\pgsql\bin\demo_bike_share_tripdata_temp.csv' -v -t demo_bikeshare_tripdata_temp -o postgres -u -p -m -n 5432 -d -s |

Truncate table and load (optional)

That does sound indeed like a good enhancement, thank you very much! I have created ER #45 for that.

Append to target table (default)

Can you elaborate on this one, please? Given that INSERT statements always append to a table, I assume you mean something like the Oracle INSERT /*+ APPEND */ functionality?

I meant the default for csv2db is to append (insert) data. If we add a parameter such as "-T" which is truncate, then the user can choose to truncate the target table and reload.

Create table in the target database (the default for now i.e. to print the CREATE TABLE script)

There are several reasons why csv2db doesn't create tables automatically but just provides this as a print-out. It's meant to be a little helper for when the CSV file structure is unknown and users want to get the data into the database (i.e. staging table) quickly and further cleanse the data from there. The questions of which data types are the correct ones to use for each column, whether or not columns should have CHECK or referential integrity constraints, be NOT NULL or the PRIMARY KEY, and what storage parameters should be used for the table itself (the latter differs widely across the different databases supported) is left for the user to decide. The assumption is that the CREATE TABLE statement should only happen very rarely (for quickly getting such a staging table) but not on a regular basis for loading data into a well-defined target table.

You are absolutely right. I am looking at staging data (load step in an ELT process) only so no issues dropping or truncating the target table.
Basically I am looking at low maintenance process of uploading CSV data into a staging database. If the source CSV is updated i.e. a new column added, instead of recreating the CREATE TABLE with a new definition, the low maintenance process basically drops target table and recreates schema.

This could be an option with csv2db with a flag such as -R which is drop target table, recreate using CSV header format with set column length (e.g. VARCHAR(500)) and then load the file into the staging table. Beauty of this process is that CSV file changes are automatically captured in the staging database.

Hopefully this makes sense :)

@sunilred
Copy link
Author

sunilred commented Jun 7, 2020

Regarding bulk loads, you could tap into the bulk load capability of the database clients.

For PostgreSQL it is "COPY" or "copy" command. Happy to give you what the script looks like.

Similarly for MySQL and SQL Server.

This will make csv2db super valuable for data analytics projects where we are uploading million+ rows on a daily basis. The pre-requisite is that the client needs to have the client libraries installed.

@gvenzl
Copy link
Collaborator

gvenzl commented Jun 10, 2020

Thanks for your response, @sunilred!

Regarding bulk loads, you could tap into the bulk load capability of the database clients.

csv2db already taps into the bulk load facilities of the clients (see here and here) :)

You may want to try to bump up the batch size, the default is 10,000 rows. Depending on how long your rows are, you may want to increase that accordingly.

I meant the default for csv2db is to append (insert) data. If we add a parameter such as "-T" which is truncate, then the user can choose to truncate the target table and reload.

That makes sense now, thank you!

This could be an option with csv2db with a flag such as -R which is drop target table, recreate using CSV header format with set column length (e.g. VARCHAR(500)) and then load the file into the staging table. Beauty of this process is that CSV file changes are automatically captured in the staging database.

That makes sense to me, thank you, but that could also be easily accommodated by a shell script around csv2db, i.e. execute DROP TABLE, run the CREATE TABLE generation and pipe it into the SQL prompt, or alternatively spool it into a .sql file and execute that .sql file.

@gvenzl gvenzl changed the title Postgre fast load Postgres fast load Jun 10, 2020
@sunilred
Copy link
Author

I tried the batch size and it does not seem to make a huge difference. As per the below screenshot, I uploaded 50,000 rows to PostgreSQL (on AWS). I ran the csv2db with the default 10K batch size and then again with a 25K batch size. As shown below, the time taken was more or less the same.

2020-06-12 at 9 01 AM

I tried 50K batch size and the performance was the same.
2020-06-12 at 9 08 AM

Finally I used Postgre COPY command and it took about 5 seconds.
2020-06-12 at 9 09 AM

Am I missing something?

@gvenzl
Copy link
Collaborator

gvenzl commented Jun 13, 2020

Thanks a lot, @sunilred!

No, that all looks correct. Given that the time doesn't improve with different batch sizes, it seems that you already have big enough batches of 10k rows to avoid network round-trip related issues, which is good.

I just did some experiments with a local Postgres database myself.

For example, loading 50k rows (8MB of uncompressed file size) into the database takes me less than 8 seconds (with the default batch size of 10k):

$ wc -l 201701-citibike-tripdata.50k.csv
   50001 201701-citibike-tripdata.50k.csv

$ ls -alh 201701-citibike-tripdata.50k.csv
   8.0M Jun 13 12:39 201701-citibike-tripdata.50k.csv

$ time csv2db load -f 201701-citibike-tripdata.50k.csv -o postgres -u test -d test -t test

Loading file 201701-citibike-tripdata.50k.csv
File loaded.


real	0m7.366s
user	0m1.204s
sys	0m0.092s
$

Comparatively, loading 720k rows (117MB of uncompressed file size) into the database takes me about 1 minute 52 seconds:

$ wc -l 201701-citibike-tripdata.csv
  726677 201701-citibike-tripdata.csv

$ ls -lh 201701-citibike-tripdata.csv
   117M Jun 13 12:35 201701-citibike-tripdata.csv

$ time csv2db load -f 201701-citibike-tripdata.csv -o postgres -u test -d test -t test

Loading file 201701-citibike-tripdata.csv
File loaded.


real	1m51.615s
user	0m17.277s
sys	0m1.015s

However, that compared with the COPY command of psql shows also a significant difference.

With the COPY command, the 50k rows are loaded in sub-second:

test=> \timing
Timing is on.
test=> \copy test from '/201701-citibike-tripdata.50k.csv' DELIMITER ',' CSV HEADER;
COPY 50000
Time: 190.792 ms

And the 720k rows within less than 3(!) seconds:

test=> \copy test from '/tests/201701-citibike-tripdata.csv' DELIMITER ',' CSV HEADER;
COPY 726676
Time: 2755.240 ms (00:02.755)

It seems that the Postgres driver could handle batch loads much more efficiently than it currently does, or perhaps there is a different, better and faster way of performing data loads. I will reach out to the psycopg Postgres driver team and raise this observation with them and see whether they have any recommendations. I will track that in a new issue.

@sunilred
Copy link
Author

Hi. I think it is probably the fact that I am using AWS Aurora PostgreSQL. The may be some inefficiencies in that. I am going to run similar tests against SQL Server and see how that performs.

@andygrunwald
Copy link

I guess/believe that the COMMIT operation is handled differently in a copy and in a bulk insert.
Here is an additional explanation: https://stackoverflow.com/questions/46715354/how-does-copy-work-and-why-is-it-so-much-faster-than-insert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants