Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV-290 - Fix the wrong assumptions in PostgreSQL formats #265

Merged
merged 3 commits into from Oct 15, 2022

Conversation

angusdev
Copy link
Contributor

I tested in psql 14.5 Homebrew in Mac M1.

CSVFormat.POSTGRESQL_CSV - special characters are not escaped.
CSVFormat.POSTGRESQL_TEXT - values are not quoted.

drop table COMMONS_CSV_PSQL_TEST;
create table COMMONS_CSV_PSQL_TEST (ID INTEGER, COL1 VARCHAR, COL2 VARCHAR, COL3 VARCHAR, COL4 VARCHAR);
insert into COMMONS_CSV_PSQL_TEST select 1, 'abc', 'test line 1' || chr(10) || 'test line 2', null, '';
insert into COMMONS_CSV_PSQL_TEST select 2, 'xyz', '\b:' || chr(8) || ' \n:' || chr(10) || ' \r:' || chr(13), 'a', 'b';
insert into COMMONS_CSV_PSQL_TEST values (3, 'a', 'b,c,d', '"quoted"', 'e');
copy COMMONS_CSV_PSQL_TEST TO '/tmp/psql.csv' WITH (FORMAT CSV);
copy COMMONS_CSV_PSQL_TEST TO '/tmp/psql.tsv';
cat /tmp/psql.csv
1,abc,"test line 1
test line 2",,""
2,xyz,"\b:^H \n:
\r:^M",a,b
3,a,"b,c,d","""quoted""",e
cat /tmp/psql.tsv
1    abc    test line 1\ntest line 2               \N
2    xyz    \\b:\b \\n:\n \\r:\r       a           b
3    a      b,c,d                      "quoted"    e

CSVFormat.POSTGRESQL_CSV - special characters are not escaped.
CSVFormat.POSTGRESQL_TEXT - values are not quoted.
@codecov-commenter
Copy link

Codecov Report

Merging #265 (d992e26) into master (048d507) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##             master     #265   +/-   ##
=========================================
  Coverage     97.34%   97.34%           
  Complexity      535      535           
=========================================
  Files            11       11           
  Lines          1169     1169           
  Branches        205      205           
=========================================
  Hits           1138     1138           
  Misses           18       18           
  Partials         13       13           
Impacted Files Coverage Δ
...rc/main/java/org/apache/commons/csv/CSVFormat.java 97.16% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@garydgregory
Copy link
Member

Hi @angusdev
Thank you for the PR.
Does the version of PostgreSQL matter?

@angusdev
Copy link
Contributor Author

I can't test previous versions of PostgreSQL. But it is pretty safe to say that it applies to all versions.

Postgresql support export to CVS since version 8.0 (year 2005, https://www.postgresql.org/docs/8.0/release-8-0.html)
In the documentation of 8.1 (https://www.postgresql.org/docs/8.1/sql-copy.html), it didn't state clearly but implied that the CSV export will not escape special characters. The backslash escape is used for import data from text file (COPY FROM) only.

For text format (tab delimited), there is no reason to quote the text.

@angusdev
Copy link
Contributor Author

Tested the behaviour of import and export are consistent.

Test case: export csv/tsv from PostgreSQL, read by commons-cvs and write to new csv/tsv, import to PostgreSQL, export csv/tsv again. compare the 1st and 2nd export file

drop table COMMONS_CSV_PSQL_TEST;
create table COMMONS_CSV_PSQL_TEST (ID INTEGER, COL1 VARCHAR, COL2 VARCHAR, COL3 VARCHAR, COL4 VARCHAR);
insert into COMMONS_CSV_PSQL_TEST select 1, 'abc', 'test line 1' || chr(10) || 'test line 2', null, '';
insert into COMMONS_CSV_PSQL_TEST select 2, 'xyz', '\b:' || chr(8) || ' \n:' || chr(10) || ' \r:' || chr(13), 'a', 'b';
insert into COMMONS_CSV_PSQL_TEST values (3, 'a', 'b,c,d', '"quoted"', 'e');
copy COMMONS_CSV_PSQL_TEST to '/tmp/psql.csv' with (FORMAT CSV);
copy COMMONS_CSV_PSQL_TEST to '/tmp/psql.tsv';

use commons-csv to read '/tmp/psql.csv' and write to '/tmp/outpsql.csv', same for 'psql.tsv'

truncate table COMMONS_CSV_PSQL_TEST;
copy COMMONS_CSV_PSQL_TEST(ID, COL1, COL2, COL3, COL4) from '/tmp/outpsql.csv' with (FORMAT CSV);
copy COMMONS_CSV_PSQL_TEST to '/tmp/psql2.csv' with (FORMAT CSV);

truncate table COMMONS_CSV_PSQL_TEST;
copy COMMONS_CSV_PSQL_TEST(ID, COL1, COL2, COL3, COL4) from '/tmp/outpsql.tsv';
copy COMMONS_CSV_PSQL_TEST to '/tmp/psql2.tsv';

diff /tmp/psql.csv /tmp/psql2.csv
(no difference)

diff /tmp/psql.tsv /tmp/psql2.tsv
(no difference)

@garydgregory
Copy link
Member

Hello @angusdev

Thank you for updating your PR.

(1) I think you need to test for tab characters (ASCII 9) in values.

(2) In the PG docs I read

QUOTE
Specifies the quoting character to be used when a data value is quoted. The default is double-quote. This must be a single one-byte character. This option is allowed only when using CSV format.

ESCAPE
Specifies the character that should appear before a data character that matches the QUOTE value. The default is the same as the QUOTE value (so that the quoting character is doubled if it appears in the data). This must be a single one-byte character. This option is allowed only when using CSV format.

Please help me understand why the git master code does not match this definition.
TY!

@angusdev
Copy link
Contributor Author

angusdev commented Sep 26, 2022

Added test for tab characters (ASCII 9) in values.

For QUOTE and ESCAPE, see below example

insert into COMMONS_CSV_PSQL_TEST select 1, '"quoted"', '|quoted2|', 'a,b', null;
INSERT 0 1
postgres=# copy COMMONS_CSV_PSQL_TEST to STDOUT CSV;
1,"""quoted""",|quoted2|,"a,b",
postgres=# copy COMMONS_CSV_PSQL_TEST to STDOUT CSV QUOTE '|';
1,"quoted",|||quoted2|||,|a,b|,
postgres=# copy COMMONS_CSV_PSQL_TEST to STDOUT CSV ESCAPE '~';
1,"~"quoted~"",|quoted2|,"a,b",
postgres=# copy COMMONS_CSV_PSQL_TEST to STDOUT CSV QUOTE '|' ESCAPE '~';
1,"quoted",|~|quoted2~||,|a,b|,
postgres=# copy COMMONS_CSV_PSQL_TEST to STDOUT QUOTE '|';
ERROR:  COPY quote available only in CSV mode
postgres=# copy COMMONS_CSV_PSQL_TEST to STDOUT ESCAPE '~';
ERROR:  COPY escape available only in CSV mode

In PG (CSV), ESCAPE is used to escape the quote char, while in COMMONS_CSV, ESCAPE is to escape delimiter and special char

In PG (TEXT), QUOTE is not needed as it is tab-delimited and the delimiter (tab) is escaped by '\t'

@garydgregory garydgregory merged commit 41a063d into apache:master Oct 15, 2022
@angusdev angusdev deleted the CSV-290 branch October 15, 2022 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants