Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to output numeric data types as string. #255

Merged
merged 2 commits into from Nov 22, 2022

Conversation

shubhamdhama
Copy link
Contributor

@shubhamdhama shubhamdhama commented Nov 4, 2022

Data types like numeric, real, double precision supports Infinity, -Infinity and NaN values. Currently, these values output as null because JSON specification does not recognize them as valid numeric values. This will create problems for the users of wal2json who need these values to maintain data integrity.

Tests
Added tests and tested against Postgres 9.6, 10, 11, 12, 13, and 14.

Fixes: #245

@shubhamdhama
Copy link
Contributor Author

I tested this option numerics-as-string with pgcopydb and it fixes dimitri/pgcopydb#127 (comment)

Before this change

Start pgcopydb clone --follow in other window. While initial load is happening, check source and insert some rows for prefetch,

$ psql $PGCOPYDB_SOURCE_PGURI
psql (14.5 (Ubuntu 14.5-2.pgdg20.04+2), server 12.12 (Ubuntu 12.12-1.pgdg20.04+1))

postgres=# select * from table_integer;
 a  |   b    |      c      |          d
----+--------+-------------+----------------------
  9 |  32767 |  2147483647 |  9223372036854775807
 10 | -32768 | -2147483648 | -9223372036854775808
(2 rows)

postgres=# select * from table_decimal ;
     a     |         b         |                    c
-----------+-------------------+------------------------------------------
  Infinity |          Infinity |
 -Infinity |         -Infinity |
       NaN |               NaN |                                      NaN
   123.456 |  123456789.012345 |  1234567890987654321.1234567890987654321
  -123.456 | -123456789.012345 | -1234567890987654321.1234567890987654321
(5 rows)

postgres=# INSERT INTO table_integer (b, c, d) VALUES(32767, 2147483647, 9223372036854775807);
INSERT 0 1
postgres=# INSERT INTO table_integer (b, c, d) VALUES(-32768, -2147483648, -9223372036854775808);
INSERT 0 1
postgres=# INSERT INTO table_decimal (a, b) VALUES('Infinity', 'Infinity');
INSERT 0 1
postgres=# INSERT INTO table_decimal (a, b) VALUES('-Infinity', '-Infinity');
INSERT 0 1
postgres=# INSERT INTO table_decimal (a, b, c) VALUES('NaN', 'NaN', 'NaN');
INSERT 0 1
postgres=# INSERT INTO table_decimal (a, b, c) VALUES(123.456, 123456789.012345, 1234567890987654321.1234567890987654321);
INSERT 0 1
postgres=# INSERT INTO table_decimal (a, b, c) VALUES(-123.456, -123456789.012345, -1234567890987654321.1234567890987654321);
INSERT 0 1

After initial load has finished (in other window) and before apply has started, check the target. This should reflect the inital load data,

$ psql $PGCOPYDB_TARGET_PGURI
psql (14.5 (Ubuntu 14.5-2.pgdg20.04+2), server 13.8 (Ubuntu 13.8-1.pgdg20.04+1))

postgres=# select * from table_integer;
 a  |   b    |      c      |          d
----+--------+-------------+----------------------
  9 |  32767 |  2147483647 |  9223372036854775807
 10 | -32768 | -2147483648 | -9223372036854775808
(2 rows)

postgres=# select * from table_decimal ;
     a     |         b         |                    c
-----------+-------------------+------------------------------------------
  Infinity |          Infinity |
 -Infinity |         -Infinity |
       NaN |               NaN |                                      NaN
   123.456 |  123456789.012345 |  1234567890987654321.1234567890987654321
  -123.456 | -123456789.012345 | -1234567890987654321.1234567890987654321
(5 rows)

Set endpos to force apply the sql file generated during online phase,

shdhama@shdhama:~
$ $PGCOPYDB/src/bin/pgcopydb/pgcopydb stream sentinel set endpos --current -v
13:21:07 28812 INFO   Running pgcopydb version 0.9.13.g897182b.dirty from "/home/shdhama/pg/pgcopydb/src/bin/pgcopydb/pgcopydb"
13:21:07 28812 NOTICE [SOURCE] BEGIN;
13:21:07 28812 NOTICE [SOURCE] select current_setting('server_version'),        current_setting('server_version_num')::integer;
13:21:07 28812 NOTICE [SOURCE] Postgres version 12.12 (Ubuntu 12.12-1.pgdg20.04+1) (120012)
13:21:07 28812 NOTICE [SOURCE] update pgcopydb.sentinel set endpos = pg_current_wal_flush_lsn();
13:21:07 28812 NOTICE [SOURCE] COMMIT;
13:21:07 28812 NOTICE [SOURCE] select startpos, endpos, apply, write_lsn, flush_lsn, replay_lsn   from pgcopydb.sentinel;
13:21:07 28812 INFO   pgcopydb sentinel endpos has been set to 6/4E092878
6/4E092878

Now again check target,

shdhama@shdhama:~
$ psql $PGCOPYDB_TARGET_PGURI
psql (14.5 (Ubuntu 14.5-2.pgdg20.04+2), server 13.8 (Ubuntu 13.8-1.pgdg20.04+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

postgres=# select * from table_decimal ;
     a     |         b         |                    c
-----------+-------------------+------------------------------------------
  Infinity |          Infinity |
 -Infinity |         -Infinity |
       NaN |               NaN |                                      NaN
   123.456 |  123456789.012345 |  1234567890987654321.1234567890987654321
  -123.456 | -123456789.012345 | -1234567890987654321.1234567890987654321
           |                   |
           |                   |
           |                   |
   123.456 |         123457000 |                      1234570000000000000
  -123.456 |        -123457000 |                     -1234570000000000000
(10 rows)

postgres=# select * from table_integer;
 a  |   b    |      c      |          d
----+--------+-------------+----------------------
  9 |  32767 |  2147483647 |  9223372036854775807
 10 | -32768 | -2147483648 | -9223372036854775808
 11 |  32767 |  2147480000 |  9223370000000000000
 12 | -32768 | -2147480000 | -9223370000000000000
(4 rows)

In table_decimal, the first 4 rows were migrated during inital load and last 4 rows were migrated online. We can see NaN, Infinity and -Infinity are coming as null.

After

Start pgcopydb clone --follow in other window. While initial load is happening, check source and insert some rows for prefetch,

$ psql $PGCOPYDB_SOURCE_PGURI
psql (14.5 (Ubuntu 14.5-2.pgdg20.04+2), server 12.12 (Ubuntu 12.12-1.pgdg20.04+1))

postgres=# select * from table_decimal ;
     a     |         b         |                    c
-----------+-------------------+------------------------------------------
  Infinity |          Infinity |
 -Infinity |         -Infinity |
       NaN |               NaN |                                      NaN
   123.456 |  123456789.012345 |  1234567890987654321.1234567890987654321
  -123.456 | -123456789.012345 | -1234567890987654321.1234567890987654321
(5 rows)

postgres=# select * from table_integer;
 a |   b    |      c      |          d
---+--------+-------------+----------------------
 5 |  32767 |  2147483647 |  9223372036854775807
 6 | -32768 | -2147483648 | -9223372036854775808
(2 rows)

postgres=# INSERT INTO table_integer (b, c, d) VALUES(32767, 2147483647, 9223372036854775807);
INSERT 0 1
postgres=# INSERT INTO table_integer (b, c, d) VALUES(-32768, -2147483648, -9223372036854775808);
INSERT 0 1
postgres=# INSERT INTO table_decimal (a, b) VALUES('Infinity', 'Infinity');
INSERT 0 1
postgres=# INSERT INTO table_decimal (a, b) VALUES('-Infinity', '-Infinity');
INSERT 0 1
postgres=# INSERT INTO table_decimal (a, b, c) VALUES('NaN', 'NaN', 'NaN');
INSERT 0 1
postgres=# INSERT INTO table_decimal (a, b, c) VALUES(123.456, 123456789.012345, 1234567890987654321.1234567890987654321);
INSERT 0 1
postgres=# INSERT INTO table_decimal (a, b, c) VALUES(-123.456, -123456789.012345, -1234567890987654321.1234567890987654321);
INSERT 0 1

After initial load has finished (in other window) and before apply has started, check the target. This should reflect the inital load data,

shdhama@shdhama:~
$ psql $PGCOPYDB_TARGET_PGURI
psql (14.5 (Ubuntu 14.5-2.pgdg20.04+2), server 13.8 (Ubuntu 13.8-1.pgdg20.04+1))

postgres=# select * from table_decimal ;
     a     |         b         |                    c
-----------+-------------------+------------------------------------------
  Infinity |          Infinity |
 -Infinity |         -Infinity |
       NaN |               NaN |                                      NaN
   123.456 |  123456789.012345 |  1234567890987654321.1234567890987654321
  -123.456 | -123456789.012345 | -1234567890987654321.1234567890987654321
(5 rows)

postgres=# select * from table_integer;
 a |   b    |      c      |          d
---+--------+-------------+----------------------
 5 |  32767 |  2147483647 |  9223372036854775807
 6 | -32768 | -2147483648 | -9223372036854775808
(2 rows)

Set endpos to force apply the sql file generated during online phase,

shdhama@shdhama:~
$ $PGCOPYDB/src/bin/pgcopydb/pgcopydb stream sentinel set endpos --current -v
13:14:41 28041 INFO   Running pgcopydb version 0.9.13.g897182b.dirty from "/home/shdhama/pg/pgcopydb/src/bin/pgcopydb/pgcopydb"
13:14:41 28041 INFO   pgcopydb sentinel endpos has been set to 6/4E072488
6/4E072488

Now again check target,

shdhama@shdhama:~
$ psql $PGCOPYDB_TARGET_PGURI
psql (14.5 (Ubuntu 14.5-2.pgdg20.04+2), server 13.8 (Ubuntu 13.8-1.pgdg20.04+1))

postgres=# select * from table_decimal ;
     a     |         b         |                    c
-----------+-------------------+------------------------------------------
  Infinity |          Infinity |
 -Infinity |         -Infinity |
       NaN |               NaN |                                      NaN
   123.456 |  123456789.012345 |  1234567890987654321.1234567890987654321
  -123.456 | -123456789.012345 | -1234567890987654321.1234567890987654321
  Infinity |          Infinity |
 -Infinity |         -Infinity |
       NaN |               NaN |                                      NaN
   123.456 |  123456789.012345 |  1234567890987654321.1234567890987654321
  -123.456 | -123456789.012345 | -1234567890987654321.1234567890987654321
(10 rows)

postgres=# select * from table_integer;
 a |   b    |      c      |          d
---+--------+-------------+----------------------
 5 |  32767 |  2147483647 |  9223372036854775807
 6 | -32768 | -2147483648 | -9223372036854775808
 7 |  32767 |  2147483647 |  9223372036854775807
 8 | -32768 | -2147483648 | -9223372036854775808
(4 rows)

NaN, Infinity and -Infinity is replicated properly 🎉

@eulerto
Copy link
Owner

eulerto commented Nov 5, 2022

Based on the discussing we have in #245 , this PR needs some adjustments. It should cover all numeric data types. Use only one test file. There are some unrelated changes (blank spaces), remove them. A good name for the new parameter is numeric-data-types-as-string. The README should mention it such as:

numeric-data-types-as-string: use strings for numeric data types. JSON specification does not recognize Infinity and NaN as valid numeric values. There might be potential interoperability problems for double precision numbers. Default is false.

@eulerto eulerto added the feature label Nov 5, 2022
Data types like `numeric`, `real`, `double precision` supports `Infinity`,
`-Infinity` and `NaN` values. Currently these values output as `null` because
JSON specification does not recognize them as valid numeric values. This will
create problems for the users of wal2json who need these values to maintain
data integerity.
@shubhamdhama
Copy link
Contributor Author

shubhamdhama commented Nov 7, 2022

@eulerto I have updated the PR, please review.

It should cover all numeric data types.

I think the current switch case covers all numeric data types. Please correct me if I'm wrong.

There are some unrelated changes (blank spaces), remove them. A good name for the new parameter is numeric-data-types-as-string. The README should mention

Fixed.

@shubhamdhama
Copy link
Contributor Author

hi @eulerto, gentle ping on this. Thanks!

i tsvector
);

SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'wal2json');
Copy link

@nakatlam nakatlam Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to add illustrations/outputs of these queries with and without using the option 'numeric-data-types-as-string'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

wal2json.c Outdated
@@ -1257,8 +1272,9 @@ tuple_to_stringinfo(LogicalDecodingContext *ctx, TupleDesc tupdesc, HeapTuple tu
* Data types are printed with quotes unless they are number, true,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

COMMIT;

SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'format-version', '1', 'pretty-print', '1', 'numeric-data-types-as-string', '1');
SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'format-version', '2', 'pretty-print', '1', 'numeric-data-types-as-string', '1');
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option pretty-print has no effect for v2. Remove it. You should duplicate both pg_logical_slot_peek_changes to provide the output without this option. It is also recommended to drop the tables you created at the end.

SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'format-version', '1', 'pretty-print', '1', 'numeric-data-types-as-string', '1');
SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'format-version', '1', 'pretty-print', '1');
SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'format-version', '2', 'numeric-data-types-as-string', '1');
SELECT data FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'format-version', '2');

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, done.

@eulerto
Copy link
Owner

eulerto commented Nov 22, 2022

I have a few additional comments:

  • README: use strings for numeric... -> use string for numeric ...
  • README: there is an additional line that should be removed: output numeric data types as string. Default is false.
  • tests: remove table table_others. It doesn't add nothing. It is already covered by other test cases.
  • comment: "Data types are printed..." You change this comment but doesn't add nothing. Revert it. Keep the last paragraph "Except to this ...".
  • spaces: don't mix spaces and tabs. JsonDecodingData and other variables should be aligned with tabs.

@shubhamdhama
Copy link
Contributor Author

@eulerto thank you for the review, I've addressed all the comments.

@eulerto eulerto merged commit bb7cd50 into eulerto:master Nov 22, 2022
@shubhamdhama shubhamdhama deleted the add-numerics-as-strings-option branch November 22, 2022 13:45
alejandrosanchezcabana added a commit to streamsets/wal2json that referenced this pull request Jan 18, 2023
Add option to output numeric data types as string. (eulerto#255)
@thecatontheflat
Copy link

This fix is not available in the Debian/Ubuntu package right? Needs to be built from the source?

@eulerto
Copy link
Owner

eulerto commented May 21, 2023

No. It will be in the next release.

@hanefi
Copy link

hanefi commented Nov 30, 2023

@eulerto Any timeline of when the next release will be?

hanefi added a commit to hanefi/pgcopydb that referenced this pull request Dec 13, 2023
This change adds a new option `--wal2json-numeric-as-string` that
changes wal2json plugin output format to print numeric data types as
strings. This is accomplished by passing the
`--numeric-data-types-as-string` option to wal2json plugin.

This is useful to prevent precision loss when using wal2json
plugin to stream changes from a database that uses numeric data types.

wal2json plugin version that supports `--numeric-data-types-as-string`
option is required to use this pgcopydb option. As of today there is no
official wal2json release that supports this option, but it is available
on master branch of the project.

Relevant changes in wal2json plugin is at
eulerto/wal2json#255
dimitri pushed a commit to dimitri/pgcopydb that referenced this pull request Dec 15, 2023
* Add option to output numeric as string on wal2json

This change adds a new option `--wal2json-numeric-as-string` that
changes wal2json plugin output format to print numeric data types as
strings. This is accomplished by passing the
`--numeric-data-types-as-string` option to wal2json plugin.

This is useful to prevent precision loss when using wal2json
plugin to stream changes from a database that uses numeric data types.

wal2json plugin version that supports `--numeric-data-types-as-string`
option is required to use this pgcopydb option. As of today there is no
official wal2json release that supports this option, but it is available
on master branch of the project.

Relevant changes in wal2json plugin is at
eulerto/wal2json#255

* Add env to output numeric as string on wal2json

PGCOPYDB_WAL2JSON_NUMERIC_AS_STRING can be set to a boolean value that
will be used to determine if pgcopydb should set the wal2json option
`--numeric-data-types-as-string`.

In passing, also add the PGCOPYDB_OUTPUT_PLUGIN env variable to all
relevant pages of our documentation.
@apduvuri
Copy link

@eulerto Any timeline of when the next release will be. Most of the customers want to install wal2json from the Linux repositories for their production environment workloads and it is a blocker for them

@hsinghjkaur
Copy link

@eulerto - I would really appreciate if you could share the timeline for the upcoming release that includes this fix as many customers experience it and is long awaited.

@eulerto
Copy link
Owner

eulerto commented Jan 11, 2024

New version will be released after #273 is fixed.

@hsinghjkaur
Copy link

New version will be released after #273 is fixed.

Thank you for the update! Do you have any ETA?

@eulerto
Copy link
Owner

eulerto commented Apr 25, 2024

It took some time but the new version was released including this feature. Enjoy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Output NUMERIC type as string in JSON
7 participants