Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charset conversion mysql-connector-j-8.0.31.jar #26

Closed
djamadeus opened this issue Nov 16, 2022 · 26 comments · Fixed by #33
Closed

Charset conversion mysql-connector-j-8.0.31.jar #26

djamadeus opened this issue Nov 16, 2022 · 26 comments · Fixed by #33
Labels
bug Unwanted / harmful behavior

Comments

@djamadeus
Copy link

When reading data from virtual schema I get a converison-error from UTF8 to ASCII,. Reading directly from the jdbc-connection works though.

EXA_DB.VIRTUAL_JOINTDB_OFFICE> Select * from(
IMPORT FROM JDBC
AT MYSQL_JOINTDB_JDBC_CONNECTION
STATEMENT 'SELECT * from office.erstattungen_positionen limit 100')
[2022-11-16 10:13:59] 100 rows retrieved starting from 1 in 1 s 85 ms (execution: 1 s 53 ms, fetching: 32 ms)
EXA_DB.VIRTUAL_JOINTDB_OFFICE> Select * from virtual_jointDB_office."erstattungen_positionen" limit 100
[2022-11-16 10:21:58] [42636] ETL-3009: [Column=23 Row=0] [Charset conversion from 'UTF-8' to 'ASCII' failed [Invalid or incomplete multibyte or wide character]] (Session: 1749641517751205888).

On the Source-DB the charset is latin1 (yes i know :(, not my fault^^).

@djamadeus
Copy link
Author

the column contains german letters like "Ü". The Problem here seems to be, that the virtual schema adapter gets latin1 fromt he column definition so it expects latin1, the mysql JDBC driver already converts this to utf-8 though.

@Relativity74205
Copy link

We have a similar error message, however it is independent of the mysql driver version. However, going back to the previous virtual schema version (virtual-schema-dist-9.0.5-mysql-3.0.0.jar) solves the issue.

@djamadeus
Copy link
Author

Jep switching back to 9.0.5 of the adapter script solved the issue for me as well

@ckunki
Copy link
Contributor

ckunki commented Nov 21, 2022

Many thanks for reporting this issue. Please excuse that I answer only now. We will have a look at it.

@ckunki
Copy link
Contributor

ckunki commented Nov 21, 2022

To summarize: version virtual-schema-dist-9.0.5-mysql-3.0.0.jar did not show the behavior described above while virtual-schema-dist-10.0.1-mysql-4.0.0.jar does so, right?

Hi @djamadeus could you please add the result of describing the table? (at least for one of the columns in question of being either ASCII / latin1 / or UTF-8)

In Exasol DB SQL you get this from DESCRIBE <schema>.<table-name>; with <schema> being the name of the virtual schema you created and <table-name> replaced by the name of your table, in total probably office.erstattungen_positionen, then.

After that I would like to exclude potential effects due to caching the schema description. We can do so by refreshing the virtual schema with ALTER VIRTUAL SCHEMA <schema> REFRESH; followed by an additional DESCRIBE <schema>.<table-name>;, see also https://docs.exasol.com/db/latest/sql/alter_schema.htm.

Based on that I will add more information.

@djamadeus
Copy link
Author

@ckunki yes that only happens with version 10.0.1
Descripttion-result with 9.0.5-adapter:

COLUMN_NAME SQL_TYPE
id DECIMAL(10,0)
datum TIMESTAMP
datum_ursprung TIMESTAMP
userid VARCHAR(16) ASCII
orderid VARCHAR(32) ASCII
rekla_id VARCHAR(10) ASCII
quelle CHAR(15) ASCII
quelle_id VARCHAR(15) ASCII
art CHAR(19) ASCII
grund VARCHAR(25) ASCII
mitarbeiter VARCHAR(50) ASCII
bemerkung VARCHAR(160) ASCII
artnr VARCHAR(25) ASCII
sellid VARCHAR(15) ASCII
bezeichnung VARCHAR(50) ASCII
bezeichnung_zusatz VARCHAR(50) ASCII
anzahl DECIMAL(10,0)
einzelpreis DECIMAL(8,2)
brutto_netto CHAR(6) ASCII
steuersatz DECIMAL(4,1)
datum_bestellung TIMESTAMP
zahlungsart_bestellung VARCHAR(32) ASCII
prioritaet CHAR(6) ASCII
status CHAR(11) ASCII
status_bemerkung VARCHAR(160) ASCII
id_gutschrift DECIMAL(10,0)

@djamadeus
Copy link
Author

do you want me to test the refresh-flag with the 10 or 9.x adapter?

@ckunki
Copy link
Contributor

ckunki commented Nov 23, 2022

Thank you very much for the additional information. Please apply the refresh for the new version 10.
Could please also tell me the version of the Exasol database you are using?

@ckunki
Copy link
Contributor

ckunki commented Nov 23, 2022

Command DESCRIBE <schema>.<table-name>; shows that Exasol DB expects ASCII, which matches the error message above:
ETL-3009: [Charset conversion from 'UTF-8' to 'ASCII' failed [Invalid or incomplete multibyte or wide character]]. The prefix ETL- of the error code indicates the message being generated by the ExaLoader which is a core component of the database.

Could you help me to get insight into the perspective of the virtual schema adapter (another component different from ExaLoader)?

You can do so by using EXPLAIN VIRTUAL <sql command>;, see https://docs.exasol.com/db/latest/sql/explain_virtual.htm.
Please use the same SQL command that created the initial error message.

EXPLAIN VIRTUAL returns a table and if you could attach column "PUSHDOWN_SQL" to the current ticket, that will hopefully show which component indicated a UTF-8 encoding, which probably is incorrect and therefore cannot be converted to ASCII.

@djamadeus
Copy link
Author

`
CREATE VIRTUAL SCHEMA virtual_jointDB_office_test
USING system_adapter_scripts.ADAPTER_SCRIPT_MYSQL
WITH
CONNECTION_NAME = 'MYSQL_JOINTDB_JDBC_CONNECTION'
CATALOG_NAME = 'office';

Select * from virtual_jointDB_office_test."erstattungen_positionen";
[42636] ETL-3009: [Column=23 Row=0] [Charset conversion from 'UTF-8' to 'ASCII' failed [Invalid or incomplete multibyte or wide character]] (Session: 1750109444829478912)

ALTER VIRTUAL SCHEMA virtual_jointDB_office_test REFRESH;

Select * from virtual_jointDB_office_test."erstattungen_positionen";
[Column=23 Row=0] [Charset conversion from 'UTF-8' to 'ASCII' failed [Invalid or incomplete multibyte or wide character]] (Session: 1750109444829478912)

explain virtual Select * from virtual_jointDB_office_test."erstattungen_positionen";
Pushdown_SQL:IMPORT INTO (c1 DECIMAL(10, 0), c2 TIMESTAMP, c3 TIMESTAMP, c4 VARCHAR(16) ASCII, c5 VARCHAR(32) ASCII, c6 VARCHAR(10) ASCII, c7 CHAR(15) ASCII, c8 VARCHAR(15) ASCII, c9 CHAR(19) ASCII, c10 VARCHAR(25) ASCII, c11 VARCHAR(50) ASCII, c12 VARCHAR(160) ASCII, c13 VARCHAR(25) ASCII, c14 VARCHAR(15) ASCII, c15 VARCHAR(50) ASCII, c16 VARCHAR(50) ASCII, c17 DECIMAL(10, 0), c18 DECIMAL(8, 2), c19 CHAR(6) ASCII, c20 DECIMAL(4, 1), c21 TIMESTAMP, c22 VARCHAR(32) ASCII, c23 CHAR(6) ASCII, c24 CHAR(11) ASCII, c25 VARCHAR(160) ASCII, c26 DECIMAL(10, 0)) FROM JDBC AT MYSQL_JOINTDB_JDBC_CONNECTION STATEMENT 'SELECT erstattungen_positionen.id, erstattungen_positionen.datum, erstattungen_positionen.datum_ursprung, erstattungen_positionen.userid, erstattungen_positionen.orderid, erstattungen_positionen.rekla_id, erstattungen_positionen.quelle, erstattungen_positionen.quelle_id, erstattungen_positionen.art, erstattungen_positionen.grund, erstattungen_positionen.mitarbeiter, erstattungen_positionen.bemerkung, erstattungen_positionen.artnr, erstattungen_positionen.sellid, erstattungen_positionen.bezeichnung, erstattungen_positionen.bezeichnung_zusatz, erstattungen_positionen.anzahl, erstattungen_positionen.einzelpreis, erstattungen_positionen.brutto_netto, erstattungen_positionen.steuersatz, erstattungen_positionen.datum_bestellung, erstattungen_positionen.zahlungsart_bestellung, erstattungen_positionen.prioritaet, erstattungen_positionen.status, erstattungen_positionen.status_bemerkung, erstattungen_positionen.id_gutschrift FROM office.erstattungen_positionen'`

@djamadeus
Copy link
Author

when I run the pushdown command:
IMPORT INTO (c1 DECIMAL(10, 0), c2 TIMESTAMP, c3 TIMESTAMP, c4 VARCHAR(16) ASCII, c5 VARCHAR(32) ASCII, c6 VARCHAR(10) ASCII, c7 CHAR(15) ASCII, c8 VARCHAR(15) ASCII, c9 CHAR(19) ASCII, c10 VARCHAR(25) ASCII, c11 VARCHAR(50) ASCII, c12 VARCHAR(160) ASCII, c13 VARCHAR(25) ASCII, c14 VARCHAR(15) ASCII, c15 VARCHAR(50) ASCII, c16 VARCHAR(50) ASCII, c17 DECIMAL(10, 0), c18 DECIMAL(8, 2), c19 CHAR(6) ASCII, c20 DECIMAL(4, 1), c21 TIMESTAMP, c22 VARCHAR(32) ASCII, c23 CHAR(6) ASCII, c24 CHAR(11) ASCII, c25 VARCHAR(160) ASCII, c26 DECIMAL(10, 0)) FROM JDBC AT MYSQL_JOINTDB_JDBC_CONNECTION STATEMENT 'SELECT erstattungen_positionen.id, erstattungen_positionen.datum, erstattungen_positionen.datum_ursprung, erstattungen_positionen.userid, erstattungen_positionen.orderid, erstattungen_positionen.rekla_id, erstattungen_positionen.quelle, erstattungen_positionen.quelle_id, erstattungen_positionen.art, erstattungen_positionen.grund, erstattungen_positionen.mitarbeiter, erstattungen_positionen.bemerkung, erstattungen_positionen.artnr, erstattungen_positionen.sellid, erstattungen_positionen.bezeichnung, erstattungen_positionen.bezeichnung_zusatz, erstattungen_positionen.anzahl, erstattungen_positionen.einzelpreis, erstattungen_positionen.brutto_netto, erstattungen_positionen.steuersatz, erstattungen_positionen.datum_bestellung, erstattungen_positionen.zahlungsart_bestellung, erstattungen_positionen.prioritaet, erstattungen_positionen.status, erstattungen_positionen.status_bemerkung, erstattungen_positionen.id_gutschrift FROM office.erstattungen_positionen limit 100' [2022-11-23 12:30:26] [42636] ETL-3009: [Column=23 Row=0] [Charset conversion from 'UTF-8' to 'ASCII' failed [Invalid or incomplete multibyte or wide character]] (Session: 1750109444829478912)

@djamadeus
Copy link
Author

djamadeus commented Nov 23, 2022

If I counted right, the problem is erstattungen_positionen.prioritaet wich is an enum.
Probably always using UTF8 for enums in the newer adapter?

@djamadeus
Copy link
Author

descriptiontable with vs-10.0.1-adapter:

COLUMN_NAME SQL_TYPE NULLABLE DISTRIBUTION_KEY PARTITION_KEY
id DECIMAL(10,0) NULL NULL NULL
datum TIMESTAMP NULL NULL NULL
datum_ursprung TIMESTAMP NULL NULL NULL
userid VARCHAR(16) ASCII NULL NULL NULL
orderid VARCHAR(32) ASCII NULL NULL NULL
rekla_id VARCHAR(10) ASCII NULL NULL NULL
quelle CHAR(15) ASCII NULL NULL NULL
quelle_id VARCHAR(15) ASCII NULL NULL NULL
art CHAR(19) ASCII NULL NULL NULL
grund VARCHAR(25) ASCII NULL NULL NULL
mitarbeiter VARCHAR(50) ASCII NULL NULL NULL
bemerkung VARCHAR(160) ASCII NULL NULL NULL
artnr VARCHAR(25) ASCII NULL NULL NULL
sellid VARCHAR(15) ASCII NULL NULL NULL
bezeichnung VARCHAR(50) ASCII NULL NULL NULL
bezeichnung_zusatz VARCHAR(50) ASCII NULL NULL NULL
anzahl DECIMAL(10,0) NULL NULL NULL
einzelpreis DECIMAL(8,2) NULL NULL NULL
brutto_netto CHAR(6) ASCII NULL NULL NULL
steuersatz DECIMAL(4,1) NULL NULL NULL
datum_bestellung TIMESTAMP NULL NULL NULL
zahlungsart_bestellung VARCHAR(32) ASCII NULL NULL NULL
prioritaet CHAR(6) ASCII NULL NULL NULL
status CHAR(11) ASCII NULL NULL NULL
status_bemerkung VARCHAR(160) ASCII NULL NULL NULL
id_gutschrift DECIMAL(10,0) NULL NULL NULL

@djamadeus
Copy link
Author

Exasol version is 7.1.15 btw

@ckunki
Copy link
Contributor

ckunki commented Nov 23, 2022

Thank you very much for all the information. I am confident that this will enable us to find the correct answer and will come back in short.

@djamadeus
Copy link
Author

My pleasure, Good Luck! :)

@ckunki
Copy link
Contributor

ckunki commented Nov 23, 2022

If I counted right, the problem is erstattungen_positionen.prioritaet wich is an enum.

Thanks for this information. I counted the same and agree to your suspicion.
Enum is not among the generic JDBC base types but a proprietary extension of MySql. Looking at the output of EXPLAIN VIRTUAL and DESCRIBE TABLE JDBC seems to map this to CHAR(6).

As EXPLAIN VIRTUAL and DESCRIBE TABLE identically show datatype CHAR(6) ASCII and in combination with the prefix ETL-3009 of the error message we now assume ExaLoader to not only generate the error message but also to trigger the failed conversion.

If this holds true then the virtual schema adapter cannot influence this behavior.

The only thing that still makes me frown is that the current ticket reports a change of behavior from version virtual-schema-dist-9.0.5-mysql-3.0.0.jar to virtual-schema-dist-10.0.1-mysql-4.0.0.jar. Did you make the observations all with the same database version Exasol 7.1.15?

@djamadeus
Copy link
Author

jep.
Using this as adapter on same DB works without problems:
CREATE OR REPLACE JAVA ADAPTER SCRIPT system_adapter_scripts.ADAPTER_SCRIPT_MYSQL_ASCII AS %scriptclass com.exasol.adapter.RequestDispatcher; %jar /buckets/bucketfs1/bucket1/virtual-schema-dist-9.0.5-mysql-3.0.0.jar; %jar /buckets/bucketfs1/bucket1/mysql-connector-j-8.0.31.jar; / ;
while this creates the reported behavior:
CREATE OR REPLACE JAVA ADAPTER SCRIPT system_adapter_scripts.ADAPTER_SCRIPT_MYSQL AS %scriptclass com.exasol.adapter.RequestDispatcher; %jar /buckets/bucketfs1/bucket1/virtual-schema-dist-10.0.1-mysql-4.0.0.jar; %jar /buckets/bucketfs1/bucket1/mysql-connector-j-8.0.31.jar; / ;
IN the changes to version 10 it says something about, not relying on analyzing the received data anymore but using the table meta-information instead...
That sounds a lot like the actual culprit to me ;)

@ckunki
Copy link
Contributor

ckunki commented Nov 23, 2022

changes to version 10 ... using the table meta-information instead...
That sounds a lot like the actual culprit to me ;)

Thanks for mentioning that as this constantly goes through my mind, too.
I only now found out that this already has been backported to Exasol version 7.1.14. So yes, this is still a candidate.

What I am still wondering about: How could the information be different from what we get from DESCRIBE TABLE?
And actually we must assume that the meta-information includes an indication for UTF-8 while all the other places don't. Not mentioning any encoding would mean the same as UTF-8 is the default.

Would it be OK to ask for two additional log samples?

  1. from the data base / ExaLoader
  2. from the virtual schema and the meta-information

For (1) the relevant log files have names like *EtlJdbc_<etl_proc_num>_<job_num>*.
Here is an example for mapping a JDBC data type to Exasol:

2022-11-22 23:07:42.438 debu: Mapped Column: types[0]=93, internalTypes[0]=INTERNALTYPES_TIMESTAMP

For (2) see https://docs.exasol.com/db/latest/database_concepts/virtual_schema/logging.htm

@djamadeus
Copy link
Author

How would I access the file from (1) ? running a community version instance here. no root access to that blackbox as far as I know

@ckunki
Copy link
Contributor

ckunki commented Nov 24, 2022

Sorry for assuming you might be able to access these logs. In the meantime I could do some deeper experiments and I am close to a proof of our current assumption which is matching your initial diagnosis.

VSMYSQL version 3.x inspected only the data to detect the data type and got UTF-8 which is accepted by Exasol DB when importing UTF-8 encoded values.

VSMYSQL version 4.x inspects the metadata of the JDBC connection to detect the data type of each database column. In our case this is probably CHAR with octet length 1, i.e. 1 byte per character which is more or less correct for encoding latin1. VSMYSQL maps this to Exasol data type ASCII.

When retrieving actual values in a select statement the JDBC driver converts these character to UTF-8 and Exasol database reports an error as it cannot convert these data to data type ASCII.

image

The question is now how to overcome this.
I see the following options

  • (O1) continue to use the old version of the VSMYSQL
  • (O2) say that the German umlaut "Ü" is not strict "ASCII" and recommend to use UTF-8 encoding in MySQL
  • (O3) enhance VSMYSQL to support a parameter to always assume UTF-8
  • (O4) enhance VSMYSQL to support a parameter to ignore connection metadata for data type detection
  • (O5) Find a way to suppress JDBC driver to (silently) convert values to UTF-8

I will ponder these options and discuss with my colleagues.
Please tell me if you have a preference or if you see additional options.

@djamadeus
Copy link
Author

djamadeus commented Nov 24, 2022 via email

@ckunki
Copy link
Contributor

ckunki commented Nov 25, 2022

Thanks for contact information and feeback.
BTW. the image was wrong - I replaced it 🥲

There were a number of reasons for the change in VSMYSQL:

  • Improved performance of queries to virtual schema by avoiding one query for each push down
  • Enhanced accuracy of data type mapping
  • Makes data type mapping simpler and easier to extend
  • Fixes wrong data types in some use cases
  • Supports additional use cases and correct data types in special situations

Also guessing the data types from the values in the result set appeared to be inappropriate and risky. Although, we must must admit, that in your special case the old behavior seems to be preferable. 😃

In the meanwhile we did some experiments with settings of the connection which did not change the behavior.

The bug filed for MySQL Connector J: https://bugs.mysql.com/bug.php?id=69328, submitted: 27 May 2013, status Verified i.e. still open further narrows down the options ☹️.

In summary we are currently focusing option (O4) as proposed by you.

@ckunki
Copy link
Contributor

ckunki commented Nov 30, 2022

To ensure correct understanding of the problem I investigated the behavior of the MySQL JDBC driver:

CREATE TABLE T (c1 CHAR(1), c2  ENUM('A', 'Ü')) CHARACTER SET latin1;
INSERT INTO T VALUES ('Ü', 'Ü');

The following results have been retrieved by using the MySQL JDBC driver com.mysql:mysql-connector-j:8.0.31.

Column Metadata for MySQL table 'T' provided by JDBC driver:

  • Column ‘c1': from adapter notes: { JdbcDataType = 1 (CHAR), typeName = 'CHAR' }, type = CHAR(1) ASCII.
  • Column 'c2': from adapter notes: { JdbcDataType = 1 (CHAR), typeName = 'ENUM' }, type = CHAR(1) ASCII.

Column descriptions from query select * from T, as provided by JDBC driver: c1 CHAR(1) UTF8, c2 CHAR(1) UTF8.

I will also enhance the integration tests of VSMYSQL to reproduce this output.

ckunki added a commit that referenced this issue Dec 5, 2022
* updated version, pk fix, artifact references and added documentation
* #26: Enabled to use MySQL database with character set `latin1` and characters not strictly ASCII.
* updated to latest PR of VSCJDBC
* removed repository maven.exasol.com
* upgraded to exasol-testcontainers 6.4.0
* excluded vulnerablity
Co-authored-by: Christoph Pirkl <christoph.pirkl@exasol.com>
@djamadeus
Copy link
Author

glad to see this fixed. thanks alot!

@djamadeus
Copy link
Author

btw same some problem occurs with postgres adapter...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unwanted / harmful behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants