Latin Extended-A characters being converted to ASCII on INSERT #39

keitherskine · 2017-01-31T19:22:48Z

If I insert a Latin Extended-A character into my MS SQL Server database, it appears to be getting converted to its ASCII equivalent character.

Here's my setup:
Python 3.4.4, running on CentOS 6.6
MS SQL Server 2008 R2
Using the Microsoft ODBC Driver 11 for SQL Server on Linux with unixODBC 2.3.2

create table dbo.turbodbc (
  col_int int null,
  col_str_a varchar(100) null,  -- ascii (single-byte per character) column
  col_str_u nvarchar(100) null  -- unicode (actually UCS2 LE, two bytes per character) column
)

# the following string has three test characters separated by spaces:
#   a normal "ascii" character c, hex value \x63
#   c-cedilla, from "Latin-1 Supplement" but still a one byte value \xE7
#   c-with-an-acute-accent, from "Latin Extended-A", i.e. needs two bytes, value \x0107
# see http://inamidst.com/stuff/unidata/ for reference
string = 'c \xe7 \u0107'
row = [1, string, string]
sql = "insert into dbo.turbodbc (col_int, col_str_a, col_str_u) values (?, ?, ?)"
crsr = conn.cursor()
crsr.execute(sql, row)
conn.commit()

If I query for the hex equivalent of the two columns in the table, I get: 0x6320E72063 and 0x63002000E70020006300. As you can see, the third test character has been converted from c-with-an-acute-accent to plain old c, in both columns. I can kind-of understand this would happen in the first (ascii) column because the column can contain only single-byte characters, but the second "unicode" column should be able to handle this and hence contain 0x63002000E70020000701, i.e. end with 0701 not 6300 (note the little-endianness).

Somewhere along the line, the "Latin Extended-A" character \u0107 is being translated to \u0063, even if the target column is a unicode column. Would this be happening within turbodbc? Or somewhere else? Is turbodbc designed to work only with single-byte characters?

The text was updated successfully, but these errors were encountered:

MathMagique · 2017-02-01T11:42:41Z

I'll have to check if this is some bug in turbodbc. I have a test case for selecting unicode characters:
https://github.com/blue-yonder/turbodbc/blob/master/python/turbodbc_test/test_cursor_select.py#L61
This works with all databases I test with (MySQL and PostgreSQL on travis plus Exasol on my development machine).

The complimentary test for inserts sadly lacks unicode characters, that's something I'll have to add:
https://github.com/blue-yonder/turbodbc/blob/master/python/turbodbc_test/test_cursor_insert.py#L45

Even with this, it is possible that turbodbc is missing some MS SQL oddities (as every ODBC driver seems to have peculiarities of some sort or another). I don't have access to an MS SQL database myself, so if I don't find anything with the existing databases, debugging will be very hard for me.

MathMagique · 2017-02-01T11:42:57Z

In any case, thanks for reporting!

keitherskine · 2017-02-01T14:24:07Z

Thanks Michael! By the way, executemany with buffering works a treat, a million records in 30 seconds for me. It's the one thing missing from pyodbc.

Also, just in case you weren't aware, MS SQL Server is available on Ubuntu (and other Linux flavours) right now. https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-setup-ubuntu . Here's the main page: https://www.microsoft.com/en-us/sql-server/sql-server-vnext-including-Linux

MathMagique · 2017-02-01T14:30:17Z

Thanks for the numbers and associated praise :-). I was not aware of MS SQL on Linux, and even for free! I think I have a new project...

keitherskine · 2017-02-02T16:00:34Z

On the question of handling Unicode, you might want to read Michael Kleehammer's thoughts on this. He recently did a fairly major update of pyodbc to handle all the different RDBMS's and drivers, adding "encoding" parameters to the connect function. Here's some links:
http://mkleehammer.github.io/pyodbc/unicode.html
https://github.com/mkleehammer/pyodbc/wiki/Connecting-to-MySQL
https://github.com/mkleehammer/pyodbc/wiki/Connecting-to-PostgreSQL

I notice that turbodbc doesn't use SQLDriverConnectW, SQLPrepareW, or SQLExecDirectW. I'm guessing this might have something to do with this issue. Here's an example of its use in pyodbc:
https://github.com/mkleehammer/pyodbc/blob/09072a08b68a2a2b06cb8bb638341c7cc62f1511/src/cursor.cpp#L594-L597

keitherskine · 2017-02-02T22:19:50Z

Just FYI, here's the log from unixODBC when making the database connection:

[ODBC][31227][1486063361.104827][SQLDriverConnect.c][728]
                Entry:
                        Connection = 0x1598c70
                        Window Hdl = (nil)
                        Str In = [database=xxx;uid=xxx;pwd=xxx;dsn=xxx][length = 68]
                        Str Out = 0x1596a00
                        Str Out Max = 1024
                        Str Out Ptr = 0x7fff6199c058
                        Completion = 0
                UNICODE Using encoding ASCII 'UTF8' and UNICODE 'UTF16LE'

                DIAG [01000] [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Changed database context to 'xxx'.

                DIAG [01000] [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Changed language setting to us_english.

MathMagique · 2017-02-17T20:34:10Z

I've installed MSSQL today and can reproduce issues with Unicode and stuff. This may require a more thorough investigation.

MathMagique · 2017-02-27T20:52:33Z

Just digging around with this issue a little left me with some new funny stuff. Trying to insert Unicode (UTF-8 encoding) stuff into MSSQL failed with both Microsoft's ODBC driver and FreeTDS. If I query Unicode from the database, however, it works with Microsoft's ODBC driver. It also works with FreeTDS, but only if I set the ClientCharset option to UTF-8.

So the driver definitely plays a role. But still the available MSSQL drivers seem to have a problem with UTF-8 encoded strings.

keitherskine · 2017-02-27T21:22:58Z

Yes, Microsoft SQL Server drivers don't really do utf-8. They use UCS2 instead (i.e. two bytes, fixed width), most of the time, and use the "wide" versions of the ODBC functions SQLExecDirect, etc. See here for some more info.

dirkjonker · 2017-02-28T09:11:48Z

There also seems to be some truncation happening with unicode characters.

>>> cursor.execute("SELECT N'hęīhœøōõ'")
>>> cursor.fetchall()
[['hęīhœ']]

However, that does not happen with an explicit cast:

>>> cursor.execute("SELECT CAST(N'hęīhœøōõ' AS NVARCHAR)")
>>> cursor.fetchall()
[['hęīhœøōõ']]

pyodbc returns the correct string in both cases.

MathMagique · 2017-02-28T09:15:54Z

There's no way around the wide character functions for MSSQL, I fear.

I'll try to figure out how to safely determine what the driver's preferred way for transferring text is. Many drivers seem to cope with UTF-8 just fine, so I don't want to force them to use 2-byte UCS-2 when most characters they will be transferring is just plain old ASCII.

MathMagique · 2017-03-13T14:19:38Z

@keitherskine I have started implementation for unicode support for MSSQL and other databases that do not really support UTF-8. You can find the code in the unicode branch.

The build still breaks on old compilers thanks to overly optimistic usage of standard library features not implemented in GCC 4.8. Retrieving wide characters works, and so does using unicode characters in SQL commands. Parameter support is still missing. To benefit from the new things, you need to pass the parameter prefer_unicode=True to turbodbc.connect().

It's still work in progress, but I hope there is something releasable by the end of the week.

MathMagique · 2017-03-17T16:17:33Z

I just merged all the unicode fixes to master. Here's how to use it:

>>> import turbodbc

>>> options = turbodbc.make_options(prefer_unicode=False)
>>> con = turbodbc.connect(dsn="MSSQL", uid="SA", pwd="StrongPassword1", turbodbc_options=options)

>>> cursor = con.cursor()
>>> cursor.execute("INSERT INTO test_unicode VALUES (?)", [u'I \u2665 unicode'])
>>> print cursor.execute("SELECT * FROM test_unicode").fetchall()
[[u'I \u2665 unicode']]

keitherskine · 2017-03-18T14:36:16Z

Hi @MathMagique , has version 1.1 been released to PyPi yet? I tried installing the latest version of turbodbc using pip install git+https://github.com/blue-yonder/turbodbc.git --upgrade but got the error error: package directory 'turbodbc' does not exist. I tried building from scratch (on Ubuntu 16.04) but couldn't get past the error CMake Error: The following variables are used in this project, but they are set to NOTFOUND. ... Numpy_INCLUDE_DIR. Any and all help appreciated!

xhochy · 2017-03-18T16:50:18Z

For building from scratch you need numpy installed (your virtualenv). If it already is but the scripts are not detecting it even after a clean build, it would be helpful if you could post the full output of CMake.

MathMagique · 2017-03-18T16:56:57Z

And it has not been released yet ;-)

keitherskine · 2017-03-18T19:10:23Z

Thank you @xhochy for your suggestion. I managed to create a build in the end. But for the sake of the other kids at the back of the class like me, it might be helpful to update the build instructions to start off as follows:

Create a Python virtual environment and install numpy into it (pip install numpy).
Clone turbodbc into the virtual environment somewhere (git clone https://github.com/blue-yonder/turbodbc.git).
Run git submodule update --init --recursive to pull pybind11 into the git repository.

After that, the build process should run smoothly, it did for me.

keitherskine · 2017-03-18T19:45:54Z

I've been trying the new build, but I don't seem to be able to insert any string into a table, even ascii. Here is my setup:

Ubuntu 16.04 64-bit
Python 3.5.2
turbodbc built from the latest master branch
connecting to "SQL Server vNext (CTP1.3) - 14.0.304.138 (X64)" using unixODBC 2.3.1

Here's the code I'm running

options = turbodbc.make_options(prefer_unicode=True)
conn = turbodbc.connect(dsn='xxx', uid='xxx', pwd='xxx', database='testdb', turbodbc_options=options)
crsr = conn.cursor()
string = 'c'
crsr.execute("INSERT INTO dbo.turbodbc VALUES (?, ?)", (string, string))

This is the result:

Traceback (most recent call last):
  File "turbodbc_test.py", line 17, in <module>
    crsr.execute("INSERT INTO dbo.turbodbc VALUES (?, ?)", (string, string))
  File "/home/keith/dev/pyvenvs/turbodbc/lib/python3.5/site-packages/turbodbc/exceptions.py", line 31, in wrapper
    return f(*args, **kwds)
  File "/home/keith/dev/pyvenvs/turbodbc/lib/python3.5/site-packages/turbodbc/cursor.py", line 70, in execute
    buffer.add_set(parameters)
RuntimeError: Could not convert python value to C++

It doesn't seem to matter whether I set prefer_unicode to True or False, I still get the same result. I can INSERT values into the table only if they are hard-coded (i.e. not using parameters).

keitherskine · 2017-03-18T19:50:19Z

One other thing on the build instructions. It says tests can be executed with ctest, but ctest seems to be looking for a py.test file which doesn't exist, as far as I can figure out. Are there some parameters to ctest that should be included? Like a specific test file perhaps?

MathMagique · 2017-03-19T10:54:51Z

@keitherskine Considering the build instructions: That would be a valuable pull request. As for the py.test stuff, you need to pip install the packages pytest and mock. This should get ctest to a stage where it will attempt to run the integration tests. The integration tests will require the variable TURBODBC_TEST_CONFIGURATION_FILES to be set to a comma-separated list of the JSON files you can find in the integration test folder. These files contain various fixtures for the database tests, and also contain a data source name and/or connection data.

The integration tests work on travis using Ubuntu 14 and OSX. And they also work with my local setup that uses a Ubuntu 16.04 VM and Microsoft's ODBC driver.

The error you experience could be caused by a too old version of pybind11. Pybind11 only recently started to add support for C++'a u16string type. If the driver reports to turbodbc that the parameter is meant for an NVARCHAR column, turbodbc will use the unicode path in https://github.com/blue-yonder/turbodbc/blob/master/cpp/turbodbc_python/Library/src/determine_parameter_type.cpp#L97. The latest version of pybind11 is not released yet (which also blocks a turbodbc release). I will try to test things again when pybind11 is released.

keitherskine · 2017-03-19T18:11:06Z

Thank you for the feedback @MathMagique . I managed to get the ctest tests for MS SQL to run and they did indeed pass. I tried updating my local pybind to the latest version in order to get my separate Python test to pass but it didn't seem to make any difference. As you say, it looks like I'll have to wait until the next version of pybind is released.

As suggested, I have raised a PR for the docs update #61.

keitherskine · 2017-03-23T13:32:32Z

Hi @MathMagique , just wanted to make sure you were aware that pybind released version 2.1.0 yesterday, so it may be possible to release v1.1 of turbodbc now.

MathMagique · 2017-03-23T13:35:18Z

Hey @keitherskine! Thanks for the note. I was aware of it already. I am hoping to include #57 in the release, but I am planning for a release this weekend.

keitherskine · 2017-03-23T13:37:52Z

Excellent, many thanks @MathMagique ! I look forward to trying out turbodbc on Windows.

MathMagique added the bug label Feb 1, 2017

MathMagique mentioned this issue Feb 20, 2017

FreeTDS: cannot execute with params #48

Closed

MathMagique mentioned this issue Mar 17, 2017

Unicode #60

Merged

MathMagique closed this as completed in #60 Mar 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latin Extended-A characters being converted to ASCII on INSERT #39

Latin Extended-A characters being converted to ASCII on INSERT #39

keitherskine commented Jan 31, 2017

MathMagique commented Feb 1, 2017

MathMagique commented Feb 1, 2017

keitherskine commented Feb 1, 2017

MathMagique commented Feb 1, 2017

keitherskine commented Feb 2, 2017

keitherskine commented Feb 2, 2017

MathMagique commented Feb 17, 2017

MathMagique commented Feb 27, 2017

keitherskine commented Feb 27, 2017

dirkjonker commented Feb 28, 2017 •

edited

MathMagique commented Feb 28, 2017

MathMagique commented Mar 13, 2017

MathMagique commented Mar 17, 2017

keitherskine commented Mar 18, 2017

xhochy commented Mar 18, 2017

MathMagique commented Mar 18, 2017

keitherskine commented Mar 18, 2017

keitherskine commented Mar 18, 2017

keitherskine commented Mar 18, 2017

MathMagique commented Mar 19, 2017

keitherskine commented Mar 19, 2017

keitherskine commented Mar 23, 2017

MathMagique commented Mar 23, 2017

keitherskine commented Mar 23, 2017

Latin Extended-A characters being converted to ASCII on INSERT #39

Latin Extended-A characters being converted to ASCII on INSERT #39

Comments

keitherskine commented Jan 31, 2017

MathMagique commented Feb 1, 2017

MathMagique commented Feb 1, 2017

keitherskine commented Feb 1, 2017

MathMagique commented Feb 1, 2017

keitherskine commented Feb 2, 2017

keitherskine commented Feb 2, 2017

MathMagique commented Feb 17, 2017

MathMagique commented Feb 27, 2017

keitherskine commented Feb 27, 2017

dirkjonker commented Feb 28, 2017 • edited

MathMagique commented Feb 28, 2017

MathMagique commented Mar 13, 2017

MathMagique commented Mar 17, 2017

keitherskine commented Mar 18, 2017

xhochy commented Mar 18, 2017

MathMagique commented Mar 18, 2017

keitherskine commented Mar 18, 2017

keitherskine commented Mar 18, 2017

keitherskine commented Mar 18, 2017

MathMagique commented Mar 19, 2017

keitherskine commented Mar 19, 2017

keitherskine commented Mar 23, 2017

MathMagique commented Mar 23, 2017

keitherskine commented Mar 23, 2017

dirkjonker commented Feb 28, 2017 •

edited