Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latin Extended-A characters being converted to ASCII on INSERT #39

Closed
keitherskine opened this issue Jan 31, 2017 · 24 comments
Closed

Latin Extended-A characters being converted to ASCII on INSERT #39

keitherskine opened this issue Jan 31, 2017 · 24 comments
Labels

Comments

@keitherskine
Copy link
Contributor

If I insert a Latin Extended-A character into my MS SQL Server database, it appears to be getting converted to its ASCII equivalent character.

Here's my setup:
Python 3.4.4, running on CentOS 6.6
MS SQL Server 2008 R2
Using the Microsoft ODBC Driver 11 for SQL Server on Linux with unixODBC 2.3.2

create table dbo.turbodbc (
  col_int int null,
  col_str_a varchar(100) null,  -- ascii (single-byte per character) column
  col_str_u nvarchar(100) null  -- unicode (actually UCS2 LE, two bytes per character) column
)
# the following string has three test characters separated by spaces:
#   a normal "ascii" character c, hex value \x63
#   c-cedilla, from "Latin-1 Supplement" but still a one byte value \xE7
#   c-with-an-acute-accent, from "Latin Extended-A", i.e. needs two bytes, value \x0107
# see http://inamidst.com/stuff/unidata/ for reference
string = 'c \xe7 \u0107'
row = [1, string, string]
sql = "insert into dbo.turbodbc (col_int, col_str_a, col_str_u) values (?, ?, ?)"
crsr = conn.cursor()
crsr.execute(sql, row)
conn.commit()

If I query for the hex equivalent of the two columns in the table, I get: 0x6320E72063 and 0x63002000E70020006300. As you can see, the third test character has been converted from c-with-an-acute-accent to plain old c, in both columns. I can kind-of understand this would happen in the first (ascii) column because the column can contain only single-byte characters, but the second "unicode" column should be able to handle this and hence contain 0x63002000E70020000701, i.e. end with 0701 not 6300 (note the little-endianness).

Somewhere along the line, the "Latin Extended-A" character \u0107 is being translated to \u0063, even if the target column is a unicode column. Would this be happening within turbodbc? Or somewhere else? Is turbodbc designed to work only with single-byte characters?

@MathMagique
Copy link
Member

I'll have to check if this is some bug in turbodbc. I have a test case for selecting unicode characters:
https://github.com/blue-yonder/turbodbc/blob/master/python/turbodbc_test/test_cursor_select.py#L61
This works with all databases I test with (MySQL and PostgreSQL on travis plus Exasol on my development machine).

The complimentary test for inserts sadly lacks unicode characters, that's something I'll have to add:
https://github.com/blue-yonder/turbodbc/blob/master/python/turbodbc_test/test_cursor_insert.py#L45

Even with this, it is possible that turbodbc is missing some MS SQL oddities (as every ODBC driver seems to have peculiarities of some sort or another). I don't have access to an MS SQL database myself, so if I don't find anything with the existing databases, debugging will be very hard for me.

@MathMagique MathMagique added the bug label Feb 1, 2017
@MathMagique
Copy link
Member

In any case, thanks for reporting!

@keitherskine
Copy link
Contributor Author

Thanks Michael! By the way, executemany with buffering works a treat, a million records in 30 seconds for me. It's the one thing missing from pyodbc.

Also, just in case you weren't aware, MS SQL Server is available on Ubuntu (and other Linux flavours) right now. https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-setup-ubuntu . Here's the main page: https://www.microsoft.com/en-us/sql-server/sql-server-vnext-including-Linux

@MathMagique
Copy link
Member

Thanks for the numbers and associated praise :-). I was not aware of MS SQL on Linux, and even for free! I think I have a new project...

@keitherskine
Copy link
Contributor Author

On the question of handling Unicode, you might want to read Michael Kleehammer's thoughts on this. He recently did a fairly major update of pyodbc to handle all the different RDBMS's and drivers, adding "encoding" parameters to the connect function. Here's some links:
http://mkleehammer.github.io/pyodbc/unicode.html
https://github.com/mkleehammer/pyodbc/wiki/Connecting-to-MySQL
https://github.com/mkleehammer/pyodbc/wiki/Connecting-to-PostgreSQL

I notice that turbodbc doesn't use SQLDriverConnectW, SQLPrepareW, or SQLExecDirectW. I'm guessing this might have something to do with this issue. Here's an example of its use in pyodbc:
https://github.com/mkleehammer/pyodbc/blob/09072a08b68a2a2b06cb8bb638341c7cc62f1511/src/cursor.cpp#L594-L597

@keitherskine
Copy link
Contributor Author

Just FYI, here's the log from unixODBC when making the database connection:

[ODBC][31227][1486063361.104827][SQLDriverConnect.c][728]
                Entry:
                        Connection = 0x1598c70
                        Window Hdl = (nil)
                        Str In = [database=xxx;uid=xxx;pwd=xxx;dsn=xxx][length = 68]
                        Str Out = 0x1596a00
                        Str Out Max = 1024
                        Str Out Ptr = 0x7fff6199c058
                        Completion = 0
                UNICODE Using encoding ASCII 'UTF8' and UNICODE 'UTF16LE'

                DIAG [01000] [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Changed database context to 'xxx'.

                DIAG [01000] [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Changed language setting to us_english.

@MathMagique
Copy link
Member

I've installed MSSQL today and can reproduce issues with Unicode and stuff. This may require a more thorough investigation.

@MathMagique
Copy link
Member

Just digging around with this issue a little left me with some new funny stuff. Trying to insert Unicode (UTF-8 encoding) stuff into MSSQL failed with both Microsoft's ODBC driver and FreeTDS. If I query Unicode from the database, however, it works with Microsoft's ODBC driver. It also works with FreeTDS, but only if I set the ClientCharset option to UTF-8.

So the driver definitely plays a role. But still the available MSSQL drivers seem to have a problem with UTF-8 encoded strings.

@keitherskine
Copy link
Contributor Author

Yes, Microsoft SQL Server drivers don't really do utf-8. They use UCS2 instead (i.e. two bytes, fixed width), most of the time, and use the "wide" versions of the ODBC functions SQLExecDirect, etc. See here for some more info.

@dirkjonker
Copy link
Contributor

dirkjonker commented Feb 28, 2017

There also seems to be some truncation happening with unicode characters.

>>> cursor.execute("SELECT N'hęīhœøōõ'")
>>> cursor.fetchall()
[['hęīhœ']]

However, that does not happen with an explicit cast:

>>> cursor.execute("SELECT CAST(N'hęīhœøōõ' AS NVARCHAR)")
>>> cursor.fetchall()
[['hęīhœøōõ']]

pyodbc returns the correct string in both cases.

@MathMagique
Copy link
Member

There's no way around the wide character functions for MSSQL, I fear.

I'll try to figure out how to safely determine what the driver's preferred way for transferring text is. Many drivers seem to cope with UTF-8 just fine, so I don't want to force them to use 2-byte UCS-2 when most characters they will be transferring is just plain old ASCII.

@MathMagique
Copy link
Member

@keitherskine I have started implementation for unicode support for MSSQL and other databases that do not really support UTF-8. You can find the code in the unicode branch.

The build still breaks on old compilers thanks to overly optimistic usage of standard library features not implemented in GCC 4.8. Retrieving wide characters works, and so does using unicode characters in SQL commands. Parameter support is still missing. To benefit from the new things, you need to pass the parameter prefer_unicode=True to turbodbc.connect().

It's still work in progress, but I hope there is something releasable by the end of the week.

@MathMagique
Copy link
Member

I just merged all the unicode fixes to master. Here's how to use it:

>>> import turbodbc

>>> options = turbodbc.make_options(prefer_unicode=False)
>>> con = turbodbc.connect(dsn="MSSQL", uid="SA", pwd="StrongPassword1", turbodbc_options=options)

>>> cursor = con.cursor()
>>> cursor.execute("INSERT INTO test_unicode VALUES (?)", [u'I \u2665 unicode'])
>>> print cursor.execute("SELECT * FROM test_unicode").fetchall()
[[u'I \u2665 unicode']]

@keitherskine
Copy link
Contributor Author

Hi @MathMagique , has version 1.1 been released to PyPi yet? I tried installing the latest version of turbodbc using pip install git+https://github.com/blue-yonder/turbodbc.git --upgrade but got the error error: package directory 'turbodbc' does not exist. I tried building from scratch (on Ubuntu 16.04) but couldn't get past the error CMake Error: The following variables are used in this project, but they are set to NOTFOUND. ... Numpy_INCLUDE_DIR. Any and all help appreciated!

@xhochy
Copy link
Collaborator

xhochy commented Mar 18, 2017

For building from scratch you need numpy installed (your virtualenv). If it already is but the scripts are not detecting it even after a clean build, it would be helpful if you could post the full output of CMake.

@MathMagique
Copy link
Member

And it has not been released yet ;-)

@keitherskine
Copy link
Contributor Author

Thank you @xhochy for your suggestion. I managed to create a build in the end. But for the sake of the other kids at the back of the class like me, it might be helpful to update the build instructions to start off as follows:

  1. Create a Python virtual environment and install numpy into it (pip install numpy).
  2. Clone turbodbc into the virtual environment somewhere (git clone https://github.com/blue-yonder/turbodbc.git).
  3. Run git submodule update --init --recursive to pull pybind11 into the git repository.

After that, the build process should run smoothly, it did for me.

@keitherskine
Copy link
Contributor Author

I've been trying the new build, but I don't seem to be able to insert any string into a table, even ascii. Here is my setup:

  • Ubuntu 16.04 64-bit
  • Python 3.5.2
  • turbodbc built from the latest master branch
  • connecting to "SQL Server vNext (CTP1.3) - 14.0.304.138 (X64)" using unixODBC 2.3.1

Here's the code I'm running

options = turbodbc.make_options(prefer_unicode=True)
conn = turbodbc.connect(dsn='xxx', uid='xxx', pwd='xxx', database='testdb', turbodbc_options=options)
crsr = conn.cursor()
string = 'c'
crsr.execute("INSERT INTO dbo.turbodbc VALUES (?, ?)", (string, string))

This is the result:

Traceback (most recent call last):
  File "turbodbc_test.py", line 17, in <module>
    crsr.execute("INSERT INTO dbo.turbodbc VALUES (?, ?)", (string, string))
  File "/home/keith/dev/pyvenvs/turbodbc/lib/python3.5/site-packages/turbodbc/exceptions.py", line 31, in wrapper
    return f(*args, **kwds)
  File "/home/keith/dev/pyvenvs/turbodbc/lib/python3.5/site-packages/turbodbc/cursor.py", line 70, in execute
    buffer.add_set(parameters)
RuntimeError: Could not convert python value to C++

It doesn't seem to matter whether I set prefer_unicode to True or False, I still get the same result. I can INSERT values into the table only if they are hard-coded (i.e. not using parameters).

@keitherskine
Copy link
Contributor Author

One other thing on the build instructions. It says tests can be executed with ctest, but ctest seems to be looking for a py.test file which doesn't exist, as far as I can figure out. Are there some parameters to ctest that should be included? Like a specific test file perhaps?

@MathMagique
Copy link
Member

@keitherskine Considering the build instructions: That would be a valuable pull request. As for the py.test stuff, you need to pip install the packages pytest and mock. This should get ctest to a stage where it will attempt to run the integration tests. The integration tests will require the variable TURBODBC_TEST_CONFIGURATION_FILES to be set to a comma-separated list of the JSON files you can find in the integration test folder. These files contain various fixtures for the database tests, and also contain a data source name and/or connection data.

The integration tests work on travis using Ubuntu 14 and OSX. And they also work with my local setup that uses a Ubuntu 16.04 VM and Microsoft's ODBC driver.

The error you experience could be caused by a too old version of pybind11. Pybind11 only recently started to add support for C++'a u16string type. If the driver reports to turbodbc that the parameter is meant for an NVARCHAR column, turbodbc will use the unicode path in https://github.com/blue-yonder/turbodbc/blob/master/cpp/turbodbc_python/Library/src/determine_parameter_type.cpp#L97. The latest version of pybind11 is not released yet (which also blocks a turbodbc release). I will try to test things again when pybind11 is released.

@keitherskine
Copy link
Contributor Author

Thank you for the feedback @MathMagique . I managed to get the ctest tests for MS SQL to run and they did indeed pass. I tried updating my local pybind to the latest version in order to get my separate Python test to pass but it didn't seem to make any difference. As you say, it looks like I'll have to wait until the next version of pybind is released.

As suggested, I have raised a PR for the docs update #61.

@keitherskine
Copy link
Contributor Author

Hi @MathMagique , just wanted to make sure you were aware that pybind released version 2.1.0 yesterday, so it may be possible to release v1.1 of turbodbc now.

@MathMagique
Copy link
Member

Hey @keitherskine! Thanks for the note. I was aware of it already. I am hoping to include #57 in the release, but I am planning for a release this weekend.

@keitherskine
Copy link
Contributor Author

Excellent, many thanks @MathMagique ! I look forward to trying out turbodbc on Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants