UTF8 / Unicode encoding for csv/dataframe load not storing correctly in DB #68

ausiddiqui · 2019-01-05T08:54:22Z

I believe this is the same / similar issue as #47 this as I'm having the same issue.

giraffez 2.0.24
macOS 10.14.2 / Ubuntu 16
python 3.5 (on macOS) / python 3.6 (on Ubuntu box)
Teradata ODBC / TTU 16.10
Teradata Server 15.10
SQL client: dbeaver / Aqua Data Studio (tried both)
JDBC for SQL client: TeraJDBC_16.20.00.02

import giraffez as g
import pandas as pd

with g.Cmd() as cmd:
    cmd.execute("""
CREATE SET TABLE mydb.mytable ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      location VARCHAR(250) CHARACTER SET UNICODE NOT CASESPECIFIC
PRIMARY INDEX ( location );""")

df = pd.DataFrame({'location': ['Düsseldorf','Marché Saint-Germain','İstanbul']})
df.to_csv('location.csv', index=False)

with g.BulkLoad(table='mydb.mytable') as load:
    load.from_file('location.csv', table='mydb.mytable', delimiter=",", null='')

select *
from mydb.mytable

location             
---------------------
DÃ¼sseldorf          
MarchÃ© Saint-Germain
Ä°stanbul

Have the same result if I use the dataframe to_list and iterate over the rows as the OP had done (from #47). The character set of my connection to TD when I do the select * is utf-8 and I'm using a GUI client. I am able to see these specific unicode characters and non-latin characters in other tables where the variable's character set is UNICODE in the create table statement on the same database. So I believe this is isolated and not related to either my SQL client or the TD server itself.

Originally posted by @ausiddiqui in #47 (comment)

ChrisRx · 2019-01-08T17:04:16Z

could you post the output of hexdump location.csv? Just want to rule that out up front.

Alternatively, I see here that it isn't setting the character set like with export, so that might be an issue with different platforms defaulting to different character sets (since unicode works on my end), and that there is an option for "Unicode Passthrough" in the TPTAPI programming guide, but haven't seen information about the default behavior yet.

ausiddiqui · 2019-01-08T17:28:48Z

Ran this on my desktop (macOS) using Terminal.

> head -n 5 location.csv
location
Düsseldorf
Marché Saint-Germain
İstanbul
> hexdump location.csv
0000000 6c 6f 63 61 74 69 6f 6e 0a 44 c3 bc 73 73 65 6c
0000010 64 6f 72 66 0a 4d 61 72 63 68 c3 a9 20 53 61 69
0000020 6e 74 2d 47 65 72 6d 61 69 6e 0a c4 b0 73 74 61
0000030 6e 62 75 6c 0a
0000035

Yeah, that's interesting versus the Export (which has no problem) here which has the Character Set specified.

ChrisRx · 2019-01-17T17:26:19Z

Check out the code in the ext-cleanup branch. In particular this change, which like you said wasn't explicitly being set for the Update driver. I have tested it locally and having it versus not seems to reproduce the issue you are experiencing, but I just want to confirm that this is the case for you as well.

ausiddiqui · 2019-01-22T17:39:32Z

Unfortunately, this didn't do the trick. I didn't run a full fork of the ext-cleanup branch. Just added that one line after the Mload_Init and recompiled. I'll try the whole branch and see if that works.

ChrisRx · 2019-01-24T20:09:45Z

Hmmm, I was really hoping that would do it, because it worked for me locally. When I would comment that line out it would produce the exact same messed up output you were getting (which is the respective UTF-8 bytes that has been incorrectly encoded with ASCII, so that makes sense). When I would recompile with that line set it would then load and query the correct information. Depending upon how you are compiling it is possible it isn't loading the write version, it might be worth doing a make clean and making sure you are using the newly build shared objects.

ausiddiqui · 2019-03-20T07:34:41Z

This is now working. I clean installed it:

pip install git+https://github.com/capitalone/giraffez.git@ext-cleanup

Would you suggest using the ext-cleanup version in general, any plans to merge back to the master anytime soon? Thanks.

istvan-fodor · 2019-05-17T00:05:59Z

Hi,
Is there a plan to merge ext-cleanup? I am running into the same issue, but ext-cleanup fixed it on Ubuntu

ChrisRx mentioned this issue Apr 3, 2020

Refactor of C extension #75

Merged

KrishanBhasin closed this as completed in #75 May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 / Unicode encoding for csv/dataframe load not storing correctly in DB #68

UTF8 / Unicode encoding for csv/dataframe load not storing correctly in DB #68

ausiddiqui commented Jan 5, 2019 •

edited

Loading

ChrisRx commented Jan 8, 2019

ausiddiqui commented Jan 8, 2019 •

edited

Loading

ChrisRx commented Jan 17, 2019

ausiddiqui commented Jan 22, 2019

ChrisRx commented Jan 24, 2019

ausiddiqui commented Mar 20, 2019

istvan-fodor commented May 17, 2019 •

edited

Loading

UTF8 / Unicode encoding for csv/dataframe load not storing correctly in DB #68

UTF8 / Unicode encoding for csv/dataframe load not storing correctly in DB #68

Comments

ausiddiqui commented Jan 5, 2019 • edited Loading

ChrisRx commented Jan 8, 2019

ausiddiqui commented Jan 8, 2019 • edited Loading

ChrisRx commented Jan 17, 2019

ausiddiqui commented Jan 22, 2019

ChrisRx commented Jan 24, 2019

ausiddiqui commented Mar 20, 2019

istvan-fodor commented May 17, 2019 • edited Loading

ausiddiqui commented Jan 5, 2019 •

edited

Loading

ausiddiqui commented Jan 8, 2019 •

edited

Loading

istvan-fodor commented May 17, 2019 •

edited

Loading