Skip to content
This repository has been archived by the owner on Nov 19, 2021. It is now read-only.

UTF8 / Unicode encoding for csv/dataframe load not storing correctly in DB #68

Closed
ausiddiqui opened this issue Jan 5, 2019 · 7 comments · Fixed by #75
Closed

UTF8 / Unicode encoding for csv/dataframe load not storing correctly in DB #68

ausiddiqui opened this issue Jan 5, 2019 · 7 comments · Fixed by #75

Comments

@ausiddiqui
Copy link

ausiddiqui commented Jan 5, 2019

I believe this is the same / similar issue as #47 this as I'm having the same issue.

giraffez 2.0.24
macOS 10.14.2 / Ubuntu 16
python 3.5 (on macOS) / python 3.6 (on Ubuntu box)
Teradata ODBC / TTU 16.10
Teradata Server 15.10
SQL client: dbeaver / Aqua Data Studio (tried both)
JDBC for SQL client: TeraJDBC_16.20.00.02

import giraffez as g
import pandas as pd

with g.Cmd() as cmd:
    cmd.execute("""
CREATE SET TABLE mydb.mytable ,NO FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO
     (
      location VARCHAR(250) CHARACTER SET UNICODE NOT CASESPECIFIC
PRIMARY INDEX ( location );""")

df = pd.DataFrame({'location': ['Düsseldorf','Marché Saint-Germain','İstanbul']})
df.to_csv('location.csv', index=False)

with g.BulkLoad(table='mydb.mytable') as load:
    load.from_file('location.csv', table='mydb.mytable', delimiter=",", null='')
select *
from mydb.mytable

location             
---------------------
Düsseldorf          
Marché Saint-Germain
Ä°stanbul            

Have the same result if I use the dataframe to_list and iterate over the rows as the OP had done (from #47). The character set of my connection to TD when I do the select * is utf-8 and I'm using a GUI client. I am able to see these specific unicode characters and non-latin characters in other tables where the variable's character set is UNICODE in the create table statement on the same database. So I believe this is isolated and not related to either my SQL client or the TD server itself.

Originally posted by @ausiddiqui in #47 (comment)

@ChrisRx
Copy link
Collaborator

ChrisRx commented Jan 8, 2019

could you post the output of hexdump location.csv? Just want to rule that out up front.

Alternatively, I see here that it isn't setting the character set like with export, so that might be an issue with different platforms defaulting to different character sets (since unicode works on my end), and that there is an option for "Unicode Passthrough" in the TPTAPI programming guide, but haven't seen information about the default behavior yet.

@ausiddiqui
Copy link
Author

ausiddiqui commented Jan 8, 2019

Ran this on my desktop (macOS) using Terminal.

> head -n 5 location.csv
location
Düsseldorf
Marché Saint-Germain
İstanbul
> hexdump location.csv
0000000 6c 6f 63 61 74 69 6f 6e 0a 44 c3 bc 73 73 65 6c
0000010 64 6f 72 66 0a 4d 61 72 63 68 c3 a9 20 53 61 69
0000020 6e 74 2d 47 65 72 6d 61 69 6e 0a c4 b0 73 74 61
0000030 6e 62 75 6c 0a
0000035

Yeah, that's interesting versus the Export (which has no problem) here which has the Character Set specified.

@ChrisRx
Copy link
Collaborator

ChrisRx commented Jan 17, 2019

Check out the code in the ext-cleanup branch. In particular this change, which like you said wasn't explicitly being set for the Update driver. I have tested it locally and having it versus not seems to reproduce the issue you are experiencing, but I just want to confirm that this is the case for you as well.

@ausiddiqui
Copy link
Author

Unfortunately, this didn't do the trick. I didn't run a full fork of the ext-cleanup branch. Just added that one line after the Mload_Init and recompiled. I'll try the whole branch and see if that works.

@ChrisRx
Copy link
Collaborator

ChrisRx commented Jan 24, 2019

Hmmm, I was really hoping that would do it, because it worked for me locally. When I would comment that line out it would produce the exact same messed up output you were getting (which is the respective UTF-8 bytes that has been incorrectly encoded with ASCII, so that makes sense). When I would recompile with that line set it would then load and query the correct information. Depending upon how you are compiling it is possible it isn't loading the write version, it might be worth doing a make clean and making sure you are using the newly build shared objects.

@ausiddiqui
Copy link
Author

This is now working. I clean installed it:

pip install git+https://github.com/capitalone/giraffez.git@ext-cleanup

Would you suggest using the ext-cleanup version in general, any plans to merge back to the master anytime soon? Thanks.

@istvan-fodor
Copy link

istvan-fodor commented May 17, 2019

Hi,
Is there a plan to merge ext-cleanup? I am running into the same issue, but ext-cleanup fixed it on Ubuntu

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants