Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when using athena.read_sql_query #1156

Closed
Chintan-D opened this issue Feb 3, 2022 · 7 comments
Closed

UnicodeDecodeError when using athena.read_sql_query #1156

Chintan-D opened this issue Feb 3, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@Chintan-D
Copy link

Chintan-D commented Feb 3, 2022

Describe the bug

Hi,
I am using the latest version of awswrangler library to extract bunch of tables from Athena. For one of my table function athena.read_sql_query fails with error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 230232: character maps to <undefined>

Here is the part of code which is giving this error:
df = wr.athena.read_sql_query(query, database=database, boto3_session=session, ctas_approach=False)

Code otherwise works fine for other tables.

Here is the detailed error trace:


 File "D:\PythonProjects\venvPy392\lib\site-packages\awswrangler\athena\_read.py", line 897, in read_sql_query
    return _resolve_query_without_cache(
  File "D:\PythonProjects\venvPy392\lib\site-packages\awswrangler\athena\_read.py", line 519, in _resolve_query_without_cache
    return _resolve_query_without_cache_regular(
  File "D:\PythonProjects\venvPy392\lib\site-packages\awswrangler\athena\_read.py", line 425, in _resolve_query_without_cache_regular
    return _fetch_csv_result(
  File "D:\PythonProjects\venvPy392\lib\site-packages\awswrangler\athena\_read.py", line 161, in _fetch_csv_result
    ret = s3.read_csv(
  File "D:\PythonProjects\venvPy392\lib\site-packages\awswrangler\s3\_read_text.py", line 294, in read_csv
    return _read_text(
  File "D:\PythonProjects\venvPy392\lib\site-packages\awswrangler\s3\_read_text.py", line 149, in _read_text
    ret = _read_text_file(
  File "D:\PythonProjects\venvPy392\lib\site-packages\awswrangler\s3\_read_text.py", line 91, in _read_text_file
    df: pd.DataFrame = parser_func(f, **pandas_kwargs)
  File "D:\PythonProjects\venvPy392\lib\site-packages\pandas\io\parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "D:\PythonProjects\venvPy392\lib\site-packages\pandas\io\parsers.py", line 468, in _read
    return parser.read(nrows)
  File "D:\PythonProjects\venvPy392\lib\site-packages\pandas\io\parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "D:\PythonProjects\venvPy392\lib\site-packages\pandas\io\parsers.py", line 2036, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 1943, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 230232: character maps to <undefined>

How to Reproduce

Create following view in Athena:

create view vw_error_test as 
select '“Noto Emoji”' as col

Query this view using awswrangler

query = 'select * from vw_error_test'
df = wr.athena.read_sql_query(query, database=database, boto3_session=session, ctas_approach=False)

Expected behavior

No response

Your project

No response

Screenshots

No response

Environment

Provide your `pip list` output, particularly the version of the AWS Data Wrangler library you used. Providing this information may significantly improve resolution times.

OS

Windows

Python version

3.9.2

AWS DataWrangler version

2.14.0

Additional context

No response

@Chintan-D Chintan-D added the bug Something isn't working label Feb 3, 2022
@Chintan-D
Copy link
Author

Chintan-D commented Feb 4, 2022

I am able to narrow down issue to a particular column in the Athena database table.
Values in this column contain special characters like superscripts, and Asian characters.

Since I am calling the function athena.read_sql_query with parameter ctas_approach=False, I have a feeling that the underlying function is not opening file with correct encoding.

Is it possible to pass encoding parameter to this function, which would then be used while reading the file (ex: encoding='utf8')

@NickCorbett
Copy link
Contributor

Hi @Chintan-D - thanks for reaching out. What is the data type of the column in Athena?

@Chintan-D
Copy link
Author

Chintan-D commented Feb 9, 2022

Hi @Chintan-D - thanks for reaching out. What is the data type of the column in Athena?

varchar

image

CREATE OR REPLACE VIEW vw_error_test AS 
SELECT U&'\201CNoto Emoji\201D' col

@cotrariello84
Copy link

has the bug been fixed?

@malachi-constant
Copy link
Contributor

@cotrariello84 Yes this was released in 2.16.0

@cotrariello84
Copy link

Hi @malachi-constant I still have the bug in version 2.17.0. I had to use pyathena to solve it. but I 'd like to use wrangler.

@malachi-constant
Copy link
Contributor

Hi @malachi-constant I still have the bug in version 2.17.0. I had to use pyathena to solve it. but I 'd like to use wrangler.

See comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

6 participants