Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: Not yet implemented: Unsupported encoding. #442

Closed
csy08 opened this issue Nov 4, 2020 · 5 comments
Closed

OSError: Not yet implemented: Unsupported encoding. #442

csy08 opened this issue Nov 4, 2020 · 5 comments
Labels
bug Something isn't working wont implement Will not be implemented.

Comments

@csy08
Copy link

csy08 commented Nov 4, 2020

When trying to read in an AWS Cost and Usage Report (CUR) in Parquet format, I get an encoding error on a particular file when I request the column identity_line_item_id (it just seems to be this column that has the issue).

The column identity_line_item_id should just be alphanumeric so I dont understand why it's causing an issue. Examples:

qcoyzdqfgurbdckd37f4wybt7r4qkqscpdgtd4dqcekbj2nnwuzq
us76r32alymcndaayvqepufnqkj6z37wlcrffuxyl3o2nkiogngq
hkotcwyxeamaqxg2ta4yzaqusz7pbzcptzrznzbjv435gqmp4rta
wh6qai5rprlcxso23cl2xg43s2hspzj3mfcwxetyhvs6gmwz3s3a
vpuafqfpu3ljjh3aewrmw32yaccyncksbphxx2kg53uyi3ywra2a
ndhwhrhjgasimuknnrjewkdfbfwz35oegqbqa7llt3dqziedotlq
s3nzxioecg6icyndf4qxn5ac4n74ige56pkjwdclyo6ywg73c7ja
k6sgtmfb5wid3g5cfmpyexuozzs34rw4qqoc3d25qw2elpitomfq
ron3lc6cp2767zzvjhnplkapnwkpmqgwezmpbnmiwcm6urbysrsq
5ei7xalvxwhzyqulcvl6vl6i67neep3qgj2ydkftz2zrqewkymoa
q3suwaqrg35kb2zbgoj4qnip4bw7d6i4jhjv5ukxn7vgfm76quca

Code

import awswrangler as wr

bucket = 'bucket_name`
key = 'test-cur-00001.snappy.parquet'

if __name__ == '__main__':
    df = wr.s3.read_parquet(path=f"s3://{bucket}/", path_suffix=key, dataset=True, columns=[
        "identity_line_item_id"
    ])
Traceback (most recent call last):
  File "/Users/csy08/Documents/Development/playground/parquets/awswranlger_read_parquet.py", line 8, in <module>
    "identity_line_item_id",
  File "/Users/csy08/Documents/Development/playground/parquets/venv/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 567, in read_parquet
    return _read_parquet(path=paths[0], **args)
  File "/Users/csy08/Documents/Development/playground/parquets/venv/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 389, in _read_parquet
    use_threads=use_threads,
  File "/Users/csy08/Documents/Development/playground/parquets/venv/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 346, in _read_parquet_file
    return pq_file.read(columns=columns, use_threads=False, use_pandas_metadata=False)
  File "/Users/csy08/Documents/Development/playground/parquets/venv/lib/python3.7/site-packages/pyarrow/parquet.py", line 328, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1121, in pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Not yet implemented: Unsupported encoding.

I can use the following code and export to CSV or JSON without a problem

import boto3

s3c = boto3.client('s3')

bucket = 'bucket_name`
key = 'test-cur-00001.snappy.parquet'

if __name__ == '__main__':
    response = s3c.select_object_content(
        Bucket=bucket,
        Key=key,
        Expression='SELECT * FROM s3object',
        ExpressionType='SQL',
        InputSerialization={
            'Parquet': {}
        },
        OutputSerialization={
            'CSV': {}
        },
        RequestProgress={'Enabled': True},
    )

    end_event_received = False

    with open('output.csv', 'wb') as f:
        for event in response['Payload']:
            if 'Records' in event:
                data = event['Records']['Payload']
                f.write(data)
            elif 'Progress' in event:
                print(event['Progress']['Details'])
            elif 'End' in event:
                print('Result is complete')
                end_event_received = True

    if not end_event_received:
        raise Exception("End event not received, request incomplete.")
@csy08 csy08 added the bug Something isn't working label Nov 4, 2020
@igorborgest
Copy link
Contributor

Hi @csy08 ! Thanks for reporting it.

Currently CUR writes your report using the “V2” Parquet Schema. And Apache Arrow (Wrangler's dependency to handle low-level parquet functionalities), doesn’t currently support the encodings used by this schema.

So, unfortunately, Wrangler/Apache Arrow will not be able to read columns encoded with DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, and DELTA_BYTE_ARRAY by now.

My recommendation is to use CSV/JSON, or just skip not supported columns if you can.

Reference: https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-strings-delta_byte_array--7

P.S. Let's block this issue until we have a CUR or Apache Arrow update.

@igorborgest igorborgest self-assigned this Nov 5, 2020
@igorborgest igorborgest added the blocked Something is blocking the development label Nov 5, 2020
@igorborgest igorborgest removed their assignment Nov 5, 2020
@csy08
Copy link
Author

csy08 commented Nov 5, 2020

@igorborgest Thank you for coming back to me. Understood and shall try to avoid parquet files fo now.

@clemthi
Copy link

clemthi commented Nov 27, 2020

HI @igorborgest,
I have a similar however there is something I don't understand : the CUR parquet datasets I'm usingare usually between 15 to 20 files. I'm able to load them all except the last file (usually around 450 KB).
Why is this last file encoded in a different way?

@igorborgest
Copy link
Contributor

This will be naturally supported after CUR or Apache Arrow overcame this incompatibility.

@igorborgest igorborgest added wont implement Will not be implemented. and removed blocked Something is blocking the development labels Jan 27, 2021
@mattboyd-aws
Copy link
Contributor

One way to work around this is to use Athena CTAS statements instead to manipulate the data. As a bonus, this offloads the CPU and memory intensive operations to Athena. Here is an example that uses CTAS to write CUR joined with AWS Account metadata: https://github.com/aws-samples/glue-enrich-cost-and-usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wont implement Will not be implemented.
Projects
None yet
Development

No branches or pull requests

4 participants