OSError: Not yet implemented: Unsupported encoding. #442

csy08 · 2020-11-04T19:01:39Z

When trying to read in an AWS Cost and Usage Report (CUR) in Parquet format, I get an encoding error on a particular file when I request the column identity_line_item_id (it just seems to be this column that has the issue).

The column identity_line_item_id should just be alphanumeric so I dont understand why it's causing an issue. Examples:

qcoyzdqfgurbdckd37f4wybt7r4qkqscpdgtd4dqcekbj2nnwuzq
us76r32alymcndaayvqepufnqkj6z37wlcrffuxyl3o2nkiogngq
hkotcwyxeamaqxg2ta4yzaqusz7pbzcptzrznzbjv435gqmp4rta
wh6qai5rprlcxso23cl2xg43s2hspzj3mfcwxetyhvs6gmwz3s3a
vpuafqfpu3ljjh3aewrmw32yaccyncksbphxx2kg53uyi3ywra2a
ndhwhrhjgasimuknnrjewkdfbfwz35oegqbqa7llt3dqziedotlq
s3nzxioecg6icyndf4qxn5ac4n74ige56pkjwdclyo6ywg73c7ja
k6sgtmfb5wid3g5cfmpyexuozzs34rw4qqoc3d25qw2elpitomfq
ron3lc6cp2767zzvjhnplkapnwkpmqgwezmpbnmiwcm6urbysrsq
5ei7xalvxwhzyqulcvl6vl6i67neep3qgj2ydkftz2zrqewkymoa
q3suwaqrg35kb2zbgoj4qnip4bw7d6i4jhjv5ukxn7vgfm76quca

Code

import awswrangler as wr

bucket = 'bucket_name`
key = 'test-cur-00001.snappy.parquet'

if __name__ == '__main__':
    df = wr.s3.read_parquet(path=f"s3://{bucket}/", path_suffix=key, dataset=True, columns=[
        "identity_line_item_id"
    ])

Traceback (most recent call last):
  File "/Users/csy08/Documents/Development/playground/parquets/awswranlger_read_parquet.py", line 8, in <module>
    "identity_line_item_id",
  File "/Users/csy08/Documents/Development/playground/parquets/venv/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 567, in read_parquet
    return _read_parquet(path=paths[0], **args)
  File "/Users/csy08/Documents/Development/playground/parquets/venv/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 389, in _read_parquet
    use_threads=use_threads,
  File "/Users/csy08/Documents/Development/playground/parquets/venv/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 346, in _read_parquet_file
    return pq_file.read(columns=columns, use_threads=False, use_pandas_metadata=False)
  File "/Users/csy08/Documents/Development/playground/parquets/venv/lib/python3.7/site-packages/pyarrow/parquet.py", line 328, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1121, in pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Not yet implemented: Unsupported encoding.

I can use the following code and export to CSV or JSON without a problem

import boto3

s3c = boto3.client('s3')

bucket = 'bucket_name`
key = 'test-cur-00001.snappy.parquet'

if __name__ == '__main__':
    response = s3c.select_object_content(
        Bucket=bucket,
        Key=key,
        Expression='SELECT * FROM s3object',
        ExpressionType='SQL',
        InputSerialization={
            'Parquet': {}
        },
        OutputSerialization={
            'CSV': {}
        },
        RequestProgress={'Enabled': True},
    )

    end_event_received = False

    with open('output.csv', 'wb') as f:
        for event in response['Payload']:
            if 'Records' in event:
                data = event['Records']['Payload']
                f.write(data)
            elif 'Progress' in event:
                print(event['Progress']['Details'])
            elif 'End' in event:
                print('Result is complete')
                end_event_received = True

    if not end_event_received:
        raise Exception("End event not received, request incomplete.")

The text was updated successfully, but these errors were encountered:

igorborgest · 2020-11-05T12:36:27Z

Hi @csy08 ! Thanks for reporting it.

Currently CUR writes your report using the “V2” Parquet Schema. And Apache Arrow (Wrangler's dependency to handle low-level parquet functionalities), doesn’t currently support the encodings used by this schema.

So, unfortunately, Wrangler/Apache Arrow will not be able to read columns encoded with DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, and DELTA_BYTE_ARRAY by now.

My recommendation is to use CSV/JSON, or just skip not supported columns if you can.

Reference: https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-strings-delta_byte_array--7

P.S. Let's block this issue until we have a CUR or Apache Arrow update.

csy08 · 2020-11-05T13:10:30Z

@igorborgest Thank you for coming back to me. Understood and shall try to avoid parquet files fo now.

clemthi · 2020-11-27T18:02:55Z

HI @igorborgest,
I have a similar however there is something I don't understand : the CUR parquet datasets I'm usingare usually between 15 to 20 files. I'm able to load them all except the last file (usually around 450 KB).
Why is this last file encoded in a different way?

igorborgest · 2021-01-27T18:50:06Z

This will be naturally supported after CUR or Apache Arrow overcame this incompatibility.

mattboyd-aws · 2021-03-11T06:25:51Z

One way to work around this is to use Athena CTAS statements instead to manipulate the data. As a bonus, this offloads the CPU and memory intensive operations to Athena. Here is an example that uses CTAS to write CUR joined with AWS Account metadata: https://github.com/aws-samples/glue-enrich-cost-and-usage.

csy08 added the bug Something isn't working label Nov 4, 2020

igorborgest self-assigned this Nov 5, 2020

igorborgest added the blocked Something is blocking the development label Nov 5, 2020

igorborgest removed their assignment Nov 5, 2020

igorborgest closed this as completed Jan 27, 2021

igorborgest added wont implement Will not be implemented. and removed blocked Something is blocking the development labels Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSError: Not yet implemented: Unsupported encoding. #442

OSError: Not yet implemented: Unsupported encoding. #442

csy08 commented Nov 4, 2020

igorborgest commented Nov 5, 2020

csy08 commented Nov 5, 2020

clemthi commented Nov 27, 2020

igorborgest commented Jan 27, 2021

mattboyd-aws commented Mar 11, 2021

OSError: Not yet implemented: Unsupported encoding. #442

OSError: Not yet implemented: Unsupported encoding. #442

Comments

csy08 commented Nov 4, 2020

igorborgest commented Nov 5, 2020

csy08 commented Nov 5, 2020

clemthi commented Nov 27, 2020

igorborgest commented Jan 27, 2021

mattboyd-aws commented Mar 11, 2021