Skip to content

Can't output blended JSON as Parquet #6938

@philrz

Description

@philrz

When subject to blend, ths GitHub archive data can't be output as Parquet anymore.

$ curl -s -O https://data.gharchive.org/2023-02-08-0.json.gz &&
  super -f parquet -o 2023-02-08-0.parquet -c 'blend' 2023-02-08-0.json.gz

parquetio: unsupported type: not implemented: support for DENSE_UNION

I suspect this is due at least partially to the introduction of the none type, as this fails similarly.

$ echo '{a:[1]} {a:[]} {a:[2]}' | super -f parquet -o tmp.parquet -c 'blend' -

parquetio: unsupported type: not implemented: support for DENSE_UNION

Details

Repro is with super commit f0f35f7.

I bumped into this issue because this GitHub archive data is commonly used in our benchmarks, including a permutation where it's turned into Parquet and queried in that form, since this creates an apples-to-apples with SQL databases that also support directly querying Parquet. To conform to the single-schema constraint of Parquet, the data was previously made columnar with the fuse operator, but the mission of fuse changed slightly (#6713) so that form of the data can't currently be output directly as Parquet (#6765). I was advised to use blend instead, and this did indeed work with this data up through commit f2bbdd1.

$ super -version
Version: v0.3.0-45-gf2bbdd1ed

$ super -f parquet -o 2023-02-08-0.parquet -c 'blend' 2023-02-08-0.json.gz &&
  super -c 'count()' 2023-02-08-0.parquet
172049

That broke at the next commit f2b63b5, which is associated with the merge of #6819.

$ super -version
Version: v0.3.0-46-gf2b63b5bd

$ super -f parquet -o 2023-02-08-0.parquet -c 'blend' 2023-02-08-0.json.gz
parquetio: unsupported type: none

That error message remained until commit f0f35f7, which is associated with the merge of #6881.

$ super -version
Version: v0.3.0-91-gf0f35f760

$ super -f parquet -o 2023-02-08-0.parquet -c 'blend' 2023-02-08-0.json.gz
parquetio: unsupported type: not implemented: support for DENSE_UNION

That error message remains up through current tip of main (040ecf9 at the moment).

Beyond the first error message tipping me off, I could see by diff-ing the SUP representation of the blended type then vs. now that the none type sticks out.

$ super -S -c 'blend | by typeof(this)' 2023-02-08-0.json.gz | head -19
<
  {
    id: string,
    type: string,
    actor: {
      id: int64,
      login: string,
      display_login: string,
      gravatar_id: string,
      url: string,
      avatar_url: string
    },
    repo: {
      id: int64,
      name: string,
      url: string
    },
    payload: {
      ref?: string|null|none,

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions