When subject to blend, ths GitHub archive data can't be output as Parquet anymore.
$ curl -s -O https://data.gharchive.org/2023-02-08-0.json.gz &&
super -f parquet -o 2023-02-08-0.parquet -c 'blend' 2023-02-08-0.json.gz
parquetio: unsupported type: not implemented: support for DENSE_UNION
I suspect this is due at least partially to the introduction of the none type, as this fails similarly.
$ echo '{a:[1]} {a:[]} {a:[2]}' | super -f parquet -o tmp.parquet -c 'blend' -
parquetio: unsupported type: not implemented: support for DENSE_UNION
Details
Repro is with super commit f0f35f7.
I bumped into this issue because this GitHub archive data is commonly used in our benchmarks, including a permutation where it's turned into Parquet and queried in that form, since this creates an apples-to-apples with SQL databases that also support directly querying Parquet. To conform to the single-schema constraint of Parquet, the data was previously made columnar with the fuse operator, but the mission of fuse changed slightly (#6713) so that form of the data can't currently be output directly as Parquet (#6765). I was advised to use blend instead, and this did indeed work with this data up through commit f2bbdd1.
$ super -version
Version: v0.3.0-45-gf2bbdd1ed
$ super -f parquet -o 2023-02-08-0.parquet -c 'blend' 2023-02-08-0.json.gz &&
super -c 'count()' 2023-02-08-0.parquet
172049
That broke at the next commit f2b63b5, which is associated with the merge of #6819.
$ super -version
Version: v0.3.0-46-gf2b63b5bd
$ super -f parquet -o 2023-02-08-0.parquet -c 'blend' 2023-02-08-0.json.gz
parquetio: unsupported type: none
That error message remained until commit f0f35f7, which is associated with the merge of #6881.
$ super -version
Version: v0.3.0-91-gf0f35f760
$ super -f parquet -o 2023-02-08-0.parquet -c 'blend' 2023-02-08-0.json.gz
parquetio: unsupported type: not implemented: support for DENSE_UNION
That error message remains up through current tip of main (040ecf9 at the moment).
Beyond the first error message tipping me off, I could see by diff-ing the SUP representation of the blended type then vs. now that the none type sticks out.
$ super -S -c 'blend | by typeof(this)' 2023-02-08-0.json.gz | head -19
<
{
id: string,
type: string,
actor: {
id: int64,
login: string,
display_login: string,
gravatar_id: string,
url: string,
avatar_url: string
},
repo: {
id: int64,
name: string,
url: string
},
payload: {
ref?: string|null|none,
When subject to
blend, ths GitHub archive data can't be output as Parquet anymore.I suspect this is due at least partially to the introduction of the
nonetype, as this fails similarly.Details
Repro is with super commit f0f35f7.
I bumped into this issue because this GitHub archive data is commonly used in our benchmarks, including a permutation where it's turned into Parquet and queried in that form, since this creates an apples-to-apples with SQL databases that also support directly querying Parquet. To conform to the single-schema constraint of Parquet, the data was previously made columnar with the
fuseoperator, but the mission offusechanged slightly (#6713) so that form of the data can't currently be output directly as Parquet (#6765). I was advised to useblendinstead, and this did indeed work with this data up through commit f2bbdd1.That broke at the next commit f2b63b5, which is associated with the merge of #6819.
That error message remained until commit f0f35f7, which is associated with the merge of #6881.
That error message remains up through current tip of
main(040ecf9 at the moment).Beyond the first error message tipping me off, I could see by
diff-ing the SUP representation of the blended type then vs. now that thenonetype sticks out.