Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exported Parquet Files are incompatible with Hive due to capital letters in column names #37

Closed
belenaj opened this issue Jun 23, 2019 · 0 comments · Fixed by #164
Closed
Labels
feature Product feature timeline:long-term Marker for tickets that are unlikely to be implemented in the near future

Comments

@belenaj
Copy link

belenaj commented Jun 23, 2019

When using the script EXPORT_PATH to export an eXasol Table, the generated parquet files have a schema with columns names in capital letters. The reason is probably that EXASOL uses upper case metadata.

This is not a problem by itself, but when it comes to store these files as a hive table, where Hive and Spark share the common meta-store, new issues appear.

As explained here, Hive is case insensitive, while Parquet is not.

Hive stores the table, field names in lowercase in Hive Metastore.
Spark preserves the case of the field name in Dataframe, Parquet Files.
When a table is created/accessed using Spark SQL, Case Sensitivity is preserved by Spark storing the details in Table Properties (in hive metastore). This results in a weird behavior when parquet records are accessed thru Spark SQL using Hive Metastore.

Therefore, as a user of cloud-storage-etl-udfs,
I want to be able to export parquet files with column names in lower case to maximize compatibility with Hive and Spark.

EXPORT SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
  BUCKET_PATH    = 's3a://bucket-path/parquet/retail/sales_positions/'
  S3_ACCESS_KEY  = 'MY_AWS_ACCESS_KEY'
  S3_SECRET_KEY  = 'MY_AWS_SECRET_KEY'
  S3_ENDPOINT    = 's3.MY_REGION.amazonaws.com'
  PARALLELISM    = 'iproc(), floor(random()*4)'
  LOWERCASE_SCHEMA = true;
@morazow morazow added feature Product feature timeline:long-term Marker for tickets that are unlikely to be implemented in the near future labels Nov 30, 2020
morazow added a commit that referenced this issue Jul 22, 2021
Fixes #37
Fixes #145.


Co-authored-by: Anastasiia Sergienko <46891819+AnastasiiaSergienko@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Product feature timeline:long-term Marker for tickets that are unlikely to be implemented in the near future
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants