-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet/iceberg data types #96
Comments
Hi @samdaviestvg thanks for opening an issue. Regarding the issue of the decimal type: yes, you're right. This can currently only expressed by stating Could you provide a small example parquet file and the accompanying datacontract.yaml regarding the time zone issue? This would help me to reproduce exactly the error, as I have not been able to do this on my machine. Especially, which type did you encode in the parquet file for the timestamp_tz column? |
Thank you. The timezone is encoded as required int64 field_id=4 createddate (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false) Uploading file types yaml and parquet are not supported on github, please see raw text version. If they are of no use please let me know
parquet as json |
Can you send me the parquet file via email, please? simon.harrer@innoq.com |
Thank you. I could reproce the issue. The tool basically loads the parquet file via DuckDB and uses soda-core to execute quality checks. The import using DuckDB leads to the following SQL schema: {'column_name': {0: 'id', 1: 'isdeleted', 2: 'name', 3: 'createddate'}, 'column_type': {0: 'VARCHAR', 1: 'BOOLEAN', 2: 'VARCHAR', 3: 'TIMESTAMP WITH TIME ZONE'}, 'null': {0: 'YES', 1: 'YES', 2: 'YES', 3: 'YES'}, 'key': {0: None, 1: None, 2: None, 3: None}, 'default': {0: None, 1: None, 2: None, 3: None}, 'extra': {0: None, 1: None, 2: None, 3: None}} This means that we need to convert the "timestamp_tz" to "TIMESTAMP WITH TIME ZONE" as the type that is used for DuckDB. This is a bug, and we should fix this. Basically, we need to add a type mapping based on the SQL dialect used (in this case the SQL dialect of DuckDB + parquet). |
Here's the python script I used to detect the DuckDB SQL Schema of this parquet file: import duckdb
con = duckdb.connect()
con.execute("INSTALL parquet;")
con.execute("LOAD parquet;")
con.execute("CREATE VIEW test as (SELECT * FROM 'testfile.parquet')")
print(con.execute("DESCRIBE test").df().to_dict()) |
I had to add a similar mapping for CSV files: https://github.com/datacontract/datacontract-cli/blob/main/datacontract/export/csv_type_converter.py |
Added a fix. Targeted for release v0.10.4 |
Im trying to write a data contract to read parquet files from s3. I tried using the type Decimal(6,2)
cost__c:
type: decimal(6,2)
But get met with the error - ERROR:root:Run operation failed: [lint] Check that data contract YAML is valid - None - failed - data.models.complaintcost_c.fields.cost__c.type must be one of ['number', 'decimal', 'numeric', 'int', 'integer', 'long', 'bigint', 'float', 'double', 'string', 'text', 'varchar', 'boolean', 'timestamp', 'timestamp_tz', 'timestamp_ntz', 'date', 'array', 'object', 'record', 'struct', 'bytes', 'null'] - datacontract
So changing just to decimal
cost__c:
type: decimal
I get the error - Type Mismatch, Expected Type: timestamp_tz; Actual Type: timestamp with time zone
Column,Event,Details
cost__c,:icon-fail: Type Mismatch, Expected Type: decimal; Actual Type: decimal(6,2)
a similar thing with type "timestamp with timezone" and "timestamp_tz"
Type Mismatch, Expected Type: timestamp_tz; Actual Type: timestamp with time zone
Column,Event,Details
lastreferenceddate,:icon-fail: Type Mismatch, Expected Type: timestamp_tz; Actual Type: timestamp with time zone
should I be able to use these data types as part of the contract?
The text was updated successfully, but these errors were encountered: