Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination acceptance tests: verify that a destination is able to write any ISO8601-compliant date string #5712

Closed
Tracked by #6996 ...
sherifnada opened this issue Aug 27, 2021 · 3 comments · Fixed by #5814 or #9816
Assignees
Labels
area/connectors Connector related issues normalization

Comments

@sherifnada
Copy link
Contributor

sherifnada commented Aug 27, 2021

Implementation hints

the destination acceptance test suite, among other things, inserts pre-defined records into each destination to verify that it works correctly. See this directory for the records it inserts and the catalogs that define the records/stream schemas. As part of this, we want to enumerate all compliant ISO8601 formats and insert a few records which conform to each formats (one or two per format is fine). Then we should expect destinations which aren't able to handle all formats (like snowflake) to start failing acceptance tests. To do this we'll probably want to do something like the following:

  1. Verify snowflake destination is passing acceptance tests before beginning by running ./gradlew :airbyte-integrations:connectors:destination-snowflake:integrationTest (you'll need to place the config in lastpass in the secrets/ directory for this to work)
  2. Identify all the acceptable ISO 8601 json schema date and date-time date string formats (i.e; what are all the acceptable string patterns for each of these formats?). It is probably helpful to list them in a text file in the directory linked above, just for reference e.g: just write out all the formats in a formats_reference.txt file.
  3. Add one stream to the edge_cases_catalog which has two string fields whose format fields are set to date and date-time respectively.
  4. For each format you listed in step 2, add a few records in the records file. Now the destination acceptance test (DAT) will try to insert them into every destination.
  5. Verify that the snowflake tests are now failing.
  6. open a PR with your changes and merge it into master after approval.
  7. Find out which other destinations are failing acceptance tests after this change. Create a ticket to fix each of them.
@bazarnov
Copy link
Collaborator

bazarnov commented Sep 2, 2021

STEP 1: ✅

./gradlew :airbyte-integrations:connectors:destination-snowflake:integrationTest Is passing before begining.

STEP 2: ✅ Identified + made the Snowflake edge cases tests with date & datetimes, the results are down bellow.

ISO8601 possible date & datetime formats

General Information

Based on the official documentation for JSONSchema about Date & Time formats:
Dates and times are represented in RFC 3339, section 5.6. This is a subset of the date format also commonly known as ISO8601 format. Which could be represented in the table down bellow:

VALID DATE-TIME formats

Table with VALID Date-Time formats acceptable for JsonSchema

Format Data Type JsonSchema Type JsonSchema Format Example
YYYY-M-D string string date "2021-1-1"
YYYY-M-D hh:mm:ss string string date-time "2021-1-1 01:01:01"
YYYY-M-D hh:mm:ss Z string string date-time "2021-1-1 01:01:01 +1"
YYYY-M-DThh:mm:ss Z string string date-time "2021-1-1T01:01:01 +1:00"
======= ======= ======= ======= =======
YYYY-MM-DD string string date "2021-01-01"
YYYY-MM-DD hh:mm:ss string string date-time "2021-01-01 01:01:01"
YYYY-MM-DD hh:mm:ss Z string string date-time "2021-01-01 01:01:01 +0000"
YYYY-MM-DDThh:mm:ssZ string string date-time "2021-01-01T01:01:01Z"
"2021-01-01T01:01:01-01:00"
YYYY-MM-DDThh:mm:ss Z string string date-time "2021-01-01T01:01:01 +1:00"

INVALID DATE-TIME formats

Table with INVALID Date-Time formats for JsonSchema

This examples will lead to validation error in json schema, as well as Timestamp 'your_timestamp' is not recognized inside Snowflake as a destination.

Format Data Type JsonSchema Type JsonSchema Format Example
YYYY-MM-DD hh:mm:ss Z string string date-time "2021-01-01 01:01:01 UTC"
YYYY-MM-DDThh:mm:ss Z string string date-time "2021-01-01T01:01:01 UTC"
"2021-01-01T01:01:01 +1"
"2021-01-01T01:01:01 +0000"
YYYY-MM-DDThh:mm:ssZ string string date-time "2021-01-01T01:01:01+0000"
"2021-01-01T01:01:01UTC"
"2021-01-01T01:01:01+1"

TESTS ON Snowflake Destination

The tests are made with the single conversion statement:
SELECT to_timestamp_tz('input_string_with_possible_datetime') and the output is the result of the test, indicated in the tables bellow.

Table with tests on Snowflake as a destination

VALID DATE-TIME formats

Input String Format Snowflake Output
"2021-1-1" YYYY-M-D 2021-01-01 00:00:00.000 -0800
"2021-1-1 01:01:01" YYYY-M-D hh:mm:ss 2021-01-01 01:01:01.000 -0800
"2021-1-1 01:01:01 +1" YYYY-M-D hh:mm:ss Z 2021-01-01 01:01:01.000 +0100
"2021-1-1T01:01:01 +1:00" YYYY-M-DThh:mm:ss Z 2021-01-01 01:01:01.000 +0100
"2021-01-01" YYYY-MM-DD 2021-01-01 00:00:00.000 -0800
"2021-01-01 01:01:01" YYYY-MM-DD hh:mm:ss 2021-01-01 01:01:01.000 -0800
"2021-01-01 01:01:01 +0000" YYYY-MM-DD hh:mm:ss Z 2021-01-01 01:01:01.000 +0000
"2021-01-01T01:01:01Z" YYYY-MM-DDThh:mm:ssZ 2021-01-01 01:01:01.000 +0000
"2021-01-01T01:01:01-01:00" YYYY-MM-DDThh:mm:ssZ 2021-01-01 01:01:01.000 -0100
"2021-01-01T01:01:01 +1:00" YYYY-MM-DDThh:mm:ss Z 2021-01-01 01:01:01.000 +0100

INVALID DATE-TIME formats

Input String Format Snowflake Output
"2021-01-01 01:01:01 UTC" YYYY-MM-DD hh:mm:ss Z Timestamp '2021-01-01 01:01:01 UTC' is not recognized
"2021-01-01T01:01:01 UTC" YYYY-MM-DDThh:mm:ss Z Timestamp '2021-01-01T01:01:01 UTC' is not recognized
"2021-01-01T01:01:01 +1" YYYY-MM-DDThh:mm:ss Z Timestamp '2021-01-01T01:01:01 +1' is not recognized
"2021-01-01T01:01:01 +0000" YYYY-MM-DDThh:mm:ss Z Timestamp '2021-01-01T01:01:01 +0000' is not recognized
"2021-01-01T01:01:01+0000" YYYY-MM-DDThh:mm:ssZ Timestamp '2021-01-01T01:01:01+0000' is not recognized
"2021-01-01T01:01:01UTC" YYYY-MM-DDThh:mm:ssZ Timestamp '2021-01-01T01:01:01UTC' is not recognized
"2021-01-01T01:01:01+1" YYYY-MM-DDThh:mm:ssZ Timestamp '2021-01-01T01:01:01+1' is not recognized

STEP 3, 4, 5: ❌

  1. Snowflake DestinationAcceptanceTest fails on date assertion with the following messages:

    SnowflakeInsertDestination:

    SnowflakeInsertDestinationAcceptanceTest > [2] edge_case_messages.txt, edge_case_catalog.json FAILED
    org.opentest4j.AssertionFailedError: expected: <"2021-01-01"> but was: <"2021-01-01T00:00:00Z">
    

    SnowflakeGcsCopyDestination:

    SnowflakeGcsCopyDestinationAcceptanceTest > [2] edge_case_messages.txt, edge_case_catalog.json FAILED
    org.opentest4j.AssertionFailedError: expected: <"2021-01-01"> but was: <"2021-01-01T00:00:00Z">
    

    SnowflakeS3CopyDestination:

    SnowflakeS3CopyDestinationAcceptanceTest > [2] edge_case_messages.txt, edge_case_catalog.json FAILED
    org.opentest4j.AssertionFailedError: expected: <"2021-01-01"> but was: <"2021-01-01T00:00:00Z">
    

Which corresponds to the input of:
"2021-01-01" or "2021-1-1" of format: "date"

Perhaps, the underlying convertion tries to use Snowflakes to_timestamp_tz() method, instead of to_date() in this case or it's something else inside of DestinationAcceptanceTest.

  1. Normalisation on Snowflake fails with the following messages:

    SnowflakeInsertDestination:

    SnowflakeInsertDestinationAcceptanceTest > [2] edge_case_messages.txt, edge_case_catalog.json FAILED
    io.airbyte.workers.WorkerException: Normalization Failed.
    

    SnowflakeGcsCopyDestination:

    SnowflakeGcsCopyDestinationAcceptanceTest > [2] edge_case_messages.txt, edge_case_catalog.json FAILED
    io.airbyte.workers.WorkerException: Normalization Failed.
    

    SnowflakeS3CopyDestination:

    SnowflakeS3CopyDestinationAcceptanceTest > [2] edge_case_messages.txt, edge_case_catalog.json FAILED
    io.airbyte.workers.WorkerException: Normalization Failed.
    

THE FULL HTML FILE WITH COLLECTED LOGS IS HERE:

SnowflakeDestination_IntegrationTest.zip

STEP 6: ✅ PR is here: #5814

STEP 7: In progress.

@yurii-bidiuk
Copy link
Contributor

The following integration tests are failed on the normalization phase with error like: invalid datetime '1/1/2021':

BigQueryDestinationAcceptanceTest.testSyncWithNormalization()
MySQLDestinationAcceptanceTest.testSyncWithNormalization()
PostgresDestinationAcceptanceTest.testSyncWithNormalization()
RedshiftInsertDestinationAcceptanceTest.testSyncWithNormalization()
SnowflakeS3CopyDestinationAcceptanceTest.testSyncWithNormalization()
SnowflakeInsertDestinationAcceptanceTest.testSyncWithNormalization()
SnowflakeGcsCopyDestinationAcceptanceTest.testSyncWithNormalization()

MSSQLDestinatuionAcceptanceTest.testSyncWithNormalization -> normalization returns null for values when value couldn't be normalized

These two tests are failed with the message like: Invalid format: "20210101T0930" is malformed at "30". It should be fixed on the Java destination side (io.airbyte.integrations.destination.bigquery.BigQueryUtils#transformJsonDateTimeToBigDataFormat):

BigQueryDenormalizedDestinationAcceptanceTest.testSyncNormalizedWithoutNormalization()
BigQueryDenormalizedDestinationAcceptanceTest.testSync()

@tuliren
Copy link
Contributor

tuliren commented Dec 3, 2021

Are we sure that all the valid formats listed here are actually valid? For example: 2021-1-1 01:01:01 +1 does not seem a legit string following YYYY-M-D hh:mm:ss Z. The Z stands for offset-Z that look like this: +0000; -0800; -08:00;, according to Java DateTimeFormatter.

When we add these datetime strings back, please make sure that there is a Java or Python parser that can correctly distinguish between the valid and invalid ones.

By the way @VitaliiMaltsev already implemented one that is capable to handle lots of them in here with various tests. We should have something similar in the main repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment