Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨CAT: Add test to ensure all file types covered #33746

Merged
merged 15 commits into from
Jan 26, 2024

Conversation

askarpets
Copy link
Contributor

@askarpets askarpets commented Dec 22, 2023

What

Add a CAT to ensure all file types covered
Resolves #33363

How

This test will run if connector's metadata has the following params:

  • connectorSubtype: file
  • ab_internal.ql >= 400

(i.e. certified file-based connectors) and verify that the sandbox account for this connector contains the following:

  • all supported structured file types: .avro, .csv, .jsonl, .parquet
  • at least one of supported unstructured file types: .pdf, .doc, .docx, .ppt, .pptx, .md

unless otherwise specified in the connector's tests config (acceptance-test-config.yml).

In case the connector does not support some of the file types listed above, there is a possibility to disable checks for them using unsupported_types section in config:

acceptance_tests:
  basic_read:
    tests:
      - config_path: secrets/config.json
        expect_records:
          path: integration_tests/expected_records/csv.jsonl
          exact_order: true
        file_types:
          unsupported_types:
            - extension: .csv
              bypass_reason: "Optional reason of why this type is not supported"
            - extension: .avro

Another option is to skip the test at all:

acceptance_tests:
  basic_read:
    tests:
      - config_path: secrets/config.json
        expect_records:
          path: integration_tests/expected_records/csv.jsonl
          exact_order: true
        file_types:
          skip_test: true
          bypass_reason: "Optional reason of why this test is skipped"

Important note

This test collects available files and their corresponding types in TestBasicRead.test_read (to avoid extra API calls), so if your connector have multiple basic_read configs for different file types (e.g. source-s3), please bypass all configs except the last one.

Next steps

The following connectors have connectorSubtype: file and ab_internal.ql >= 400 and their configs need to be reviewed:

  • source-file
  • source-google-sheets
  • source-s3

Recommended reading order

  1. test_core.py
  2. config.py

🚨 User Impact 🚨

No breaking changes

Pre-merge Actions

Updating the Python CDK

Airbyter

Before merging:

  • Pull Request description explains what problem it is solving
  • Code change is unit tested
  • Build and my-py check pass
  • Smoke test the change on at least one affected connector
    • On Github: Run this workflow, passing --use-local-cdk --name=source-<connector> as options
    • Locally: airbyte-ci connectors --use-local-cdk --name=source-<connector> test
  • PR is reviewed and approved

After merging:

  • Publish the CDK
    • The CDK does not follow proper semantic versioning. Choose minor if this the change has significant user impact or is a breaking change. Choose patch otherwise.
    • Write a thoughtful changelog message so we know what was updated.
  • Merge the platform PR that was auto-created for updating the Connector Builder's CDK version
    • This step is optional if the change does not affect the connector builder or declarative connectors.

@askarpets askarpets self-assigned this Dec 22, 2023
Copy link

vercel bot commented Dec 22, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Jan 26, 2024 10:36am

Copy link
Contributor

Warning

Soft code freeze is in effect until 2024-01-02. Please avoid merging to master. #freedom-and-responsibility

@askarpets askarpets marked this pull request as ready for review December 27, 2023 15:49
@askarpets askarpets requested a review from a team December 27, 2023 15:49
Copy link
Collaborator

@lazebnyi lazebnyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @askarpets - just a few requests.

In addition to the requests inline in the code, can you please update the PR description with:

  • An example of how a user can set unsupported file types
  • An example of how a user can bypass this test
  • A list of which connectors we expect to run this test by default (i.e. a list of file-based, certified connectors). If we expect any to fail because the test config needs to be updated to list the unsupported file types, please state that here and either open a ticket or go ahead and update the config.

Comment on lines +175 to +178
file_types: Optional[FileTypesConfig] = Field(
default_factory=FileTypesConfig,
description="For file-based connectors, unsupported by source file types can be configured or a test can be skipped at all",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should just be called unsupported_file_types, and should be a list/set (similar to empty_streams). WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @clnoll, thanks for the review!
I added this section as a complex object mostly because we need to have the ability to skip this test at all, and to do so, a user just need to set FileTypesConfig.skip_test = True instead of listing all available file types. Do you think there could be a better approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see what you mean now. LGTM!

@staticmethod
def _get_file_extension(file_name: str) -> str:
_, file_extension = splitext(file_name)
return file_extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return file_extension
return file_extension.casefold()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, thanks! I also added this to _get_unsupported_file_types method to avoid issues when users list unsupported file types in uppercase for some reason.

Comment on lines 1193 to 1194
f"Please make sure you added files with all of supported structured types {structured_types} "
f"and at least one with unstructured type {unstructured_types} to the test account."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind updating this to 1) list out the ones that are missing, and 2) provide instructions for marking a file type as unsupported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, please take a look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks great.

@katmarkham katmarkham removed the request for review from a team January 17, 2024 18:05
@askarpets askarpets requested a review from clnoll January 25, 2024 18:37
Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@askarpets askarpets merged commit 3592ee9 into master Jan 26, 2024
29 of 30 checks passed
@askarpets askarpets deleted the cat-test-all-supported-file-types-present branch January 26, 2024 10:52
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File-based CDK: Add a CAT to ensure all file types covered
4 participants