Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File CDK: update spec & config with new CSV options #28133

Closed
2 tasks
clnoll opened this issue Jul 11, 2023 · 2 comments
Closed
2 tasks

File CDK: update spec & config with new CSV options #28133

clnoll opened this issue Jul 11, 2023 · 2 comments

Comments

@clnoll
Copy link
Contributor

clnoll commented Jul 11, 2023

The existing S3 connector offers a number of options for configuring connections for CSV file types. To ensure backwards compatibility, we'll want to update the AbstractFileBasedSpec and config adapter to handle them.

In #28131, we're creating a new S3 FileBasedConfig object. This ticket involves extending that object to handle CSV-specific options, and will also require the creation of a custom parser that handles the old options.

  • Verify that all options that we still support are appropriately mapped to the name in the file-based CDK:

    • delimiter
    • quote_char
    • escape_char
    • encoding
    • double_quote
    • newlines_in_values
  • Handle options that we don't offer in the new file-based CDK

    • infer_datatypes: Configures whether a schema for the source should be inferred from the current data or not. If this is set to True, we'll want to infer & cast types even if the user has not provided a schema. Unfortunately this is set to True for the vast majority of connectors so it does feel like we should handle it as opposed to letting this be a breaking change.
    • additional_reader_options: Options provided to the CSV reader. There are only a handful of connectors with these set. The file-based CDK will be updated to support the following
      • strings_can_be_null: this should always be True.
      • null_values: this should be offered as a CSV-specific config option, so the spec should be updated accordingly.
      • We should confirm that these options are not necessary: autogenerate_column_names, compression, include_missing_columns, and check_utf8.
      • One connector is using {"column_types":{"Zipcode": "string"}}. Because this is a single connector we should consider deprecating this option.
      • Double-check to verify that we have a plan to either support or deprecate all additional_reader_options that are in use by connectors in cloud.
    • advanced_options: Options provided to Pyarrow, used by a handful of connectors. Instead of blindly passing these options to pyarrow, we should deliberately surface those that we want to support, and deprecate the rest, as follows.
      • column_types: this allows us to support headerless CSVs. We should surface it as an option in the CSV-specific section of the spec.
      • skip_rows & skip_rows_after_names: select one of these and offer it as a CSV-specific config option. (For existing connectors, we should be able to support both by calculating the value for skip_rows based on skip_rows_after_names or vice-versa.)
      • Verify that encoding is already handled, and that compression will be handled by the stream reader without requiring additional config options.
      • Double-check to verify that we have a plan to either support or deprecate alladvanced_options that are in use by connectors in cloud.

Acceptance Criteria

  • The existing CSV config options are mapped and handled appropriately by the S3 connector.
  • Any options that we cannot support are identified, along with the connectors that will be impacted.
@clnoll
Copy link
Contributor Author

clnoll commented Jul 11, 2023

Grooming notes:

  • To avoid breaking OSS connectors we should handle all options instead of trying to only support the existing ones.

@clnoll clnoll changed the title File CDK: S3 config adapter (CSV options) File CDK: update spec & config with new CSV options Jul 13, 2023
@girarda girarda self-assigned this Jul 18, 2023
@girarda
Copy link
Contributor

girarda commented Jul 27, 2023

End of sprint update:

  • PR for missing options except infer_datatypes
  • infer_datatypes can be tackled in a separate PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants