Add CSV options to the CSV parser #28491

girarda · 2023-07-19T20:30:09Z

What

Handle the following CSV options that are supported by pyarrow
- true_values: Define which string values convert to True
- false_values: Define which string values convert to False
- skip_rows (renamed to skip_rows_before_header): Number of rows to skip before header
- skip_rows_after_names (renamed to skip_rows_after_header): Number of rows to skip after the header (the ticket says we can combine these two into a single config field, but pyarrow's skip_rows means "skip rows before header, not a global number of rows to skip)
- newlines in values (only valid if the values are quoted). New lines in non-quoted strings will result in unparseable records
- autogenerate_column_names
Added tests for
- strings_can_be_null (always true)

These options are not handled. I bolded the ones that would cause breaking changes:

infer_datatypes. We'll need to write our own type detection. That can/should be done separately to make reviewing easier (this PR is already quite large)
compression: This will need to be implemented at the source level as part of the file reader
encoding: This will need to be implemented at the source level as part of the file reader
include_missing_columns: This field doesn't make sense since we're always supporting schema evolution. The fields will show up as null in the destination if they are not set
column_types: Users can pass an input schema if they want to decide on the schema
check_utf8: This should be part of the encoding (not sure we'll need to support this though)
decimal_point: Seems fairly low value to allow users to use any character as decimal point
quoted_strings_can_be_null: Always setting this to true seems like a reasonable default
include_columns: Users can use column selection to decide what to sync
auto_dict_encode: I don’t know what this means in a non-pyarrow world. There is no encoded dicts type. This will be a breaking change
auto_dict_max_cardinality: Same as above. There is no dict type since we're not converting to a pyarrow object
timestamp_parsers: We write the timestamps as strings instead of converting them into pyarrow timestamp objects

How

Add config fields to the CsvFormat object: null_values, skip_rows_before_header, skip_rows_after_header, autogenerate_column_names, true_values, false_values
Update the FileTypeParser.parse_records so it can return None if a row cannot be parsed
Update the CsvParser.parse_records so it returns None if a row cannot be parsed
Update CsvParser.parse_records to skip rows before and after the header if specified in the config
Update CsvParse to autogenerate the header if specified in the config
Update the cast_types method to convert bool and null values according to the config
Update DefaultFileBasedStream to handle None records. These records should never pass the validation. The skip policy return False while the others raise an Exception)

Recommended reading order

airbyte-cdk/python/airbyte_cdk/sources/file_based/config/csv_format.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/file_type_parser.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/csv_parser.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/jsonl_parser.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/parquet_parser.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/schema_validation_policies/abstract_schema_validation_policy.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/schema_validation_policies/default_schema_validation_policies.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py
airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

github-actions · 2023-07-19T20:30:38Z

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

PR name follows PR naming conventions
Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan and you've followed all steps in the Breaking Changes Checklist
Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
Secrets in the connector's spec are annotated with airbyte_secret
All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
The connector tests are passing in CI
If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

Check for hidden checklists in your PR description
Toggle the github label checklist-action-run on/off to re-run the checklist CI.

girarda · 2023-07-19T22:24:34Z

airbyte-cdk/python/unit_tests/sources/file_based/scenarios/csv_scenarios.py

@@ -922,7 +941,6 @@
                            "escape_char": "@",
                            "double_quote": True,
                            "newlines_in_values": False,
-                            "quoting_behavior": "Quote All"


removed this option from the test since it was not impacting the output

…alex/csv_options

clnoll · 2023-08-02T15:21:54Z

airbyte-cdk/python/unit_tests/sources/file_based/config/test_csv_format.py

+        with pytest.raises(ValidationError):
+            CsvFormat(skip_rows_before_header=1, autogenerate_column_names=True)
+    else:
+        CsvFormat(skip_rows_before_header=1, autogenerate_column_names=True)


Suggested change

with pytest.raises(ValidationError):

CsvFormat(skip_rows_before_header=1, autogenerate_column_names=True)

else:

CsvFormat(skip_rows_before_header=1, autogenerate_column_names=True)

with pytest.raises(ValidationError):

CsvFormat(skip_rows_before_header=skip_rows_before_header, autogenerate_column_names=autogenerate_column_names)

else:

CsvFormat(skip_rows_before_header=skip_rows_before_header, autogenerate_column_names=autogenerate_column_names)

…alex/csv_options

maxi297

LGTM! For my personal knowledge, do we need to cast values within a dict or a list?

maxi297 · 2023-08-03T13:02:46Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/csv_parser.py

@@ -178,5 +221,17 @@ def cast_types(row: Dict[str, str], property_types: Dict[str, Any], logger: logg
    return result


+def _value_to_bool(value: str, true_values: List[str], false_values: List[str]) -> bool:


Should those be sets instead of lists for performance purposes?

maxi297 · 2023-08-03T13:02:58Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/csv_parser.py

@@ -178,5 +221,17 @@ def cast_types(row: Dict[str, str], property_types: Dict[str, Any], logger: logg
    return result


+def _value_to_bool(value: str, true_values: List[str], false_values: List[str]) -> bool:


Should this check be case-insensitive? Thinking about this, I feel it provides more flexibility if not but the UX might be annoying if I want to validate "true" as case insensitive (i.e. I'll need to provide "True", "TRUE", tRue", "tRUe", etc...)

The same question probably applies to null values

I think they should be case-sensitve for backwards compatibility. We can add the option for case-insensitivity if needed.

Documented that the strings are case-sensitive in the spec

maxi297 · 2023-08-03T13:03:43Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/csv_parser.py


-def cast_types(row: Dict[str, str], property_types: Dict[str, Any], logger: logging.Logger) -> Dict[str, Any]:
+
+def cast_types(row: Dict[str, str], property_types: Dict[str, Any], config_format: CsvFormat, logger: logging.Logger) -> Dict[str, Any]:


I know this was like this before but why is this public? It does not seem to be be used elsewhere

maxi297 · 2023-08-03T13:09:03Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/csv_parser.py

+            if i < config_format.skip_rows_after_header:
+                continue
+            # The row was not properly parsed if any of  the values are None
+            if any(val is None for val in row.values()):


Do we have a test for this? It seems odd and I feel it would be worth documenting

this is implicitly tested by test_read[csv_newline_in_values_not_quoted] and I added a unit test because I agree it's surprising

girarda · 2023-08-03T15:53:58Z

LGTM! For my personal knowledge, do we need to cast values within a dict or a list?

Excellent question!

I played around with pyarrow to and conversions of List[bool] and map[str, bool] aren't supported so legacy s3 converted these to strings.

pyarrow.lib.ArrowNotImplementedError: CSV conversion to list<item: bool> is not supported
pyarrow.lib.ArrowNotImplementedError: CSV conversion to list<item: bool> is not supported

In general, we shouldn't have to worry about casting nested values since destination v2 won't normalize the data

* remove invalid legacy option * remove unused option * the tests pass but this is quite messy * very slight clean up * Add skip options to csv format * fix some of the typing issues * fixme comment * remove extra log message * fix typing issues * skip before header * skip after header * format * add another test * Automated Commit - Formatting Changes * auto generate column names * delete dead code * update title and description * true and false values * Update the tests * Add comment * missing test * rename * update expected spec * move to method * Update comment * fix typo * remove unused import * Add a comment * None records do not pass the WaitForDiscoverPolicy * format * remove second branch to ensure we always go through the same processing * Raise an exception if the record is None * reset * Update tests * handle unquoted newlines * Automated Commit - Formatting Changes * Update test case so the quoting is explicit * Update comment * Automated Commit - Formatting Changes * Fail validation if skipping rows before header and header is autogenerated * always fail if a record cannot be parsed * format * set write line_no in error message * remove none check * Automated Commit - Formatting Changes * enable autogenerate test * remove duplicate test * missing unit tests * Update * remove branching * remove unused none check * Update tests * remove branching * format * extract to function * comment * missing type * type annotation * use set * Document that the strings are case-sensitive * public -> private * add unit test * newline --------- Co-authored-by: girarda <girarda@users.noreply.github.com>

* Add everything for BQ but migrate, refactor interface after practical work * Make new default methods, refactor to single implemented method * MigrationInterface and BQ impl created * Trying to integrate with standard inserts * remove unnecessary NameAndNamespacePair class * Shimmed in * Java Docs * Initial Testing Setup * Tests! * Move Migrator into TyperDeduper * Functional Migration * Add Integration Test * Pr updates * bump version * bump version * version bump * Update to airbyte-ci-internal (#29026) * 🐛 Source Github, Instagram, Zendesk-support, Zendesk-talk: fix CAT tests fail on `spec` (#28910) * connectors-ci: better modified connectors detection logic (#28855) * connectors-ci: report path should always start with `airbyte-ci/` (#29030) * make report path always start with airbyte-ci * revert report path in orchestrator * add more test cases * bump version * Updated docs (#29019) * CDK: Embedded reader utils (#28873) * relax pydantic dep * Automated Commit - Format and Process Resources Changes * wip * wrap up base integration * add init file * introduce CDK runner and improve error message * make state param optional * update protocol models * review comments * always run incremental if possible * fix --------- Co-authored-by: flash1293 <flash1293@users.noreply.github.com> * 🤖 Bump minor version of Airbyte CDK * 🚨🚨 Low code CDK: Decouple SimpleRetriever and HttpStream (#28657) * fix tests * format * review comments * Automated Commit - Formatting Changes * review comments * review comments * review comments * log all messages * log all message * review comments * review comments * Automated Commit - Formatting Changes * add comment --------- Co-authored-by: flash1293 <flash1293@users.noreply.github.com> * 🤖 Bump minor version of Airbyte CDK * 🐛 Source Github, Instagram, Zendesk Support / Talk - revert `spec` changes and improve (#29031) * Source oauth0: new streams and fix incremental (#29001) * Add new streams Organizations,OrganizationMembers,OrganizationMemberRoles * relax schema definition to allow additional fields * Bump image tag version * revert some changes to the old schemas * Format python so gradle can pass * update incremental * remove unused print * fix unit test --------- Co-authored-by: Vasilis Gavriilidis <vasilis.gavriilidis@orfium.com> * 🐛 Source Mongo: Fix failing acceptance tests (#28816) * Fix failing acceptance tests * Fix failing strict acceptance tests * Source-Greenhouse: Fix unit tests for new CDK version (#28969) Fix unit tests * Add CSV options to the CSV parser (#28491) * remove invalid legacy option * remove unused option * the tests pass but this is quite messy * very slight clean up * Add skip options to csv format * fix some of the typing issues * fixme comment * remove extra log message * fix typing issues * skip before header * skip after header * format * add another test * Automated Commit - Formatting Changes * auto generate column names * delete dead code * update title and description * true and false values * Update the tests * Add comment * missing test * rename * update expected spec * move to method * Update comment * fix typo * remove unused import * Add a comment * None records do not pass the WaitForDiscoverPolicy * format * remove second branch to ensure we always go through the same processing * Raise an exception if the record is None * reset * Update tests * handle unquoted newlines * Automated Commit - Formatting Changes * Update test case so the quoting is explicit * Update comment * Automated Commit - Formatting Changes * Fail validation if skipping rows before header and header is autogenerated * always fail if a record cannot be parsed * format * set write line_no in error message * remove none check * Automated Commit - Formatting Changes * enable autogenerate test * remove duplicate test * missing unit tests * Update * remove branching * remove unused none check * Update tests * remove branching * format * extract to function * comment * missing type * type annotation * use set * Document that the strings are case-sensitive * public -> private * add unit test * newline --------- Co-authored-by: girarda <girarda@users.noreply.github.com> * Dagster: Add sentry logging (#28822) * Add sentry * add sentry decorator * Add traces * Use sentry trace * Improve duplicate logging * Add comments * DNC * Fix up issues * Move to scopes * Remove breadcrumb * Update lock * ✨Source Shortio: Migrate Python CDK to Low-code CDK (#28950) * Migrate Shortio to Low-Code * Update abnormal state * Format * Update Docs * Fix metadata.yaml * Add pagination * Add incremental sync * add incremental parameters * update metadata * rollback update version * release date --------- Co-authored-by: marcosmarxm <marcosmarxm@gmail.com> * Update to new verbiage (#29051) * [skip ci] Metadata: Remove leading underscore (#29024) * DNC * Add test models * Add model test * Remove underscore from metadata files * Regenerate models * Add test to check for key transformation * Allow additional fields on metadata * Delete transform * Proof of concept parallel source stream reading implementation for MySQL (#26580) * Proof of concept parallel source stream reading implementation for MySQL * Automated Change * Add read method that supports concurrent execution to Source interface * Remove parallel iterator * Ensure that executor service is stopped * Automated Commit - Format and Process Resources Changes * Expose method to fix compilation issue * Use concurrent map to avoid access issues * Automated Commit - Format and Process Resources Changes * Ensure concurrent streams finish before closing source * Fix compile issue * Formatting * Exclude concurrent stream threads from orphan thread watcher * Automated Commit - Format and Process Resources Changes * Refactor orphaned thread logic to account for concurrent execution * PR feedback * Implement readStreams in wrapper source * Automated Commit - Format and Process Resources Changes * Add readStream override * Automated Commit - Format and Process Resources Changes * 🤖 Auto format source-mysql code [skip ci] * 🤖 Auto format source-mysql code [skip ci] * 🤖 Auto format source-mysql code [skip ci] * 🤖 Auto format source-mysql code [skip ci] * 🤖 Auto format source-mysql code [skip ci] * Debug logging * Reduce logging level * Replace synchronized calls to System.out.println when concurrent * Close consumer * Flush before close * Automated Commit - Format and Process Resources Changes * Remove charset * Use ASCII and flush periodically for parallel streams * Test performance harness patch * Automated Commit - Format and Process Resources Changes * Cleanup * Logging to identify concurrent read enabled * Mark parameter as final --------- Co-authored-by: jdpgrailsdev <jdpgrailsdev@users.noreply.github.com> Co-authored-by: octavia-squidington-iii <octavia-squidington-iii@users.noreply.github.com> Co-authored-by: Rodi Reich Zilberman <867491+rodireich@users.noreply.github.com> Co-authored-by: rodireich <rodireich@users.noreply.github.com> * connectors-ci: disable dependency scanning (#29033) * updates (#29059) * Metadata: skip breaking change validation on prerelease (#29017) * skip breaking change validation * Move ValidatorOpts higher in call * Add prerelease test * Fix test * ✨ Source MongoDB Internal POC: Generate Test Data (#29049) * Add script to generate test data * Fix prose * Update credentials example * PR feedback * Bump Airbyte version from 0.50.12 to 0.50.13 * Bump versions for mssql strict-encrypt (#28964) * Bump versions for mssql strict-encrypt * Fix failing test * Fix failing test * 🎨 Improve replication method selection UX (#28882) * update replication method in MySQL source * bump version * update expected specs * update registries * bump strict encrypt version * make password always_show * change url * update registries * 🐛 Avoid writing records to log (#29047) * Avoid writing records to log * Update version * Rollout ctid cdc (#28708) * source-postgres: enable ctid+cdc implementation * 100% ctid rollout for cdc * remove CtidFeatureFlags * fix CdcPostgresSourceAcceptanceTest * Bump versions and release notes * Fix compilation error due to previous merge --------- Co-authored-by: subodh <subodh1810@gmail.com> * connectors-ci: fix `unhashable type 'set'` (#29064) * Add Slack Alert lifecycle to Dagster for Metadata publish (#28759) * DNC * Add slack lifecycle logging * Update to use slack * Update slack to use resource and bot * Improve markdown * Improve log * Add sensor logging * Extend sensor time * merge conflict * PR Refactoring * Make the tests work * remove unnecessary classes, pr feedback * more merging * Update airbyte-integrations/bases/base-typing-deduping-test/src/main/java/io/airbyte/integrations/base/destination/typing_deduping/BaseSqlGeneratorIntegrationTest.java Co-authored-by: Edward Gao <edward.gao@airbyte.io> * snowflake updates --------- Co-authored-by: Ben Church <ben@airbyte.io> Co-authored-by: Baz <oleksandr.bazarnov@globallogic.com> Co-authored-by: Augustin <augustin@airbyte.io> Co-authored-by: Serhii Lazebnyi <53845333+lazebnyi@users.noreply.github.com> Co-authored-by: Joe Reuter <joe@airbyte.io> Co-authored-by: flash1293 <flash1293@users.noreply.github.com> Co-authored-by: Marcos Marx <marcosmarxm@users.noreply.github.com> Co-authored-by: Vasilis Gavriilidis <vasilis.gavriilidis@orfium.com> Co-authored-by: Jonathan Pearlin <jonathan@airbyte.io> Co-authored-by: Alexandre Girard <alexandre@airbyte.io> Co-authored-by: girarda <girarda@users.noreply.github.com> Co-authored-by: btkcodedev <btk.codedev@gmail.com> Co-authored-by: marcosmarxm <marcosmarxm@gmail.com> Co-authored-by: Natalie Kwong <38087517+nataliekwong@users.noreply.github.com> Co-authored-by: jdpgrailsdev <jdpgrailsdev@users.noreply.github.com> Co-authored-by: octavia-squidington-iii <octavia-squidington-iii@users.noreply.github.com> Co-authored-by: Rodi Reich Zilberman <867491+rodireich@users.noreply.github.com> Co-authored-by: rodireich <rodireich@users.noreply.github.com> Co-authored-by: Alexandre Cuoci <Hesperide@users.noreply.github.com> Co-authored-by: terencecho <terencecho@users.noreply.github.com> Co-authored-by: Lake Mossman <lake@airbyte.io> Co-authored-by: Benoit Moriceau <benoit@airbyte.io> Co-authored-by: subodh <subodh1810@gmail.com> Co-authored-by: Edward Gao <edward.gao@airbyte.io>

* Add everything for BQ but migrate, refactor interface after practical work * Make new default methods, refactor to single implemented method * MigrationInterface and BQ impl created * Trying to integrate with standard inserts * remove unnecessary NameAndNamespacePair class * Shimmed in * Java Docs * Initial Testing Setup * Tests! * Move Migrator into TyperDeduper * Functional Migration * Add Integration Test * Pr updates * bump version * bump version * version bump * Update to airbyte-ci-internal (airbytehq#29026) * 🐛 Source Github, Instagram, Zendesk-support, Zendesk-talk: fix CAT tests fail on `spec` (airbytehq#28910) * connectors-ci: better modified connectors detection logic (airbytehq#28855) * connectors-ci: report path should always start with `airbyte-ci/` (airbytehq#29030) * make report path always start with airbyte-ci * revert report path in orchestrator * add more test cases * bump version * Updated docs (airbytehq#29019) * CDK: Embedded reader utils (airbytehq#28873) * relax pydantic dep * Automated Commit - Format and Process Resources Changes * wip * wrap up base integration * add init file * introduce CDK runner and improve error message * make state param optional * update protocol models * review comments * always run incremental if possible * fix --------- Co-authored-by: flash1293 <flash1293@users.noreply.github.com> * 🤖 Bump minor version of Airbyte CDK * 🚨🚨 Low code CDK: Decouple SimpleRetriever and HttpStream (airbytehq#28657) * fix tests * format * review comments * Automated Commit - Formatting Changes * review comments * review comments * review comments * log all messages * log all message * review comments * review comments * Automated Commit - Formatting Changes * add comment --------- Co-authored-by: flash1293 <flash1293@users.noreply.github.com> * 🤖 Bump minor version of Airbyte CDK * 🐛 Source Github, Instagram, Zendesk Support / Talk - revert `spec` changes and improve (airbytehq#29031) * Source oauth0: new streams and fix incremental (airbytehq#29001) * Add new streams Organizations,OrganizationMembers,OrganizationMemberRoles * relax schema definition to allow additional fields * Bump image tag version * revert some changes to the old schemas * Format python so gradle can pass * update incremental * remove unused print * fix unit test --------- Co-authored-by: Vasilis Gavriilidis <vasilis.gavriilidis@orfium.com> * 🐛 Source Mongo: Fix failing acceptance tests (airbytehq#28816) * Fix failing acceptance tests * Fix failing strict acceptance tests * Source-Greenhouse: Fix unit tests for new CDK version (airbytehq#28969) Fix unit tests * Add CSV options to the CSV parser (airbytehq#28491) * remove invalid legacy option * remove unused option * the tests pass but this is quite messy * very slight clean up * Add skip options to csv format * fix some of the typing issues * fixme comment * remove extra log message * fix typing issues * skip before header * skip after header * format * add another test * Automated Commit - Formatting Changes * auto generate column names * delete dead code * update title and description * true and false values * Update the tests * Add comment * missing test * rename * update expected spec * move to method * Update comment * fix typo * remove unused import * Add a comment * None records do not pass the WaitForDiscoverPolicy * format * remove second branch to ensure we always go through the same processing * Raise an exception if the record is None * reset * Update tests * handle unquoted newlines * Automated Commit - Formatting Changes * Update test case so the quoting is explicit * Update comment * Automated Commit - Formatting Changes * Fail validation if skipping rows before header and header is autogenerated * always fail if a record cannot be parsed * format * set write line_no in error message * remove none check * Automated Commit - Formatting Changes * enable autogenerate test * remove duplicate test * missing unit tests * Update * remove branching * remove unused none check * Update tests * remove branching * format * extract to function * comment * missing type * type annotation * use set * Document that the strings are case-sensitive * public -> private * add unit test * newline --------- Co-authored-by: girarda <girarda@users.noreply.github.com> * Dagster: Add sentry logging (airbytehq#28822) * Add sentry * add sentry decorator * Add traces * Use sentry trace * Improve duplicate logging * Add comments * DNC * Fix up issues * Move to scopes * Remove breadcrumb * Update lock * ✨Source Shortio: Migrate Python CDK to Low-code CDK (airbytehq#28950) * Migrate Shortio to Low-Code * Update abnormal state * Format * Update Docs * Fix metadata.yaml * Add pagination * Add incremental sync * add incremental parameters * update metadata * rollback update version * release date --------- Co-authored-by: marcosmarxm <marcosmarxm@gmail.com> * Update to new verbiage (airbytehq#29051) * [skip ci] Metadata: Remove leading underscore (airbytehq#29024) * DNC * Add test models * Add model test * Remove underscore from metadata files * Regenerate models * Add test to check for key transformation * Allow additional fields on metadata * Delete transform * Proof of concept parallel source stream reading implementation for MySQL (airbytehq#26580) * Proof of concept parallel source stream reading implementation for MySQL * Automated Change * Add read method that supports concurrent execution to Source interface * Remove parallel iterator * Ensure that executor service is stopped * Automated Commit - Format and Process Resources Changes * Expose method to fix compilation issue * Use concurrent map to avoid access issues * Automated Commit - Format and Process Resources Changes * Ensure concurrent streams finish before closing source * Fix compile issue * Formatting * Exclude concurrent stream threads from orphan thread watcher * Automated Commit - Format and Process Resources Changes * Refactor orphaned thread logic to account for concurrent execution * PR feedback * Implement readStreams in wrapper source * Automated Commit - Format and Process Resources Changes * Add readStream override * Automated Commit - Format and Process Resources Changes * 🤖 Auto format source-mysql code [skip ci] * 🤖 Auto format source-mysql code [skip ci] * 🤖 Auto format source-mysql code [skip ci] * 🤖 Auto format source-mysql code [skip ci] * 🤖 Auto format source-mysql code [skip ci] * Debug logging * Reduce logging level * Replace synchronized calls to System.out.println when concurrent * Close consumer * Flush before close * Automated Commit - Format and Process Resources Changes * Remove charset * Use ASCII and flush periodically for parallel streams * Test performance harness patch * Automated Commit - Format and Process Resources Changes * Cleanup * Logging to identify concurrent read enabled * Mark parameter as final --------- Co-authored-by: jdpgrailsdev <jdpgrailsdev@users.noreply.github.com> Co-authored-by: octavia-squidington-iii <octavia-squidington-iii@users.noreply.github.com> Co-authored-by: Rodi Reich Zilberman <867491+rodireich@users.noreply.github.com> Co-authored-by: rodireich <rodireich@users.noreply.github.com> * connectors-ci: disable dependency scanning (airbytehq#29033) * updates (airbytehq#29059) * Metadata: skip breaking change validation on prerelease (airbytehq#29017) * skip breaking change validation * Move ValidatorOpts higher in call * Add prerelease test * Fix test * ✨ Source MongoDB Internal POC: Generate Test Data (airbytehq#29049) * Add script to generate test data * Fix prose * Update credentials example * PR feedback * Bump Airbyte version from 0.50.12 to 0.50.13 * Bump versions for mssql strict-encrypt (airbytehq#28964) * Bump versions for mssql strict-encrypt * Fix failing test * Fix failing test * 🎨 Improve replication method selection UX (airbytehq#28882) * update replication method in MySQL source * bump version * update expected specs * update registries * bump strict encrypt version * make password always_show * change url * update registries * 🐛 Avoid writing records to log (airbytehq#29047) * Avoid writing records to log * Update version * Rollout ctid cdc (airbytehq#28708) * source-postgres: enable ctid+cdc implementation * 100% ctid rollout for cdc * remove CtidFeatureFlags * fix CdcPostgresSourceAcceptanceTest * Bump versions and release notes * Fix compilation error due to previous merge --------- Co-authored-by: subodh <subodh1810@gmail.com> * connectors-ci: fix `unhashable type 'set'` (airbytehq#29064) * Add Slack Alert lifecycle to Dagster for Metadata publish (airbytehq#28759) * DNC * Add slack lifecycle logging * Update to use slack * Update slack to use resource and bot * Improve markdown * Improve log * Add sensor logging * Extend sensor time * merge conflict * PR Refactoring * Make the tests work * remove unnecessary classes, pr feedback * more merging * Update airbyte-integrations/bases/base-typing-deduping-test/src/main/java/io/airbyte/integrations/base/destination/typing_deduping/BaseSqlGeneratorIntegrationTest.java Co-authored-by: Edward Gao <edward.gao@airbyte.io> * snowflake updates --------- Co-authored-by: Ben Church <ben@airbyte.io> Co-authored-by: Baz <oleksandr.bazarnov@globallogic.com> Co-authored-by: Augustin <augustin@airbyte.io> Co-authored-by: Serhii Lazebnyi <53845333+lazebnyi@users.noreply.github.com> Co-authored-by: Joe Reuter <joe@airbyte.io> Co-authored-by: flash1293 <flash1293@users.noreply.github.com> Co-authored-by: Marcos Marx <marcosmarxm@users.noreply.github.com> Co-authored-by: Vasilis Gavriilidis <vasilis.gavriilidis@orfium.com> Co-authored-by: Jonathan Pearlin <jonathan@airbyte.io> Co-authored-by: Alexandre Girard <alexandre@airbyte.io> Co-authored-by: girarda <girarda@users.noreply.github.com> Co-authored-by: btkcodedev <btk.codedev@gmail.com> Co-authored-by: marcosmarxm <marcosmarxm@gmail.com> Co-authored-by: Natalie Kwong <38087517+nataliekwong@users.noreply.github.com> Co-authored-by: jdpgrailsdev <jdpgrailsdev@users.noreply.github.com> Co-authored-by: octavia-squidington-iii <octavia-squidington-iii@users.noreply.github.com> Co-authored-by: Rodi Reich Zilberman <867491+rodireich@users.noreply.github.com> Co-authored-by: rodireich <rodireich@users.noreply.github.com> Co-authored-by: Alexandre Cuoci <Hesperide@users.noreply.github.com> Co-authored-by: terencecho <terencecho@users.noreply.github.com> Co-authored-by: Lake Mossman <lake@airbyte.io> Co-authored-by: Benoit Moriceau <benoit@airbyte.io> Co-authored-by: subodh <subodh1810@gmail.com> Co-authored-by: Edward Gao <edward.gao@airbyte.io>

girarda added 3 commits July 18, 2023 19:11

remove invalid legacy option

f744e2c

remove unused option

fb5a57d

the tests pass but this is quite messy

3230205

octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/source/s3 labels Jul 19, 2023

very slight clean up

f6a67db

octavia-squidington-iii removed the area/connectors Connector related issues label Jul 19, 2023

girarda added 4 commits July 19, 2023 14:11

Add skip options to csv format

d01200b

fix some of the typing issues

b271a9e

fixme comment

7add1c7

remove extra log message

e8c88be

girarda commented Jul 19, 2023

View reviewed changes

fix typing issues

9e73b51

girarda force-pushed the alex/csv_options branch from f6b2d32 to 9e73b51 Compare July 25, 2023 22:32

girarda and others added 14 commits July 25, 2023 15:51

merge

84cabeb

skip before header

79f7748

skip after header

0ae95da

format

6324257

add another test

0fd42ca

Automated Commit - Formatting Changes

8b54aff

auto generate column names

b9a4a71

Merge branch 'alex/csv_options' of github.com:airbytehq/airbyte into …

9982834

…alex/csv_options

delete dead code

32844ce

update title and description

cd48738

true and false values

43ce434

Update the tests

df47586

Add comment

ce9a672

missing test

2c03349

girarda added 6 commits August 1, 2023 17:21

Merge branch 'alex/csv_options' of github.com:airbytehq/airbyte into …

06157dc

…alex/csv_options

remove duplicate test

e5a1c0e

missing unit tests

9c9dc72

Update

146680a

remove branching

4cfd721

remove unused none check

6f10047

clnoll reviewed Aug 2, 2023

View reviewed changes

girarda added 10 commits August 2, 2023 08:29

Merge branch 'master' into alex/csv_options

e4986e8

Update tests

1f57507

remove branching

0441c28

format

c2b3a37

extract to function

d8538f9

comment

16df89d

Merge branch 'master' into alex/csv_options

7d7f6dd

missing type

cef6a41

Merge branch 'alex/csv_options' of github.com:airbytehq/airbyte into …

cec32dc

…alex/csv_options

type annotation

05067a7

girarda requested a review from clnoll August 2, 2023 16:26

maxi297 approved these changes Aug 3, 2023

View reviewed changes

girarda added 5 commits August 3, 2023 08:01

use set

bdbd413

Document that the strings are case-sensitive

bf525b4

public -> private

d32a94f

add unit test

69240b0

newline

bfe4d47

girarda merged commit 641a65a into master Aug 3, 2023
16 checks passed

girarda deleted the alex/csv_options branch August 3, 2023 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CSV options to the CSV parser #28491

Add CSV options to the CSV parser #28491

girarda commented Jul 19, 2023 •

edited

github-actions bot commented Jul 19, 2023

girarda Jul 19, 2023

clnoll Aug 2, 2023

maxi297 left a comment

maxi297 Aug 3, 2023

girarda Aug 3, 2023

maxi297 Aug 3, 2023

girarda Aug 3, 2023

maxi297 Aug 3, 2023

girarda Aug 3, 2023

maxi297 Aug 3, 2023

girarda Aug 3, 2023

girarda commented Aug 3, 2023

		@@ -178,5 +221,17 @@ def cast_types(row: Dict[str, str], property_types: Dict[str, Any], logger: logg
		return result


		def _value_to_bool(value: str, true_values: List[str], false_values: List[str]) -> bool:


		def cast_types(row: Dict[str, str], property_types: Dict[str, Any], logger: logging.Logger) -> Dict[str, Any]:

		def cast_types(row: Dict[str, str], property_types: Dict[str, Any], config_format: CsvFormat, logger: logging.Logger) -> Dict[str, Any]:

Add CSV options to the CSV parser #28491

Add CSV options to the CSV parser #28491

Conversation

girarda commented Jul 19, 2023 • edited

What

How

Recommended reading order

github-actions bot commented Jul 19, 2023

Before Merging a Connector Pull Request

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

girarda commented Aug 3, 2023

girarda commented Jul 19, 2023 •

edited