Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Source SalesForce: Fix date filter behaviour #34107

Closed
wants to merge 2 commits into from

Conversation

artem1205
Copy link
Collaborator

What

Resolve https://github.com/airbytehq/oncall/issues/3838

How

change data comparison in WHERE clause:
before (returns duplicates):

{self.cursor_field} >= {start_date} AND {self.cursor_field} < {end_date}

after:

{self.cursor_field} > {start_date} AND {self.cursor_field} <= {end_date}

Recommended reading order

  1. airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py

🚨 User Impact 🚨

no breaking changes

Pre-merge Actions

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Unit & integration tests added

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.

Copy link

vercel bot commented Jan 10, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 10, 2024 3:44pm

Copy link
Contributor

github-actions bot commented Jan 10, 2024

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@artem1205 artem1205 self-assigned this Jan 10, 2024
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Jan 10, 2024
@artem1205 artem1205 marked this pull request as ready for review January 10, 2024 15:55
@octavia-squidington-iv octavia-squidington-iv requested a review from a team January 10, 2024 15:57
@maxi297
Copy link
Contributor

maxi297 commented Jan 10, 2024

Can we do this? Can we guarantee that if one record with a specific datetime is returned, all the records with this datetime will be returned? Given the granularity is "seconds", could we have records that were created at 2024-01-01T00:00:00:000000 and 2024-01-01T00:00:00:999999 and we sync only the first one and not the second during the sync?

Why isn't it sufficient to ask the user to use dedup?

@artem1205
Copy link
Collaborator Author

@maxi297,

Can we do this?

I don't see any reason not to do it.

Can we guarantee that if one record with a specific datetime is returned, all the records with this datetime will be returned?

Absolutely. SOQL date-time filtering works similar to relational DB queries.

Given the granularity is "seconds", could we have records that were created at 2024-01-01T00:00:00:000000 and 2024-01-01T00:00:00:999999 and we sync only the first one and not the second during the sync?

We have granularity in microseconds (3 digits) in our connector. We can confirm this from SalesForce docs and our expected records.
Another option to solve this was to add this granularity == 0.001 to initial date in method stream_slices .

Why isn't it sufficient to ask the user to use dedup?

Already asked to.

@maxi297
Copy link
Contributor

maxi297 commented Jan 15, 2024

The problem with datetime is that those are not cursors that points to one record only, they point to buckets of records. The fear I have is that the bucket we are pointing to is not filled entirely when we do the query. The following sequence of event could occur:

  • Record #⁠1 is made available on 2024-01-01T00:00:00:000000
  • We query the API at 2024-01-01T00:00:00:000500
  • Record #⁠2 is made available on 2024-01-01T00:00:00:000999

In that case, we would lose record #⁠2 because we assume that because we have one record in the bucket 2024-01-01T00:00:00:000, we therefore have all the records. Can you confirm that this situation is not possible and explain why? Repro steps would be also nice

@artem1205
Copy link
Collaborator Author

Can you confirm that this situation is not possible and explain why? Repro steps would be also nice

This is not possible, and here is the reasons:

  1. We use the same granurality level in requests and records (milliseconds)
  2. Salesforce uses SOQL query language, making it work very similarly to traditional relational databases. This means that a record with a timestamp smaller than our request time cannot appear in the database.

Closing this PR for now as this behaviour is expected for our sources:
https://docs.airbyte.com/using-airbyte/core-concepts/sync-modes/incremental-append#:~:text=When%20replicating%20data%20incrementally%2C%20Airbyte,cursor%20is%20not%20very%20granular.

@artem1205 artem1205 closed this Jan 15, 2024
@maxi297
Copy link
Contributor

maxi297 commented Jan 15, 2024

We use the same granurality level in requests and records (milliseconds)

Yes but the real life world time is more granular than the Salesforce API which means that we can have multiple records that share the same cursor field but are inserted at different time. In the example I shared earlier, both records would have a cursor value of 2024-01-01T00:00:00:000 but it does not mean they are inserted at the same time and hence it does not mean that if we query for 2024-01-01T00:00:00:000 the API will return both

EDIT: I feel this concept is better understood when the granularity of the API is bigger. Let's assume the granularity would be in months and that on Jan 15th 2024 I query record up until 2024-01, it would be possible that new records would be added to January 2024 on later days. Hence, on the next incremental sync, we should query for created_at >= 2024-01 and not created_at > 2024-01. Does that make sense? I understand that it is less likely when the granularity is in milliseconds but I don't see anything preventing this from happening

@artem1205 artem1205 deleted the artem1205/source-salesforce-OC-3838 branch March 1, 2024 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/salesforce
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants