Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Salesforce: "QUERY_TIMEOUT" #17503

Closed
marcosmarxm opened this issue Oct 3, 2022 · 4 comments · Fixed by #19869
Closed

Source Salesforce: "QUERY_TIMEOUT" #17503

marcosmarxm opened this issue Oct 3, 2022 · 4 comments · Fixed by #19869

Comments

@marcosmarxm
Copy link
Member

This Github issue is synchronized with Zendesk:

Ticket ID: #2512
Priority: normal
Group: Community Assistance Engineer
Assignee: Nataly Merezhuk

Original ticket description:

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: Ubuntu, MacOS
  • Memory / Disk: 8Gb / 500 Gb
  • Deployment: GKE (kubernetes)
  • Airbyte Version: 0.39.34-alpha
  • Source name/version: Salesforce / 1.0.20
  • Destination name/version: BigQuery / 1.2.0
  • Step: during sync
  • Description:
    I am attempting to transfer data in Salesforce.
    When I try to transfer a large object, I get the following error and want the correct transfer to take place.
2022-09-16 05:33:10 e[44msourcee[0m > Syncing stream: Contact 
2022-09-16 05:35:06 e[44msourcee[0m > [{"message":"Your query request was running for too long.","errorCode":"QUERY_TIMEOUT"}]
2022-09-16 05:35:06 e[44msourcee[0m > Cannot receive data for stream 'Contact', error message: 'Your query request was running for too long.'
2022-09-16 05:35:06 e[44msourcee[0m > Read 0 records from Contact stream

Do you have any suggestions? Thanks.

[Discourse post]

@marcosmarxm
Copy link
Member Author

Comment made from Zendesk by Marcos Marx on 2022-10-03 at 09:48:

This issue has been discussed for some time.
The solution is to simplify SOQL. (e.g., filter by WHERE clause, limit number of items by limit, etc.)

However, this solution is not suitable for Airbyte’s architecture.Because Airbyte is ELT.
If the errorCode is QUERY_TIMEOUT, the solution would be to shift the date condition specified in the WHERE clause of the query in request_params and send the request.

Also, I discovered this problem too late.
Because the worker terminates normally even if the above error occurs. I believe there is a bug in the if statement below and the exception is not just raised.
The first condition of the if statement causes all error codes to return instead of raise.

  <a href="https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py#L128" rel="noopener nofollow ugc noreferrer" style="color: #006699; font-weight: bold; text-decoration: none">github.com</a>

airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py#L128

<pre style="max-width: 694px; word-wrap: break-word"><code style="background-color: #f9f9f9; display: block; overflow: auto; padding: 2px 5px; white-space: normal">
  <ol style="counter-reset: li-counter 117" dir="auto">
      <li style="padding-bottom: 10px; white-space: pre">            There are several types of Salesforce sobjects that require additional processing:</li>
      <li style="padding-bottom: 10px; white-space: pre">              1. Sobjects for which the user, after setting up the data using Airbyte, restricted access,</li>
      <li style="padding-bottom: 10px; white-space: pre">                 and we will receive 403 HTTP errors.</li>
      <li style="padding-bottom: 10px; white-space: pre">              2. There are streams that do not allow you to make a sample using Salesforce `query` or `queryAll`.</li>
      <li style="padding-bottom: 10px; white-space: pre">                 And since we use a dynamic method of generating streams for Salesforce connector - at the stage of discover,</li>
      <li style="padding-bottom: 10px; white-space: pre">                 we cannot filter out these streams, so we catch them at the stage of reading data.</li>
      <li style="padding-bottom: 10px; white-space: pre">            """</li>
      <li style="padding-bottom: 10px; white-space: pre">            error_data = error.response.json()[0]</li>
      <li style="padding-bottom: 10px; white-space: pre">            if error.response.status_code in [codes.FORBIDDEN, codes.BAD_REQUEST]:</li>
      <li style="padding-bottom: 10px; white-space: pre">                error_code = error_data.get("errorCode", "")</li>
      <li style="padding-bottom: 10px; white-space: pre">                if error_code != "REQUEST_LIMIT_EXCEEDED" or error_code == "INVALID_TYPE_FOR_OPERATION":</li>
      <li style="padding-bottom: 10px; white-space: pre">                    self.logger.error(f"Cannot receive data for stream '{self.name}', error message: '{error_data.get('message')}'")</li>
      <li style="padding-bottom: 10px; white-space: pre">                    return</li>
      <li style="padding-bottom: 10px; white-space: pre">            raise error</li>
      <li style="padding-bottom: 10px; white-space: pre">
      </li>
  • class BulkSalesforceStream(SalesforceStream):
  • page_size = 15000
  • DEFAULT_WAIT_TIMEOUT_SECONDS = 86400 # 24-hour bulk job running time
  • MAX_CHECK_INTERVAL_SECONDS = 2.0
  • MAX_RETRY_NUMBER = 3
  • [Discourse post]

    @marcosmarxm
    Copy link
    Member Author

    Comment made from Zendesk by Nataly Merezhuk on 2022-10-03 at 13:41:

    Hi @Nakachi-S, thanks for bringing this to our attention! I have created a GitHub issue connected to your forum post and triaged it, you can find it and comment more here:
    #17503

    @PierreKerschgens
    Copy link
    Contributor

    PierreKerschgens commented Nov 22, 2022

    Hey everbody,

    I'm facing this issue as well

    2022-11-15 17:45:06 �[44msource�[0m > Syncing stream: AccountHistory 
    2022-11-15 17:47:07 �[44msource�[0m > [{"message":"Your query request was running for too long.","errorCode":"QUERY_TIMEOUT"}]
    2022-11-15 17:47:07 �[44msource�[0m > Cannot receive data for stream 'AccountHistory', error message: 'Your query request was running for too long.'
    2022-11-15 17:47:07 �[44msource�[0m > Read 0 records from AccountHistory stream
    2022-11-15 17:47:07 �[44msource�[0m > Finished syncing AccountHistory
    

    The bad thing for me is that the sync job didn't mark as failed even though this error occured. The incremental loads have the same error.

    Is there any workaround I can do to get that data?

    @alafanechere
    Copy link
    Contributor

    alafanechere commented Nov 24, 2022

    @davydov-d , @PierreKerschgens is expressing a potential regression after this PR. The AccountHistory stream might support Bulk API queries. It's contrary to what's been documented by Salesforce here in 2020. The situation might have changed since then... So we might want to remove AccountHistory from the unsupported bulk API stream list ...

    @davydov-d davydov-d linked a pull request Nov 29, 2022 that will close this issue
    @davydov-d davydov-d self-assigned this Nov 29, 2022
    davydov-d added a commit that referenced this issue Nov 29, 2022
    * #17503 source salesforce: make AccountHistory support BULK API
    
    * #17503 source salesforce: upd changelog
    
    * auto-bump connector version
    
    Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging a pull request may close this issue.

    6 participants