Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Source Freshness not running on renamed source tables #4525

Closed
1 task done
ericalouie opened this issue Dec 20, 2021 · 7 comments
Closed
1 task done

[Bug] Source Freshness not running on renamed source tables #4525

ericalouie opened this issue Dec 20, 2021 · 7 comments
Labels
bug Something isn't working packages Functionality for interacting with installed packages

Comments

@ericalouie
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

resource: our slack convo (dbt Labs)

Looks like any source table that has been renamed (i.e. has the identifier argument) doesn’t pull in the correct loaded_at timestamp. Rather, it automatically defaults to the current_timestamp()

However, when I switch the lead source to reference lead__c rather than lead, it looks like dbt will correctly pull the _sdc_batched_at field.

Example of source file:


version: 2

sources:
  - name: salesforce
    description: '{{ doc("src_salesforce") }}'
    loader: stitch
    loaded_at_field: _sdc_batched_at
    database: raw

    freshness:
      warn_after: {count: 4, period: hour}
      error_after: {count: 8, period: hour}

    tables:
      - name: account
      - name: campaign
      - name: campaignmember
      - name: contact
      - name: dbt_cloud_account
        identifier: dbt_cloud_account__c
      - name: dbt_cloud_user
        identifier: dbt_cloud_user__c
      - name: forecasts
        identifier: forecast__c
      - name: goals
        identifier: goal__c
      - name: invoices
        identifier: blng__invoice__c
      - name: invoice_lines
        identifier: blng__invoiceline__c
      - name: lead
        identifier: lead__c
      - name: opportunity
      - name: opportunitylineitem
      - name: opportunity_forecast
        identifier: opportunity_forecast__c
      - name: opportunity_history
        identifier: opportunityhistory
      - name: quarters
        identifier: quarter__c
      - name: partners
        identifier: partner__c
      - name: products
        identifier: product2
      - name: rep_quota_periods
        identifier: rep_commission_period__c

Expected Behavior

If the identifier argument exists, then dbt should choose the identifier argument, not the table name. And then it can pull the correct loaded_at field.

Steps To Reproduce

  1. Create a source.yml file
  2. Create one source with the identifier argument + one source without an identifier argument
  3. evoke dbt source freshness
  4. Check the logs

Relevant log output

2021-12-20 17:54:36.747150 (Thread-8): Using snowflake connection "source.fishtown_internal_analytics.salesforce.opportunity_stage".
2021-12-20 17:54:36.747561 (Thread-6): On source.fishtown_internal_analytics.salesforce.lead: /* {"app": "dbt", "dbt_version": "0.21.0", "profile_name": "garage-snowflake", "target_name": "dev", "node_id": "source.fishtown_internal_analytics.salesforce.lead"} */
    select
       
      convert_timezone('UTC', current_timestamp())  as max_loaded_at,
      convert_timezone('UTC', current_timestamp()) as snapshotted_at

Environment

- OS: MacOS
- Python: 3.7.2
- dbt: 0.21.0

What database are you using dbt with?

snowflake

Additional Context

No response

@ericalouie ericalouie added bug Something isn't working triage labels Dec 20, 2021
@jtcohen6
Copy link
Contributor

I think this has everything to do with the fivetran_utils package overriding the collect_freshness macro, as a means of "disabling" sources from checking freshness if they don't have a meta field is_enabled set to True.

A couple of thoughts:

  1. It seems like a bug with the package's custom collect_freshness logic if it affects unintended sources, i.e. those without meta.is_enabled defined. They should add another, more specific check to ensure this logic only runs for Fivetran package sources.
  2. The macro override for source_freshness seems to be subtly different from other macro override, where an arbitrary package macro cannot take precedence over built-in behavior. This privilege ought to rest solely with the root project, for just this reason, whether by direct override or dispatch.
  3. Rather than relying on an override of the collect_freshness macro, I'd much rather that Fivetran disable source freshness by disabling sources. More on unintended side-effects of this approach in Package project yml configs overriding root project configs #3698). That's a bit tricky to do today, so one thing we can do to help is improving behavior around var usability in dbt_project.yml (Defining vars, folder-level configs outside dbt_project.yml #2955) and source configs ([CT-201] Reconcile configs + properties for sources #3662).

reproduction

If I run this in a separate unrelated project:

version: 2

sources:
  - name: salesforce
    description: 'my cool description'
    loader: stitch
    loaded_at_field: _sdc_batched_at
    database: raw

    freshness:
      warn_after: {count: 4, period: hour}
      error_after: {count: 8, period: hour}

    tables:
      - name: account
      - name: forecasts
        identifier: forecast__c

In the logs I see:

2021-12-22 04:58:12.469649 (Thread-1): On source.testy.salesforce.account: /* {"app": "dbt", "dbt_version": "0.21.1", "profile_name": "garage-snowflake", "target_name": "dev", "node_id": "source.testy.salesforce.account"} */
select
      max(_sdc_batched_at) as max_loaded_at,
      convert_timezone('UTC', current_timestamp()) as snapshotted_at
    from raw.salesforce.account
2021-12-22 04:58:12.469754 (Thread-2): On source.testy.salesforce.forecasts: /* {"app": "dbt", "dbt_version": "0.21.1", "profile_name": "garage-snowflake", "target_name": "dev", "node_id": "source.testy.salesforce.forecasts"} */
select
      max(_sdc_batched_at) as max_loaded_at,
      convert_timezone('UTC', current_timestamp()) as snapshotted_at
    from raw.salesforce.forecast__c

However, as soon as I install the Fivetran package (running dbt deps with):

# packages.yml
packages:
  - package: fivetran/fivetran_utils
    version: 0.3.0

Then I see:

On source.testy.salesforce.account: /* {"app": "dbt", "dbt_version": "0.21.1", "profile_name": "garage-snowflake", "target_name": "dev", "node_id": "source.testy.salesforce.account"} */

    
      
    
  
    
  
  

    select
      
      max(_sdc_batched_at)
       as max_loaded_at,
      convert_timezone('UTC', current_timestamp()) as snapshotted_at

    
    from raw.salesforce.account
2021-12-22 05:00:44.420039 (Thread-1): Opening a new connection, currently in state init
2021-12-22 05:00:44.422370 (Thread-2): Using snowflake connection "source.testy.salesforce.forecasts".
2021-12-22 05:00:44.422486 (Thread-2): On source.testy.salesforce.forecasts: /* {"app": "dbt", "dbt_version": "0.21.1", "profile_name": "garage-snowflake", "target_name": "dev", "node_id": "source.testy.salesforce.forecasts"} */

    
  
    
  
  

    select
       
      convert_timezone('UTC', current_timestamp())  as max_loaded_at,
      convert_timezone('UTC', current_timestamp()) as snapshotted_at

@jtcohen6 jtcohen6 added packages Functionality for interacting with installed packages Team: Execution and removed triage labels Dec 22, 2021
@fivetran-joemarkiewicz
Copy link
Contributor

Hey @jtcohen6 and @ericalouie I have been able to look into this on my end and see where the issue resides within our collect_freshness macro that is causing the bug. I have applied a quick fix in the PR above for dbt_fivetran_utils that should not have different behavior for source tables that have an identifier. I am planning on merging this in a new release early tomorrow.

That being said, I agree that I would prefer the dbt_fivetran_utils package to not impact sources outside of Fivetran packages. I am trying to conceptualize how we would add logic to the macro to only check for a Fivetran source 🤔. Unfortunately, I can't seem to think of how I would do this. Nevertheless, I think avoiding the risk of possibly overriding the base freshness macro entirely is the optimal path here in my opinion. I am still reading through the Issues you linked about the unforeseen consequences of disabling sources.

Let me know if you have any immediate thoughts on how this can be bet tackled.

@jtcohen6
Copy link
Contributor

I am trying to conceptualize how we would add logic to the macro to only check for a Fivetran source

Yeah, there's no great way to do this. I suppose you could check whether source.loader == 'Fivetran'... but that's a metadata-only field today, not intended for functional use.

Nevertheless, I think avoiding the risk of possibly overriding the base freshness macro entirely is the optimal path here in my opinion.

Agree, agree, agree. We need to actually disable these sources (which also disables their freshness checks), rather than trying to "pass over" them at runtime.

The best approach for this, IMO, is to give you the ability to disable sources by means of variables. I'd really like to do this via #3662 (which I just prioritized for v1.1), so that you can replace this logic with:

sources:
  - name: hubspot
    tables:
    ...
      - name: contact
        description: Each record represents a contact in Hubspot.
        config:
          enabled: "{{ var('hubspot_marketing_enabled', true) }} and {{ var('hubspot_contact_enabled', true) }}"

@fivetran-joemarkiewicz
Copy link
Contributor

Thanks so much @jtcohen6! 🙌

Switching away from the current process to the source override logic you have above will be perfect. I have made a note for myself to check back in once v1.1 is live to make these respective updates to our packages. I am happy to test this once an RC available in the future, or help in any other way I can.

@jtcohen6
Copy link
Contributor

@fivetran-joemarkiewicz FYI The capability mentioned above is included in v1.1 (currently available as a release candidate):

sources:
  - name: specific_source
    tables:
      - name: specific_source_table
        config:
          enabled: False

In the case of your packages, like:

sources:
  - name: hubspot
    tables:
    ...
      - name: contact
        description: Each record represents a contact in Hubspot.
        config:
          enabled: "{{ var('hubspot_marketing_enabled', true) }} and {{ var('hubspot_contact_enabled', true) }}"

That should get us away from a need for a custom collect_freshness macro override, since the source itself will be disabled.

@fivetran-joemarkiewicz
Copy link
Contributor

Thanks so much for this update @jtcohen6 🙌

This will be a great addition! I will test this out on the RC and let you know if I have any issues or comments. Looking forward to moving away from the source freshness override!

@jtcohen6
Copy link
Contributor

After reviewing this issue, there's one thing we didn't get to the bottom of: Why did the package definition of collect_freshness silently take precedence over the global/default version in dbt-core, without the end user / root project "opting into" the override? I'd think they'd need to specify dispatch order like:

dispatch:
  - macro_namespace: collect_freshness
    search_order: ['fivetran_utils', 'dbt']

I'm going to open that in a new issue, since the other parts of this issue have since been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working packages Functionality for interacting with installed packages
Projects
None yet
Development

No branches or pull requests

3 participants