Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization #46

Closed
wants to merge 56 commits into from
Closed

Optimization #46

wants to merge 56 commits into from

Conversation

dgitis
Copy link
Collaborator

@dgitis dgitis commented Aug 3, 2022

Description & motivation

Addressing issue #40.

This PR makes substantial changes to reduce the amount of processing required when using this package in production.

It takes the static_incremental_days variable and defaults it to 1 making all runs of base_ga4__events reprocess yesterday's data. The static_incremental_days variable also restricts stg_ga4__events so that, if there's you are not doing a --full-refresh, the package only processes yesterday's data by default (configurable by setting a different number for static_incremental_days in dbt_project.yml).

As a result, all downstream models are either Views or some variety of Incremental materialization. Custom marts Tables will all need to be converted to Incremental models. The Views only hold the most recent static_incremental_days worth of data.

A stg_ga4__first_sessions table was created to incrementally add data to when ga_session_number = 1.

The stg_ga4__user_first_last_* tables were changed to stg_ga4__user_last_* to hold new session data in the window set by stg_ga4__events. These new tables should be merged with stg_ga4__first_sessions to replace the first_last functionality.

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have run dbt test and python -m pytest . to validate exists tests

@dgitis
Copy link
Collaborator Author

dgitis commented Aug 5, 2022

I had set partitions_to_replace to replace from today to static_incremental_days back worth of partitions. Someone removed the today part of that. Was there are reason for that removal @adamribaudo-velir?

If we are using either streaming or batch+streaming, then we want to include today's data.

@dgitis
Copy link
Collaborator Author

dgitis commented Aug 5, 2022

The dim_ga4__users model tripped a BigQuery complexity error in development. I've managed to change it to avoid the error but it I'm concerned that lots of user_properties might cause this error to surface again.

Simplifying the model a little more would be a good idea and there are a few areas where I think there might be some better ways of accomplishing what the model needs to do, but I just don't know them.

@dgitis dgitis linked an issue Aug 5, 2022 that may be closed by this pull request
@dgitis
Copy link
Collaborator Author

dgitis commented Aug 5, 2022

A weakness of this PR as it is currently designed is that if the dbt fails for more days than is in partitions_to_replace, then users have to manually increase partitions_to_replace to catch up on processing.

It would be nice if we detected this and caught up without user intervention. I'd love to hear ideas on how we might do this.

@adamribaudo-velir
Copy link
Collaborator

I had set partitions_to_replace to replace from today to static_incremental_days back worth of partitions. Someone removed the today part of that. Was there are reason for that removal @adamribaudo-velir?

If we are using either streaming or batch+streaming, then we want to include today's data.

No, there wasn't a principled reason for that removal of today. Just an artifact of that branch bouncing from you and then to me to implement the for loop.

@dgitis dgitis closed this Jan 23, 2023
@dgitis
Copy link
Collaborator Author

dgitis commented Jan 23, 2023

Some of the issues fixed in this branch are still worth considering, but this branch is too far separated from the main to base a PR request off.

@dgitis dgitis deleted the first-session branch March 29, 2023 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants