Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-site support #107

Closed
dgitis opened this issue Jan 13, 2023 · 15 comments
Closed

Multi-site support #107

dgitis opened this issue Jan 13, 2023 · 15 comments

Comments

@dgitis
Copy link
Collaborator

dgitis commented Jan 13, 2023

I'm currently deploying a customized multi-site installation and want to integrate it into the main package so that the site can update.

The architecture I currently have set up looks like this.

image

In this draft, base_ga4__events is a View that union all each site. Each site gets its own base_ga4__events_* incremental table.

I'll also need to integrate intraday tables and configure the sessions and users tables to flag sessions and users from each site. That shouldn't be necessary with pages and other tables as the page_location column should identify the separate sites.

@adamribaudo-velir
Copy link
Collaborator

adamribaudo-velir commented Jan 14, 2023

Doesn't each stream get a unique ID that comes in with the events? See stream_id That would be a more structured way of identifying which events come from which property/stream.

What does your source .yml file look like? Can you share?

@adamribaudo-velir
Copy link
Collaborator

dbt has on the roadmap namespacing and cross-project references, which might be the preferred way to handling this: dbt-labs/dbt-core#5244

Not sure when it's coming, obviously, but I'd be cautious about going too deep on this feature and veering away from what might become best practice.

@dgitis
Copy link
Collaborator Author

dgitis commented Jan 14, 2023

I had mentally identified variable-driven source.yml as an issue. I hand-coded my source.yml with different datasets in the same project while ignoring the variable. Presumably, we would prefer to support datasets in any project.

I put this in its src_ga4.yml.

version: 2

sources:
  - name: ga4_267562037
    database: "{{var('project')}}" 
    schema: "analytics_267562037" 
    tables:
      - name: events
        identifier: events_* # Scan across all sharded event tables. Use the 'start_date' variable to limit this scan
        description: Main events table exported by GA4. Sharded by date. 
  - name: ga4_267582302
    database: "{{var('project')}}" 
    schema: "analytics_267582302" 
    tables:
      - name: events
        identifier: events_* # Scan across all sharded event tables. Use the 'start_date' variable to limit this scan
        description: Main events table exported by GA4. Sharded by date. 
  - name: ga4_267590219
    database: "{{var('project')}}" 
    schema: "analytics_267590219" 
    tables:
      - name: events
        identifier: events_* # Scan across all sharded event tables. Use the 'start_date' variable to limit this scan
        description: Main events table exported by GA4. Sharded by date. 
  - name: ga4_267594719
    database: "{{var('project')}}" 
    schema: "analytics_267594719" 
    tables:
      - name: events
        identifier: events_* # Scan across all sharded event tables. Use the 'start_date' variable to limit this scan
        description: Main events table exported by GA4. Sharded by date. 
  - name: ga4_267596014
    database: "{{var('project')}}" 
    schema: "analytics_267596014" 
    tables:
      - name: events
        identifier: events_* # Scan across all sharded event tables. Use the 'start_date' variable to limit this scan
        description: Main events table exported by GA4. Sharded by date. 

@dgitis
Copy link
Collaborator Author

dgitis commented Jan 14, 2023

Worst-case scenario is that we move from a variable-driven source file to a hand-coded one.

@dgitis
Copy link
Collaborator Author

dgitis commented Jan 14, 2023

There are some interesting concepts in linked PR and linked from that PR. Namespacing dbt resources, if we could link tagging with namespacing (namespace var() tag wherever you find it) or something similar allowing us to namespace the entire package, just the models leading up to stg_ga4__events, or whatever else you decide to tag could be very useful.

However, I don't, based on a limited reading of the various PRs, see a complete solution to this particular problem.

@dgitis
Copy link
Collaborator Author

dgitis commented Jan 16, 2023

I'm playing with this right now. It seems like the easiest route to go is to do the reverse of what I have in the screenshot and immediately union all the source models and then write to base_ga4__events model. I prefer keeping the incremental base models separate as it makes it easier to undo everything, but I'm not completely against the reverse.

The reason for doing the union all first is that it isn't easy to dynamically create models (and it may not be possible). As I understand it, to create dynamically models for each base_ga4__events_x model, we'd need to create our own materialization which requires a greater understanding of dbt_core than I have. The BigQuery incremental macro was easy to find and copy, but it doesn't have a name block and I'd need to dig deeper into dbt_core looking for something that may not even exist.

The remaining issue when doing union all first is generating the src_ga4,yml file. We can't use jinja control structures, but we can use variables.

version: 2

sources:
  - &ga4_base
    name: ga4
    database: "{{var('project')}}"  # Default to single-site config
    schema: "{{var('dataset')}}" 
    tables:
      - name: events
        identifier: events_* # Scan across all sharded event tables. Use the 'start_date' variable to limit this scan
        description: Main events table exported by GA4. Sharded by date. 
      - name: events_intraday
        identifier: events_intraday_*
        description: Intraday events table which is optionally exported by GA4. Always contains events from the current day.

{% if var(datasets) is not none  %}
{% for ds in datasets %}
  - <<: *ga4_base
    name: ga4_"{{var('ds')}}" 
    schema: "{{var('ds')}}" 
{% endfor %}
{% endif %}

YML does have some interesting properties that let you expand on a base dictionary, but the if and for blocks aren't allowed. Since we can't know how many sites we need to add we're stuck with a few options.

  1. Add the codegen package as a dependency and have users follow extra setup instructions to generate the YAML file and copy and paste the codegen output to the src_ga4.yml file.
  2. Have users edit the src_ga4.yml file themselves.
  3. Create a src_ga4.yml file with an arbitrary number of dataset expansions and have the package break when exceeding that number (and hope it doesn't break when you have too few).

@adamribaudo-velir, do you have any preferences? I'm leaning towards number 2. I think we can have a default, single-site config and then have multi-site users override the settings.

@adamribaudo-velir
Copy link
Collaborator

Still need to review your suggestion, but is this request the same as #19 ? I feel like we discussed this (though maybe there are new ideas to consider)

@adamribaudo-velir
Copy link
Collaborator

In Slack someone ended up using macros as part of the pre/post run hooks which seems like an interesting path to pursue: https://getdbt.slack.com/archives/C2JRRQDTL/p1665516903423559?thread_ts=1656004021.576959&cid=C2JRRQDTL

@adamribaudo-velir
Copy link
Collaborator

Looping back to your suggestions...

For suggestions 1 and 2, would we move src_ga4.yml out of our package and require that the end-user create it within their project? I haven't tested that myself, so just want to make sure that our package models can reference a source defined in the parent project.

Because if we let them override the package src_ga4.yml then it will reset every time they update versions which isn't great.

@adamribaudo-velir
Copy link
Collaborator

Another idea would be to break this up into multiple projects (Fivetran does something similar, but they don't use this to solve the N sources problem)

image

@willbryant
Copy link
Contributor

#52 is a bit related too, as far as the part about source dataset name being hardcoded to one project variable.

I like the idea of using something that defaults like it does now, but can be more easily overridden in the project that's embedding the project. I believe both macros and views can be overridden, can sources do that too?

@dgitis
Copy link
Collaborator Author

dgitis commented Jan 17, 2023

I need to experiment with source overrides @willbryant . I hope that they can. It's also possible that dis-used sources don't cause a problem so that switching between a dataset: 'analytics_x' variable and a datasets: ['analytics_x','analytics_y'] list won't affect functionality.F

I think for suggestions 1 and 2, @adamribaudo-velir, I would experiment with the source overrides to understand my options before committing. I'm not sure that this will work yet, but I think we default to single source in the package and disable the single source files when it detects that the datasets variable is configured.

Maybe to prevent variable collisions similar to #52, we look for a ga4_datasets variable configured at the project level and disable the model associated with the ga4.dataset model.

@dgitis
Copy link
Collaborator Author

dgitis commented Jan 17, 2023

This is related to #19. I have a client with the need and some time to put into it whereas, in that initial issue, I only had the outline of a solution.

I didn't look at YAML macros. I'll explore that option.

@willbryant
Copy link
Contributor

Sorry to clarify, I was talking about DBT macros, not YAML macros - or did you have YAML macros in mind already?

@adamribaudo-velir
Copy link
Collaborator

adamribaudo-velir commented Jan 19, 2023

I'm not sure what a YAML macro is. I was referring to something like this:

on-run-start:
  - "{{ create_our_sources() }}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants