Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: support embedded page_view_id #52

Closed
wants to merge 22 commits into from

Conversation

jtcohen6
Copy link
Contributor

Background

Many of our recent Snowplow installations have resulted in a single event stream table, with a schema matching Snowplow's canonical event model.

In these cases, we do not need to look up the page_view_id in a separate table containing web page context; it just needs to be un-arrayed and un-nested from the contexts object on the main events table. This change also enables a more fully incremental build, since page_view_id and collector_tstamp are united from the start.

N.B. OTF = "On The Fence" = I considered multiple approaches and picked one without being sure it's the best. Open to input.

Changelog

  • Add functionality whereby a package installer can set 'snowplow:context:web_page': false. The package will expect to see a column called page_view_id directly within the event model.
  • Add snowplow_web_events_tmp model/macro. IFF the web page context is disabled, this model performs the deduplication of snowplow_web_page_context—throwing away all events that have multiple page view IDs—directly on top of base events. All subsequent models (snowplow_web_events, snowplow_web_events_time, snowplow_web_events_scroll_depth) build on top of this one.
  • Add canonical_event and canonical_event_update seeds that are exact replicas of event/event_update merged with web_page/web_page_update. OTF: Whether to include brand-new seeds or to just make the join happen in base_event. Adding more seeds is definitely duplicate code, but it's also in accordance with our integration test practice (?) of having seeds represent the expected format of raw data.
  • Add new tests on the three covered adapters where 'snowplow:context:web_page': false.

Comments

  • This PR assumes that the package installer has already plucked the page_view_id from contexts and added it to 'snowplow:events' as a column by the same name.
  • OTF: Whether to include a cross-db macro to grab values from Snowplow contexts, or to include page-view plucking by default in the snowplow_web_events_tmp, or to do neither and leave it up to the installer (status quo).

@jtcohen6 jtcohen6 requested a review from drewbanin March 31, 2019 21:52
@drewbanin
Copy link
Collaborator

OTF Add canonical_event and canonical_event_update seeds that are exact replicas of event/event_update merged with web_page/web_page_update

Yep, I think this is appropriate. Good call.

OTF ** Whether to include a cross-db macro to grab values from Snowplow contexts, or to include page-view plucking by default in the snowplow_web_events_tmp, or to do neither and leave it up to the installer (status quo).**

I think we should leave this up to the user, but conceivably it will make sense to provide helper models / macros for very typical use cases (like Snowflake or Spectrum nested fields).

Copy link
Collaborator

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, but I have some questions about performance and I still haven't totally wrapped my head around snowplow_web_events_tmp. Can you update the README to include usage instructions for the new usage of snowplow:context:web_page?


{% macro default__snowplow_web_events_tmp() %}

{% if var('snowplow:context:web_page', False) %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

 {{
    config(
        enabled=var('snowplow:context:web_page', False),
        materialized='incremental',
        sort='page_view_id',
        dist='page_view_id',
        unique_key='event_id'
    )
}}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to materialize this model incrementally? I'm weary of making another full copy of the events table, even if the work is done incrementally. Could we instead make a view that implements this logic? Curious to hear what you think, as I don't feel super strongly about this, but did want to call it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played with making this model ephemeral instead, but then we don't have a way to include the is_incremental() logic, which I still wanted to consolidate. It's now a macro called by the main slew of Snowplow event models.

@jtcohen6 jtcohen6 added this to the 0.8.0 milestone May 16, 2019
@drewbanin
Copy link
Collaborator

@jtcohen6 do you want me to re-review this one?

@jtcohen6
Copy link
Contributor Author

@drewbanin Yessir. I've made a few more changes—related though likely beyond the initial scope of this PR—in order to support my experimentation with external tables. Namely:

  • Allow the definition of a partition column + datepart precision. This is collector_tstamp + second by default.
  • Use get_most_recent_record() macro instead of where ts > (select max(ts) from {{this}}) paradigm, since BQ + Spectrum + Snowflake external cannot prune scanning based on dynamic partition filters.

I believe these changes are relevant. To my mind, the primary use case for this PR's functionality is when Snowplow data is loaded or queried, in its canonical event structure, directly from external storage.

Failing tests

I would also appreciate your eye on the failing CircleCI tests, whose operative error appears to be:

ERROR: google-api-core 1.14.2 has requirement setuptools>=34.0.0, but you'll have setuptools 28.8.0 which is incompatible.

All tests are passing for me locally.

@jtcohen6 jtcohen6 modified the milestones: 0.8.0, 0.9.0 Oct 21, 2019
@jtcohen6 jtcohen6 changed the base branch from dev/0.8.0 to dev/0.9.0 October 21, 2019 15:21
@jtcohen6 jtcohen6 modified the milestones: 0.9.0, 0.10.0 Mar 27, 2020
@jtcohen6 jtcohen6 changed the base branch from dev/0.9.0 to master April 21, 2020 18:04
@jtcohen6 jtcohen6 removed this from the 0.11 milestone Jun 12, 2020
@jtcohen6 jtcohen6 closed this Nov 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants