-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: support embedded page_view_id #52
Conversation
OTF Add canonical_event and canonical_event_update seeds that are exact replicas of event/event_update merged with web_page/web_page_update Yep, I think this is appropriate. Good call. OTF ** Whether to include a cross-db macro to grab values from Snowplow contexts, or to include page-view plucking by default in the snowplow_web_events_tmp, or to do neither and leave it up to the installer (status quo).** I think we should leave this up to the user, but conceivably it will make sense to provide helper models / macros for very typical use cases (like Snowflake or Spectrum nested fields). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, but I have some questions about performance and I still haven't totally wrapped my head around snowplow_web_events_tmp
. Can you update the README to include usage instructions for the new usage of snowplow:context:web_page
?
|
||
{% macro default__snowplow_web_events_tmp() %} | ||
|
||
{% if var('snowplow:context:web_page', False) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
{{
config(
enabled=var('snowplow:context:web_page', False),
materialized='incremental',
sort='page_view_id',
dist='page_view_id',
unique_key='event_id'
)
}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to materialize this model incrementally? I'm weary of making another full copy of the events
table, even if the work is done incrementally. Could we instead make a view
that implements this logic? Curious to hear what you think, as I don't feel super strongly about this, but did want to call it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played with making this model ephemeral instead, but then we don't have a way to include the is_incremental()
logic, which I still wanted to consolidate. It's now a macro called by the main slew of Snowplow event models.
@jtcohen6 do you want me to re-review this one? |
@drewbanin Yessir. I've made a few more changes—related though likely beyond the initial scope of this PR—in order to support my experimentation with external tables. Namely:
I believe these changes are relevant. To my mind, the primary use case for this PR's functionality is when Snowplow data is loaded or queried, in its canonical event structure, directly from external storage. Failing testsI would also appreciate your eye on the failing CircleCI tests, whose operative error appears to be:
All tests are passing for me locally. |
…to feature/adapter-model-configs
…configs Feature: adapter model configs
Background
Many of our recent Snowplow installations have resulted in a single event stream table, with a schema matching Snowplow's canonical event model.
In these cases, we do not need to look up the
page_view_id
in a separate table containing web page context; it just needs to be un-arrayed and un-nested from thecontexts
object on the main events table. This change also enables a more fully incremental build, sincepage_view_id
andcollector_tstamp
are united from the start.N.B. OTF = "On The Fence" = I considered multiple approaches and picked one without being sure it's the best. Open to input.
Changelog
'snowplow:context:web_page': false
. The package will expect to see a column calledpage_view_id
directly within the event model.snowplow_web_events_tmp
model/macro. IFF the web page context is disabled, this model performs the deduplication ofsnowplow_web_page_context
—throwing away all events that have multiple page view IDs—directly on top of base events. All subsequent models (snowplow_web_events
,snowplow_web_events_time
,snowplow_web_events_scroll_depth
) build on top of this one.canonical_event
andcanonical_event_update
seeds that are exact replicas ofevent
/event_update
merged withweb_page
/web_page_update
. OTF: Whether to include brand-new seeds or to just make the join happen inbase_event
. Adding more seeds is definitely duplicate code, but it's also in accordance with our integration test practice (?) of having seeds represent the expected format of raw data.'snowplow:context:web_page': false
.Comments
page_view_id
fromcontexts
and added it to'snowplow:events'
as a column by the same name.snowplow_web_events_tmp
, or to do neither and leave it up to the installer (status quo).