You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
duplicated as (
select
id
from prep
group by 1
having count(*) > 1
)
Next step is to exclude them all: where page_view_id not in (select id from duplicated)
Let's assume that we have 3 rows with the same id, so we will deny all of them, but isn't it to be correct way to filter 2 of them and 1 use in result set?
The text was updated successfully, but these errors were encountered:
Did you find this because of a problem with your data? Or were you just scanning the code? I think the point here is that there are two duplication steps:
exclude records for page views with multiple, different user agents
There's a subtle distinction here. The first step squashes identical/duplicate records into a single row. The second step will intentionally throw away context for pageviews with multiple, different user agent contexts.
If a browser is reporting that a single pageview is represented by two different user agents, then there's probably something bad happening either in the browser or in the tracking implementation.
I think the alternative of just picking one is also a reasonable approach, but it's not how the original Snowplow web data models were built, so I'm hesitant to change it without good reason! I would however be super happy to have better documentation around some of this code.
@drewbanin I have find it just by scanning code and you reference to original snowplow model is ok, to understand that it should go like it is implemented. Thanks!
I'm not sure, but may be it is not a correct way for filtering duplicates:
https://github.com/fishtown-analytics/snowplow/blob/274c7969d61a552cdc996d00801d7f43cc9b6c03/models/page_views/optional/snowplow_web_ua_parser_context.sql#L43
Searching for several rows with same id:
Next step is to exclude them all:
where page_view_id not in (select id from duplicated)
Let's assume that we have 3 rows with the same id, so we will deny all of them, but isn't it to be correct way to filter 2 of them and 1 use in result set?
The text was updated successfully, but these errors were encountered: