Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

May be it is incorrect way of filtering duplicates? #42

Closed
sphinks opened this issue Jan 10, 2019 · 2 comments
Closed

May be it is incorrect way of filtering duplicates? #42

sphinks opened this issue Jan 10, 2019 · 2 comments

Comments

@sphinks
Copy link
Contributor

sphinks commented Jan 10, 2019

I'm not sure, but may be it is not a correct way for filtering duplicates:
https://github.com/fishtown-analytics/snowplow/blob/274c7969d61a552cdc996d00801d7f43cc9b6c03/models/page_views/optional/snowplow_web_ua_parser_context.sql#L43
Searching for several rows with same id:

duplicated as (
select
    id

from prep

group by 1
having count(*) > 1

)

Next step is to exclude them all:
where page_view_id not in (select id from duplicated)
Let's assume that we have 3 rows with the same id, so we will deny all of them, but isn't it to be correct way to filter 2 of them and 1 use in result set?

@drewbanin
Copy link
Collaborator

Hey @sphinks - we adopted this logic from the snowplow web data model.

Did you find this because of a problem with your data? Or were you just scanning the code? I think the point here is that there are two duplication steps:

  1. group by every column to "distinct" the dataset
  2. exclude records for page views with multiple, different user agents

There's a subtle distinction here. The first step squashes identical/duplicate records into a single row. The second step will intentionally throw away context for pageviews with multiple, different user agent contexts.

If a browser is reporting that a single pageview is represented by two different user agents, then there's probably something bad happening either in the browser or in the tracking implementation.

I think the alternative of just picking one is also a reasonable approach, but it's not how the original Snowplow web data models were built, so I'm hesitant to change it without good reason! I would however be super happy to have better documentation around some of this code.

Let me know if all of this makes sense

@sphinks
Copy link
Contributor Author

sphinks commented Jan 10, 2019

@drewbanin I have find it just by scanning code and you reference to original snowplow model is ok, to understand that it should go like it is implemented. Thanks!

@sphinks sphinks closed this as completed Jan 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants