Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ready for 0.9.0: all-around upgrade #79

Merged
merged 9 commits into from
Apr 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 25 additions & 9 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ jobs:
build:
docker:
- image: circleci/python:3.6.2-stretch
- image: circleci/postgres:9.6.5-alpine-ram

steps:
- checkout
Expand All @@ -28,19 +29,20 @@ jobs:
cp integration_tests/ci/sample.profiles.yml ~/.dbt/profiles.yml

- run:
name: "Run Tests - BigQuery"
name: "Run Tests - Postgres"
environment:
GCLOUD_SERVICE_KEY_PATH: "/home/circleci/gcloud-service-key.json"

CI_DBT_USER: root
CI_DBT_PASS: ''
CI_DBT_PORT: 5432
CI_DBT_DBNAME: circle_test
command: |
. venv/bin/activate
echo `pwd`
cd integration_tests
dbt deps --target bigquery
dbt seed --target bigquery --full-refresh
dbt run --target bigquery --full-refresh --vars 'update: false'
dbt run --target bigquery --vars 'update: true'
dbt test --target bigquery
dbt deps --target postgres
dbt seed --target postgres --full-refresh
dbt run --target postgres --full-refresh --vars 'update: false'
dbt run --target postgres --vars 'update: true'
dbt test --target postgres

- run:
name: "Run Tests - Redshift"
Expand All @@ -66,6 +68,20 @@ jobs:
dbt run --target snowflake --vars 'update: true'
dbt test --target snowflake

- run:
name: "Run Tests - BigQuery"
environment:
GCLOUD_SERVICE_KEY_PATH: "/home/circleci/gcloud-service-key.json"

command: |
. venv/bin/activate
echo `pwd`
cd integration_tests
dbt deps --target bigquery
dbt seed --target bigquery --full-refresh
dbt run --target bigquery --full-refresh --vars 'update: false'
dbt run --target bigquery --vars 'update: true'
dbt test --target bigquery

- save_cache:
key: deps1-{{ .Branch }}
Expand Down
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* @jtcohen6
56 changes: 56 additions & 0 deletions .github/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
name: Bug report
about: Report a bug or an issue you've found with this package
title: ''
labels: bug, triage
assignees: ''

---

### Describe the bug
<!---
A clear and concise description of what the bug is. What command did you run? What happened?
--->

### Steps To Reproduce
<!---
In as much detail as possible, please provide steps to reproduce the issue. Sample data that triggers the issue, example model code, etc is all very helpful here.
--->

### Expected results
<!---
A clear and concise description of what you expected to happen.
--->

### Actual results
<!---
A clear and concise description of what you expected to happen.
--->

### Screenshots and log output
<!---
If applicable, add screenshots or log output to help explain your problem.
--->

### System information
**Which database are you using dbt with?**
- [ ] postgres
- [ ] redshift
- [ ] bigquery
- [ ] snowflake
- [ ] other (specify: ____________)


**The output of `dbt --version`:**
```
<output goes here>
```

**The operating system you're using:**

**The output of `python --version`:**

### Additional context
<!---
Add any other context about the problem here.
--->
4 changes: 4 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
## Description & motivation
<!---
Describe your changes, and why you're making them.
-->
47 changes: 39 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Snowplow sessionization

dbt data models for sessionizing Snowplow data. Adapted from Snowplow's [web model](https://github.com/snowplow/web-data-model).
This dbt package:
* Rolls up `page_view` and `page_ping` events into page views and sessions
* Performs "user stitching" to tie all historical events associated with an
anonymous cookie (`domain_userid`) to the same `user_id`

Adapted from Snowplow's [web model](https://github.com/snowplow/web-data-model).

### Models ###

Expand All @@ -14,6 +19,29 @@ several intermediate models used to create these two models.

![snowplow graph](/etc/snowplow_graph.png)


## Prerequisites

This package takes the Snowplow JavaScript tracker as its foundation. It assumes
that all Snowplow events are sent with a
[`web_page` context](https://github.com/snowplow/snowplow/wiki/1-General-parameters-for-the-Javascript-tracker#webPage).

### Mobile

It _is_ possible to sessionize mobile (app) events by including two predefined contexts with all events:
* [`client_session`](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow/client_session/jsonschema/1-0-1) ([iOS](https://docs.snowplowanalytics.com/docs/collecting-data/collecting-from-own-applications/objective-c-tracker/objective-c-1-2-0/#session-tracking), [Android](https://github.com/snowplow/snowplow/wiki/Android-Tracker#12-client-sessions))
* [`screen`](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.mobile/screen/jsonschema/1-0-0)

As long as all events are associated with an anonymous user, a session, and a
screen/page view, they can be made to fit the same canonical data model as web
events fired from the JavaScript tracker. Whether this is the desired outcome
will vary significantly; mobile-first analytics often makes different
assumptions about user identity, engagement, referral, and inactivity cutoffs.

For specific implementation details:
* [iOS](https://docs.snowplowanalytics.com/docs/collecting-data/collecting-from-own-applications/objective-c-tracker/)
* [Android trackers](https://docs.snowplowanalytics.com/docs/collecting-data/collecting-from-own-applications/android-tracker/)

## Installation Instructions
Check [dbt Hub](https://hub.getdbt.com/fishtown-analytics/snowplow/latest/) for
the latest installation instructions, or [read the docs](https://docs.getdbt.com/docs/package-management)
Expand Down Expand Up @@ -65,32 +93,35 @@ models:
* Redshift
* Snowflake
* BigQuery
* Postgres, with the creation of [these UDFs](pg_udfs.sql)
* Postgres

### Contributions ###

Additional contributions to this package are very welcome! Please create issues
or open PRs against `master`.
or open PRs against `master`. Check out
[this post](https://discourse.getdbt.com/t/contributing-to-a-dbt-package/657)
on the best workflow for contributing to a package..

Much of tracking can be the Wild West. Snowplow's canonical event model is a major
asset in our ability to perform consistent analysis atop predictably structured
data, but any detailed implementation is bound to diverge.

To that end, we aim to keep this package rooted in a garden-variety Snowplow web
deployment. All PRs should seek to add or improve functionality that is contained
within a plurality of snowplow deployments.
within a plurality of Snowplow deployments.

If you need to change implementation-specific details, you have two avenues:

* Override models from this package with versions that feature your custom logic.
Create a model with the same name locally (e.g. `snowplow_id_map`) and disable the `snowplow`
package's version in `dbt_project.yml`:
Create a model with the same name locally (e.g. `snowplow_id_map`) and disable
the `snowplow` package's version in `dbt_project.yml`:

```yml
snowplow:
...
identification:
snowplow_id_map:
enabled: false
default:
snowplow_id_map:
enabled: false
```
* Fork this repository :)
13 changes: 13 additions & 0 deletions data/snowplow_seeds.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
version: 2

seeds:
- name: country_codes
description: >
English names for countries based on their two-letter ISO code, which is
stored in the `geo_country` column of `snowplow_page_views` and
`snowplow_sessions`. Not directly used in any of the snowplow package's
sessionization logic.
columns:
- name: name
- name: two_letter_iso_code
- name: three_letter_iso_code
2 changes: 2 additions & 0 deletions dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ analysis-paths: ["analysis"]
data-paths: ["data"]
macro-paths: ["macros"]

require-dbt-version: ">=0.16.0"

models:
snowplow:
base:
Expand Down
Binary file modified etc/snowplow_graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion integration_tests/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,5 @@ test-bigquery:
dbt run --target bigquery --vars 'update: true'
dbt test --target bigquery

test-all: test-redshift test-snowflake test-bigquery
test-all: test-postgres test-redshift test-snowflake test-bigquery
echo "Completed successfully"
13 changes: 12 additions & 1 deletion integration_tests/ci/sample.profiles.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,19 @@ config:
use_colors: True

integration_tests:
target: redshift
target: postgres
outputs:

postgres:
type: postgres
host: localhost
user: "{{ env_var('CI_DBT_USER') }}"
pass: "{{ env_var('CI_DBT_PASS') }}"
port: "{{ env_var('CI_DBT_PORT') }}"
dbname: "{{ env_var('CI_DBT_DBNAME') }}"
schema: snowplow_integration_tests_redshift
threads: 1

redshift:
type: redshift
host: "{{ env_var('CI_REDSHIFT_DBT_HOST') }}"
Expand Down
14 changes: 13 additions & 1 deletion integration_tests/dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,16 @@ models:
'snowplow:context:performance_timing': FALSE
'snowplow:context:useragent': FALSE
'snowplow:pass_through_columns': ['test_add_col']


seeds:
quote_columns: false
snowplow_integration_tests:
snowplow:
sp_event_update:
column_types:
collector_tstamp: timestamp
derived_tstamp: timestamp
sp_event:
column_types:
collector_tstamp: timestamp
derived_tstamp: timestamp
11 changes: 11 additions & 0 deletions macros/adapters/convert_timezone.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{%- macro convert_timezone(in_tz, out_tz, in_timestamp) -%}
{{ adapter_macro('convert_timezone', in_tz, out_tz, in_timestamp) }}
{%- endmacro -%}

{% macro default__convert_timezone(in_tz, out_tz, in_timestamp) %}
convert_timezone({{in_tz}}, {{out_tz}}, {{in_timestamp}})
{% endmacro %}

{% macro postgres__convert_timezone(in_tz, out_tz, in_timestamp) %}
({{in_timestamp}} at time zone {{in_tz}} at time zone {{out_tz}})::timestamptz
{% endmacro %}
68 changes: 68 additions & 0 deletions macros/adapters/get_start_ts.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@

{%- macro get_max_sql(relation, field = 'collector_tstamp') -%}

select

coalesce(
max({{field}}),
'0001-01-01' -- a long, long time ago
) as start_ts

from {{ relation }}

{%- endmacro -%}


{%- macro get_most_recent_record(relation, field = 'collector_tstamp') -%}

{%- set result = run_query(get_max_sql(relation, field)) -%}

{% if execute %}
{% set start_ts = result.columns['start_ts'].values()[0] %}
{% else %}
{% set start_ts = '' %}
{% endif %}

{{ return(start_ts) }}

{%- endmacro -%}


{%- macro get_start_ts(relation, field = 'collector_tstamp') -%}
{{ adapter_macro('get_start_ts', relation, field) }}
{%- endmacro -%}


{%- macro default__get_start_ts(relation, field = 'collector_tstamp') -%}
({{get_max_sql(relation, field)}})
{%- endmacro -%}


{%- macro bigquery__get_start_ts(relation, field = 'collector_tstamp') -%}

{%- set partition_by = config.get('partition_by', none) -%}
{%- set partitions = config.get('partitions', none) -%}

{%- set start_ts -%}
{%- if config.incremental_strategy == 'insert_overwrite' -%}

{%- if partitions -%} least({{partitions|join(',')}})
{%- elif partition_by.data_type == 'date' -%} _dbt_max_partition
{%- else -%} date(_dbt_max_partition)
{%- endif -%}

{%- else -%}

{%- set rendered -%}
{%- if partition_by.data_type == 'date' -%} {{partition_by.field}}
{%- else -%} date({{partition_by.field}}) {%- endif -%}
{%- endset -%}
{%- set record = get_most_recent_record(relation, rendered) -%}
'{{record}}'

{%- endif -%}
{%- endset -%}

{%- do return(start_ts) -%}

{%- endmacro -%}
35 changes: 35 additions & 0 deletions macros/adapters/is_adapter.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{% macro set_default_adapters() %}

{% set default_adapters = ['postgres', 'redshift', 'snowflake'] %}

{% do return(default_adapters) %}

{% endmacro %}

{% macro is_adapter(adapter='default') %}

{#-
This logic means that if you add your own macro named `set_default_adapters`
to your project, that will be used, giving you the flexibility of overriding
which target types use the default implementation of Snowplow models.
-#}

{% if context.get(ref.config.project_name, {}).get('set_default_adapters') %}
{% set default_adapters=context[ref.config.project_name].set_default_adapters() %}
{% else %}
{% set default_adapters=snowplow.set_default_adapters() %}
{% endif %}

{% if adapter == 'default' %}
{% set adapters = default_adapters %}
{% elif adapter is string %}
{% set adapters = [adapter] %}
{% else %}
{% set adapters = adapter %}
{% endif %}

{% set result = (target.type in adapters) %}

{{return(result)}}

{% endmacro %}
File renamed without changes.
Loading