Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break up daily scheduled service index #2076

Merged
merged 8 commits into from
Dec 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions .github/workflows/build-warehouse-image.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,19 @@ on:
- 'main'
paths:
- '.github/workflows/build-warehouse-image.yml'
- 'warehouse/**'
- 'warehouse/Dockerfile'
- 'warehouse/packages.yml'
- 'warehouse/poetry.lock'
- 'warehouse/pyproject.yml'
- 'warehouse/requirements.txt'
pull_request:
paths:
- '.github/workflows/build-warehouse-image.yml'
- 'warehouse/**'
- 'warehouse/Dockerfile'
- 'warehouse/packages.yml'
- 'warehouse/poetry.lock'
- 'warehouse/pyproject.yml'
- 'warehouse/requirements.txt'

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
Expand Down
109 changes: 91 additions & 18 deletions warehouse/models/intermediate/gtfs/_int_gtfs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,30 +88,39 @@ models:
- name: int_gtfs_schedule__long_calendar
description: |
This table transforms the raw GTFS calendar.txt format (where each row corresponds to a `service_id` and
each day of the week is a column and service indicators are entered in a "wide" fashion) into a long format,
where a row is identified by `feed_key`, `service_id`, and `day_name`.
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- feed_key
- service_id
- day_name
each day of the week is a column and service indicators are entered in a "wide" fashion)
into a long format, where a row is identified by `feed_key`, `service_id`, and `date`.
columns:
- name: base64_url
- &schedule_key
name: key
description:
Synthetic key from `service_id`, `feed_key`, and `service_date`.
tests:
- unique
- not_null
- name: feed_key
description: Foreign key for `dim_schedule_feeds`.
- name: service_id
description: '{{ doc("gtfs_calendar__service_id") }}'
- name: start_date
description: '{{ doc("gtfs_calendar__start_date") }}'
- name: end_date
description: '{{ doc("gtfs_calendar_dates__date") }}'
- name: day_name
description: String name of day of the week like "monday"
- name: has_service
- name: service_date
description: |
Date on which this service was active (i.e., this date is betweem the
`start_date` and `end_date` for this service).
- name: day_num
description: |
Boolean indicating whether there is service for this `service_id` / `day_name` pair
between `start_date` and `end_date`.
Day of week as number (Sunday = 1, Saturday = 7).
- name: service_bool
lauriemerrell marked this conversation as resolved.
Show resolved Hide resolved
description: |
Boolean indicating whether there is service for this `service_id` / `date` pair.
tests:
- not_null
- name: calendar_key
description: |
Foreign key to dim_calendar.
tests:
- relationships:
to: ref('dim_calendar')
field: key
- name: int_gtfs_schedule__daily_scheduled_service_index
description: |
An index listing date, feed, and `service_id` combinations for which service was scheduled (i.e.,
Expand Down Expand Up @@ -141,6 +150,22 @@ models:
description: Foreign key for `dim_schedule_feeds`.
tests:
- not_null
- name: calendar_key
description: |
Foreign key for `dim_calendar`. If null, the service for this date was
defined exclusively via `calendar_dates`.
tests:
- relationships:
to: ref('dim_calendar')
field: key
- name: calendar_dates_key
description: |
Foreign key for `dim_calendar_dates`. If null, the service for this date was
defined exclusively via `calendar`.
tests:
- relationships:
to: ref('dim_calendar_dates')
field: key
- name: service_date
description: Date on which service was scheduled.
tests:
Expand Down Expand Up @@ -375,3 +400,51 @@ models:
RT data type: "GTFS Alerts", "GTFS VehiclePositions", "GTFS TripUpdates".
tests:
- not_null
- name: int_gtfs_schedule__all_scheduled_service
description: |
**Use with caution: This table lists service for all feeds, regardless of whether the given feed
was actually "active" on the given day. To see only service for active feeds,
consult `int_gtfs_schedule__daily_scheduled_service_index`.**
Lists date, feed, and `service_id` combinations for which service was scheduled (i.e.,
the `service_id` is "in effect" and says that service occurred).
Essentially, it takes `calendar` and `calendar_dates` for a given feed, takes the dates for which that
feed was "in effect", and combines those into a long list of all service_ids that were in effect for a given
date, then filters down to only those where service was actually scheduled on that date.
For example, a row in this table with `feed_key = A`, `service_date = 2022-10-01`, `service_id = 1` indicates
that:
* Service ID 1 covers the date 2022-10-01, i.e., if service ID 1 is defined in `calendar.txt`,
`start_date <= 2022-10-01 <= end_date` and if service ID 1 is defined in `calendar_dates.txt`,
`2022-10-01` is listed as a `date` within that file.
* The service indicator is `true` for 2022-10-01 (a Saturday). So, if this service was defined in `calendar.txt`,
`saturday = 1` for `service_id = 1`, and there is no `exception_type = 2` in `calendar_dates.txt` for this service and date.
If this service is defined exclusively in `calendar_dates.txt`, then `exception_type = 1` is listed for `2022-10-01` in that file.
columns:
- *schedule_key
- name: feed_key
description: Foreign key for `dim_schedule_feeds`.
tests:
- not_null
- name: calendar_key
description: |
Foreign key for `dim_calendar`. If null, the service for this date was
defined exclusively via `calendar_dates`.
tests:
- relationships:
to: ref('dim_calendar')
field: key
- name: calendar_dates_key
description: |
Foreign key for `dim_calendar_dates`. If null, the service for this date was
defined exclusively via `calendar`.
tests:
- relationships:
to: ref('dim_calendar_dates')
field: key
- name: service_date
description: Date on which service was scheduled.
tests:
- not_null
- name: service_id
description: Service identifier from calendar and/or calendar_dates.
tests:
- not_null
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
{{ config(materialized='table') }}

WITH dim_calendar_dates AS (
SELECT *
FROM {{ ref('dim_calendar_dates') }}
),

int_gtfs_schedule__long_calendar AS (
SELECT *
FROM {{ ref('int_gtfs_schedule__long_calendar') }}
),

boolean_calendar_dates AS (
SELECT
-- at time of writing, this will be identical to `calendar_dates_key`, but just in case?
{{ dbt_utils.surrogate_key(['feed_key', 'service_id', 'date']) }} AS key,
date AS service_date,
feed_key,
key AS calendar_dates_key,
service_id,
CASE
WHEN exception_type = 1 THEN TRUE
WHEN exception_type = 2 THEN FALSE
END AS service_bool
FROM dim_calendar_dates
),

daily_services AS (
SELECT
-- these values will be identical so doesn't matter which is first in coalesce
COALESCE(long_cal.key, cal_dates.key) AS key,
COALESCE(long_cal.service_date, cal_dates.service_date) AS service_date,
COALESCE(long_cal.feed_key, cal_dates.feed_key) AS feed_key,
COALESCE(long_cal.service_id, cal_dates.service_id) AS service_id,
-- calendar_dates takes precedence if present: it can modify calendar
-- if no calendar_dates, use calendar
-- if neither, no service
COALESCE(cal_dates.service_bool, long_cal.service_bool) AS service_bool,
calendar_key,
calendar_dates_key
FROM int_gtfs_schedule__long_calendar AS long_cal
FULL OUTER JOIN boolean_calendar_dates AS cal_dates
USING (key)
),

int_gtfs_schedule__all_scheduled_service AS (
SELECT
key,
service_date,
feed_key,
calendar_key,
calendar_dates_key,
service_id
FROM daily_services
WHERE service_bool
)

SELECT * FROM int_gtfs_schedule__all_scheduled_service
Original file line number Diff line number Diff line change
@@ -1,83 +1,28 @@
{{ config(materialized='table') }}

WITH dim_calendar_dates AS (
SELECT *
FROM {{ ref('dim_calendar_dates') }}
),

int_gtfs_schedule__long_calendar AS (
SELECT *
FROM {{ ref('int_gtfs_schedule__long_calendar') }}
),

fct_daily_schedule_feeds AS (
WITH fct_daily_schedule_feeds AS (
SELECT
*,
EXTRACT(DAYOFWEEK FROM date) AS day_num
FROM {{ ref('fct_daily_schedule_feeds') }}
),

boolean_calendar_dates AS (
SELECT
date,
feed_key,
service_id,
CASE
WHEN exception_type = 1 THEN TRUE
WHEN exception_type = 2 THEN FALSE
END AS has_service
FROM dim_calendar_dates
),

-- decide that exception type 2 trumps exception type 1
-- i.e., if same date appears twice with two exception types
-- the cancelation wins and we say no service on that date
-- (this generally shouldn't happen)
summarize_calendar_dates AS (
SELECT
date,
feed_key,
service_id,
LOGICAL_AND(has_service) AS has_service
FROM boolean_calendar_dates
GROUP BY date, feed_key, service_id
),

daily_services AS (

SELECT
daily_feeds.date AS service_date,
cal_dates.date AS cd_date,
daily_feeds.feed_key,
long_cal.service_id AS calendar_service_id,
long_cal.has_service AS calendar_has_service,
cal_dates.service_id AS calendar_dates_service_id,
cal_dates.has_service AS calendar_dates_has_service,
COALESCE(long_cal.service_id, cal_dates.service_id) AS service_id,
-- calendar_dates takes precedence if present: it can modify calendar
-- if no calendar_dates, use calendar
-- if neither, no service
COALESCE(cal_dates.has_service, long_cal.has_service, FALSE) AS has_service
FROM fct_daily_schedule_feeds AS daily_feeds
LEFT JOIN int_gtfs_schedule__long_calendar AS long_cal
ON daily_feeds.feed_key = long_cal.feed_key
AND daily_feeds.day_num = long_cal.day_num
AND daily_feeds.date BETWEEN long_cal.start_date AND long_cal.end_date
LEFT JOIN summarize_calendar_dates AS cal_dates
ON daily_feeds.feed_key = cal_dates.feed_key
AND daily_feeds.date = cal_dates.date
AND (long_cal.service_id = cal_dates.service_id OR long_cal.service_id IS NULL)
all_scheduled_service AS (
SELECT *
FROM {{ ref('int_gtfs_schedule__all_scheduled_service') }}
),

int_gtfs_schedule__daily_scheduled_service_index AS (
SELECT
service_date,
cd_date,
feed_key,
service_id
FROM daily_services
WHERE service_id IS NOT NULL
AND has_service
fct_daily_schedule_feeds.feed_key,
service_id,
calendar_key,
calendar_dates_key
FROM all_scheduled_service
INNER JOIN fct_daily_schedule_feeds
ON all_scheduled_service.feed_key = fct_daily_schedule_feeds.feed_key
AND all_scheduled_service.service_date = fct_daily_schedule_feeds.date
)

SELECT * FROM int_gtfs_schedule__daily_scheduled_service_index
Original file line number Diff line number Diff line change
@@ -1,30 +1,32 @@
{{ config(materialized='table') }}

-- TODO: make an intermediate calendar and use that instead of the dimension
WITH dim_calendar AS (
SELECT *
FROM {{ ref('dim_calendar') }}
),

-- TODO: see if this can be refactored using UNPIVOT (logic inherited from v1 warehouse, wondering if it should be revisited)
int_gtfs_schedule__long_calendar AS (
-- Note that you can unnest values easily in SQL, but getting the column names
-- is weirdly hard. To work around this, we just UNION ALL.
{% for dow in [("monday", 2), ("tuesday", 3), ("wednesday", 4), ("thursday", 5), ("friday", 6), ("saturday", 7), ("sunday", 0)] %}

{% if not loop.first %}
UNION ALL
{% endif %}

SELECT
base64_url,
feed_key,
service_id,
start_date,
end_date,
"{{ dow[0] }}" AS day_name,
{{ dow[1] }} AS day_num,
CAST({{ dow[0] }} AS boolean) AS has_service
FROM dim_calendar
{% endfor %}
SELECT
feed_key,
{{ dbt_utils.surrogate_key(['feed_key', 'service_id', 'dt']) }} AS key,
service_id,
dt AS service_date,
EXTRACT(DAYOFWEEK FROM dt) AS day_num,
CASE
lauriemerrell marked this conversation as resolved.
Show resolved Hide resolved
WHEN EXTRACT(DAYOFWEEK FROM dt) = 1 THEN CAST(sunday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 2 THEN CAST(monday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 3 THEN CAST(tuesday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 4 THEN CAST(wednesday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 5 THEN CAST(thursday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 6 THEN CAST(friday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 7 THEN CAST(saturday AS bool)
END AS service_bool,
key AS calendar_key
FROM dim_calendar
-- one row per day between calendar service start and end date
-- https://stackoverflow.com/questions/38694040/how-to-generate-date-series-to-occupy-absent-dates-in-google-biqquery/58169269#58169269
LEFT JOIN UNNEST(GENERATE_DATE_ARRAY(start_date, LEAST(end_date, DATE_ADD(CURRENT_DATE(), INTERVAL 1 YEAR)))) AS dt
)

SELECT *
Expand Down
Loading