Skip to content

Commit

Permalink
Break up daily scheduled service index (#2076)
Browse files Browse the repository at this point in the history
* mock up all scheduled service

* add calendar key

* wip

* working refactor that has discrepancies with prod

* yaml and cleanup

* add more tests, some per pr review

* only build warehouse image on actual changes to image

* I forgot to export requirements

Co-authored-by: Andrew Vaccaro <atvaccaro@gmail.com>
  • Loading branch information
lauriemerrell and atvaccaro committed Dec 15, 2022
1 parent 43fb1ed commit 333455b
Show file tree
Hide file tree
Showing 6 changed files with 211 additions and 125 deletions.
12 changes: 10 additions & 2 deletions .github/workflows/build-warehouse-image.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,19 @@ on:
- 'main'
paths:
- '.github/workflows/build-warehouse-image.yml'
- 'warehouse/**'
- 'warehouse/Dockerfile'
- 'warehouse/packages.yml'
- 'warehouse/poetry.lock'
- 'warehouse/pyproject.yml'
- 'warehouse/requirements.txt'
pull_request:
paths:
- '.github/workflows/build-warehouse-image.yml'
- 'warehouse/**'
- 'warehouse/Dockerfile'
- 'warehouse/packages.yml'
- 'warehouse/poetry.lock'
- 'warehouse/pyproject.yml'
- 'warehouse/requirements.txt'

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
Expand Down
109 changes: 91 additions & 18 deletions warehouse/models/intermediate/gtfs/_int_gtfs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,30 +88,39 @@ models:
- name: int_gtfs_schedule__long_calendar
description: |
This table transforms the raw GTFS calendar.txt format (where each row corresponds to a `service_id` and
each day of the week is a column and service indicators are entered in a "wide" fashion) into a long format,
where a row is identified by `feed_key`, `service_id`, and `day_name`.
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- feed_key
- service_id
- day_name
each day of the week is a column and service indicators are entered in a "wide" fashion)
into a long format, where a row is identified by `feed_key`, `service_id`, and `date`.
columns:
- name: base64_url
- &schedule_key
name: key
description:
Synthetic key from `service_id`, `feed_key`, and `service_date`.
tests:
- unique
- not_null
- name: feed_key
description: Foreign key for `dim_schedule_feeds`.
- name: service_id
description: '{{ doc("gtfs_calendar__service_id") }}'
- name: start_date
description: '{{ doc("gtfs_calendar__start_date") }}'
- name: end_date
description: '{{ doc("gtfs_calendar_dates__date") }}'
- name: day_name
description: String name of day of the week like "monday"
- name: has_service
- name: service_date
description: |
Date on which this service was active (i.e., this date is betweem the
`start_date` and `end_date` for this service).
- name: day_num
description: |
Boolean indicating whether there is service for this `service_id` / `day_name` pair
between `start_date` and `end_date`.
Day of week as number (Sunday = 1, Saturday = 7).
- name: service_bool
description: |
Boolean indicating whether there is service for this `service_id` / `date` pair.
tests:
- not_null
- name: calendar_key
description: |
Foreign key to dim_calendar.
tests:
- relationships:
to: ref('dim_calendar')
field: key
- name: int_gtfs_schedule__daily_scheduled_service_index
description: |
An index listing date, feed, and `service_id` combinations for which service was scheduled (i.e.,
Expand Down Expand Up @@ -141,6 +150,22 @@ models:
description: Foreign key for `dim_schedule_feeds`.
tests:
- not_null
- name: calendar_key
description: |
Foreign key for `dim_calendar`. If null, the service for this date was
defined exclusively via `calendar_dates`.
tests:
- relationships:
to: ref('dim_calendar')
field: key
- name: calendar_dates_key
description: |
Foreign key for `dim_calendar_dates`. If null, the service for this date was
defined exclusively via `calendar`.
tests:
- relationships:
to: ref('dim_calendar_dates')
field: key
- name: service_date
description: Date on which service was scheduled.
tests:
Expand Down Expand Up @@ -375,3 +400,51 @@ models:
RT data type: "GTFS Alerts", "GTFS VehiclePositions", "GTFS TripUpdates".
tests:
- not_null
- name: int_gtfs_schedule__all_scheduled_service
description: |
**Use with caution: This table lists service for all feeds, regardless of whether the given feed
was actually "active" on the given day. To see only service for active feeds,
consult `int_gtfs_schedule__daily_scheduled_service_index`.**
Lists date, feed, and `service_id` combinations for which service was scheduled (i.e.,
the `service_id` is "in effect" and says that service occurred).
Essentially, it takes `calendar` and `calendar_dates` for a given feed, takes the dates for which that
feed was "in effect", and combines those into a long list of all service_ids that were in effect for a given
date, then filters down to only those where service was actually scheduled on that date.
For example, a row in this table with `feed_key = A`, `service_date = 2022-10-01`, `service_id = 1` indicates
that:
* Service ID 1 covers the date 2022-10-01, i.e., if service ID 1 is defined in `calendar.txt`,
`start_date <= 2022-10-01 <= end_date` and if service ID 1 is defined in `calendar_dates.txt`,
`2022-10-01` is listed as a `date` within that file.
* The service indicator is `true` for 2022-10-01 (a Saturday). So, if this service was defined in `calendar.txt`,
`saturday = 1` for `service_id = 1`, and there is no `exception_type = 2` in `calendar_dates.txt` for this service and date.
If this service is defined exclusively in `calendar_dates.txt`, then `exception_type = 1` is listed for `2022-10-01` in that file.
columns:
- *schedule_key
- name: feed_key
description: Foreign key for `dim_schedule_feeds`.
tests:
- not_null
- name: calendar_key
description: |
Foreign key for `dim_calendar`. If null, the service for this date was
defined exclusively via `calendar_dates`.
tests:
- relationships:
to: ref('dim_calendar')
field: key
- name: calendar_dates_key
description: |
Foreign key for `dim_calendar_dates`. If null, the service for this date was
defined exclusively via `calendar`.
tests:
- relationships:
to: ref('dim_calendar_dates')
field: key
- name: service_date
description: Date on which service was scheduled.
tests:
- not_null
- name: service_id
description: Service identifier from calendar and/or calendar_dates.
tests:
- not_null
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
{{ config(materialized='table') }}

WITH dim_calendar_dates AS (
SELECT *
FROM {{ ref('dim_calendar_dates') }}
),

int_gtfs_schedule__long_calendar AS (
SELECT *
FROM {{ ref('int_gtfs_schedule__long_calendar') }}
),

boolean_calendar_dates AS (
SELECT
-- at time of writing, this will be identical to `calendar_dates_key`, but just in case?
{{ dbt_utils.surrogate_key(['feed_key', 'service_id', 'date']) }} AS key,
date AS service_date,
feed_key,
key AS calendar_dates_key,
service_id,
CASE
WHEN exception_type = 1 THEN TRUE
WHEN exception_type = 2 THEN FALSE
END AS service_bool
FROM dim_calendar_dates
),

daily_services AS (
SELECT
-- these values will be identical so doesn't matter which is first in coalesce
COALESCE(long_cal.key, cal_dates.key) AS key,
COALESCE(long_cal.service_date, cal_dates.service_date) AS service_date,
COALESCE(long_cal.feed_key, cal_dates.feed_key) AS feed_key,
COALESCE(long_cal.service_id, cal_dates.service_id) AS service_id,
-- calendar_dates takes precedence if present: it can modify calendar
-- if no calendar_dates, use calendar
-- if neither, no service
COALESCE(cal_dates.service_bool, long_cal.service_bool) AS service_bool,
calendar_key,
calendar_dates_key
FROM int_gtfs_schedule__long_calendar AS long_cal
FULL OUTER JOIN boolean_calendar_dates AS cal_dates
USING (key)
),

int_gtfs_schedule__all_scheduled_service AS (
SELECT
key,
service_date,
feed_key,
calendar_key,
calendar_dates_key,
service_id
FROM daily_services
WHERE service_bool
)

SELECT * FROM int_gtfs_schedule__all_scheduled_service
Original file line number Diff line number Diff line change
@@ -1,83 +1,28 @@
{{ config(materialized='table') }}

WITH dim_calendar_dates AS (
SELECT *
FROM {{ ref('dim_calendar_dates') }}
),

int_gtfs_schedule__long_calendar AS (
SELECT *
FROM {{ ref('int_gtfs_schedule__long_calendar') }}
),

fct_daily_schedule_feeds AS (
WITH fct_daily_schedule_feeds AS (
SELECT
*,
EXTRACT(DAYOFWEEK FROM date) AS day_num
FROM {{ ref('fct_daily_schedule_feeds') }}
),

boolean_calendar_dates AS (
SELECT
date,
feed_key,
service_id,
CASE
WHEN exception_type = 1 THEN TRUE
WHEN exception_type = 2 THEN FALSE
END AS has_service
FROM dim_calendar_dates
),

-- decide that exception type 2 trumps exception type 1
-- i.e., if same date appears twice with two exception types
-- the cancelation wins and we say no service on that date
-- (this generally shouldn't happen)
summarize_calendar_dates AS (
SELECT
date,
feed_key,
service_id,
LOGICAL_AND(has_service) AS has_service
FROM boolean_calendar_dates
GROUP BY date, feed_key, service_id
),

daily_services AS (

SELECT
daily_feeds.date AS service_date,
cal_dates.date AS cd_date,
daily_feeds.feed_key,
long_cal.service_id AS calendar_service_id,
long_cal.has_service AS calendar_has_service,
cal_dates.service_id AS calendar_dates_service_id,
cal_dates.has_service AS calendar_dates_has_service,
COALESCE(long_cal.service_id, cal_dates.service_id) AS service_id,
-- calendar_dates takes precedence if present: it can modify calendar
-- if no calendar_dates, use calendar
-- if neither, no service
COALESCE(cal_dates.has_service, long_cal.has_service, FALSE) AS has_service
FROM fct_daily_schedule_feeds AS daily_feeds
LEFT JOIN int_gtfs_schedule__long_calendar AS long_cal
ON daily_feeds.feed_key = long_cal.feed_key
AND daily_feeds.day_num = long_cal.day_num
AND daily_feeds.date BETWEEN long_cal.start_date AND long_cal.end_date
LEFT JOIN summarize_calendar_dates AS cal_dates
ON daily_feeds.feed_key = cal_dates.feed_key
AND daily_feeds.date = cal_dates.date
AND (long_cal.service_id = cal_dates.service_id OR long_cal.service_id IS NULL)
all_scheduled_service AS (
SELECT *
FROM {{ ref('int_gtfs_schedule__all_scheduled_service') }}
),

int_gtfs_schedule__daily_scheduled_service_index AS (
SELECT
service_date,
cd_date,
feed_key,
service_id
FROM daily_services
WHERE service_id IS NOT NULL
AND has_service
fct_daily_schedule_feeds.feed_key,
service_id,
calendar_key,
calendar_dates_key
FROM all_scheduled_service
INNER JOIN fct_daily_schedule_feeds
ON all_scheduled_service.feed_key = fct_daily_schedule_feeds.feed_key
AND all_scheduled_service.service_date = fct_daily_schedule_feeds.date
)

SELECT * FROM int_gtfs_schedule__daily_scheduled_service_index
Original file line number Diff line number Diff line change
@@ -1,30 +1,32 @@
{{ config(materialized='table') }}

-- TODO: make an intermediate calendar and use that instead of the dimension
WITH dim_calendar AS (
SELECT *
FROM {{ ref('dim_calendar') }}
),

-- TODO: see if this can be refactored using UNPIVOT (logic inherited from v1 warehouse, wondering if it should be revisited)
int_gtfs_schedule__long_calendar AS (
-- Note that you can unnest values easily in SQL, but getting the column names
-- is weirdly hard. To work around this, we just UNION ALL.
{% for dow in [("monday", 2), ("tuesday", 3), ("wednesday", 4), ("thursday", 5), ("friday", 6), ("saturday", 7), ("sunday", 0)] %}

{% if not loop.first %}
UNION ALL
{% endif %}

SELECT
base64_url,
feed_key,
service_id,
start_date,
end_date,
"{{ dow[0] }}" AS day_name,
{{ dow[1] }} AS day_num,
CAST({{ dow[0] }} AS boolean) AS has_service
FROM dim_calendar
{% endfor %}
SELECT
feed_key,
{{ dbt_utils.surrogate_key(['feed_key', 'service_id', 'dt']) }} AS key,
service_id,
dt AS service_date,
EXTRACT(DAYOFWEEK FROM dt) AS day_num,
CASE
WHEN EXTRACT(DAYOFWEEK FROM dt) = 1 THEN CAST(sunday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 2 THEN CAST(monday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 3 THEN CAST(tuesday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 4 THEN CAST(wednesday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 5 THEN CAST(thursday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 6 THEN CAST(friday AS bool)
WHEN EXTRACT(DAYOFWEEK FROM dt) = 7 THEN CAST(saturday AS bool)
END AS service_bool,
key AS calendar_key
FROM dim_calendar
-- one row per day between calendar service start and end date
-- https://stackoverflow.com/questions/38694040/how-to-generate-date-series-to-occupy-absent-dates-in-google-biqquery/58169269#58169269
LEFT JOIN UNNEST(GENERATE_DATE_ARRAY(start_date, LEAST(end_date, DATE_ADD(CURRENT_DATE(), INTERVAL 1 YEAR)))) AS dt
)

SELECT *
Expand Down

0 comments on commit 333455b

Please sign in to comment.