Skip to content

Commit

Permalink
Assessed organization & service indices (#2087)
Browse files Browse the repository at this point in the history
* draft of daily services & documentation

* intermediate work on assessed orgs

* remove pseudo type 2 stuff for gtfs datasets; use as-is

* dataset and service data history

* non-working wip

* add type to gtfs datasets for convenience

* fix renamed key reference

* cast organizations assessment status to bool, not sure why services comes in natively as bool and orgs doesn't

* working-ish daily entities

* fix key column name

* fix wrong test definition

* only coalesce customer_facing for dates after that field was added

* use type instead of data in dataset history

* rename, get working

* update yaml

* remove date from key hash

* mart guidelines table

* add yaml for organization level checks

* service level check aggregation

* add comment per pr review

* sort arrays
  • Loading branch information
lauriemerrell committed Dec 20, 2022
1 parent a3d3bae commit 8a8c35d
Show file tree
Hide file tree
Showing 22 changed files with 755 additions and 49 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,86 @@ models:
combination_of_columns:
- code
- severity

- name: int_gtfs_quality__daily_assessment_candidate_entities
description: |
A row here is a combination of an organization, service, service / GTFS dataset relationship,
GTFS dataset, and schedule feed that existed as of the given date. This combined entity
represents a candidate for GTFS guideline adherence assessment. The `assessed` column
indicates whether the candidate was actually in scope for assessment on the given date, based on whether:
* The organization had `reporting_category` = 'Core' or 'Other Public Transit'
* The service was `currently_operating`
* The service had `service_type` 'fixed-route'
* The service/GTFS dataset relationship was `customer_facing` or had `category` = 'primary'
(i.e., according to our data, was this dataset ingested by trip planners on this date
to represent this service)
Entities that fail any of the criteria will *not* be included as `assessed` for the given date.
The data in this table comes directly from historical extracts from the Transit Database. This means
that it is subject to historical data quality issues. On days when an extract failed, entities
may not resolve correctly.
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- key
- date
columns:
- name: key
description: |
Synthetic key from `organization_key`, `service_key`, `gtfs_service_data_key`, `gtfs_dataset_key`,
and `schedule_feed_key`.
tests:
- not_null
- name: date
description: |
Date on which this combination of records is present in our data.
- name: organization_name
- name: service_name
- name: gtfs_dataset_name
- name: gtfs_dataset_type
- name: assessed
description: |
Boolean indicator for whether this combined entity met the criteria to be "assessed"
on the given date. Applies "and" logic to `organization_assessed`, `service_assessed`,
and `gtfs_service_data_assessed`.
- name: organization_assessed
description: |
Boolean indicator for whether the given organization record met the criteria to be "assessed"
on the given date, i.e., had `reporting_category` = 'Core' or 'Other Public Transit'.
- name: service_assessed
description: |
Boolean indicator for whether the given organization record met the criteria to be "assessed"
on the given date, i.e., was `currently_operating` and had `service_type` 'fixed-route'.
- name: gtfs_service_data_assessed
description: |
Boolean indicator for whether the given service/GTFS dataset (`gtfs_service_data`) record met
the criteria to be "assessed" on the given date, i.e., was `customer_facing` or had `category` = 'primary'.
- name: base64_url
- name: organization_key
description: |
Foreign key to `dim_organizations`; note that this is a *historical* key value and
so joins may result in some unexpected behavior (for example, deleted records will
fail to join.)
- name: service_key
description: |
Foreign key to `dim_services`; note that this is a *historical* key value and
so joins may result in some unexpected behavior (for example, deleted records will
fail to join.)
- name: gtfs_service_data_key
description: |
Foreign key to `dim_gtfs_service_data`; note that this is a *historical* key value and
so joins may result in some unexpected behavior (for example, deleted records will
fail to join.)
- name: gtfs_dataset_key
description: |
Foreign key to `dim_gtfs_datasets`; note that this is a *historical* key value and
so joins may result in some unexpected behavior (for example, deleted records will
fail to join.)
- name: schedule_feed_key
description: |
Foreign key to `dim_schedule_feeds`. Because `dim_schedule_feeds` is slowly-
changing dimension, joins using this key are not subject to the historical
limitations noted for the other foreign keys in this table.
- name: int_gtfs_quality__no_schedule_validation_errors
tests: &stg_gtfs_guideline_tests_schedule
- dbt_utils.unique_combination_of_columns:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
{{ config(materialized='table') }}

WITH orgs AS (
SELECT *
FROM {{ ref('int_transit_database__organizations_history') }}
),

services AS (
SELECT *
FROM {{ ref('int_transit_database__services_history') }}
),

service_data AS (
SELECT *
FROM {{ ref('int_transit_database__gtfs_service_data_history') }}
),

datasets AS (
SELECT *
FROM {{ ref('int_transit_database__gtfs_datasets_history') }}
),

feeds AS (
SELECT *
FROM {{ ref('fct_daily_schedule_feeds') }}
-- this table goes into the future
WHERE date <= CURRENT_DATE()
),

full_join AS (
SELECT
COALESCE(orgs.date,
services.date,
service_data.date,
datasets.date,
feeds.date) AS date,
COALESCE(orgs.organization_key, services.provider_organization_key) AS organization_key,
COALESCE(orgs.mobility_services_managed_service_key,
services.service_key,
service_data.service_key)
AS service_key,
COALESCE(service_data.gtfs_dataset_key,
datasets.key) AS gtfs_dataset_key,

orgs.name AS organization_name,
orgs.assessment_status AS organization_raw_assessment_status,
orgs.reporting_category AS reporting_category,
COALESCE(
orgs.assessment_status,
(orgs.reporting_category = "Core") OR (orgs.reporting_category = "Other Public Transit"),
FALSE
) AS organization_assessed,

services.name AS service_name,
services.assessment_status AS services_raw_assessment_status,
services.currently_operating AS service_currently_operating,
service_type_str,
COALESCE(
services.assessment_status,
services.currently_operating
AND CONTAINS_SUBSTR(services.service_type_str, "fixed-route"),
FALSE
) AS service_assessed,

gtfs_service_data_key,
service_data.customer_facing AS gtfs_service_data_customer_facing,
service_data.category AS gtfs_service_data_category,
COALESCE(
service_data.customer_facing,
service_data.category = "primary",
FALSE
) AS gtfs_service_data_assessed,

datasets.name AS gtfs_dataset_name,
datasets.type AS gtfs_dataset_type,
datasets.regional_feed_type,
datasets.base64_url,

feeds.feed_key AS schedule_feed_key
FROM orgs
FULL OUTER JOIN services
ON orgs.date = services.date
AND orgs.mobility_services_managed_service_key = services.service_key
AND orgs.organization_key = services.provider_organization_key
FULL OUTER JOIN service_data
ON services.date = service_data.date
AND services.service_key = service_data.service_key
FULL OUTER JOIN datasets
ON service_data.date = datasets.date
AND service_data.gtfs_dataset_key = datasets.key
FULL OUTER JOIN feeds
ON datasets.date = feeds.date
AND datasets.base64_url = feeds.base64_url
),

int_gtfs_quality__daily_assessment_candidate_entities AS (
SELECT
{{ dbt_utils.surrogate_key([
'organization_key',
'service_key',
'gtfs_service_data_key',
'gtfs_dataset_key',
'schedule_feed_key']) }} AS key,
date,
organization_name,
service_name,
gtfs_dataset_name,
gtfs_dataset_type,

(organization_assessed
AND service_assessed
AND gtfs_service_data_assessed) AS assessed,


organization_assessed,
service_assessed,
gtfs_service_data_assessed,

base64_url,

organization_key,
service_key,
gtfs_service_data_key,
gtfs_dataset_key,
schedule_feed_key
FROM full_join
)

SELECT * FROM int_gtfs_quality__daily_assessment_candidate_entities
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,11 @@ int_gtfs_rt__daily_url_index AS (
configs.dt,
url_to_encode AS string_url,
base64_url,
CASE
WHEN data = "GTFS Alerts" THEN "service_alerts"
WHEN data = "GTFS VehiclePositions" THEN "vehicle_positions"
WHEN data = "GTFS TripUpdates" THEN "trip_updates"
END AS type
type
FROM int_gtfs_rt__distinct_download_configs AS configs
LEFT JOIN stg_transit_database__gtfs_datasets AS datasets
ON configs._config_extract_ts = datasets.ts
WHERE data IN ("GTFS Alerts", "GTFS VehiclePositions", "GTFS TripUpdates")
WHERE type IN ("service_alerts", "vehicle_positions", "trip_updates")
QUALIFY RANK() OVER (PARTITION BY configs.dt, url_to_encode, base64_url ORDER BY _config_extract_ts DESC) = 1
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,137 @@ models:
description: Foreign key to dim_gtfs_datasets, for a record with data = "GTFS VehiclePositions."
- name: gtfs_dataset_key_trip_updates
description: Foreign key to dim_gtfs_datasets, for a record with data = "GTFS TripUpdates."
- name: int_transit_database__services_history
description: |
Daily list of services, unnested to the organization/service relationship level,
with attributes required to check whether they were assessed by Cal-ITP on the given date.
This table is different than other Transit Database data because it
captures the historical state of various attributes. This historical
data is meant to be used for quality assessment purposes.
For the most up to date information about services, use dim_services.
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- key
- date
columns:
- name: date
description: |
Date on which this service was in our Transit Database data.
- name: key
description: |
Synthetic key, hash of `service_key` and `provider_organization_key`.
- name: service_key
description: |
Service record key.
- name: name
- name: assessment_status
description: |
Assessment status from raw Transit Database data as of date.
- name: currently_operating
description: |
Currently operating value from raw Transit Database data as of date.
- name: service_type_str
description: |
Service types from raw Transit Database data as of date, concatenated together
into a comma-delimited string.
- name: provider
description: |
Array of organization key(s) for organization(s) managing this service as of this date.
- name: int_transit_database__organizations_history
description: |
Daily list of organizations, unnested to the organization/service relationship level,
with attributes required to check whether they were assessed by Cal-ITP on the given date.
This table is different than other Transit Database data because it
captures the historical state of various attributes. This historical
data is meant to be used for quality assessment purposes.
For the most up to date information about organizations, use dim_organizations.
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- key
- date
columns:
- name: date
description: |
Date on which this organization was in our Transit Database data.
- name: key
description: |
Synthetic key, hash of `organization_key` and `mobility_services_managed_service_key`.
- name: organization_key
description: |
Organization record key.
- name: name
- name: assessment_status
description: |
Assessment status from raw Transit Database data as of date.
- name: reporting_category
description: |
Currently operating value from raw Transit Database data as of date.
- name: mobility_services_managed
description: |
Array of service keys for managed services as of this date.
- name: int_transit_database__gtfs_service_data_history
description: |
Daily list of GTFS dataset/service relationships with attributes required to check
whether they were assessed by Cal-ITP on the given date.
This table is different than other Transit Database data because it
captures the historical state of various attributes. This historical
data is meant to be used for quality assessment purposes.
For the most up to date information about services, use dim_gtfs_service_data.
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- key
- date
columns:
- name: date
description: |
Date on which this service was in our Transit Database data.
- name: key
description: |
Synthetic key, hash of `gtfs_service_data_key`, `service_key` and `gtfs_dataset_key`.
- name: gtfs_service_data_key
description: |
GTFS service data record key.
- name: service_key
description: |
Service record key.
- name: gtfs_dataset_key
description: |
GTFS dataset record key.
- name: name
- name: customer_facing
- name: category
- name: int_transit_database__gtfs_datasets_history
description: |
Daily list of GTFS datasets relationships with attributes required to check
whether they were assessed by Cal-ITP on the given date.
This table is different than other Transit Database data because it
captures the historical state of various attributes. This historical
data is meant to be used for quality assessment purposes.
For the most up to date information about services, use dim_gtfs_datasets.
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- key
- date
columns:
- name: date
description: |
Date on which this service was in our Transit Database data.
- name: key
description: |
GTFS dataset record key.
- name: name
- name: type
- name: regional_feed_type
- name: base64_url
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{{ config(materialized='table') }}

WITH stg_transit_database__gtfs_datasets AS (
SELECT *
FROM {{ ref('stg_transit_database__gtfs_datasets') }}
),

int_gtfs_quality__gtfs_datasets_history AS (
SELECT
calitp_extracted_at AS date,
key,
name,
type,
regional_feed_type,
base64_url
FROM stg_transit_database__gtfs_datasets
)

SELECT * FROM int_gtfs_quality__gtfs_datasets_history
Loading

0 comments on commit 8a8c35d

Please sign in to comment.