Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: zero trips showing up in fct_scheduled_trips for select agencies #3383

Closed
evansiroky opened this issue Jun 27, 2024 · 9 comments
Closed
Assignees

Comments

@evansiroky
Copy link
Member

Describe the bug

There are zero trips in the fct_scheduled_trips table for:

  • B-Line after May 17
  • Long Beach after June 14

To Reproduce

Expected behavior

There should be scheduled trips appearing in this table for all active dates.

Additional context

B-Line and Long Beach use calendar-dates-only style for publishing their schedule. However, I'm not sure if this is the cause of the problem.

@vevetron
Copy link
Contributor

vevetron commented Jul 2, 2024

This query seems okay:
SELECT * FROM cal-itp-data-infra.staging.stg_transit_database__gtfs_datasets
WHERE name LIKE '%B-Line Schedule%' order by dt DESC;

But...
B-line appears to have a new GTFS with dates that start on 20240519.

By this query:
SELECT
key,
name,
base64_url,
data_quality_pipeline,
_valid_from,
_valid_to,
FROM cal-itp-data-infra.mart_transit_database.dim_gtfs_datasets
WHERE name LIKE '%B-Line%' AND type = 'schedule'
order by _valid_from;
We use gtfs that starts on 3-12.

Maybe we are still using the previous gtfs and haven't properly ingested the new one?

@vevetron
Copy link
Contributor

Also affected is Sacramento Schedule, cuts out around the same date 5/13.
image

Ends up in the reports website as well:
image

@evansiroky evansiroky changed the title Bug: zero trips not showing up in fct_scheduled_trips for select agencies Bug: zero trips showing up in fct_scheduled_trips for select agencies Jul 11, 2024
@vevetron vevetron self-assigned this Jul 15, 2024
@vevetron
Copy link
Contributor

vevetron commented Jul 18, 2024

So there are a decent number of stop_times that didn't get processed, while the stops did get processed.

If you compare the count of the unique feeds used for stops vs the unique feeds used for stop times, there is an incongruity.

with s as (
  select base64_url, count(distinct(feed_key)) as feed_key from cal-itp-data-infra.mart_gtfs.dim_stops 
  group by base64_url
),
st as (
select count(distinct(st.feed_key)) as feed_key, st.base64_url
FROM cal-itp-data-infra.mart_gtfs.dim_stop_times st
group by base64_url
)
select s.feed_key, st.feed_key as st_key_count, s.base64_url, 
CASE
    WHEN REGEXP_CONTAINS(s.base64_url, r'^[A-Za-z0-9+/=]*$')  -- Check for valid Base64 characters
    THEN SAFE_CONVERT_BYTES_TO_STRING(FROM_BASE64(s.base64_url))
    ELSE NULL -- Or handle the invalid case differently (e.g., return an error message)
  END AS decoded_column
from s
left join st
on s.base64_url = st.base64_url;

image

There is a slight pattern that makes me think the error is cause by this issue:
#3368

@vevetron
Copy link
Contributor

vevetron commented Jul 22, 2024

So i've thought of two potential fixes for this issues:

  • Wait till Bug: unzip_and_validated_gtfs - negsignal.SIGKILL #3368 is fixed
  • Finding the bad airflow jobs and have them re-run in production either manually or scripted. I'm not sure how to do this but it's probably doable
    Pros: Should fix all the bad airflow jobs at the same time
    Pros: Airflow records will be fixed as well
    Cons: Slow, Hard

Another option:

and manually re-reun the processing / zipping.

And do the same for the shapes that might be missing.

Pro: potentially easier. faster.
Cons: We'd still have bad airflow records but the data would be complete.

@vevetron
Copy link
Contributor

Based on a discussion with @mjumbewu , we'll hold off on option two and wait for #3368 to get resolved. Then we will go into the airflow console and manually re-run the failed jobs. There seems to only be about 15ish so this is doable.

@vevetron
Copy link
Contributor

vevetron commented Sep 6, 2024

I reran all the failed jobs (slowly, painfully, manually) , I got all to work but 2 (which consistently fail). I did a query to see if this fixed things but I think I need to wait for a Sunday Full Refresh for that data to populate.

Not sure if it's worth it to fix those last two.

@vevetron
Copy link
Contributor

A better query.

WITH s AS (
  SELECT base64_url, COUNT(DISTINCT(feed_key)) AS feed_key 
  FROM cal-itp-data-infra.mart_gtfs.dim_stops 
  GROUP BY base64_url
),
st AS (
  SELECT COUNT(DISTINCT(st.feed_key)) AS feed_key, st.base64_url
  FROM cal-itp-data-infra.mart_gtfs.dim_stop_times st
  GROUP BY base64_url
)
SELECT 
  s.feed_key, 
  st.feed_key AS st_key_count,
  s.feed_key - st.feed_key as difference,
  s.base64_url, 
  CASE
      WHEN REGEXP_CONTAINS(s.base64_url, r'^[A-Za-z0-9+/=]*$')  -- Check for valid Base64 characters
      THEN SAFE_CONVERT_BYTES_TO_STRING(FROM_BASE64(s.base64_url))
      ELSE NULL -- Or handle the invalid case differently (e.g., return an error message)
  END AS decoded_column,
  IF(s.feed_key = st.feed_key, 1, 0) AS key_match -- Added statement
FROM s
LEFT JOIN st
ON s.base64_url = st.base64_url
ORDER by difference DESC;

The data looks waaay better now. 18 agencies with 1 unprocessed GTFS.
image

@evansiroky
Copy link
Member Author

I spoke with @vevetron and I think this one can be marked as closed as we should have all needed data now.

@vevetron
Copy link
Contributor

Both B-Line and Long Beach are resolved for this issue, Sacramento is not. Sacramento's issue is likely because they publish their GTFS a head of schedule without merging in the previous GTFS, and we don't handle that scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants