# Feeds with No Critical Validation Errors

Ref: [GH issue](https://github.com/cal-itp/data-infra/issues/513)

Since the amount of validation errors can vary wildly from feed-to-feed, I think we should just count a feed if the amount of validation errors that Cal-ITP considers critical are 0 errors for a feed. For static GTFS, the list of validation errors are included in the `views.validation_fact_daily_feed_notices`. You'll want to group the notices by day and filter out the codes that are **not** included in the `views.validation_code_descriptions table`. The feeds that have 0 count of validation errors should be added to the overall count.

### Findings
For this one we should do 3 graphs as follows: 
1. Static GTFS errors 
1. GTFS-RT errors 
1. Static + GTFS-RT 

The GTFS-RT aspect of this is blocked until the pipeline to validate GTFS-RT has been completed.

In [1]:
import altair as alt
import pandas as pd
import os
os.environ["CALITP_BQ_MAX_BYTES"] = str(100_000_000_000)

import calitp
from calitp.tables import tbl
from siuba import *

from shared_utils import altair_utils
from shared_utils import geography_utils

alt.themes.enable("fivethirtyeight")
alt.renderers.enable('altair_saver', fmts=['png'])



RendererRegistry.enable('altair_saver')

In [2]:
'''
code_descriptions = (tbl.views.validation_code_descriptions()
 >> collect()
)

code_descriptions.to_parquet("./validation_code_descriptions.parquet")
'''
code_descriptions = pd.read_parquet("./validation_code_descriptions.parquet")

In [3]:
# All of these are critical for reporting...52 errors total
code_descriptions.critical_for_reporting.value_counts()

Series([], Name: critical_for_reporting, dtype: int64)

In [4]:
critical_errors = list(code_descriptions.code.unique())

'''
daily_validations = (
    tbl.views.validation_fact_daily_feed_notices()
    >> distinct(_.feed_key, _.calitp_itp_id, _.calitp_url_number,
                _.date, _.code)
    >> collect()
    >> filter(_.code.isin(critical_errors))
)

daily_validations.to_parquet("./daily_validations.parquet")

# Show number of errors within operator
# Earlier df shows across errors, capturing distinct errors
num_errors = (
    tbl.views.validation_fact_daily_feed_notices()
    >> group_by(_.feed_key, _.calitp_itp_id, _.calitp_url_number,
                _.date, _.code)
    >> count(_.code)
    >> collect()
    >> filter(_.code.isin(critical_errors))
)

num_errors.to_parquet("./num_errors.parquet")
'''

daily_validations = pd.read_parquet("./daily_validations.parquet")
num_errors = pd.read_parquet("./num_errors.parquet")

In [5]:
GROUP_COLS = ["feed_key", "date", "calitp_itp_id", "calitp_url_number"]

daily_validations = daily_validations.merge(
    num_errors.rename(
    columns = {"n": "num_errors"}), 
    on = GROUP_COLS + ["code"],
    how = "inner",
    validate = "1:1"
)

daily_validations.head(2)

Unnamed: 0,feed_key,calitp_itp_id,calitp_url_number,date,code,num_errors
0,-7847305992506683453,48,0,2021-05-29,stop_time_timepoint_without_times,34422
1,72313453925071107,120,0,2021-06-08,stop_time_timepoint_without_times,40530


In [6]:
daily_errors = (
    geography_utils.aggregate_by_geography(daily_validations, 
                                           group_cols = GROUP_COLS,
                                           sum_cols = ["num_errors"],
                                           count_cols = ["code"])
    .rename(columns = {"code": "critical_errors"})
)

daily_errors.head(2)

Unnamed: 0,feed_key,date,calitp_itp_id,calitp_url_number,num_errors,critical_errors
0,-7847305992506683453,2021-05-29,48,0,34422,1
1,72313453925071107,2021-06-08,120,0,40532,3


In [7]:
# For a given day, grab all the feeds (exclude info about errors)
'''
daily_feed = (
    tbl.views.validation_fact_daily_feed_notices()
    >> distinct(_.feed_key, _.date,
                _.calitp_itp_id, _.calitp_url_number,
               )
    >> collect()
)

daily_feed.to_parquet("./daily_feed.parquet")
'''
daily_feed = pd.read_parquet("./daily_feed.parquet")

In [8]:
df = pd.merge(
    daily_feed,
    daily_errors,
    on = GROUP_COLS,
    how = "left",
    validate = "1:m"
)

df = df.assign(
    critical_errors = df.critical_errors.fillna(0).astype(int),
    num_errors = df.num_errors.fillna(0).astype(int),
    date = pd.to_datetime(df.date),
    total_feeds = df.groupby("date")["feed_key"].transform("count")
)

df.head()

Unnamed: 0,feed_key,date,calitp_itp_id,calitp_url_number,num_errors,critical_errors,total_feeds
0,-1796609641282825154,2021-09-18,112,0,3619,3,178
1,7817634405777406718,2021-10-07,284,0,2782,5,177
2,-1796609641282825154,2021-09-30,112,0,3619,3,177
3,2508272846412219500,2021-10-28,474,0,15425,2,180
4,2415344669691788437,2021-10-07,200,0,6486,9,177


### Feeds with No Critical Errors

In [9]:
no_errors = (
    geography_utils.aggregate_by_geography(
        df[df.critical_errors==0],
        group_cols = ["date"],
        nunique_cols = ["feed_key"]
    )
)

In [10]:
axis_date_format ="%-m/%-d/%y"

def base_line_chart(df):
    chart = (alt.Chart(df)
             .mark_line()
             .encode(
                 x=alt.X("date", axis=alt.Axis(format=axis_date_format))
             )
            )
    return chart

In [11]:
chart = base_line_chart(no_errors)
             
chart = (chart
         .encode(
             y=alt.Y("feed_key:Q", title="# feeds"),
             color=alt.value(altair_utils.FIVETHIRTYEIGHT_CATEGORY_COLORS[0])
         ).properties(title="# Feeds with No Critical Errors")
        )

chart = altair_utils.preset_chart_config(chart)
chart.save("./no_errors.png")

Sudden drop...why? Is it because feed expired?

Also add 2nd chart that is % of feeds with no critical errors.

![no_errors](./no_errors.png)

### Unique Errors by Day

* Total number of unique errors over all the operators.
* For each operator, count the number of unique errors. 
* Sum across the entire day.
* Add a chart where it's number of unique error codes across the day (not operator-day).

In [12]:
chart = base_line_chart((df[df.critical_errors > 0]
                         .groupby("date")
                         .agg({"critical_errors": "sum"})
                         .reset_index()
                        ))
chart = (chart
         .encode(
             y=alt.Y("critical_errors:Q", title="Total # Unique Errors"),
             color=alt.value(altair_utils.FIVETHIRTYEIGHT_CATEGORY_COLORS[0])
         ).properties(title="Total Validation Errors Across Operators")
        )

chart = altair_utils.preset_chart_config(chart)
chart.save("./total_validation_errors.png")

![total_unique_errors](./total_validation_errors.png)

### Feeds by Error Type
% daily feeds for each type of error

In [13]:
errors_by_type = (
    geography_utils.aggregate_by_geography(
        daily_validations,
        group_cols = ["date", "code"],
        nunique_cols = ["feed_key"]
    ).rename(columns = {"feed_key": "num_feeds"})
)

errors_by_type.head()

Unnamed: 0,date,code,num_feeds
0,2021-05-29,stop_time_timepoint_without_times,24
1,2021-06-08,stop_time_timepoint_without_times,25
2,2021-09-30,stop_time_timepoint_without_times,22
3,2021-09-30,decreasing_or_equal_shape_distance,31
4,2021-06-28,stop_time_timepoint_without_times,23


In [14]:
errors_by_type2 = pd.merge(
    errors_by_type.assign(date = pd.to_datetime(errors_by_type.date)),
    df[["date", "total_feeds"]].drop_duplicates().reset_index(drop=True), 
    on = "date",
    how = "inner",
    validate = "m:1"
)

errors_by_type2 = errors_by_type2.assign(
    pct_error = errors_by_type2.num_feeds.divide(errors_by_type2.total_feeds)
)

errors_by_type2.head(2)

Unnamed: 0,date,code,num_feeds,total_feeds,pct_error
0,2021-05-29,stop_time_timepoint_without_times,24,175,0.137143
1,2021-05-29,decreasing_or_equal_shape_distance,28,175,0.16


In [15]:
chart = base_line_chart(errors_by_type2)

chart = (chart
 .encode(
     y=alt.Y("pct_error:Q", title="% feeds", axis=alt.Axis(format="%")),
     color=alt.Color("code:N", title="Validation Error", 
                    scale=alt.Scale(range=altair_utils.FIVETHIRTYEIGHT_CATEGORY_COLORS))
 ).properties(title="% Daily Feeds by Critical Validation Error Types")
)

chart = (altair_utils.preset_chart_config(chart)
        )
chart.save("./pct_feeds_by_error.png")

![feeds_by_error](./pct_feeds_by_error.png)

In [16]:
# Better chart...pick out the most critical ones and plot those
# 52 is too many
top_errors = (errors_by_type2.groupby(["code"])
              .agg({"num_feeds": "sum"})
              .reset_index()
              .sort_values("num_feeds", ascending=False)
              .reset_index(drop=True)
)

TOP_ERRORS = list(top_errors.code.iloc[:10])
TOP_ERRORS

['duplicate_route_name',
 'decreasing_or_equal_shape_distance',
 'too_fast_travel',
 'stop_time_timepoint_without_times',
 'duplicate_fare_rule_zone_id_fields',
 'feed_expiration_date',
 'route_short_name_too_long',
 'stop_too_far_from_trip_shape',
 'decreasing_or_equal_stop_time_distance',
 'same_name_and_description_for_route']

In [17]:
chart = base_line_chart(
    errors_by_type2[errors_by_type2.code.isin(TOP_ERRORS)])

chart = (chart
 .encode(
     y=alt.Y("pct_error:Q", title="% feeds", axis=alt.Axis(format="%")),
     color=alt.Color("code:N", title="Validation Error", 
                    scale=alt.Scale(range=altair_utils.FIVETHIRTYEIGHT_CATEGORY_COLORS))
 ).properties(title="% Daily Feeds by Top 10 Critical Validation Error Types")
)

chart = (altair_utils.preset_chart_config(chart)
        )

chart.save("./pct_feeds_by_error_top10.png")

![top10_errors](./pct_feeds_by_error_top10.png)

Metabase Q: which feeds are expired? Click on a validation error and see which operators.